UP | HOME

Abstract Syntax Trees and (Un)parsing

Table of Contents

\( %% Your math definitions here % \newcommand{\alphaequiv}{{\underset{\raise 0.7em\alpha}{=}}} \newcommand{\yields}{\Rightarrow} \newcommand{\derives}{\overset{*}{\yields}} \newcommand{\alphaequiv}{=_{\alpha}} \newcommand{\tto}[2]{{\overset{#1}{\underset{#2}{\longrightarrow}}}} \newcommand{\transitsto}[2]{{\overset{#1}{\underset{#2}{\longrightarrow}}}} \newcommand{\xtransitsto}[2]{{\underset{#2}{\xrightarrow{#1}}}} \newcommand{\xtransitsfrom}[2]{{\underset{#2}{\xleftarrow{#1}}}} \newcommand{\xto}[2]{{\xtransitsto{#1}{#2}}} \newcommand{\xfrom}[2]{{\xtransitsfrom{#1}{#2}}} \newcommand{\xreaches}[2]{{\underset{#2}{\xtwoheadrightarrow{#1}}}} \newcommand{\reaches}[2]{{\underset{#2}{\xtwoheadrightarrow{#1}}}} %\newcommand{\reaches}[2]{{\overset{#1}{\underset{#2}{\twoheadrightarrow}}}} %\newcommand{\goesto}[2]{\transitsto{#1}{#2}} %\newcommand{\betareducesto}{{\underset{\beta}{\rightarrow}}} \newcommand{\betareducesto}{\rightarrow_{\beta}} %\newcommand{\etareducesto}{{\underset{\eta}{\rightarrow}}} \newcommand{\etareducesto}{\rightarrow_{\eta}} %\newcommand{\betaetareducesto}{{\underset{\beta\ \eta}{\rightarrow}}} \newcommand{\betaetareducesto}{\rightarrow_{\beta\eta}} \newcommand{\preducesto}{\rhd} \newcommand{\psimplifiesto}{\stackrel{\scriptstyle{*}}{\rhd}} \newcommand{\lreducesto}{\rightsquigarrow} \newcommand{\lsimplifiesto}{\stackrel{\scriptstyle{*}}{\lreducesto}} \newcommand{\rewritesto}{\hookrightarrow} \newcommand{\goesto}[1]{\stackrel{#1}{\rightarrow}} \newcommand{\xgoesto}[1]{\xrightarrow{#1}} \newcommand{\reducesto}{\stackrel{}{\rightarrow}} \newcommand{\simplifiesto}{\stackrel{\scriptstyle{*}}{\rightarrow}} \newcommand{\connected}[1]{\stackrel{#1}{\leftrightarrow}} \newcommand{\joins}{\downarrow} \newcommand{\evaluatesto}{\Longrightarrow} %\newcommand{\lit}[1]{\hbox{\sf{#1}}} \newcommand{\lit}[1]{{\sf{#1}}} \newcommand{\true}{\lit{true}} \newcommand{\false}{\lit{false}} \def\Z{\mbox{${\mathbb Z}$}} \def\N{\mbox{${\mathbb N}$}} \def\P{\mbox{${\mathbb P}$}} \def\R{\mbox{${\mathbb R}$}} \newcommand{\Rp}{{\mathbb{R}}^+} \def\Bool{\mbox{${\mathbb B}$}} \def\sA{\mbox{${\cal A}$}} \def\sB{\mbox{${\cal B}$}} \def\sC{\mbox{${\cal C}$}} \def\sD{\mbox{${\cal D}$}} \def\sF{\mbox{${\cal F}$}} \def\sG{\mbox{${\cal G}$}} \def\sL{\mbox{${\cal L}$}} \def\sP{\mbox{${\cal P}$}} \def\sM{\mbox{${\cal M}$}} \def\sN{\mbox{${\cal N}$}} \def\sR{\mbox{${\cal R}$}} \def\sS{\mbox{${\cal S}$}} \def\sO{\mbox{${\cal O}$}} \def\sT{\mbox{${\cal T}$}} \def\sU{\mbox{${\cal U}$}} \def\th{\mbox{$\widetilde{h}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\tP{\mbox{$\widetilde{P}$}} \def\norm{\mbox{$\parallel$}} \def\osum{${{\bigcirc}}\!\!\!\!{\rm s}~$} \def\pf{\noindent {\bf Proof}~~} \def\exec{\mathit{exec}} \def\Act{\mathit{A\!ct}} \def\Traces{\mathit{Traces}} \def\Spec{\mathit{Spec}} \def\uns{\mathit{unless}} \def\ens{\mathit{ensures}} \def\lto{\mathit{leads\!\!-\!\!to}} \def\a{\alpha} \def\b{\beta} \def\c{\gamma} \def\d{\delta} \def\sP{\mbox{${\cal P}$}} \def\sM{\mbox{${\cal M}$}} \def\sA{\mbox{${\cal A}$}} \def\sB{\mbox{${\cal B}$}} \def\sC{\mbox{${\cal C}$}} \def\sI{\mbox{${\cal I}$}} \def\sS{\mbox{${\cal S}$}} \def\sD{\mbox{${\cal D}$}} \def\sF{\mbox{${\cal F}$}} \def\sG{\mbox{${\cal G}$}} \def\sR{\mbox{${\cal R}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\ta{\mbox{$\widetilde{a}$}} \def\tb{\mbox{$\widetilde{b}$}} \def\tc{\mbox{$\widetilde{c}$}} \def\tx{\mbox{$\widetilde{x}$}} \def\ty{\mbox{$\widetilde{y}$}} \def\tz{\mbox{$\widetilde{z}$}} \def\tI{\mbox{$\widetilde{I}$}} \def\norm{\mbox{$\parallel$}} \def\sL{\mbox{${\cal L}$}} \def\sM{\mbox{${\cal M}$}} \def\sN{\mbox{${\cal N}$}} \def\th{\mbox{$\widetilde{h}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\tP{\mbox{$\widetilde{P}$}} \def\norm{\mbox{$\parallel$}} \def\to{\rightarrow} \def\ov{\overline} \def\gets{\leftarrow} \def\too{\longrightarrow} \def\To{\Rightarrow} \def\points{\mapsto} %\def\yields{\mapsto^{*}} \def\un{\underline} \def\vep{$\varepsilon$} \def\ep{$\epsilon$} \def\tri{$\bigtriangleup$} \def\Fi{$F^{\infty}$} \def\Di{\Delta^{\infty}} \def\ebox\Box \def\emp{\emptyset} \def\leadsto{\rightharpoondown^{*}} \newcommand{\benum}{\begin{enumerate}} \newcommand{\eenum}{\end{enumerate}} \newcommand{\bdes}{\begin{description}} \newcommand{\edes}{\end{description}} \newcommand{\bt}{\begin{theorem}} \newcommand{\et}{\end{theorem}} \newcommand{\bl}{\begin{lemma}} \newcommand{\el}{\end{lemma}} %\newcommand{\bp}{\begin{prop}} %\newcommand{\ep}{\end{prop}} \newcommand{\bd}{\begin{defn}} \newcommand{\ed}{\end{defn}} \newcommand{\brem}{\begin{remark}} \newcommand{\erem}{\end{remark}} \newcommand{\bxr}{\begin{exercise}} \newcommand{\exr}{\end{exercise}} \newcommand{\bxm}{\begin{example}} \newcommand{\exm}{\end{example}} \newcommand{\beqa}{\begin{eqnarray*}} \newcommand{\eeqa}{\end{eqnarray*}} \newcommand{\bc}{\begin{center}} \newcommand{\ec}{\end{center}} \newcommand{\bcent}{\begin{center}} \newcommand{\ecent}{\end{center}} \newcommand{\la}{\langle} \newcommand{\ra}{\rangle} \newcommand{\bcor}{\begin{corollary}} \newcommand{\ecor}{\end{corollary}} \newcommand{\bds}{\begin{defns}} \newcommand{\eds}{\end{defns}} \newcommand{\brems}{\begin{remarks}} \newcommand{\erems}{\end{remarks}} \newcommand{\bxrs}{\begin{exercises}} \newcommand{\exrs}{\end{exercises}} \newcommand{\bxms}{\begin{examples}} \newcommand{\exms}{\end{examples}} \newcommand{\bfig}{\begin{figure}} \newcommand{\efig}{\end{figure}} \newcommand{\set}[1]{\{#1\}} \newcommand{\pair}[1]{\langle #1\rangle} \newcommand{\tuple}[1]{\langle #1\rangle} \newcommand{\union}{\cup} \newcommand{\Union}{\bigcup} \newcommand{\intersection}{\cap} \newcommand{\Intersection}{\bigcap} \newcommand{\B}{\textbf{B}} %\newcommand{\be}[2]{\begin{equation} \label{#1} \tag{#2} \end{equation}} \newcommand{\abs}[1]{{\lvert}#1{\rvert}} \newcommand{\id}[1]{\mathit{#1}} \newcommand{\pfun}{\rightharpoonup} %\newcommand{\ra}[1]{\kern-1.5ex\xrightarrow{\ \ #1\ \ }\phantom{}\kern-1.5ex} %\newcommand{\ras}[1]{\kern-1.5ex\xrightarrow{\ \ \smash{#1}\ \ }\phantom{}\kern-1.5ex} \newcommand{\da}[1]{\bigg\downarrow\raise.5ex\rlap{\scriptstyle#1}} \newcommand{\ua}[1]{\bigg\uparrow\raise.5ex\rlap{\scriptstyle#1}} % \newcommand{\lift}[1]{#1_{\bot}} \newcommand{\signal}[1]{\tilde{#1}} \newcommand{\ida}{\stackrel{{\sf def}}{=}} \newcommand{\eqn}{\doteq} \newcommand{\deduce}[1]{\sststile{#1}{}} %% \theoremstyle{plain}%default %% \newtheorem{thm}{Theorem}[section] %% \newtheorem{lem}[thm]{Lemma} %% \newtheorem{cor}[thm]{Corollary} %% % %% \theoremstyle{definition} %% \newtheorem{defn}[thm]{Definition} %% % %% \theoremstyle{remark} %% \newtheorem{remark}[thm]{Remark} %% \newtheorem{exercise}[thm]{Exercise} %% See http://u.cs.biu.ac.il/~tsaban/Pdf/LaTeXCommonErrs.pdf %% \newtheorem{prop}[thm]{Proposition} \newcommand{\less}[1]{#1_{<}} \newcommand{\pfn}{\rightharpoonup} \newcommand{\ffn}{\stackrel{{\sf fin}}{\rightharpoonup}} \newcommand{\stkout}[1]{\ifmmode\text{\sout{\ensuremath{#1}}}\else\sout{#1}\fi} % \DeclareMathSymbol{\shortminus}{\mathbin}{AMSa}{"39} \newcommand{\mbf}[1]{\mathbf{#1}} \)

1 Introduction

1.1 How is a program represented?

Welcome
We continue with our study of the principles of programming languages.
Problem addressed here
We start with the following question: How are programs represented?
Multiple answers
This question has multiple answers. In this lesson, we will examine a couple of ways to answer this question.
Start small
During the course of this semester, we will examine many small programming languages. But it's best to start with something small, almost, but not quite trivial.

2 ADDITION Language

Addition language
We will consider the language we simply call ADDITION.
Addition expressions
Programs are arithmetic expressions involving natural numbers and a binary addition operator +.
Ignore Algebra of +
We will ignore the other algebraic properties (viz., commutativty and associativity) of + for the moment.

3 Program Representation

3.1 Concrete Syntax

Concrete Syntax
the structure of programs as they are directly seen or entered by the programmer. Usually the concrete syntax is a linear sequence of characters grouped into tokens.
C or Python examples
Here are possible representations of ADDITION expressions:
5
3+4
(3+4)+2

3.2 Strings vs Trees

Programs as strings
There is an obvious linear structure of a a program as a string, or sequence of characters.
Programs as Trees
There is a deeper, more abstract structure embedded within the string, which is a tree.

3.3 Abstract Syntax

Abstract Syntax Trees
Rather than consider concrete syntax, which usually involves writing down expressions linearly, we will assume that expressions are denoted abstractly as trees.
ASTs
We call these trees Abstract Syntax Trees, or ASTs for short.

3.4 Drawing ASTs

Leaves are numbers nodes
Notice that the leaves of these trees are numbers
Internal nodes are +
and each internal vertex is the operator +.
(3+4)+2:
       +
      / \
     +  2
    / \
   3  4



(3+1)+(2+3):
        *
       / \
      /   +
     +   / \
    / \  2  3 
   3  1

3.5 Lisp concrete syntax

Lisp concrete syntax
In the Lisp Family of languages, it is easier to identify the tree stucture from the concrete syntax: The matching parentheses enclose non-leaf subtrees.
(+ 3 4)

(+ (+ 3 4)
   2)

(+ (+ 3 1)
   (+ 2 3)

(+ (+ 3 
      (+ 2
         (+ 4 5)))
   (+ (+ 1 8)
      (+ 2 5)))

4 More complex languages

For the following code in Python,

     while (i <= x):
         i=g(i);
     return i;

An abstract syntax would look something like this:

        seq
       /   \
      while return
      / \     \
     /   \     i
    <=    @
   / \   / \
  i  x   g  i

Can you guess what the seq and @ stands for?

4.1 Practical notational conventions

Drawing trees
Drawing trees is a slightly more elaborate process
Notation
We will continue to write programs in the concrete syntax as strings, with the awareness that these really denote trees.
Example
we write
3+2
AST
but switch to the AST representation whenever necessary.
   +
  / \
 3   2

5 Syntax and Inference Rules

We are now interested in specifying the structure of abstract syntax trees.

5.1 Abstract Grammar

The abstract grammar specifies trees, not strings.

<ast> ::=  <number> | + <ast> <ast> 

5.2 ADDITION syntax as Inference Rules

Alternatively, we can use inference rules to specify the syntax

\begin{align*} \genfrac{}{}{4}{0}{}{n\ Exp} & n\in \N & \hspace{6em}{NUM}\\ \\ \\ % \genfrac{}{}{4}{0}{{e_1\ Exp}\hspace{2em}{e_2\ Exp}}{e_1 + e_2\ Exp} & & \hspace{2em}{PLUS} \end{align*}

6 AST's in Racket

6.1 Boilerplate

#lang racket
(require eopl)
(require rackunit)
;;; the only-in so that we import only what we need and we don't clash
;;; with bindings defined in the eopl module.
(require (only-in racket/list
                  first
                  second
                  third))
(provide (all-defined-out))

6.2 define-datatype

define-datatype is used to define a variant (or sum) datatype, i.e., a datatype that has multiple variants.

(define-datatype ast ast?
  [num (n number?)]  ; base case
  [plus (left? ast?) (right? ast?)]) ; inductive case

6.3 Constructors and Type predicates auto-defined

These constructors get automatically defined

;;; num : number? -> ast?
;;; plus : [ast?, ast?] -> ast?

These type predicates get automatically defined

;;; ast?    : any/c -> boolean?

6.4 Testing the AST functions

(check-true (ast? (num 5)))
(check-true (ast? (plus (num 5) (num 3))))

6.5 cases

cases can be used to identify the appropriate variant

(cases ast (plus (num 5) (num 4))
  [plus (l r) "it's a plus"]
  [else "it's a number"])

7 Exercise: A simple unparser

We now consider two problems

  • Choosing a concrete representation of addition expressions.
  • Writing an unparser to convert AST's to expressions in the concrete syntax.

7.1 Choosing a concrete representation

We will experiment with two choices:

  1. A datum (s-exp) based prefix syntax like Racket's own.
  2. An infix syntax, à la Python or C.

7.2 A Datum based prefix syntax

Here is the grammar for the concrete syntax. The quote marks enclose literal symbols.

<exp> ::= <num> | '(+ ' <exp> <exp>')'

7.3 Unparser and Parser

Unparser
An unparser takes an ast and returns an element of concrete syntax.
Parser
A parser does the opposite: it takes a string, determines if the string is an instance of the concrete syntax. If it is, the parser converts the string to an ast. If not, it raises an error.

7.4 Unparsing to a datum syntax: Racket implementation

(define (unparse a)
  (cases ast a
    [num (n) n]
    [plus (left right)
          (list '+ 
                (unparse left)
                (unparse right))]))

7.5 Testing unparse

(check-equal? 5 (unparse (num 5)))
(check-equal? '(+ 2 3) (unparse (plus (num 2) (num 3))))

7.6 An unparser using quasiquote

Quasiquote (or backquote) allows controlled evaluation inside an expression. Evaluation inside the expression is triggered in the context of a quasiquote and an unquote (comma).

`<exp> is short for (quasiquote <exp>). ,<exp> is short for (unquote <exp>).

(define (unparse/quasiquote a)
  (cases ast a
    [num (n) n]
    [plus (left right)
          `(+ 
            ,(unparse left)
            ,(unparse right))]))

7.7 An unparser that generates infix syntax

(define (unparse-infix a)
  (cases ast a
    [num (n) (format "~a" n)]
    [plus (left right)
          (format "(~a + ~a)" 
                  (unparse-infix left)
                  (unparse-infix right))]))

8 Exercise: A datum parser

8.1 A datum vs string parser

Datum parser
The concrete syntax is built using datum elements. It is much easier to build a datum parser because the conversion of strings of characters to datum elements, the heavy lifting, is already done by Racket.
String parser
The concrete syntax is a string. The parsing to ASTs involves more complex parsing methods, usually the stuff of a compilers course.

8.2 Does the concrete syntax determine the meaning of programs?

  • It is profitable to think of programs as trees, i.e., as Abstract Syntax trees.
  • The concrete syntax is relevant only insofar as how it appeals to a human reader of the program.
  • Therefore, we will pick a Racket-like syntax for the languages we build in this course. One advantage is that it is already familiar to us by now. The other advantage is that parsers are very easy to build for such concrete syntax in Racket.

8.3 A datum parser

The datum parser for the ADDITION language is quite simple.

;;; parse :: any/c -> ast? + error
(define parse
  (lambda (d)
    (cond [(number? d) (num d)]
          [(and (list? d) (= (length d) 3)
                (eq? (first d) '+))
           (plus (parse (second d)) 
                 (parse (third d)))]
          [else (error 'parse 
                       "invalid syntax ~a" d)])))

8.4 Testing parse

(check-equal? (num 5) (parse 5))
(check-equal? (plus (num 4) (num 3)) (parse '(+  4 3)))
(check-exn exn:fail? (lambda () (parse '(+ "hello" 3))))

8.5 Datum parser using match

match is a construct in Racket that allows pattern-matching. A given expression may be matched with different patterns. match is a derived keyword, but the implementation of match is not necessary to understand how to use it.

;;; parse/match :: any/c -> ast? + error
;;;
(require racket/match)
(define parse/match
  (lambda (d)
    (match d
      [(? number? n) (num n)]         ; is d a number, say n?
      [(list '+ left right)           ; is d a list consisting of '+, 
                                      ; and two other parts, 
                                      ; say, left and right?
       (plus (parse/match left) 
             (parse/match right))]
      [else (error 'parse/match
                   "invalid syntax ~a" d)])))

8.6 Testing parse/match

(check-equal? (num 5) (parse/match 5))
(check-equal? (plus (num 4) (num 3)) (parse/match '(+  4 3)))
(check-exn exn:fail? (lambda () (parse/match '(+ "hello" 3))))

9 Summary

9.1 Key concepts in this lesson

Concrete Syntax
Programs have concrete syntax.
Abstract Syntax Trees
The abstract syntax is the hidden hierarchical structure beneath the concrete syntax.
Syntax via inference rules
The syntax of a programming language may be specified via judgements and inference rules.
Racket implementation of AST as a datatype
An abstract syntax tree is realised a data type consisting of variants for each syntactic category.
Unparser
Converts an AST of a program to its concrete syntax.
Parser
Converts a concrete syntax of a program into its AST.

9.2 Key Racket concepts in this lesson

define-datatype
A way of defining variant abstract data types (inherited from EOPL).
quasiquote and unquote
Ways of controlling evaluation within an expression.
match
A pattern matcher for datums.

Venkatesh Choppella

2020-11-18 Wed 12:51

CC-BY-NC-ND 4.0