Abstract Syntax Trees and (Un)parsing
Table of Contents
\( %% Your math definitions here % \newcommand{\alphaequiv}{{\underset{\raise 0.7em\alpha}{=}}} \newcommand{\yields}{\Rightarrow} \newcommand{\derives}{\overset{*}{\yields}} \newcommand{\alphaequiv}{=_{\alpha}} \newcommand{\tto}[2]{{\overset{#1}{\underset{#2}{\longrightarrow}}}} \newcommand{\transitsto}[2]{{\overset{#1}{\underset{#2}{\longrightarrow}}}} \newcommand{\xtransitsto}[2]{{\underset{#2}{\xrightarrow{#1}}}} \newcommand{\xtransitsfrom}[2]{{\underset{#2}{\xleftarrow{#1}}}} \newcommand{\xto}[2]{{\xtransitsto{#1}{#2}}} \newcommand{\xfrom}[2]{{\xtransitsfrom{#1}{#2}}} \newcommand{\xreaches}[2]{{\underset{#2}{\xtwoheadrightarrow{#1}}}} \newcommand{\reaches}[2]{{\underset{#2}{\xtwoheadrightarrow{#1}}}} %\newcommand{\reaches}[2]{{\overset{#1}{\underset{#2}{\twoheadrightarrow}}}} %\newcommand{\goesto}[2]{\transitsto{#1}{#2}} %\newcommand{\betareducesto}{{\underset{\beta}{\rightarrow}}} \newcommand{\betareducesto}{\rightarrow_{\beta}} %\newcommand{\etareducesto}{{\underset{\eta}{\rightarrow}}} \newcommand{\etareducesto}{\rightarrow_{\eta}} %\newcommand{\betaetareducesto}{{\underset{\beta\ \eta}{\rightarrow}}} \newcommand{\betaetareducesto}{\rightarrow_{\beta\eta}} \newcommand{\preducesto}{\rhd} \newcommand{\psimplifiesto}{\stackrel{\scriptstyle{*}}{\rhd}} \newcommand{\lreducesto}{\rightsquigarrow} \newcommand{\lsimplifiesto}{\stackrel{\scriptstyle{*}}{\lreducesto}} \newcommand{\rewritesto}{\hookrightarrow} \newcommand{\goesto}[1]{\stackrel{#1}{\rightarrow}} \newcommand{\xgoesto}[1]{\xrightarrow{#1}} \newcommand{\reducesto}{\stackrel{}{\rightarrow}} \newcommand{\simplifiesto}{\stackrel{\scriptstyle{*}}{\rightarrow}} \newcommand{\connected}[1]{\stackrel{#1}{\leftrightarrow}} \newcommand{\joins}{\downarrow} \newcommand{\evaluatesto}{\Longrightarrow} %\newcommand{\lit}[1]{\hbox{\sf{#1}}} \newcommand{\lit}[1]{{\sf{#1}}} \newcommand{\true}{\lit{true}} \newcommand{\false}{\lit{false}} \def\Z{\mbox{${\mathbb Z}$}} \def\N{\mbox{${\mathbb N}$}} \def\P{\mbox{${\mathbb P}$}} \def\R{\mbox{${\mathbb R}$}} \newcommand{\Rp}{{\mathbb{R}}^+} \def\Bool{\mbox{${\mathbb B}$}} \def\sA{\mbox{${\cal A}$}} \def\sB{\mbox{${\cal B}$}} \def\sC{\mbox{${\cal C}$}} \def\sD{\mbox{${\cal D}$}} \def\sF{\mbox{${\cal F}$}} \def\sG{\mbox{${\cal G}$}} \def\sL{\mbox{${\cal L}$}} \def\sP{\mbox{${\cal P}$}} \def\sM{\mbox{${\cal M}$}} \def\sN{\mbox{${\cal N}$}} \def\sR{\mbox{${\cal R}$}} \def\sS{\mbox{${\cal S}$}} \def\sO{\mbox{${\cal O}$}} \def\sT{\mbox{${\cal T}$}} \def\sU{\mbox{${\cal U}$}} \def\th{\mbox{$\widetilde{h}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\tP{\mbox{$\widetilde{P}$}} \def\norm{\mbox{$\parallel$}} \def\osum{${{\bigcirc}}\!\!\!\!{\rm s}~$} \def\pf{\noindent {\bf Proof}~~} \def\exec{\mathit{exec}} \def\Act{\mathit{A\!ct}} \def\Traces{\mathit{Traces}} \def\Spec{\mathit{Spec}} \def\uns{\mathit{unless}} \def\ens{\mathit{ensures}} \def\lto{\mathit{leads\!\!-\!\!to}} \def\a{\alpha} \def\b{\beta} \def\c{\gamma} \def\d{\delta} \def\sP{\mbox{${\cal P}$}} \def\sM{\mbox{${\cal M}$}} \def\sA{\mbox{${\cal A}$}} \def\sB{\mbox{${\cal B}$}} \def\sC{\mbox{${\cal C}$}} \def\sI{\mbox{${\cal I}$}} \def\sS{\mbox{${\cal S}$}} \def\sD{\mbox{${\cal D}$}} \def\sF{\mbox{${\cal F}$}} \def\sG{\mbox{${\cal G}$}} \def\sR{\mbox{${\cal R}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\ta{\mbox{$\widetilde{a}$}} \def\tb{\mbox{$\widetilde{b}$}} \def\tc{\mbox{$\widetilde{c}$}} \def\tx{\mbox{$\widetilde{x}$}} \def\ty{\mbox{$\widetilde{y}$}} \def\tz{\mbox{$\widetilde{z}$}} \def\tI{\mbox{$\widetilde{I}$}} \def\norm{\mbox{$\parallel$}} \def\sL{\mbox{${\cal L}$}} \def\sM{\mbox{${\cal M}$}} \def\sN{\mbox{${\cal N}$}} \def\th{\mbox{$\widetilde{h}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\tP{\mbox{$\widetilde{P}$}} \def\norm{\mbox{$\parallel$}} \def\to{\rightarrow} \def\ov{\overline} \def\gets{\leftarrow} \def\too{\longrightarrow} \def\To{\Rightarrow} \def\points{\mapsto} %\def\yields{\mapsto^{*}} \def\un{\underline} \def\vep{$\varepsilon$} \def\ep{$\epsilon$} \def\tri{$\bigtriangleup$} \def\Fi{$F^{\infty}$} \def\Di{\Delta^{\infty}} \def\ebox\Box \def\emp{\emptyset} \def\leadsto{\rightharpoondown^{*}} \newcommand{\benum}{\begin{enumerate}} \newcommand{\eenum}{\end{enumerate}} \newcommand{\bdes}{\begin{description}} \newcommand{\edes}{\end{description}} \newcommand{\bt}{\begin{theorem}} \newcommand{\et}{\end{theorem}} \newcommand{\bl}{\begin{lemma}} \newcommand{\el}{\end{lemma}} %\newcommand{\bp}{\begin{prop}} %\newcommand{\ep}{\end{prop}} \newcommand{\bd}{\begin{defn}} \newcommand{\ed}{\end{defn}} \newcommand{\brem}{\begin{remark}} \newcommand{\erem}{\end{remark}} \newcommand{\bxr}{\begin{exercise}} \newcommand{\exr}{\end{exercise}} \newcommand{\bxm}{\begin{example}} \newcommand{\exm}{\end{example}} \newcommand{\beqa}{\begin{eqnarray*}} \newcommand{\eeqa}{\end{eqnarray*}} \newcommand{\bc}{\begin{center}} \newcommand{\ec}{\end{center}} \newcommand{\bcent}{\begin{center}} \newcommand{\ecent}{\end{center}} \newcommand{\la}{\langle} \newcommand{\ra}{\rangle} \newcommand{\bcor}{\begin{corollary}} \newcommand{\ecor}{\end{corollary}} \newcommand{\bds}{\begin{defns}} \newcommand{\eds}{\end{defns}} \newcommand{\brems}{\begin{remarks}} \newcommand{\erems}{\end{remarks}} \newcommand{\bxrs}{\begin{exercises}} \newcommand{\exrs}{\end{exercises}} \newcommand{\bxms}{\begin{examples}} \newcommand{\exms}{\end{examples}} \newcommand{\bfig}{\begin{figure}} \newcommand{\efig}{\end{figure}} \newcommand{\set}[1]{\{#1\}} \newcommand{\pair}[1]{\langle #1\rangle} \newcommand{\tuple}[1]{\langle #1\rangle} \newcommand{\union}{\cup} \newcommand{\Union}{\bigcup} \newcommand{\intersection}{\cap} \newcommand{\Intersection}{\bigcap} \newcommand{\B}{\textbf{B}} %\newcommand{\be}[2]{\begin{equation} \label{#1} \tag{#2} \end{equation}} \newcommand{\abs}[1]{{\lvert}#1{\rvert}} \newcommand{\id}[1]{\mathit{#1}} \newcommand{\pfun}{\rightharpoonup} %\newcommand{\ra}[1]{\kern-1.5ex\xrightarrow{\ \ #1\ \ }\phantom{}\kern-1.5ex} %\newcommand{\ras}[1]{\kern-1.5ex\xrightarrow{\ \ \smash{#1}\ \ }\phantom{}\kern-1.5ex} \newcommand{\da}[1]{\bigg\downarrow\raise.5ex\rlap{\scriptstyle#1}} \newcommand{\ua}[1]{\bigg\uparrow\raise.5ex\rlap{\scriptstyle#1}} % \newcommand{\lift}[1]{#1_{\bot}} \newcommand{\signal}[1]{\tilde{#1}} \newcommand{\ida}{\stackrel{{\sf def}}{=}} \newcommand{\eqn}{\doteq} \newcommand{\deduce}[1]{\sststile{#1}{}} %% \theoremstyle{plain}%default %% \newtheorem{thm}{Theorem}[section] %% \newtheorem{lem}[thm]{Lemma} %% \newtheorem{cor}[thm]{Corollary} %% % %% \theoremstyle{definition} %% \newtheorem{defn}[thm]{Definition} %% % %% \theoremstyle{remark} %% \newtheorem{remark}[thm]{Remark} %% \newtheorem{exercise}[thm]{Exercise} %% See http://u.cs.biu.ac.il/~tsaban/Pdf/LaTeXCommonErrs.pdf %% \newtheorem{prop}[thm]{Proposition} \newcommand{\less}[1]{#1_{<}} \newcommand{\pfn}{\rightharpoonup} \newcommand{\ffn}{\stackrel{{\sf fin}}{\rightharpoonup}} \newcommand{\stkout}[1]{\ifmmode\text{\sout{\ensuremath{#1}}}\else\sout{#1}\fi} % \DeclareMathSymbol{\shortminus}{\mathbin}{AMSa}{"39} \newcommand{\mbf}[1]{\mathbf{#1}} \)
1 Introduction
1.1 How is a program represented?
- Welcome
- We continue with our study of the principles of programming languages.
- Problem addressed here
- We start with the following question: How are programs represented?
- Multiple answers
- This question has multiple answers. In this lesson, we will examine a couple of ways to answer this question.
- Start small
- During the course of this semester, we will examine many small programming languages. But it's best to start with something small, almost, but not quite trivial.
2 ADDITION Language
- Addition language
- We will consider the language we simply call ADDITION.
- Addition expressions
- Programs are arithmetic expressions involving natural numbers and a binary addition operator +.
- Ignore Algebra of +
- We will ignore the other algebraic properties (viz., commutativty and associativity) of + for the moment.
3 Program Representation
3.1 Concrete Syntax
- Concrete Syntax
- the structure of programs as they are directly seen or entered by the programmer. Usually the concrete syntax is a linear sequence of characters grouped into tokens.
- C or Python examples
- Here are possible representations of ADDITION expressions:
5 3+4 (3+4)+2
3.2 Strings vs Trees
- Programs as strings
- There is an obvious linear structure of a a program as a string, or sequence of characters.
- Programs as Trees
- There is a deeper, more abstract structure embedded within the string, which is a tree.
3.3 Abstract Syntax
- Abstract Syntax Trees
- Rather than consider concrete syntax, which usually involves writing down expressions linearly, we will assume that expressions are denoted abstractly as trees.
- ASTs
- We call these trees Abstract Syntax Trees, or ASTs for short.
3.4 Drawing ASTs
- Leaves are numbers nodes
- Notice that the leaves of these trees are numbers
- Internal nodes are +
- and each internal vertex is the operator +.
(3+4)+2: + / \ + 2 / \ 3 4 (3+1)+(2+3): * / \ / + + / \ / \ 2 3 3 1
3.5 Lisp concrete syntax
- Lisp concrete syntax
- In the Lisp Family of languages, it is easier to identify the tree stucture from the concrete syntax: The matching parentheses enclose non-leaf subtrees.
(+ 3 4) (+ (+ 3 4) 2) (+ (+ 3 1) (+ 2 3) (+ (+ 3 (+ 2 (+ 4 5))) (+ (+ 1 8) (+ 2 5)))
4 More complex languages
For the following code in Python,
while (i <= x): i=g(i); return i;
An abstract syntax would look something like this:
seq / \ while return / \ \ / \ i <= @ / \ / \ i x g i
Can you guess what the seq
and @
stands for?
4.1 Practical notational conventions
- Drawing trees
- Drawing trees is a slightly more elaborate process
- Notation
- We will continue to write programs in the concrete syntax as strings, with the awareness that these really denote trees.
- Example
- we write
3+2
- AST
- but switch to the AST representation whenever necessary.
+ / \ 3 2
5 Syntax and Inference Rules
We are now interested in specifying the structure of abstract syntax trees.
5.1 Abstract Grammar
The abstract grammar specifies trees, not strings.
<ast> ::= <number> | + <ast> <ast>
5.2 ADDITION syntax as Inference Rules
Alternatively, we can use inference rules to specify the syntax
\begin{align*} \genfrac{}{}{4}{0}{}{n\ Exp} & n\in \N & \hspace{6em}{NUM}\\ \\ \\ % \genfrac{}{}{4}{0}{{e_1\ Exp}\hspace{2em}{e_2\ Exp}}{e_1 + e_2\ Exp} & & \hspace{2em}{PLUS} \end{align*}6 AST's in Racket
6.1 Boilerplate
#lang racket (require eopl) (require rackunit) ;;; the only-in so that we import only what we need and we don't clash ;;; with bindings defined in the eopl module. (require (only-in racket/list first second third)) (provide (all-defined-out))
6.2 define-datatype
define-datatype
is used to define a variant (or sum)
datatype, i.e., a datatype that has multiple variants.
(define-datatype ast ast? [num (n number?)] ; base case [plus (left? ast?) (right? ast?)]) ; inductive case
6.3 Constructors and Type predicates auto-defined
These constructors get automatically defined
;;; num : number? -> ast? ;;; plus : [ast?, ast?] -> ast?
These type predicates get automatically defined
;;; ast? : any/c -> boolean?
6.4 Testing the AST functions
(check-true (ast? (num 5))) (check-true (ast? (plus (num 5) (num 3))))
6.5 cases
cases
can be used to identify the appropriate variant
(cases ast (plus (num 5) (num 4)) [plus (l r) "it's a plus"] [else "it's a number"])
7 Exercise: A simple unparser
We now consider two problems
- Choosing a concrete representation of addition expressions.
- Writing an unparser to convert AST's to expressions in the concrete syntax.
7.1 Choosing a concrete representation
We will experiment with two choices:
- A datum (s-exp) based prefix syntax like Racket's own.
- An infix syntax, à la Python or C.
7.2 A Datum based prefix syntax
Here is the grammar for the concrete syntax. The quote marks enclose literal symbols.
<exp> ::= <num> | '(+ ' <exp> <exp>')'
7.3 Unparser and Parser
- Unparser
- An unparser takes an ast and returns an element of concrete syntax.
- Parser
- A parser does the opposite: it takes a string, determines if the string is an instance of the concrete syntax. If it is, the parser converts the string to an ast. If not, it raises an error.
7.4 Unparsing to a datum syntax: Racket implementation
(define (unparse a) (cases ast a [num (n) n] [plus (left right) (list '+ (unparse left) (unparse right))]))
7.5 Testing unparse
(check-equal? 5 (unparse (num 5))) (check-equal? '(+ 2 3) (unparse (plus (num 2) (num 3))))
7.6 An unparser using quasiquote
Quasiquote (or backquote) allows controlled evaluation inside an expression. Evaluation inside the expression is triggered in the context of a quasiquote and an unquote (comma).
`<exp>
is short for (quasiquote <exp>)
. ,<exp>
is
short for (unquote <exp>)
.
(define (unparse/quasiquote a) (cases ast a [num (n) n] [plus (left right) `(+ ,(unparse left) ,(unparse right))]))
7.7 An unparser that generates infix syntax
(define (unparse-infix a) (cases ast a [num (n) (format "~a" n)] [plus (left right) (format "(~a + ~a)" (unparse-infix left) (unparse-infix right))]))
8 Exercise: A datum parser
8.1 A datum vs string parser
- Datum parser
- The concrete syntax is built using datum elements. It is much easier to build a datum parser because the conversion of strings of characters to datum elements, the heavy lifting, is already done by Racket.
- String parser
- The concrete syntax is a string. The parsing to ASTs involves more complex parsing methods, usually the stuff of a compilers course.
8.2 Does the concrete syntax determine the meaning of programs?
- It is profitable to think of programs as trees, i.e., as Abstract Syntax trees.
- The concrete syntax is relevant only insofar as how it appeals to a human reader of the program.
- Therefore, we will pick a Racket-like syntax for the languages we build in this course. One advantage is that it is already familiar to us by now. The other advantage is that parsers are very easy to build for such concrete syntax in Racket.
8.3 A datum parser
The datum parser for the ADDITION language is quite simple.
;;; parse :: any/c -> ast? + error (define parse (lambda (d) (cond [(number? d) (num d)] [(and (list? d) (= (length d) 3) (eq? (first d) '+)) (plus (parse (second d)) (parse (third d)))] [else (error 'parse "invalid syntax ~a" d)])))
8.4 Testing parse
(check-equal? (num 5) (parse 5)) (check-equal? (plus (num 4) (num 3)) (parse '(+ 4 3))) (check-exn exn:fail? (lambda () (parse '(+ "hello" 3))))
8.5 Datum parser using match
match
is a construct in Racket that allows pattern-matching.
A given expression may be matched with different patterns.
match
is a derived keyword, but the implementation of match
is not necessary to understand how to use it.
;;; parse/match :: any/c -> ast? + error ;;; (require racket/match) (define parse/match (lambda (d) (match d [(? number? n) (num n)] ; is d a number, say n? [(list '+ left right) ; is d a list consisting of '+, ; and two other parts, ; say, left and right? (plus (parse/match left) (parse/match right))] [else (error 'parse/match "invalid syntax ~a" d)])))
8.6 Testing parse/match
(check-equal? (num 5) (parse/match 5)) (check-equal? (plus (num 4) (num 3)) (parse/match '(+ 4 3))) (check-exn exn:fail? (lambda () (parse/match '(+ "hello" 3))))
9 Summary
9.1 Key concepts in this lesson
- Concrete Syntax
- Programs have concrete syntax.
- Abstract Syntax Trees
- The abstract syntax is the hidden hierarchical structure beneath the concrete syntax.
- Syntax via inference rules
- The syntax of a programming language may be specified via judgements and inference rules.
- Racket implementation of AST as a datatype
- An abstract syntax tree is realised a data type consisting of variants for each syntactic category.
- Unparser
- Converts an AST of a program to its concrete syntax.
- Parser
- Converts a concrete syntax of a program into its AST.
9.2 Key Racket concepts in this lesson
define-datatype
- A way of defining variant abstract data types (inherited from EOPL).
quasiquote
andunquote
- Ways of controlling evaluation within an expression.
match
- A pattern matcher for datums.