Abstract Syntax Trees and (Un)parsing

1. Introduction
- 1.1. How is a program represented?
2. ADDITION Language
3. Program Representation
4. More complex languages
- 4.1. Practical notational conventions
5. Syntax and Inference Rules
- 5.1. Abstract Grammar
- 5.2. ADDITION syntax as Inference Rules
6. AST's in Racket
7. Exercise: A simple unparser
8. Exercise: A datum parser
9. Summary
- 9.1. Key concepts in this lesson
- 9.2. Key Racket concepts in this lesson

\( %% Your math definitions here % \newcommand{\alphaequiv}{{\underset{\raise 0.7em\alpha}{=}}} \newcommand{\yields}{\Rightarrow} \newcommand{\derives}{\overset{*}{\yields}} \newcommand{\alphaequiv}{=_{\alpha}} \newcommand{\tto}[2]{{\overset{#1}{\underset{#2}{\longrightarrow}}}} \newcommand{\transitsto}[2]{{\overset{#1}{\underset{#2}{\longrightarrow}}}} \newcommand{\xtransitsto}[2]{{\underset{#2}{\xrightarrow{#1}}}} \newcommand{\xtransitsfrom}[2]{{\underset{#2}{\xleftarrow{#1}}}} \newcommand{\xto}[2]{{\xtransitsto{#1}{#2}}} \newcommand{\xfrom}[2]{{\xtransitsfrom{#1}{#2}}} \newcommand{\xreaches}[2]{{\underset{#2}{\xtwoheadrightarrow{#1}}}} \newcommand{\reaches}[2]{{\underset{#2}{\xtwoheadrightarrow{#1}}}} %\newcommand{\reaches}[2]{{\overset{#1}{\underset{#2}{\twoheadrightarrow}}}} %\newcommand{\goesto}[2]{\transitsto{#1}{#2}} %\newcommand{\betareducesto}{{\underset{\beta}{\rightarrow}}} \newcommand{\betareducesto}{\rightarrow_{\beta}} %\newcommand{\etareducesto}{{\underset{\eta}{\rightarrow}}} \newcommand{\etareducesto}{\rightarrow_{\eta}} %\newcommand{\betaetareducesto}{{\underset{\beta\ \eta}{\rightarrow}}} \newcommand{\betaetareducesto}{\rightarrow_{\beta\eta}} \newcommand{\preducesto}{\rhd} \newcommand{\psimplifiesto}{\stackrel{\scriptstyle{*}}{\rhd}} \newcommand{\lreducesto}{\rightsquigarrow} \newcommand{\lsimplifiesto}{\stackrel{\scriptstyle{*}}{\lreducesto}} \newcommand{\rewritesto}{\hookrightarrow} \newcommand{\goesto}[1]{\stackrel{#1}{\rightarrow}} \newcommand{\xgoesto}[1]{\xrightarrow{#1}} \newcommand{\reducesto}{\stackrel{}{\rightarrow}} \newcommand{\simplifiesto}{\stackrel{\scriptstyle{*}}{\rightarrow}} \newcommand{\connected}[1]{\stackrel{#1}{\leftrightarrow}} \newcommand{\joins}{\downarrow} \newcommand{\evaluatesto}{\Longrightarrow} %\newcommand{\lit}[1]{\hbox{\sf{#1}}} \newcommand{\lit}[1]{{\sf{#1}}} \newcommand{\true}{\lit{true}} \newcommand{\false}{\lit{false}} \def\Z{\mbox{${\mathbb Z}$}} \def\N{\mbox{${\mathbb N}$}} \def\P{\mbox{${\mathbb P}$}} \def\R{\mbox{${\mathbb R}$}} \newcommand{\Rp}{{\mathbb{R}}^+} \def\Bool{\mbox{${\mathbb B}$}} \def\sA{\mbox{${\cal A}$}} \def\sB{\mbox{${\cal B}$}} \def\sC{\mbox{${\cal C}$}} \def\sD{\mbox{${\cal D}$}} \def\sF{\mbox{${\cal F}$}} \def\sG{\mbox{${\cal G}$}} \def\sL{\mbox{${\cal L}$}} \def\sP{\mbox{${\cal P}$}} \def\sM{\mbox{${\cal M}$}} \def\sN{\mbox{${\cal N}$}} \def\sR{\mbox{${\cal R}$}} \def\sS{\mbox{${\cal S}$}} \def\sO{\mbox{${\cal O}$}} \def\sT{\mbox{${\cal T}$}} \def\sU{\mbox{${\cal U}$}} \def\th{\mbox{$\widetilde{h}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\tP{\mbox{$\widetilde{P}$}} \def\norm{\mbox{$\parallel$}} \def\osum{${{\bigcirc}}\!\!\!\!{\rm s}~$} \def\pf{\noindent {\bf Proof}~~} \def\exec{\mathit{exec}} \def\Act{\mathit{A\!ct}} \def\Traces{\mathit{Traces}} \def\Spec{\mathit{Spec}} \def\uns{\mathit{unless}} \def\ens{\mathit{ensures}} \def\lto{\mathit{leads\!\!-\!\!to}} \def\a{\alpha} \def\b{\beta} \def\c{\gamma} \def\d{\delta} \def\sP{\mbox{${\cal P}$}} \def\sM{\mbox{${\cal M}$}} \def\sA{\mbox{${\cal A}$}} \def\sB{\mbox{${\cal B}$}} \def\sC{\mbox{${\cal C}$}} \def\sI{\mbox{${\cal I}$}} \def\sS{\mbox{${\cal S}$}} \def\sD{\mbox{${\cal D}$}} \def\sF{\mbox{${\cal F}$}} \def\sG{\mbox{${\cal G}$}} \def\sR{\mbox{${\cal R}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\ta{\mbox{$\widetilde{a}$}} \def\tb{\mbox{$\widetilde{b}$}} \def\tc{\mbox{$\widetilde{c}$}} \def\tx{\mbox{$\widetilde{x}$}} \def\ty{\mbox{$\widetilde{y}$}} \def\tz{\mbox{$\widetilde{z}$}} \def\tI{\mbox{$\widetilde{I}$}} \def\norm{\mbox{$\parallel$}} \def\sL{\mbox{${\cal L}$}} \def\sM{\mbox{${\cal M}$}} \def\sN{\mbox{${\cal N}$}} \def\th{\mbox{$\widetilde{h}$}} \def\tg{\mbox{$\widetilde{g}$}} \def\tP{\mbox{$\widetilde{P}$}} \def\norm{\mbox{$\parallel$}} \def\to{\rightarrow} \def\ov{\overline} \def\gets{\leftarrow} \def\too{\longrightarrow} \def\To{\Rightarrow} \def\points{\mapsto} %\def\yields{\mapsto^{*}} \def\un{\underline} \def\vep{$\varepsilon$} \def\ep{$\epsilon$} \def\tri{$\bigtriangleup$} \def\Fi{$F^{\infty}$} \def\Di{\Delta^{\infty}} \def\ebox\Box \def\emp{\emptyset} \def\leadsto{\rightharpoondown^{*}} \newcommand{\benum}{\begin{enumerate}} \newcommand{\eenum}{\end{enumerate}} \newcommand{\bdes}{\begin{description}} \newcommand{\edes}{\end{description}} \newcommand{\bt}{\begin{theorem}} \newcommand{\et}{\end{theorem}} \newcommand{\bl}{\begin{lemma}} \newcommand{\el}{\end{lemma}} %\newcommand{\bp}{\begin{prop}} %\newcommand{\ep}{\end{prop}} \newcommand{\bd}{\begin{defn}} \newcommand{\ed}{\end{defn}} \newcommand{\brem}{\begin{remark}} \newcommand{\erem}{\end{remark}} \newcommand{\bxr}{\begin{exercise}} \newcommand{\exr}{\end{exercise}} \newcommand{\bxm}{\begin{example}} \newcommand{\exm}{\end{example}} \newcommand{\beqa}{\begin{eqnarray*}} \newcommand{\eeqa}{\end{eqnarray*}} \newcommand{\bc}{\begin{center}} \newcommand{\ec}{\end{center}} \newcommand{\bcent}{\begin{center}} \newcommand{\ecent}{\end{center}} \newcommand{\la}{\langle} \newcommand{\ra}{\rangle} \newcommand{\bcor}{\begin{corollary}} \newcommand{\ecor}{\end{corollary}} \newcommand{\bds}{\begin{defns}} \newcommand{\eds}{\end{defns}} \newcommand{\brems}{\begin{remarks}} \newcommand{\erems}{\end{remarks}} \newcommand{\bxrs}{\begin{exercises}} \newcommand{\exrs}{\end{exercises}} \newcommand{\bxms}{\begin{examples}} \newcommand{\exms}{\end{examples}} \newcommand{\bfig}{\begin{figure}} \newcommand{\efig}{\end{figure}} \newcommand{\set}[1]{\{#1\}} \newcommand{\pair}[1]{\langle #1\rangle} \newcommand{\tuple}[1]{\langle #1\rangle} \newcommand{\union}{\cup} \newcommand{\Union}{\bigcup} \newcommand{\intersection}{\cap} \newcommand{\Intersection}{\bigcap} \newcommand{\B}{\textbf{B}} %\newcommand{\be}[2]{\begin{equation} \label{#1} \tag{#2} \end{equation}} \newcommand{\abs}[1]{{\lvert}#1{\rvert}} \newcommand{\id}[1]{\mathit{#1}} \newcommand{\pfun}{\rightharpoonup} %\newcommand{\ra}[1]{\kern-1.5ex\xrightarrow{\ \ #1\ \ }\phantom{}\kern-1.5ex} %\newcommand{\ras}[1]{\kern-1.5ex\xrightarrow{\ \ \smash{#1}\ \ }\phantom{}\kern-1.5ex} \newcommand{\da}[1]{\bigg\downarrow\raise.5ex\rlap{\scriptstyle#1}} \newcommand{\ua}[1]{\bigg\uparrow\raise.5ex\rlap{\scriptstyle#1}} % \newcommand{\lift}[1]{#1_{\bot}} \newcommand{\signal}[1]{\tilde{#1}} \newcommand{\ida}{\stackrel{{\sf def}}{=}} \newcommand{\eqn}{\doteq} \newcommand{\deduce}[1]{\sststile{#1}{}} %% \theoremstyle{plain}%default %% \newtheorem{thm}{Theorem}[section] %% \newtheorem{lem}[thm]{Lemma} %% \newtheorem{cor}[thm]{Corollary} %% % %% \theoremstyle{definition} %% \newtheorem{defn}[thm]{Definition} %% % %% \theoremstyle{remark} %% \newtheorem{remark}[thm]{Remark} %% \newtheorem{exercise}[thm]{Exercise} %% See http://u.cs.biu.ac.il/~tsaban/Pdf/LaTeXCommonErrs.pdf %% \newtheorem{prop}[thm]{Proposition} \newcommand{\less}[1]{#1_{<}} \newcommand{\pfn}{\rightharpoonup} \newcommand{\ffn}{\stackrel{{\sf fin}}{\rightharpoonup}} \newcommand{\stkout}[1]{\ifmmode\text{\sout{\ensuremath{#1}}}\else\sout{#1}\fi} % \DeclareMathSymbol{\shortminus}{\mathbin}{AMSa}{"39} \newcommand{\mbf}[1]{\mathbf{#1}} \)

1 Introduction

1.1 How is a program represented?

Welcome: We continue with our study of the principles of programming languages.

Problem addressed here: We start with the following question: How are programs represented?

Multiple answers: This question has multiple answers. In this lesson, we will examine a couple of ways to answer this question.

Start small: During the course of this semester, we will examine many small programming languages. But it's best to start with something small, almost, but not quite trivial.

2 ADDITION Language

Addition language: We will consider the language we simply call ADDITION.

Addition expressions: Programs are arithmetic expressions involving natural numbers and a binary addition operator +.

Ignore Algebra of +: We will ignore the other algebraic properties (viz., commutativty and associativity) of + for the moment.

3 Program Representation

3.1 Concrete Syntax

Concrete Syntax: the structure of programs as they are directly seen or entered by the programmer. Usually the concrete syntax is a linear sequence of characters grouped into tokens.

C or Python examples: Here are possible representations of ADDITION expressions:

5
3+4
(3+4)+2

3.2 Strings vs Trees

Programs as strings: There is an obvious linear structure of a a program as a string, or sequence of characters.

Programs as Trees: There is a deeper, more abstract structure embedded within the string, which is a tree.

3.3 Abstract Syntax

Abstract Syntax Trees: Rather than consider concrete syntax, which usually involves writing down expressions linearly, we will assume that expressions are denoted abstractly as trees.

ASTs: We call these trees Abstract Syntax Trees, or ASTs for short.

3.4 Drawing ASTs

Leaves are numbers nodes: Notice that the leaves of these trees are numbers

Internal nodes are +: and each internal vertex is the operator +.

(3+4)+2:
       +
      / \
     +  2
    / \
   3  4



(3+1)+(2+3):
        *
       / \
      /   +
     +   / \
    / \  2  3 
   3  1

3.5 Lisp concrete syntax

Lisp concrete syntax: In the Lisp Family of languages, it is easier to identify the tree stucture from the concrete syntax: The matching parentheses enclose non-leaf subtrees.

(+ 3 4)

(+ (+ 3 4)
   2)

(+ (+ 3 1)
   (+ 2 3)

(+ (+ 3 
      (+ 2
         (+ 4 5)))
   (+ (+ 1 8)
      (+ 2 5)))

4 More complex languages

For the following code in Python,

     while (i <= x):
         i=g(i);
     return i;

An abstract syntax would look something like this:

        seq
       /   \
      while return
      / \     \
     /   \     i
    <=    @
   / \   / \
  i  x   g  i

Can you guess what the seq and @ stands for?

4.1 Practical notational conventions

Drawing trees: Drawing trees is a slightly more elaborate process

Notation: We will continue to write programs in the concrete syntax as strings, with the awareness that these really denote trees.

Example: we write

3+2

AST: but switch to the AST representation whenever necessary.

   +
  / \
 3   2

5 Syntax and Inference Rules

We are now interested in specifying the structure of abstract syntax trees.

5.1 Abstract Grammar

The abstract grammar specifies trees, not strings.

<ast> ::=  <number> | + <ast> <ast>

5.2 ADDITION syntax as Inference Rules

Alternatively, we can use inference rules to specify the syntax

\begin{align*} \genfrac{}{}{4}{0}{}{n\ Exp} & n\in \N & \hspace{6em}{NUM}\\ \\ \\ % \genfrac{}{}{4}{0}{{e_1\ Exp}\hspace{2em}{e_2\ Exp}}{e_1 + e_2\ Exp} & & \hspace{2em}{PLUS} \end{align*}

6 AST's in Racket

6.1 Boilerplate

#lang racket
(require eopl)
(require rackunit)
;;; the only-in so that we import only what we need and we don't clash
;;; with bindings defined in the eopl module.
(require (only-in racket/list
                  first
                  second
                  third))
(provide (all-defined-out))

6.2 `define-datatype`

define-datatype is used to define a variant (or sum) datatype, i.e., a datatype that has multiple variants.

(define-datatype ast ast?
  [num (n number?)]  ; base case
  [plus (left? ast?) (right? ast?)]) ; inductive case

6.3 Constructors and Type predicates auto-defined

These constructors get automatically defined

;;; num : number? -> ast?
;;; plus : [ast?, ast?] -> ast?

These type predicates get automatically defined

;;; ast?    : any/c -> boolean?

6.4 Testing the AST functions

(check-true (ast? (num 5)))
(check-true (ast? (plus (num 5) (num 3))))

6.5 `cases`

cases can be used to identify the appropriate variant

(cases ast (plus (num 5) (num 4))
  [plus (l r) "it's a plus"]
  [else "it's a number"])

7 Exercise: A simple unparser

We now consider two problems

Choosing a concrete representation of addition expressions.
Writing an unparser to convert AST's to expressions in the concrete syntax.

7.1 Choosing a concrete representation

We will experiment with two choices:

A datum (s-exp) based prefix syntax like Racket's own.
An infix syntax, � la Python or C.

7.2 A Datum based prefix syntax

Here is the grammar for the concrete syntax. The quote marks enclose literal symbols.

<exp> ::= <num> | '(+ ' <exp> <exp>')'

7.3 Unparser and Parser

Unparser: An unparser takes an ast and returns an element of concrete syntax.

Parser: A parser does the opposite: it takes a string, determines if the string is an instance of the concrete syntax. If it is, the parser converts the string to an ast. If not, it raises an error.

7.4 Unparsing to a datum syntax: Racket implementation

(define (unparse a)
  (cases ast a
    [num (n) n]
    [plus (left right)
          (list '+ 
                (unparse left)
                (unparse right))]))

7.5 Testing `unparse`

(check-equal? 5 (unparse (num 5)))
(check-equal? '(+ 2 3) (unparse (plus (num 2) (num 3))))

7.6 An unparser using quasiquote

Quasiquote (or backquote) allows controlled evaluation inside an expression. Evaluation inside the expression is triggered in the context of a quasiquote and an unquote (comma).

`<exp> is short for (quasiquote <exp>). ,<exp> is short for (unquote <exp>).

(define (unparse/quasiquote a)
  (cases ast a
    [num (n) n]
    [plus (left right)
          `(+ 
            ,(unparse left)
            ,(unparse right))]))

7.7 An unparser that generates infix syntax

(define (unparse-infix a)
  (cases ast a
    [num (n) (format "~a" n)]
    [plus (left right)
          (format "(~a + ~a)" 
                  (unparse-infix left)
                  (unparse-infix right))]))

8 Exercise: A datum parser

8.1 A datum vs string parser

Datum parser: The concrete syntax is built using datum elements. It is much easier to build a datum parser because the conversion of strings of characters to datum elements, the heavy lifting, is already done by Racket.
String parser: The concrete syntax is a string. The parsing to ASTs involves more complex parsing methods, usually the stuff of a compilers course.

8.2 Does the concrete syntax determine the meaning of programs?

It is profitable to think of programs as trees, i.e., as Abstract Syntax trees.
The concrete syntax is relevant only insofar as how it appeals to a human reader of the program.
Therefore, we will pick a Racket-like syntax for the languages we build in this course. One advantage is that it is already familiar to us by now. The other advantage is that parsers are very easy to build for such concrete syntax in Racket.

8.3 A datum parser

The datum parser for the ADDITION language is quite simple.

;;; parse :: any/c -> ast? + error
(define parse
  (lambda (d)
    (cond [(number? d) (num d)]
          [(and (list? d) (= (length d) 3)
                (eq? (first d) '+))
           (plus (parse (second d)) 
                 (parse (third d)))]
          [else (error 'parse 
                       "invalid syntax ~a" d)])))

8.4 Testing `parse`

(check-equal? (num 5) (parse 5))
(check-equal? (plus (num 4) (num 3)) (parse '(+  4 3)))
(check-exn exn:fail? (lambda () (parse '(+ "hello" 3))))

8.5 Datum parser using match

match is a construct in Racket that allows pattern-matching. A given expression may be matched with different patterns. match is a derived keyword, but the implementation of match is not necessary to understand how to use it.

;;; parse/match :: any/c -> ast? + error
;;;
(require racket/match)
(define parse/match
  (lambda (d)
    (match d
      [(? number? n) (num n)]         ; is d a number, say n?
      [(list '+ left right)           ; is d a list consisting of '+, 
                                      ; and two other parts, 
                                      ; say, left and right?
       (plus (parse/match left) 
             (parse/match right))]
      [else (error 'parse/match
                   "invalid syntax ~a" d)])))

8.6 Testing `parse/match`

(check-equal? (num 5) (parse/match 5))
(check-equal? (plus (num 4) (num 3)) (parse/match '(+  4 3)))
(check-exn exn:fail? (lambda () (parse/match '(+ "hello" 3))))

9 Summary

9.1 Key concepts in this lesson

Concrete Syntax: Programs have concrete syntax.

Abstract Syntax Trees: The abstract syntax is the hidden hierarchical structure beneath the concrete syntax.

Syntax via inference rules: The syntax of a programming language may be specified via judgements and inference rules.

Racket implementation of AST as a datatype: An abstract syntax tree is realised a data type consisting of variants for each syntactic category.

Unparser: Converts an AST of a program to its concrete syntax.

Parser: Converts a concrete syntax of a program into its AST.

9.2 Key Racket concepts in this lesson

define-datatype: A way of defining variant abstract data types (inherited from EOPL).

quasiquote and unquote: Ways of controlling evaluation within an expression.

match: A pattern matcher for datums.