Let’s start our investigation with an interpreter for a simple language of calculations involving +
and -
over numbers. We’ll use this first discussion to define an interpreter that calculates a value from such expressions. The language is basic, but we can start discussing important concepts of abstract and concrete syntax, abstract syntax trees, parsing, interpretation and testing. Pay less attention to the language than to the underlying methodology and concepts used to define an interpreter.
Concrete Syntax is the textual language written by programmers as input to compilers and interpreters. When you think of a programming language, you think first of its concrete syntax. The block structure of C, s-expressions defining Lisp, and the functional syntax of Haskell are examples of concrete syntax that immediately associates with a particular language.
Concrete syntax is always described by a grammar consisting of an alphabet and a collection of grammar rules. As discussed in the introduction, the alphabet defines basic symbols or tokens in the concrete syntax while grammar rules define how those tokens are sequenced defining elements of the language. A term is any text string defined by the grammar. Similarly, a language is the smallest set of terms that satisfy a set of grammar rules. It is useful to think of a language as a set and a grammar as a specifier for that set. Using traditional set comprehension we can define a language, $L$, as:
\[L = \{s:string\; \mid\; G(s)\}\]where $G$ is a predicate that is true when $s$ satisfies $L$’s grammar rules.
Let’s define the concrete syntax for our first language that we will call AE
for Arithmetic Expressions. Terms in AE
are defined by the following grammar:
Hopefully this is familiar. Terms, $t$, in AE
are either numbers or sums and differences nested arbitrarily.
The language AE
is the set of all strings that can be created with the grammar. The strings:
3
1+5
3+5-1
are all terms in AE
. While the strings:
A
1*5
A+B-C
are not terms in AE
.
Let’s also define values in AE
as numbers:
The set of values is a subset of the set of terms and represents terms that result from interpretation. The set of values define what we should get when we run an interpreter. We will define this more formally later, but for now remember that $v$ is an interpretation result.
If we define the predicate $ae(t)$ to be true when $t$ satisfies AE
’s grammar rules, then the language can be defined as a set:
In AE
we can express three things: (i) a number; (ii) adding two terms; and (iii) subtracting two terms. Not the most useful of languages, but we have to start somewhere.
This discussion is hopefully review of your formal language theory or theory of computing studies. If it isn’t, I suggest stopping now and studying some formal language theory. In addition to being foundational, it is beautiful and useful.
Knowing what AE
looks like, let’s now define how terms in AE
are interpreted. Before writing a Haskell interpreter, we should define formally the meaning of terms. We will use inference rules for this purpose. $\eval$ is the evaluation relation and $x \eval y$ is read x evaluates to y. As the name implies the relation states that evaluating $x$ results in $y$.
Our first inference rule tells us how to interpret numbers buy defining $\eval$ for $v$:
\[\frac{}{\underline{v} \eval v}\; [NumE]\]The $NumE$ rule says evaluating numeral syntax in gives us the corresponding number. Remember that $v$ is a number and what we’re saying is interpreting a constant number value gives back the constant number value. Note that when an ambiguity exists between syntax and terms used to define meaning, the underscore notation indicates syntax. For example, $\underline{v}$ is number syntax while $v$ is number value.
Addition and subtraction are more interesting and hint at how all our interpreters will be structured. The rule, $PlusE$ defines the interpretation of terms of the form $t_1+t_2$:
\[\frac{t_1 \eval v_1,\; t_2 \eval v_2}{t_1 \underline{+} t_2 \eval v_1+v_2}\; [PlusE]\]$PlusE$’s antecedents and consequent work together to define a recursive interpreter. The first antecedent states that $t_1 \eval v_1$ must be true before the consequent can be true. But $v_1$ is a variable whose value is the result of evaluating $t_1$. In effect, this antecedent says that $v_1$ must be the result of that evaluation. The second antecedent behaves similarly for $t_2$ and $v_2$. Both antecedents name the results of interpreting $t_1$ and $t_2$ as $v_1$ and $v_2$ respectively.
Now that we know the results of evaluating $t_1$ and $t_2$, defining their sum is simple. Because values in AE
are numbers we use Haskell’s notion of addition to define the sum. Thus, the consequent
is $t_1 \underline{+} t_2 \eval v_1 + v_2$.
We define subtraction similarly in the $MinusE$ rule:
\[\frac{t_1 \eval v_1,\; t_2 \eval v_2}{t_1 \underline{-} t_2 \eval v_1-v_2}\; [MinusE]\]Using this rule $2\underline{-}1$ evaluates to $1$ as it should. However, $1\underline{-}2$ evaluates to $-1$. As $\NUM$ values are defined to be non-negative, this is a problem. It’s easy enough to fix the $MinusE$ rule to only generate positive values:
\[\frac{t_1 \eval v_1,\; t_2 \eval v_2,\; v_1\geq v_2}{t_1 \underline{-} t_2 \eval v_1-v_2}\; [MinusE+]\]The new rule $MinusE+$ adds a third antecedent that specifies $v_1$ must be greater than or equal to $v_2$ The $MinusE+$ applies when: $v_1\geq v_2$ thus $v_1 - v_2$ will always be non-negative as required.
What does happen when $v_2 > v_1$? The new $MinusE+$ does not apply nor does any other evaluation rule. Subtraction when $v_2 > v_1$ is undefined. There is no rule that applies and thus no evaluation results. When evaluation is not complete in this way and the term being evaluated is not a value, we say that interpretation is stuck and the evaluation relation defined by $\eval$ is not total.
One way to fix this problem is to return $0$ when the result would be negative. This fix is captured with the rule $MinusEZero$:
\[\frac{t_1 \eval v_1,\; t_2 \eval v_2,\; v_1 < v_2}{t_1 \underline{-} t_2 \eval 0}\; [MinusEZero]\]Evaluation is now total, but subtraction in $AE$ has properties that might prove problematic. $0$ could be a legitimate subtraction result or it could indicate a problem. We call these designated values “magic values” and they prove problematic. An alternative is returning an error value:
\[\frac{t_1 \eval v_1,\; t_2 \eval v_2,\; v_1 < v_2}{t_1 \underline{-} t_2 \eval \bot}\; [MinusEBottom]\]This introduces a new value, $\bot$, called bottom that represents an error value. As a result, $\bot$ must be included in all other inference rules. What is $5+\bot$? Simply defining the result of any expression involving $\bot$ as $\bot$ requires new rules for each operator.
Often the best approach is to leave the relation as is and wait until implementation time to deal with errors. For now, this is the approach we will take.
Understanding the structure of these rules before moving forward is vital. They both define antecedents that name the results of other calculations. More specifically, other recursive calculations. When writing and defining interpreters, recursion is your best friend. We needn’t think now about calculating the values of $t_1$ and $t_2$, only that their values are calculated the same way all other values are calculated.
Now we have both a concrete syntax and an evaluation semantics for AE
defined using mathematical techniques. We need to transform both definitions into Haskell structures to build our first interpreter for AE
. To do this, we will first define an abstract syntax for AE
.
Concrete syntax is nice on the user side, but painful to work with directly when writing interpreters and compilers. If you want to try an experiment, write a Haskell program that will interpret the AE
language directly. It can be done, but making changes or extending the language is laborious and potentially painful.
Abstract Syntax is a data structure representing parsed terms. This data structure is far more useful than concrete syntax when writing interpreters and compilers. Programs are data and abstract syntax is the data structure that represents them. An abstract syntax for AE
written using a Haskell data type is:
data AE where
Num :: Int -> AE
Plus :: AE -> AE -> AE
Minus :: AE -> AE -> AE
deriving (Show,Eq)
where Num
, Plus
and Minus
are the constructors of the data type AE
that correspond with numbers, sums and differences in the AE
concrete syntax. We use the term constructor because all values of type AE
are constructed with one of these operations. By definition, the AE
type contains all constructions using Plus
, Minus
and Num
, and no more.
All terms in AE have associated data structures in the abstract syntax. For example (Num 1)
is the abstract syntax for 1
. (Plus (Num 1) (Num 3))
is the abstract syntax for 1+3
. (Minus (Plus (Num 3 (Num 5)) (Num 1))
is the abstract syntax for 3+5-1
. For the abstract syntax to be effective, every term in the concrete syntax must have an associated term in the abstract syntax. Remember the properties of relations you learned in your discrete math class? They come in handy right now. The relationship between concrete syntax and associated abstract syntax should be a total function. Concrete syntax terms should have exactly one abstract syntax value and all concrete syntax terms should be associated with an abstract syntax value. Always remember that errors are outputs.
From this point forward I will use the TLA1 AST when referring to abstract syntax data structures. AST stands for abstract syntax tree. Haskell data types naturally form trees and trees are perfect representations for abstract syntax. I’ll come back to this later, but for now remember that AST, abstract syntax, and abstract syntax tree refer to the same Haskell data type.
A parser is a program that translates concrete syntax into an AST. It checks the syntax of its input, generates error messages if the syntax is bad, and generates an AST if the syntax is good. The signature of a parser for the AE language is:
parseAE :: String -> AE
Give parseAE
a string and it will return an AE
if it does not throw an error. A parser’s signature for any language will be a string to the datatype for that language’s abstract syntax.
A pretty printer is the inverse of a parser. It translates an AE
data structure into a string representing the concrete syntax. The signature of a pretty printer for the AE language is:
pprintAE :: AE -> String
Give pprintAE
an AE
data structure and it will return a string. The pretty printer’s signature for any language will be the datatype for that language’s abstract syntax to a string.
If parseAE
and pprintAE
are correct, then the following equivalence should hold for any string representing legal concrete syntax:
pprintAE (parseAE s) == s
If we parse a string and pretty print the result, we back the string. A useful relationship for testing these tools.
This course is not about building parsers and pretty printers. This task is largely automated, Haskell provides tools for building parsers, and parser generators are widely available for other languages. I will provide examples using Parsec, a monadic parser combinator tool for writing and composing parsers. After some setup, Parsec is not difficult to extend to the languages we would like to discuss. For more details there are a number of online tutorials and chapters in major Haskell texts. For now, let’s assume we have the parseAE
and pprintAE
functions available.
An evaluator converts an abstract syntax term into a value. As noted earlier, values represent valid interpretation results. If an evaluator produces anything besides a value, something went wrong. Either the input is invalid, the evaluator is written wrong, or the language definition is problematic. We’ll talk more about these issues later. For now, let’s look at an evaluator where everything works as it should.
How should the evaluator be constructed? The data type defined for the abstract syntax gives us a big clue. If the constructors from the data type define every possible AST element, then defining an evaluator for each element of the data type should do the trick. We need to define how the interpreter behaves on each constructor from the AE
datatype.
First, let’s get the eval
signature down. Our evaluator will take an element of AE
and produce a value, where a value is defined as a number. In the AST, numbers are of the form (Num n)
where n
is a Haskell Int
. Unfortunately, we can’t capture value-ness in our signature, so we’ll say that eval
returns Maybe AE
. We’ll use Maybe
to allow eval
to return a value or an error. The signature for eval
is:
eval :: AE -> Maybe AE
We now define eval
’s cases, one for each AE
constructor starting with the Num
constructor. (Num 3)
, for example, represents a constant 3
in the AE
abstract syntax. Without much thought it should be clear that as a constant, (Num 3)
evaluates to (Num 3)
. Numbers are values in AE
, thus they should not be evaluated further making this consistent with our formal definition. Thankfully, this is exactly what our inference rule for evaluating numbers says if we remember that $v$ represents $\NUM$:
The value associated with an AE
number is the Haskell equivalent. Thus, the eval
case for (Num n)
just returns its argument:
(Num n) = return (Num n)
Because return = Just
, eval (Num 3)
returns (Just (Num 3))
We now have an evaluator for literal numbers, but nothing more.
The next constructor, Plus
, represents a more interesting case. We have a rule named $PlusE$ that defines evaluation of t1+t2
:
The inference rule is defined in terms of concrete rather than abstract syntax, so we have to do a bit of translation work. $t_1\underline{+}t_2$ translates quickly into (Plus t1 t2)
. If t1
evaluates to (Num v1)
and t2
evaluates to (Num v2)
then (Plus t1 t2)
evaluates to (Num v1)+(Num v2)
. There is no +
operation in Haskell for (Num n)
constructions, this we need to write one to define addition in AE
. We can define addition in AE
in terms of addition in Haskell. Specifically:
(Num n) + (Num m) == (Num n+m)
There are many ways to construct this operation in Haskell. Let’s use a lifting function that will allow us to define any operation in AE
using an operation from Haskell:
liftNum :: (Int -> Int -> Int) -> AE -> AE -> AE
liftNum f (Num l) (Num r) = (Num (f l r))
liftNum
takes a binary function over integers and two Num
constructions. It lifts the binary operation into Num
by extracting the Int
values, applying the operation, and putting the result back in Num
. While we could not use +
on Num
constructions directly, we can use liftNum
to define addition in AE
:
(Num n) + (Num m) == (liftNum (+) n m)
We can define other AE
operations similarly.
Now we have everything needed to define Plus
. This inference rule can be almost directly translated into Haskell using do
to evaluate antecedents, calculate the and return the value defined by the consequent:
(Plus t1 t2) = do v1 <- eval t1
v2 <- eval t2
return (liftNum (+) v1 v2)
While the lifting function is pretty cool, there is a potentially better way. We can use Haskell’s pattern matching capability to extract good values and filter out bad values. Consider this use of the do notation:
(Plus t1 t2) = do (Num v1) <- eval t1
(Num v2) <- eval t2
return (Num v1+v2)
It turns out that the <-
notation is not a simple assignment operator. eval t1
and eval t2
behave as they always have returning values of type AE
. However, the notation (Num v1) <- eval t1
causes the result from eval t1
to be matched with (Num v1)
. v1
is a variable, but Num
is a constructor. If eval t1
returns a value (Num 3)
pattern matching succeeds and v1
is bound to 3
. If a match fails, the result is an error and the monadic value Nothing
gets returned.
The translation from inference rule to Haskell follows standard rule of thumb. The do
bindings manage the rule antecedents, performing evaluation and binding variables. The return
is the consequent and evaluates the term. While only a rule-of-thumb, it will prove useful.
Finally, The Minus
constructor case is identical to the Plus
constructor case except values are subtracted rather than added together. For completeness, here is the subtraction case:
(Minus t1 t2) = do (Num v1) <- eval t1
(Num v2) <- eval t2
return (Num v1-v2)
Putting the cases together in the following evaluator for AE
that reduces every abstract syntax term to a value:
eval :: AE -> Maybe AE
eval (Num x) = return (Num x)
eval (Plus t1 t2) = do (Num v1) <- (eval t1)
(Num v2) <- (eval t2)
return (Num v1+v2)
eval (Minus t1 t2) = do (Num v1) <- (eval t1)
(Num v2) <- (eval t2)
return (Num v1-v2)
The evaluator follows a pattern that every evaluator we write will follow. Each constructor from the AST definition has a case in the eval
function that evaluates that constructor. This pattern and the accompanying data
construct gives us three nice language properties - completeness, determinicity, and normalization
Completeness says that every term constructed in AE
will be evaluated by eval
. Can we prove that? The Haskell data
construct gives us exceptionally nice properties for free, without direct proof. By definition, every value in the abstract syntax for AE
is constructed with Num
, Plus
, and Minus
. There are no other AE
values. This is true of any type defined using data
in Haskell. All values of the type are constructed with its constructors.
We know AE
is complete because there is one inference rule for every syntax construct and one case in eval
for each inference rule. The eval
function has one case for each constructor from AE
. Because all AE
values are built with those constructors, there is a case for every AE
value in the eval
function. While not a proof, this gives strong evidence that our language definition is complete.
Determinicity says that if we call eval
on any term, we will get the same result. This is an exceptionally important property as we do bit want the same call to eval
resulting in different values. We know AE
is deterministic because there is one inference rule for interpreting each element of the concrete syntax. In turn we know that eval
is deterministic because there is precisely one case in the definition for each AE
constructor. Given (Num n)
there is one rule and one eval
case. Given Plus
there is one rule and one eval
case.
Normalization says that a call to eval
on any term constructed in AE
will terminate. Again, a pretty bold statement. Can we prove it? The elements of a Haskell data
type specifically and algebraic types generally have are well-ordered
. In mathematics, well-ordered means that every subset of a set has a least element. Getting from well-ordered to guaranteed termination takes a bit of work, but the gist is that components of a constructor are smaller than the constructor itself. Each recursive call on the parts of a term is made on a smaller term. Consider eval (Plus t1 t2)
where t1
and t2
can be viewed as smaller than (Plus t1 t2)
. Because every set of AE
terms has a least element, pulling a term apart into its pieces will eventually get to that least element.
The least elements of AE
are those constructed by (Num n)
. When the evaluator reaches the least element, it cannot go further and terminates. Every call to eval
gets closer to the least element. Note that those least elements are what we defined as values. This is not a coincidence. Not only does AE
terminate, it always terminates with a value. Said in terms we all understand, the AE
evaluator never crashes for any element of AE
.
Before we say more about AE
, let’s put all the pieces together into a single definition. Definition of AE
is now complete syntactically, semantically and operationally:
AE
grammarAE
inference ruleseval
and parse
.First we defined a concrete syntax for terms:
\[\begin{align*} t ::= & \NUM \mid t \underline{+} t \mid t \underline{-} t \\ \end{align*}\]and syntactic definition of values:
\[\begin{align*} v ::= & \NUM \\ \end{align*}\]Then we defined basic inference rules formally defining how AE
is interpreted:
We implemented a parser from the grammar and an evaluator from the inference rules:
eval :: AE -> Maybe AE
eval (Num x) = return (Num x)
eval (Plus t1 t2) = do (Num v1) <- (eval t1)
(Num v2) <- (eval t2)
return (Num v1+v2)
eval (Minus t1 t2) = do (Num v1) <- (eval t1)
(Num v2) <- (eval t2)
return (Num v1-v2)
Given the parser and evaluator for AE
, we can now define a language interpreter, interp
, that puts everything together:
interp :: String -> Maybe AE
interp = eval . parseAE
In words, interp
is the composition of parseAE
and eval
. This notation uses Haskell’s function composition to define that parse
will be called first and the output passed to eval
. If parseAE
throws an error, interp
will terminate without passing a value to eval
.
If you are unfamiliar with the point-free style and the use of function composition, interp
can be defined in a more traditional manner as:
interp x = eval (parseAE x)
Use the notation you are comfortable with. There is no advantage to either, but function composition more closely resembles the mathematics of interpreter construction.
AE
is a silly language that is less powerful than one of those bank calculators you get when you open a checking account. It adds and subtracts numbers. However, we need to start somewhere and AE
is good for that.
We said earlier that eval
is complete, deterministic and normalizing. Complete in that any element of AE
can be interpreted, deterministic in that there is one way to interpret any element of AE
, and normalizing in that every interpretation of and AE
element terminates.
Unfortunately, these nice properties result from AE
being a trivial language. Completeness and deterministic are properties we will seek to preserve as we add features. However, normalizing will prove problematic when we add recursion and looping.
Another important question, one might say the most important question is whether eval
is correct. How do we define correctness? We have the evaluation relation, $t\eval v$ that relates terms to values and we have eval :: AE -> AE
that is a function from terms to values. The simplest correctness definition is:
This states that evaluating any $t$ results in a value that is related to $t$ by the evaluation relation, $t\eval v$. By this definition, is our interpreter correct? How would you go about proving this condition?
There is one property of AE
structures that underlies our discussion. The AE
abstract syntax and algebraic types have both inductive and extensionality principles. The inductive principle allows us to prove properties over AE
and the extensionality principle allows us to determine if two AE
structures are equivalent. We won’t use either principle yet, so let’s define them informally for now.
Induction over AE
says that some property, $p$, is true for all elements of AE
if we can:
What are the base and inductive cases? For AE
, Num
is the base case while Plus
and Minus
are inductive cases. Num
is a base case because it is not constructed from other elements of AE
. We can see this in the data
definition. It is also reflected in the definition of eval
where there is no recursive call for Num
. Plus
and Minus
both depend on other AE
constructions in their definition and eval
cases. Stated in terms of the AE
concrete syntax, the induction principle states that $p$ is true for any AE
term $t$ when:
Extensionality is simpler to state. Two terms in AE
are equivalent if they use the same constructor and their parts are equivalent. For any $t$ and $t’$, $t=t’$ if one of the following cases applies:
Thus if no case applies, then the terms are not equal. So, $t_1+t_2$ will never be equal to $t_1’-t_2’$.
Lots of definitions to get us started:
data
system defines algebraic typesAE
. Do we lose any of our nice properties by doing this?eval
function to return a Haskell Int
rather than an AE
value. This is an alternative to returning the Haskell datatype.AE
replacing the Plus
and Minus
constructors in the AST with a single constructor Binop op t1 t2
where op
is the represented binary operation. You will need to change the parser to generate Binop
rather than Plus
and Minus
. Think carefully about what op
should be. If you do it right, you should be able to add any operator to AE
by simply changing the parser.Documentation for the Parsec Expression Parser
Three Letter Acronym ↩