Table of Contents
From Haskell to LLVM
Introduction
In this presentation, we illustrate the LHC (Little Haskell Compiler), which can recognize a strict subset of the Haskell Programming Language, build it and produce LLVM code as output.
In the last section, you can find how to install and use the tool.
The High-end Language
Haskell is an advanced, Purely Functional Programming Language. Born in 1990, it has become one of the most popular Functional Programming Languages. Typically, Functional Languages are more confined to the academic context as developing programs is more complex than using an Imperative Language (C, Java, Python). Nonetheless, Haskell is increasingly used in the industrial domain because of its useful properties, e.g. it is much easier to prove Correctness.
Being a Pure Functional Language, Haskell has many interesting properties:
- Referential Transparency
- Single Assignment
- Non-Strictness
- Type Inference
- Functions are 1st order entities
- Lexical Scoping
- …
LHC tries to follow in the footsteps of this Language by implementing some of its peculiar features (Referential Transparency, Single Assignment, Lexical Scoping).
The Core
LHC is based on:
- additional classes for Semantic Analysis and IR Generation, written in Java
The Target Language
The Target Language of LHC is LLVM, a Programming Language specially used for Intermediate Code Representation.
Born as a research project at the University of Illinois, it has developed into an umbrella project consisting of several subprojects, including the LLVM Language itself and a C++ API to build and optimize LLVM code at a higher level of abstraction.
Features
Currently, LHC supports the following subset of Haskell:
- Supported Types:
- Basic Types: Int, Double, Char, Bool
- Aggregate Types: Lists (not Recursive), Strings
- Other Types: Functions (not as 1st order entities)
- Supported operations:
- Addition (+)
- Subtraction (-)
- Multiplication (*)
- Division (/)
- Integer Division (div)
- Remainder (rem)
- Logical operators:
- Logical and (&&)
- Logical or (||)
- Logical not (not)
- Relational operators:
- Greater than (>)
- Greater or equal to (>=)
- Less than (<)
- Less or equal to (<=)
- Equal to (==)
- Not equal to (/=)
- Element extraction from List/String (!!)
- Function call
- Constructs:
- If-Then-Else (the only possible Control Flow Statement: loops do not exist in Haskell)
- Let Bindings (both in the Functional and Imperative Part)
- Do Block
- I/O:
- Print Function (accepts Int, Double, Char, String)
- Language Properties:
- Indentation-based Parsing
- Referential Transparency
- Single Assignment (“Everything is a constant”)
- Lexical Scoping
Partial/Missing Features
- Functions cannot return Lists/Strings
- Global Lists/Strings are not supported
The Compiler
Note: all debug code is not shown at all; we use “[…]” for meaningful parser code that is covered in other sections to focus on what is most important.
Scanner
In this section, we illustrate how Tokens are scanned in LHC using a Scanner written in JFlex.
General Scanning Algorithm
The Scanning Algorithm implemented in LHC uses a FIFO tokenQueue
where all tokens are inserted once scanned. All Tokens are sent to the Parser as Symbols to provide additional information for Semantic Analysis.
In particular, every time that a Token is requested by the Parser:
- if the
tokenQueue
is not empty, return the next element of the queue (in the pre-scanning methodnext_token_custom()
) - otherwise:
- look for the next token(s) (launches the default scanning method
next_token()
, called by the Parser by default) - manage the token(s) (in the post-scanning method
manageToken()
) - return the first next token to the Parser (in
manageToken()
, too)
Managing basic Tokens, i.e. all those not related to indentation, only consists of inserting them into the tokenQueue
. For indentation-related token management, read this section.
About code: when a method is reported, it assumes a form like method_name()
, regardless of the number of arguments.
Snippet of next_token_custom()
:
public Symbol next_token_custom() throws IOException{ // if the token Queue is not empty, extract the next item if (this.tokenQueue.size() > 0) { return this.tokenQueue.remove(); } [...] // Additional code for Indentation-based Parsing, we will cover this later return this.next_token(); }
Snippet of manageToken()
:
private Symbol manageToken(Symbol... symbols) { [...] // Additional code for Indentation-based Parsing for (Symbol sym: symbols) { tokenQueue.add(sym); } return this.tokenQueue.remove(); }
It is always up to the Parser to ask the Scanner for a new token; by default, the Parser calls the next_token()
method of the Scanner and a new Token is received eventually. In this case, there is a pre-scanning method to be executed: the scan with
code section of the Parser needs to be tuned accordingly:
scan with {: if (true) { // Required because of a bug in this enhanced implementation of Cup Scanner scanner = (Scanner) this.getScanner(); Symbol s = scanner.next_token_custom(); [...] // Additional code for tracking current line and column and for drawing the Parse Tree return s; } :}
Basic Token management
The Scanner can recognize several keywords and regular expressions that identify specific terminals of the Grammar.
Here is a snippet of the Scanner code for identifying those Tokens:
int = 0|[1-9][0-9]* //unsigned int rational = {int}\.[0-9]+ //unsigned rational expform = {rational}[eE]-?{int} double = {rational}|{expform} bool = True|False char = \'.\' string = \"~\" id = [a-z][a-zA-Z0-9_\']* nl = \r|\n|\r\n ws = [ \t] %% "=" {return manageToken(createSymbol(sym.eq));} "::" {return manageToken(createSymbol(sym.clns));} "," {return manageToken(createSymbol(sym.cm));} "(" {return manageToken(createSymbol(sym.ro));} ")" {return manageToken(createSymbol(sym.rc));} "[" {return manageToken(createSymbol(sym.bo));} "]" {return manageToken(createSymbol(sym.bc));} "->" {return manageToken(createSymbol(sym.arrow));} "+" {return manageToken(createSymbol(sym.plus));} "-" {return manageToken(createSymbol(sym.minus));} "*" {return manageToken(createSymbol(sym.times));} "/" {return manageToken(createSymbol(sym.div));} div {return manageToken(createSymbol(sym.intdiv));} rem {return manageToken(createSymbol(sym.rem));} "&&" {return manageToken(createSymbol(sym.and));} "||" {return manageToken(createSymbol(sym.or));} not {return manageToken(createSymbol(sym.not));} "/=" {return manageToken(createSymbol(sym.relnoteq));} "==" {return manageToken(createSymbol(sym.releq));} ">=" {return manageToken(createSymbol(sym.relge));} ">" {return manageToken(createSymbol(sym.relgt));} "<=" {return manageToken(createSymbol(sym.relle));} "<" {return manageToken(createSymbol(sym.rellt));} ";" {return manageToken(createSymbol(sym.sep));} in {[...] return manageToken(createSymbol(sym.in));} main {return manageToken(createSymbol(sym.main));} do {[...] return manageToken(createSymbol(sym.do_begin));} "!!" {return manageToken(createSymbol(sym.index));} if {return manageToken(createSymbol(sym.if_begin));} then {return manageToken(createSymbol(sym.then));} else {return manageToken(createSymbol(sym.else_begin));} let {[...] return manageToken(createSymbol(sym.let));} print {return manageToken(createSymbol(sym.print));} Int {return manageToken(createSymbol(sym.type_int));} Double {return manageToken(createSymbol(sym.type_double));} Bool {return manageToken(createSymbol(sym.type_bool));} Char {return manageToken(createSymbol(sym.type_char));} String {return manageToken(createSymbol(sym.type_string));} {int} {return manageToken(createSymbol(sym.val_int, Integer.valueOf(yytext())));} {double} {return manageToken(createSymbol(sym.val_double, Double.valueOf(yytext())));} {bool} {return manageToken(createSymbol(sym.val_bool, Boolean.valueOf(yytext())));} {char} {return manageToken(createSymbol(sym.val_char, yytext().charAt(1)));} {string} {return manageToken(createSymbol(sym.val_string, yytext().substring(1, yytext().length()-1)));} //LLVM does not support the single quote as a valid character for identifiers, while dots are: for this reason, every single tick is replaced with a dot {id} {return manageToken(createSymbol(sym.id, yytext().replace('\'', '.')));} {ws} {;} {nl} {[...]} /* Single-line Comment */ "--" ~ {nl} {[...]}
Indentation-based Parsing
For nested blocks, Haskell supports both a coding style that uses brackets and an indentation-based one, which LHC supports. However, Haskell uses Off-side rules only for certain Blocks like Let Bindings and Do Blocks, while others are in free-form, as the If-Then-Else block.
How can we recognize Indentation?
A possible approach, used in Python, consists of defining additional tokens on purpose: Indent, Dedent and Separator; Indent starts a block, Dedent terminates it, Separator divides statements inside the same one.
We also define additional variables and data structures:
- an
indentStack
, where each entry contains the column index of a block - an
indentEnable
flag, raised when an indentation-sensitive block is recognized, i.e. a “let” or “do” token is scanned - a
scanIndent
flag, raised whenindentEnable
is active, so that the next Token will define the column index for the block - a
foundNewline
flag, raised when a newline is recognized - a
dedentForce
flag, to force a decent for the very peculiar case in which we scan an “in” Token.
The concept is the following: when an indentation-sensitive block is recognized, i.e. its keyword is scanned (“let”, “do”), an Indent is scanned; indentEnable
is raised. Before scanning the following Token, scanIndent
is raised, too: the Token will define the column index for the block, which is stored in the indentStack
.
Beware: nested blocks can only be more to the right than the current one.
In case we are in an indentation-sensitive block:
- if a newline is recognized and the following line is more to the left than the current block, a Dedent is scanned: the block is closed; actually, this is a loop: we may scan multiple Dedents at once. The loop halts when a block is aligned more to the left than the line. In the particular case where a Dedent is scanned because
dedentForce
was raised, only one Dedent is accepted, i.e. the Let Binding corresponding to the “in” is closed, nothing else - if the following and the current line are aligned the same, then a Separator is scanned
- if the following line is more to the right than the block, no additional symbol is scanned: we are still in that block
Once defined the principle, we can illustrate the code.
Here are the additional instructions for indentation control, for specific Tokens:
in {dedentForce = true; return manageToken(createSymbol(sym.in));} [...] do {indentEnable = true; return manageToken(createSymbol(sym.do_begin));} [...] let {indentEnable = true; return manageToken(createSymbol(sym.let));} [...] {nl} {foundNewline = true;} /* Single-line Comment */ "--" ~ {nl} {foundNewline = true;}
Code snippet to enable scanIndent
(in next_token_custom()
):
if (indentEnable) { indentEnable = false; scanIndent = true; }
Code snippet to scan a new Indent, define the column level of the nested block and store it (in manageToken()
):
if (!endOfCode && scanIndent) { indentColumn = yycolumn+1; if (indentStack.size() == 0 || indentColumn > indentStack.peek()) { indentStack.push(indentColumn); tokenQueue.add(createSymbol(sym.indent)); } else { report_error("Nested block is not indented further in than the enclosing expression"); tokenQueue.add(createSymbol(sym.error)); } scanIndent = false; foundNewline = false; }
Code snippet to scan one or more Dedents (in manageToken()
):
if (!endOfCode && !indentStack.empty() && (foundNewline || dedentForce)) { dedentColumn = yycolumn + 1; boolean thereIsDedentToInsert = true; while (thereIsDedentToInsert && !indentStack.empty()) { indentColumn = indentStack.peek(); // take the column of the innermost block if (indentColumn > dedentColumn || dedentForce) { // check if we are out of the innermost block dedentForce = false; indentStack.pop(); [...] tokenQueue.add(createSymbol(sym.dedent)); } else { thereIsDedentToInsert = false; if (indentColumn.equals(dedentColumn)) { // the following statement is not outside of the block : add only a Separator tokenQueue.add(createSymbol(sym.sep)); } } } }
Error Handling
If no Token can be recognized, a Lexical Error is notified, calling the report_error()
method:
public void report_error(String msg) { System.err.print("ERROR: Lexical error"); System.err.print(" (line " + (yyline+1) + ", column " + (yycolumn+1) + "): " + msg); System.err.println(" (Token \"" + yytext() + "\")"); }
If no Token can be recognized, then the text matches with the following catch-all regular expression, defined at the end of the code:
. {report_error("Not recognized token"); return this.manageToken(createSymbol(sym.error));}
Parser
The Parser, written in Cup, recognizes the syntax of the program and handles possible occurring errors.
Here is the full Grammar of the Language recognized by LHC (Semantic Rules are explored in this Section).
PROGRAM ::= /* empty program */ | indent FUNCT_PART IMPER_PART dedent /// ////////////////// /// /// IMPERATIVE PART /// /// ////////////////// /// IMPER_PART ::= main eq IO_ACTION IO_ACTION ::= PRINT | DO_BLOCK | IF_BLOCK_IMPER IO_ACTIONS ::= IO_ACTIONS sep IO_ACTION | IO_ACTIONS sep LET_BLOCK_IMPER | IO_ACTION | LET_BLOCK_IMPER PRINT ::= print ACTARG DO_BLOCK ::= do_begin indent IO_ACTIONS dedent IF_BLOCK_IMPER ::= if_begin COND then IO_ACTION else_begin IO_ACTION LET_BLOCK_IMPER ::= let indent LET_STMTS dedent LET_STMTS ::= LET_STMTS sep DECL_TYPE | LET_STMTS sep DEF_VALUE | DECL_TYPE | DEF_VALUE /// ////////////////// /// /// FUNCTIONAL PART /// /// ////////////////// /// FUNCT_PART ::= GLOBAL_STMTS GLOBAL_STMTS ::= /* empty List of Statements section */ | GLOBAL_STMTS STMT sep /// STATEMENTS /// STMT ::= DECL_TYPE | DEF_VALUE | DEF_FUNCT DECL_TYPE ::= id cm DECL_TYPE | id clns TYPE DEF_VALUE ::= id eq EXPR /* no nullary functions */ DEF_FUNCT ::= id LFORMARG eq EXPR LFORMARG ::= LFORMARG id | id /* special expression management for boolean conditions */ COND ::= EXPR /// EXPRESSIONS /// EXPR ::= EXPR plus EXPR | EXPR minus EXPR | EXPR times EXPR | EXPR div EXPR | EXPR intdiv EXPR | EXPR rem EXPR | EXPR and EXPR | EXPR or EXPR | EXPR relnoteq EXPR | EXPR releq EXPR | EXPR relgt EXPR | EXPR relge EXPR | EXPR rellt EXPR | EXPR relle EXPR | EXPR index EXPR | ro EXPR rc | not EXPR | minus EXPR %prec uminus | LET_BLOCK_FUNC | IF_BLOCK_FUNC | FUNCT_CALL | VALUE | EXPR_LIST FUNCT_CALL ::= id LACTARG LET_BLOCK_FUNC ::= let indent LET_STMTS dedent in EXPR IF_BLOCK_FUNC ::= if_begin COND then EXPR else_begin EXPR /// args for function call /// ACTARG ::= id | VALUE | ro EXPR rc // list of input arguments for function call LACTARG ::= /* empty list of args: we call a Value */ | LACTARG ACTARG /// VALUES /// // Represents constants VALUE ::= VALUE_BASIC // VALUE_BASIC represents constants VALUE_BASIC ::= val_int | val_double | val_bool | val_char | val_string EXPR_LIST ::= bo LEXPR bc LEXPR ::= EXPR | LEXPR cm EXPR /// TYPES /// TYPE ::= TYPE_VALUE | TYPE_FUNC // both basic types and compound TYPE_VALUE ::= TYPE_LIST | TYPE_BASIC TYPE_BASIC ::= type_int | type_double | type_bool | type_char TYPE_LIST ::= bo TYPE_BASIC bc | type_string // no higher order functions TYPE_FUNC ::= TYPE_VALUE arrow TYPE_FUNC | TYPE_VALUE arrow TYPE_VALUE
To avoid Shift-Reduce conflicts, operators have given precedence, shown below:
/// ///////////////////////// /// /// PRECEDENCE RULES /// /// ///////////////////////// /// precedence left or; precedence left and; precedence nonassoc releq, relnoteq, relgt, relge, rellt, relle; precedence left plus, minus; precedence left times, div, intdiv, rem; precedence nonassoc index; precedence nonassoc not, uminus;
We can notice that Haskell has some peculiarities, especially in comparison with typical imperative languages, like Java.
Any Haskell program can be split into two parts: a functional part and an imperative one; the latter corresponds to the main and is the only region of the program where it is possible to access I/O devices via IO Actions; in general, it is where we can modify the state of the machine. Instead, all functions in the first part are stateless: this property is fundamental for making Haskell a Purely Functional programming language and is also known as referential transparency.
To declare local variables, the language defines Let Bindings that can be placed inside the function body; here, they are represented by LET_BLOCK_FUNC
and LET_BLOCK_IMPER
.
Strangely, the If-Then-Else construct and Let Bindings defined in the functional part are Expressions. In particular, If-Then-Else is an expression whose return value corresponds to that of the Then/Else Branch; since expressions have always one and only one type, the two branches must return a value of the same type. Instead, Let Bindings have the same type and return value of the expressions following the in
token (the expression is part of the construct).
In Haskell, the If-Then-Else construct imposes to define both the branches; this is due to a particular rule of the language:
Expressions must always return something
The If-Then-Else construct is an expression, so it must always return something regardless of the result of the condition: every If must define an expression for both the branches.
Calling a value is like calling a nullary function, which is how Haskell interprets values, actually.
An n-ary function, i.e. a function taking n arguments, is equivalent to a sequence of unary functions taking one argument at a time. This concept is known as currying and is extensively used in Haskell for defining higher-order functions.
Finally, strings are lists of chars: there exists no Array type.
Error Handling
The Parser is also able to recognize syntax errors and recover them, in some cases. For this purpose, Cup allows using an error
token, for error recovery.
Every time an error is detected, the Parser launches the syntax_error()
method by default, which forwards the error to report_error()
, whose body is reported here:
public void report_error(Object info) { noCompileErrors = false; System.err.print("ERROR: Syntax error"); if (info instanceof Symbol) if (((Symbol)info).left != -1){ int line = (((Symbol)info).left); int column = (((Symbol)info).right); System.err.print(" (line "+line+", column "+column+"): "); } else System.err.print(": "); else System.err.print(": "); }
The noCompileErrors
flag is true at the beginning and set to false
when the first error of any kind occurs.
We can observe that report_error()
builds an incomplete error message: no information about the kind of error has been provided since the error has not been recovered yet!
To complete the message, we need to wait until the Parser recovers an error, i.e. reduces a rule containing the error
token. At that point, the report_syntax_error()
is called and the error is diagnosed.
These are the additional rules, included in the compiler, to handle some syntax errors:
IMPER_PART ::= main eq error IO_ACTIONS ::= IO_ACTIONS:io_actions sep error DO_BLOCK ::= do_begin indent error dedent IF_BLOCK_IMPER ::= if_begin error then IO_ACTION else_begin IO_ACTION | if_begin COND then error else_begin IO_ACTION | if_begin COND error else_begin IO_ACTION GLOBAL_STMTS ::= GLOBAL_STMTS error sep LET_BLOCK_FUNC ::= let error in EXPR IF_BLOCK_FUNC ::= if_begin error else_begin EXPR | if_begin COND then error else_begin EXPR | if_begin COND error else_begin EXPR
This is the list of possible syntax errors that LHC can recognize:
- Error in Main
- Error in Imperative Part Statement
- Error in Do Block
- Error in Condition of If-Then-Else block
- Error in Then Block of If-Then-Else block
- Missing 'then' in If-Then-Else block
- Error in Functional Statement
- Error in Statement inside a Let Block
Semantic Analysis
Together with lexicon and syntax, semantics is one of the main components of a program. Thanks to semantics, it is possible to define context-sensitive information, like variable scoping and typing, empowering the language.
Note:
In Cup, it is possible to define intermediate actions, like:
A ::= x { something } y
which is equivalent to
A ::= M y
M ::= x { something }
In particular, Cup can define implicit rules for the markers and translate the grammar from the first representation to the second one; however, this may raise conflicts if the parser cannot be sure that it will reduce exactly that rule.
The typing system
Like any other programming language, Haskell has a typing system. It is very complex and uses very high-level concepts as typeclasses, polymorphic data structures, type variables, the kind system, and so on.
In LHC, only pretty simple types are supported: Int, Double, Bool, Char, not recursive List, Strings and Functions.
Each of these types has a standard representation as a type tree. Basic types are represented as a simple tree node, whose label corresponds to the name of the type itself. For aggregate types, it is different: Lists are polymorphic, so they are represented as a root node and a child node, holding the type of the elements of the list; strings are simple lists of chars. For functions, it is even more interesting: thanks to currying, we can consider functions as a sequence of unary functions.
So, we can represent them as a binary tree where:
- the root node is labelled as “Function”
- the left child holds the type of the first argument
- the right child holds the type of the current function without the first argument: the tree depth is equal to the number of arguments
In this compiler, type trees are represented by the Type
class, composed of the following methods and fields:
- constructors, setters, getters;
typeName
(the label of the root node);typeParams
(the child nodes of the tree);typeMap
: the type map contains all the “standard” representations of Types, the same described some line above; it is used as a sort of lookup table for type trees;widen()
: method for type widening, i.e. it widens Int to Double if requiredisSubtype()
: a method for checking subtypes; it takes the name of a certain type as argument and checks if the current type is a subtype of the argumentisEquivalent()
: method for type equivalence checking; it works asisSubtype()
, but checks equivalencereturnType()
: given the type tree of a function, returns its return typehasArity()
: a method for arity checking, i.e. verifies whether a function takes a given number of arguments, provided as argument of the methodcreateTypeMap()
: creates the type map during the parser initializationaddType()
: adds a type to the type maptypeExists()
: checks if a type is defined in the type map, i.e. if it is recognized by the languagegetType()
: gets the type tree of a given type from the type mapdumpTypeMap()
,dumpType()
: used for debugging purposes
In the type map, two additional types are used:
Any
, for polymorphic types like functions and listserror
, for error recovery
The symbol table stack
During semantic analysis, one of the most important data structures is the symbol table: it contains all the information about declared values and functions, providing all those context-sensitive functionalities that empower the language.
Every symbol table is a hashmap containing many symbol table entries. Each symbol table entry has a name, a type and a flag to check whether it has been already assigned.
In this compiler, we do not use a unique symbol table; instead, we use a symbol table stack: in LHC, it is possible to define nested blocks with different contexts, e.g. nested let bindings.
The symbol table stack is implemented via the SymTableStack
class, which defines an inner SymTableEntry
class for the symbol table entries.
The SymTableEntry
class defines objects with the following information:
type
: the type of the value/function, as a type tree;isAssigned
: boolean flag for checking assignment; initially, it is false
Thanks to this flag, values and functions can be assigned only once. Single assignment is one peculiar characteristic of functional language, where values and functions are intended in a mathematical way. If you define cos(x) as a certain function, why should you define it again?
Where is the entry name? Since a symbol table is a hashmap, we use the name of the entry as a key and a SymTableEntry
instance as its value.
This is the information contained in a SymTableStack
instance:
symTableStack
: a LinkedList containing the stack of symbol tables itself; it is not defined as a stack on purpose: there are cases where it is necessary to look in the symbol tables below (as explained forgetEntry()
);pushSymTable()
: push a new symbol table in the stack;popsymTable()
: pop a symbol table from the stack;peekSymTable()
: get the symbol table on top of the stack, without popping it;putEntry()
: insert an entry in the symbol table on top of the stack;containsEntry()
: check if a value/function is present in the whole stack; search starts at the uppermost symbol table and ends at the bottom;getEntry()
: looks for an entry likecontainsEntry()
and returns it, if presentgetLevel()
: looks for an entry likecontainsEntry()
and returns its level, where level is a positive integer and 0 corresponds to the bottomisGlobal()
: checks whether a value/function is global, i.e. declared in the bottom symbol tableisLocallyDeclared()
: checks whether a value/function has been declared in the top-level symbol tableisAssigned()
: checks if a variable has already been assigned; used for assignment uniqueness checkisDeclared()
: equivalent ofcontainsEntry()
- additional fields and methods for debug and IR Generation (explored here)
In the Parser, the symbol table stack is allocated here:
action code {: /* The Symbol Table Stack to be used */ private SymTableStack symTableStack = new SymTableStack(); :};
The AST
The abstract syntax tree is a tree representation of the abstract syntactic structure of the source code. Each node of the tree denotes a construct occurring in the text. If we define the programming language in Backus-Naur form as Cup allows to, then each symbol of the grammar represents a node of the tree; in particular, the node is a leaf if the symbol is terminal.
AST is very useful for top-down IR generation, as reported here.
In this compiler, the AST nodes are defined in the wrapper AST
class, which contains all of them. It contains many different nodes, in practice one per non-terminal symbol:
BasicNode
: represents any node of the AST; any other node is subtype ofBasicNode
; this class defines several methods and fields for code generation, as shown later;Program
: the root of all the syntax tree, represents the whole programFunctPart
: represents the functional part of the programImperPart
: represents the imperative part of the programDoStmt
: abstract class, parent class of all accepted statements inside a do blockIOAction
: abstract class, parent class of all nodes that are also IO actionsDoBlock
: represents a do blockIfBlockImper
: represents an if-then-else in the imperative partLetBlockImper
: represents a let binding in the imperative partStmt
: abstract class, parent of all statements, i.e. declarations and definitionsDeclType
: represents a type declarationDefValue
: represents a value definitionDefFunct
: represents function definitionBasicExpr
: abstract class, parent class of all sorts of expressions; it also holds a typeExpr
: class for almost all expressions; basically, all operations are represented by this nodeLetBlockFunc
: expression node for a let binding in the functional partIfBlockFunc
: expression node for an if-then-else in the functional partFunctCall
: expression node for function callsValue
: expression node for constantsExprList
: expression node for a list of expressionsLFormArg
: node containing the list of all formal arguments of a function
Now that we have defined the AST nodes, we may ask how they are interconnected. It is rather simple: they are connected according to the production rules of the corresponding non-terminal, e.g. a Program
node has always two children: a FunctPart
and an ImperPart
node, as we can see in the production rule:
PROGRAM ::= indent FUNCT_PART IMPER_PART dedent
Here is the relation between non-terminals and AST nodes:
non terminal AST.Program PROGRAM; non terminal AST.FunctPart FUNCT_PART; non terminal AST.ImperPart IMPER_PART; non terminal LinkedList<AST.DoStmt> IO_ACTIONS; non terminal AST.IOAction IO_ACTION; non terminal AST.Print PRINT; non terminal AST.Expr COND, EXPR; non terminal AST.ExprList EXPR_LIST, LEXPR; non terminal AST.FunctCall FUNCT_CALL; non terminal AST.Expr ACTARG; non terminal LinkedList<AST.Expr> LACTARG; non terminal AST.LFormArg LFORMARG; non terminal LinkedList<AST.Stmt> LET_STMTS, GLOBAL_STMTS; non terminal AST.DoBlock DO_BLOCK; non terminal AST.LetBlockFunc LET_BLOCK_FUNC; non terminal AST.LetBlockImper LET_BLOCK_IMPER; non terminal AST.IfBlockFunc IF_BLOCK_FUNC; non terminal AST.IfBlockImper IF_BLOCK_IMPER; non terminal AST.Stmt STMT; non terminal AST.Stmt DEF_VALUE; non terminal AST.DefFunct DEF_FUNCT; non terminal AST.DeclType DECL_TYPE; non terminal Type TYPE, TYPE_VALUE, TYPE_FUNC, TYPE_LIST, TYPE_BASIC; non terminal AST.Value VALUE, VALUE_BASIC;
Semantic rules
After exploring the additional classes for semantic analysis, we can explore the semantic rules, at least the most interesting ones. In particular, we focus extensively on how the symbol table stack is used and much less on building the AST, which is rather straightforward. Error detection and recovery are covered in the following section about error handling, so.
The start and the end of parsing
The PROGRAM
non-terminal is the start symbol of the grammar. When parsing begins, we need a symbol table to store all the global variables and functions; instead, when parsing terminates, we need to pop the symbol table, as we do not need it anymore. There is a particular case: empty programs; in this case, no symbol table is pushed or popped: the compiler recognizes the program as good and terminates. Finally, we need to inform the user if parsing is successful or not.
This is a code snippet of the semantic rules related to the PROGRAM
symbol. There are still some lines about code generation, which are explored here.
PROGRAM ::= /* empty program */ {: if (noCompileErrors) { System.out.println("CODE COMPILED SUCCESSFULLY"); } RESULT = new AST.Program(new AST.FunctPart(), new AST.ImperPart(null)); [...] // Code generation :} | indent {: // push the top-level symtable symTableStack.pushSymTable(); :} FUNCT_PART:funct_part IMPER_PART:imper_part {: symTableStack.popSymTable(); :} dedent {: // pop the top-level symtable if (noCompileErrors) { System.out.println("CODE COMPILED SUCCESSFULLY"); RESULT = new AST.Program(funct_part, imper_part); [...] // Code generation } else { System.out.println("CODE NOT COMPILED: FAILED"); RESULT = new AST.Program(null, null); } :} ;
The Print function
The print()
function is the only action allowed to interact with I/O; it accepts numbers, chars and strings and prints them with a trailing newline. Since the only semantic constraint on this function is the type of the argument, this is also the only check to perform in the semantic rule. Its code is reported below.
PRINT ::= print ACTARG:actarg {: Type argType = actarg.getType(); if (argType.isSubtype("Double") || argType.isSubtype("String") || argType.isSubtype("Char")) { RESULT = new AST.Print(actarg); } else { [...] // Error handling } :} ;
Do blocks
A do block consists of a sequence of IO Actions; this construct defines also its own context, i.e. its own symbol table. Like for all other constructs that create symbol tables, the semantic rules both push and pop the symbol table, so that contexts are well defined. Here is the code:
DO_BLOCK ::= do_begin indent {: symTableStack.pushSymTable(); :} IO_ACTIONS:io_actions dedent {: symTableStack.popSymTable(); RESULT = new AST.DoBlock(io_actions); :} | [...] // additional rule for detecting syntax errors ;
Declarations and definitions
Declarations and definitions are among the most important constructs: they allow to declare and instantiate variables. There is a difference between the two concepts:
- Declarations only declare that a value/function exists, but do not allocate it
- Definitions allocate them
There are three constructs for this purpose:
- Type Declaration
- Value definition
- Function definition
Type Declarations are responsible for inserting new entries in the current symbol table; given an id and a type, the rule checks if it has already been declared in the current symbol table and that, in case it is a function, it does not return a list as it is not possible in this language. Since the type is on the right of the rule, it is easy to propagate the type to multiple declarations using a synthesized attribute.
Here is the code:
DECL_TYPE ::= id:id cm DECL_TYPE:decl_type {: Type type = decl_type.getType(); if (!symTableStack.isLocallyDeclared(id) && !(type.isSubtype("Function") && type.returnType().isSubtype("List"))) { symTableStack.putEntry(id, symTableStack.new SymTableEntry(type)); } else [...] //Error handling RESULT = new AST.DeclType(new Type(type)); :} | id:id clns TYPE:type {: if (!symTableStack.isLocallyDeclared(id) && !(type.isSubtype("Function") && type.returnType().isSubtype("List"))) { symTableStack.putEntry(id, symTableStack.new SymTableEntry(type)); } else [...] // Error handling RESULT = new AST.DeclType(type); :} ;
Value definitions allow defining local and global values. Once we recognize that we are parsing a value definition, an intermediate action defines the value as assigned, allowing recursion on the value, which is possible in Haskell. Before this, the action checks if the value is locally declared and not assigned yet. Finally, when the rule is reduced, the first check verifies if the type of the expression and the type of the value match; then, we check if the value is global or local. In case it is global, the value is treated as a nullary function, as global values are here intended as functions with no arguments; we will see that during IR generation; otherwise, it is treated as a value.
Here is the code:
DEF_VALUE ::= id:id {: if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && !symTableStack.isAssigned(id)) { symTableStack.getEntry(id).setIsAssigned(true); } else [...] //Error handling :} eq EXPR:expr {: if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && symTableStack.isAssigned(id) && !expr.type.isEquivalent(symTableStack.getEntry(id).getType().getTypeName())) { [...] //Error handling } if (symTableStack.isGlobal(id)) { // Global Values are actually Nullary Functions ArrayList<String> argNames = new ArrayList<>(); ArrayList<Type> argTypes = new ArrayList<>(); AST.LFormArg lformarg = new AST.LFormArg(argNames, argTypes, expr.getType()); RESULT = new AST.DefFunct(id, lformarg, expr); } else { RESULT = new AST.DefValue(symTableStack.createUniqueId(id), expr); } :} ;
Finally, function definitions allow defining functions. Once we recognize that we are parsing a function definition, an intermediate action defines the function as assigned, allowing recursion. Before that, the action checks if the function is locally declared and not assigned yet. Functions define their own context, i.e. their own symbol table, so we need to push a symbol table before parsing the rest of the function. Finally, the symbol table is popped and the AST node for the function definition is created.
DEF_FUNCT ::= id:id {: if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && !symTableStack.isAssigned(id)) { symTableStack.getEntry(id).setIsAssigned(true); } else [...] // Error handling symTableStack.pushSymTable(); :} LFORMARG:lformarg eq EXPR:expr {: [...] // Error handling symTableStack.popSymTable(); RESULT = new AST.DefFunct(symTableStack.createUniqueId(id), lformarg, expr); :}
To handle formal arguments, represented by the LFORMARG
symbol, we need to propagate the type of the arguments by using the function type tree, to assign it and define the arguments in the symbol table. In the meanwhile, the parser checks that the number of arguments is right and that there are no multiple arguments with the same name.
Here is the code:
LFORMARG ::= LFORMARG:lformarg id:id {: Type propType = lformarg.getPropType(); if (!symTableStack.isLocallyDeclared(id) && propType != null && propType.isSubtype("Function")) { Type argTree = new Type(propType.getTypeParam(0)); symTableStack.putEntry(id, symTableStack.new SymTableEntry(argTree, true)); lformarg.getArgNames().add(symTableStack.createUniqueId(id)); lformarg.getArgTypes().add(argTree); RESULT = new AST.LFormArg(lformarg.getArgNames(), lformarg.getArgTypes(), propType.getTypeParam(1)); } else [...] // Error handling :} | id:id {: String funcName = (String) parser.stack(-2); if (symTableStack.isDeclared(funcName)) { Type type = symTableStack.getEntry(funcName).getType(); if (!symTableStack.isLocallyDeclared(id) && type != null && type.isSubtype("Function")) { Type argTree = new Type(type.getTypeParam(0)); symTableStack.putEntry(id, symTableStack.new SymTableEntry(argTree, true)); ArrayList<String> argNames = new ArrayList<>(); ArrayList<Type> argTypes = new ArrayList<>(); argNames.add(symTableStack.createUniqueId(id)); argTypes.add(argTree); RESULT = new AST.LFormArg(argNames, argTypes, type.getTypeParam(1)); } else [...] //Error handling } else [...] //Error handling :} ;
Expressions and conditions
Expressions are fundamental for performing computation, as they represent operations which return some value. In this compiler, we define two types of expressions: operations and special constructs. Operations are basic expressions like addition, subtraction and so on, while special constructs represent expressions that need special management, like if-then-else blocks and let bindings. In particular, special constructs are:
- let bindings;
- if-then-else blocks;
- function calls;
- constants;
- list of expressions
For all these expressions, we use special rules to handle them.
Instead, basic expressions are all handled in the same manner: we check the type of the operands and then produce the type of the result. The type is the only information that we can use for semantic analysis, in this case.
Conditions are special expressions, which must be of boolean type.
Example of operation: addition.
EXPR ::= EXPR:expr1 plus EXPR:expr2 {: String opType = "+"; AST.BasicExpr[] subExpressions = { expr1, expr2 }; if (expr1.type.isSubtype("Double") && expr2.type.isSubtype("Double") && expr1.type.isEquivalent(expr2.type.getTypeName())) { RESULT = new AST.Expr(new Type(expr1.type), AST.ExprKind.PLUS, subExpressions); } else [...] //Error handling :}
Function calls
In this compiler, function calls refer both to functions and to values as Haskell does. The semantic rule proceeds as follows: first, we check that the function/value is declared and assigned; then, we look at the number of actual arguments and distinguish the two cases: if there are no arguments, we refer to value, otherwise to a function. If we refer to a value, we check that the id is not a function and finally build the AST node. In the other case, we check that the type of the actual arguments and of the formal arguments matches, then we build the AST node.
FUNCT_CALL ::= id:id LACTARG:lactarg {: int nArgs = lactarg.size(); Type funcType = null; boolean failedArgsTypeChecking = false; if(symTableStack.isDeclared(id) && symTableStack.isAssigned(id)) { funcType = symTableStack.getEntry(id).getType(); // This is a Value Call if (nArgs == 0) { if (funcType != null && !funcType.isSubtype("Function")) { // This is not a Function Call, but a reference to a Value RESULT = new AST.FunctCall(new Type(funcType), symTableStack.createUniqueId(id), lactarg); } else [...] // Error handling } // This is a Function Call else if (nArgs > 0) { if (funcType != null && funcType.isSubtype("Function") && funcType.hasArity(nArgs)) { for (AST.Expr actarg : lactarg) { if (!actarg.type.isEquivalent(funcType.getTypeParam(0))) failedArgsTypeChecking = true; funcType = funcType.getTypeParam(1); //extract the right child of the funcType Tree } if (!failedArgsTypeChecking) { RESULT = new AST.FunctCall(new Type(funcType), symTableStack.createUniqueId(id), lactarg); } else [...] // Error handling } else [...] //Error handling } } else [...] //Error handling } :} ;
An important point: the symTableStack.getEntry()
method extracts the symbol table entry of the value/function and, if it is not in the top-level symbol table, it searches in the symbol table below. This allows implementing lexical scoping, i.e. every value/function is available to all the nested blocks with respect to the one where it has been declared.
This means that the following Haskell code works:
value :: Int value = let x :: Int; x = 3 in let y :: Int; y = 4 in x + y
Let bindings (functional part)
Let bindings exist for one purpose: declaring and defining new variables in a nested context. For this reason, the semantic rules only create and destroy the symbol table where to store the new statements.
LET_BLOCK_FUNC ::= let indent {: symTableStack.pushSymTable(); :} LET_STMTS:let_stmts dedent in EXPR:expr {: symTableStack.popSymTable(); RESULT = new AST.LetBlockFunc(new Type(expr.type), let_stmts, expr); :} | [...] //Additional rule for syntax errors ;
The if-then-else block (functional part)
Like for let bindings, if-then-else blocks are rather simple to analyze: the parser needs only to check that the type of the value returned by the two branches is the same, otherwise it is an error.
IF_BLOCK_FUNC ::= if_begin COND:cond then EXPR:thenBody else_begin EXPR:elseBody {: if(thenBody.type.isEquivalent(elseBody.type)) { RESULT = new AST.IfBlockFunc(thenBody.type, cond, thenBody, elseBody); } else [...] // Error handling :} | [...] //Additional rules for syntax errors ;
Building types
How are type trees built? By mounting the subtrees according to the production rules, as the code clearly shows.
TYPE ::= TYPE_VALUE:type_value {: RESULT = type_value; :} | TYPE_FUNC:type_func {: RESULT = type_func; :} ; // both basic types and compound TYPE_VALUE ::= TYPE_LIST:type_list {: RESULT = type_list; :} | TYPE_BASIC:type_basic {: RESULT = type_basic; :} ; TYPE_BASIC ::= type_int {: RESULT = Type.getType("Int"); :} | type_double {: RESULT = Type.getType("Double"); :} | type_bool {: RESULT = Type.getType("Bool"); :} | type_char {: RESULT = Type.getType("Char"); :} ; TYPE_LIST ::= bo TYPE_BASIC:type_basic bc {: Type type = Type.getType("List"); type.setTypeParam(0, type_basic); RESULT = type; :} | type_string {: RESULT = Type.getType("String"); :} ; // no higher order functions TYPE_FUNC ::= TYPE_VALUE:type_value arrow TYPE_FUNC:type_func {: Type type = Type.getType("Function"); type.setTypeParam(0, type_value); type.setTypeParam(1, type_func); RESULT = type; :} | TYPE_VALUE:type_valueL arrow TYPE_VALUE:type_valueR {: Type type = Type.getType("Function"); type.setTypeParam(0, type_valueL); type.setTypeParam(1, type_valueR); RESULT = type; :} ;
Error handling
As for lexicon and syntax, the semantics of a program can be wrong: it is necessary to handle these errors, too.
Errors are recognized inside the semantic rules, so error detection and recovery are defined inside the Parser.
Every time an error is detected, the parser calls the report_semantic_error()
method, reported below:
public void report_semantic_error(String msg) { noCompileErrors = false; System.err.print("ERROR: Semantic error"); System.err.println(" (line "+currLine+", column "+currColumn+"): " + msg); }
Then, the parser creates a valid AST node for the production rule and so it recovers the error and can move forward.
Example: semantic error in a functional if-then-else block:
IF_BLOCK_FUNC ::= if_begin COND:cond then EXPR:thenBody else_begin EXPR:elseBody {: if(thenBody.type.isEquivalent(elseBody.type)) { RESULT = new AST.IfBlockFunc(thenBody.type, cond, thenBody, elseBody); } else { report_semantic_error("Multiple Return Types for If-Then-Else"); RESULT = new AST.IfBlockFunc(Type.getType("error"), cond, thenBody, elseBody); } :} | [...] //Additional rules for syntax errors
This is the list of all the possible syntax errors that LHC can recognize:
- Print does not support the argument type
- No support for local functions
- Global lists not supported
- Lists as a return value are not supported
- Missing value declaration
- Missing function declaration
- Multiple value declaration
- Value is not locally declared
- Function is not locally declared
- Multiple value assignment
- Multiple function assignment
- Mismatching type on assignment
- Not a function (but you expect one)
- Not a value (but you expect one)
- Mismatching arity in function (wrong number of arguments)
- Return type is not equivalent to the expression type (in Function Definition)
- Multiple argument declaration
- Wrong type in condition (Expected type: Bool)
- Wrong type in LHS/RHS of [expression_kind] expression (expression_kind represents “addition”, “subtraction” and so on, typically by using the operator symbol)
- Mismatching operand type on [expression_kind] Expression
- Wrong type in arguments applied to function
- Value declared but not assigned (when using a not initialized variable)
- Multiple return types for if-then-else
- Wrong type in list expression (subexpressions in a list expression must have the same type)
IR Generation
After lexical, syntactical and semantic analysis, we can assume that the program is correct and recognized so we can proceed with the next step: generating the intermediate representation. For this purpose, we need additional fields and methods, especially in the AST.
We said that the BasicNode
class defines fields and methods for code generation. They are:
code
: the LLVM code for the node;basicBlock
: reports in which basic block we are; used for if-then-else code generation;codegen()
: the method where code is generated; the method returns the identifier of the return value, if the block refers to an expressionmountCode()
: merges the code of two AST nodes
The LLVM
class supports the codegen()
method by providing static methods that build any LLVM instruction required, like createAlloca()
.
The LLVM
class provides also two counters, counterSSA
and counterIf
, to handle static single assignments and label uniqueness for the different if-then-else blocks.
In this section, we focus on the same constructs seen in the previous section with respect to IR generation.
The start and the end of parsing
The Program
node has a very simple codegen()
method for building the output program:
public static class Program extends BasicNode { private FunctPart functPart; private ImperPart imperPart; Program(FunctPart functPart, ImperPart imperPart) { this.functPart = functPart; this.imperPart = imperPart; } @Override String codegen() { functPart.codegen(); mountCode(functPart); imperPart.codegen(); mountCode(imperPart); return null; } }
The Print function
The print()
function, like any other function call, requires its argument to be evaluated. After that, we can call the function.
public static class Print extends IOAction { private Expr actarg; Print(Expr actarg) { this.actarg = actarg; } @Override String codegen() { String actargIndex = actarg.codegen(); mountCode(actarg); String result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createPrintCall(result, actarg.getType(), actargIndex)); return null; } }
Do blocks
Do blocks are lists of statements, so it is sufficient to generate the code of all the statements composing the block.
public static class DoBlock extends IOAction { private LinkedList<DoStmt> ioActions; DoBlock(LinkedList<DoStmt> ioActions) { this.ioActions = ioActions; } @Override String codegen() { for (DoStmt ioAction: ioActions) { ioAction.codegen(); mountCode(ioAction); } return null; } }
Declarations and definitions
Type declarations do not generate code: since in Haskell we cannot use a value/function until it is assigned, it will be allocated and assigned all at once.
Value definitions are rather simple: first, we evaluate the expression to assign; then we allocate the value and store the expression in it. Values here are intended as local: global values are treated as functions and use a DefFunct
node.
public static class DefValue extends Stmt { private String id; // must be a "unique" id private Expr expr; DefValue(String id, Expr expr) { this.id = id; this.expr = expr; } @Override String codegen() { String exprIndex = expr.codegen(); mountCode(expr); code.add(LLVM.createAlloca(LLVM.createRegister(id), expr.getType())); code.add(LLVM.createStore(LLVM.createRegister(id), exprIndex, expr.getType())); return null; } }
Function definitions define global functions. To generate a function we need to:
- create its definition (return type, name and arguments);
- open the function; the first block of the function is always called
entry
, so we callsetBasicBlock
to update the current basic block; - store all the values of the arguments and give them their name as in the native Haskell code;
- generate the code of the body
- create the return statement
- close the function
All these steps can be recognized in the code:
public static class DefFunct extends Stmt { private String id; // must be a "unique" id private LFormArg lformarg; private Expr expr; DefFunct(String id, LFormArg lformarg, Expr expr) { this.id = id; this.lformarg = lformarg; this.expr = expr; } @Override String codegen() { ArrayList<String> argNames = lformarg.getArgNames(); ArrayList<Type> argTypes = lformarg.getArgTypes(); LLVM.resetCounterSSA(); code.add(LLVM.createFunctionDefinition(id, expr.getType(), lformarg)); code.addAll(LLVM.openFunction()); setBasicBlock("entry"); for (int i = 0; i < argNames.size(); i++) { code.add(LLVM.createAlloca(LLVM.createRegister(argNames.get(i)), argTypes.get(i))); code.add(LLVM.createStore(LLVM.createRegister(argNames.get(i)), LLVM.createRegister(i), argTypes.get(i))); LLVM.getCounterSSA(); //update the register index } String result = expr.codegen(); mountCode(expr); code.add(LLVM.createReturn(result, expr.getType())); code.add(LLVM.closeFunction()); return null; } }
Expressions and conditions
To evaluate an expression, it is necessary to evaluate the subexpressions first, which are stored in an array. Then, we can build the code for the expression itself: first, we create a new register for the result; then, we create the operation instruction and finally use the register of the return value as the return value of the codegen()
method itself; in this way, the register is available to the rest of the program, which can use it as operand for other instructions.
The same distinction between operations and special constructs is still true during code generation. While operations build code using LLVM instructions for arithmetic and more complex operations, special constructs are managed on their own.
Conditions are boolean expressions, so they are treated as expressions.
public enum ExprKind { PLUS, MINUS, TIMES, DIV, INTDIV, REM, AND, OR, RELNOTEQ, RELEQ, RELGE, RELGT, RELLE, RELLT, INDEX, NOT, UMINUS, LET_BLOCK_FUNC, IF_BLOCK_FUNC, FUNCT_CALL, VALUE, EXPR_LIST } public static class Expr extends BasicExpr { private ExprKind exprKind; private BasicExpr[] subExpressions; //the subexpressions Expr (Type type, ExprKind exprKind, BasicExpr[] subExpressions) { this.type = type; this.exprKind = exprKind; this.subExpressions = subExpressions; } @Override String codegen() { String[] subExpResults = new String[subExpressions.length]; String result = null; for (int i = 0; i < subExpressions.length; i++) { subExpResults[i] = subExpressions[i].codegen(); mountCode(subExpressions[i]); } switch (exprKind) { case PLUS: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createAdd(result, subExpResults, subExpressions[0].getType())); break; case MINUS: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createSub(result, subExpResults, subExpressions[0].getType())); break; case TIMES: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createMult(result, subExpResults, subExpressions[0].getType())); break; case DIV: String[] oldExpResults = new String[subExpressions.length]; for (int i = 0; i < subExpressions.length; i++) { oldExpResults[i] = subExpResults[i]; if (subExpressions[i].getType().isEquivalent("Int")) { subExpResults[i] = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createSItoFP(subExpResults[i], oldExpResults[i], type, subExpressions[i].getType())); } } result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createDiv(result, subExpResults, type)); break; case INTDIV: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createIntDiv(result, subExpResults, subExpressions[0].getType())); break; case REM: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createRem(result, subExpResults, subExpressions[0].getType())); break; case AND: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createAnd(result, subExpResults, subExpressions[0].getType())); break; case OR: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createOr(result, subExpResults, subExpressions[0].getType())); break; case RELNOTEQ: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createRelNotEQ(result, subExpResults, subExpressions[0].getType())); break; case RELEQ: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createRelEQ(result, subExpResults, subExpressions[0].getType())); break; case RELGE: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createRelGE(result, subExpResults, subExpressions[0].getType())); break; case RELGT: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createRelGT(result, subExpResults, subExpressions[0].getType())); break; case RELLE: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createRelLE(result, subExpResults, subExpressions[0].getType())); break; case RELLT: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createRelLT(result, subExpResults, subExpressions[0].getType())); break; case INDEX: String elemPtr = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createGEP(elemPtr, subExpResults[0], subExpressions[0].getType(), subExpResults[1])); result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createLoad(result, type, elemPtr)); break; case NOT: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createNot(result, subExpResults, subExpressions[0].getType())); break; case UMINUS: result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createUMinus(result, subExpResults, subExpressions[0].getType())); break; case LET_BLOCK_FUNC: result = subExpResults[0]; break; case IF_BLOCK_FUNC: result = subExpResults[0]; break; case FUNCT_CALL: result = subExpResults[0]; break; case VALUE: result = subExpResults[0]; break; case EXPR_LIST: result = subExpResults[0]; break; } return result; } }
Function calls
To perform a function call, we need first to evaluate all its arguments. If it is a value, we create a register where to load its content and finally return this register. Otherwise, we create a register where to save the result, create a function call by using the function name and all its arguments one after the other; finally, we return the result register.
public static class FunctCall extends BasicExpr { private String id; // must be a unique id private LinkedList<Expr> actargs; FunctCall(Type type, String id, LinkedList<Expr> actargs) { this.type = type; this.id = id; this.actargs = actargs; } @Override String codegen() { ArrayList<String> argIds = new ArrayList<>(); // list containing all the input argument id for the function ArrayList<Type> argTypes = new ArrayList<>(); String result; for (Expr actarg: actargs) { argIds.add(actarg.codegen()); argTypes.add(actarg.type); mountCode(actarg); } if (id.contains("$")) { // only local values have a '$' appended to their name result = LLVM.createRegister(LLVM.getCounterSSA()); code.add(LLVM.createLoad(result, this.type, LLVM.createRegister(id))); } else { result = LLVM.createRegister(LLVM.getCounterSSA()); code.add(LLVM.createFunctionCall(result, this.type, id, argIds, argTypes)); } return result; } }
About lexical scoping: if we have two values with the same name, how can we refer to the right one? We may distinguish them by observing in which symbol table the value is defined, but symbol tables are destroyed when parsing terminates. So, we append to the name of a value a symbol table identifier reporting this kind of information.
In particular, SymTableStack
defines additional fields and methods for this purpose:
symTableStackId
: an Arraylist, contains the numeric id for all the levels of the stack and is expressively used for intermediate code generation; in brief, every time a new symbol table is pushed onto the stack at a given level, the id corresponding to its level increments; the id of a given symbol table corresponds to the concatenation of all the ids, from the bottom level to that of the symbol table we are considering, separated by “$” (e.g. let us suppose that we have two symbol tables, one on top of the other; let us suppose also that the id of the two symbol tables is “1” for the bottom one and “1$1” for the top one. If we pop a symbol table, only the bottom one remains with id “1”. Then, we suppose to push another symbol table: its id will be “1$2”). We can safely use “$” as a separator because it is not a valid character for Haskell identifiers, so there is no risk of homonymy.updateSymTableStackId()
: updatesymTableStackId
when a symbol table is pushedgetSymTableIds()
: gets a symbol table id, following the specification defined abovecreateUniqueId()
: create an id for a variable (for IR Generation)
When we reduce a function call, such a unique id is stored in the AST node of the expression, thus storing the information and preserving name uniqueness. In this way, the unique identifier can be used during code generation.
Let bindings (functional part)
A let binding consists of a sequence of statements, together with a final expression. So, to generate its code we need to generate all the statements and then evaluate the expression.
public static class LetBlockFunc extends BasicExpr { private LinkedList<Stmt> letStmts; private Expr expr; LetBlockFunc(Type type, LinkedList<Stmt> letStmts, Expr expr) { this.type = type; this.letStmts = letStmts; this.expr = expr; } @Override String codegen() { for (Stmt letStmt: letStmts) { letStmt.codegen(); mountCode(letStmt); } String result = expr.codegen(); mountCode(expr); return result; } }
If-then-else block (functional part)
The if-then-else block is one of the most peculiar in terms of IR generation because it exploits the basic block concept, which is used in LLVM to define the control flow graph.
Before generating code, we define an index for this if-then-else block: every if-then-else has a different index so that it is impossible to have homonymy conflicts on the labels.
To generate an if-then-else block, we need to evaluate its condition first. Then, we create a conditional branch, depending on the condition result. After this, the body of the then branch can be generated. Since we define a new label, we define a new basic block; for this reason, we call setBasicBlock()
to update this information. After generating the body of the then branch, it is very important to catch the current basic block: nested if-then-else blocks may have changed it.
Then, we create an unconditional branch towards the end of the block, to skip the else expression. At this point, we can generate the else expression and branch to the exit of the loop: this latter step is mandatory because in LLVM every basic block requires either a branch or a return statement.
Finally, we create a register for the result and define a phi node: the phi node is a sort of merge operator which takes the results of the two branches and selects which one to consider depending on which branch has been executed. For phi nodes to work correctly, it is mandatory to define the exact basic block where the result of the branch is defined, which explains why it is necessary to call getBasicBlock()
in both the branches. In the end, the register containing the result is returned, so that other parts of the program can use it.
public static class IfBlockFunc extends BasicExpr { private Expr cond, thenBody, elseBody; IfBlockFunc(Type type, Expr cond, Expr thenBody, Expr elseBody) { this.type = type; this.cond = cond; this.thenBody = thenBody; this.elseBody = elseBody; } @Override String codegen() { final int ifIndex; String condIndex, thenIndex, elseIndex, thenLabel, elseLabel, exitLabel, thenBB, elseBB, result; ifIndex = LLVM.getCounterIf(); condIndex = cond.codegen(); mountCode(cond); code.add(LLVM.createBranchCond(condIndex, "if.then$" + ifIndex, "if.else$" + ifIndex)); thenLabel = "if.then$" + ifIndex; code.add(LLVM.createLabel("if.then$" + ifIndex)); setBasicBlock(thenLabel); thenIndex = thenBody.codegen(); mountCode(thenBody); thenBB = getBasicBlock(); code.add(LLVM.createBranchNotCond("if.exit$" + ifIndex)); elseLabel = "if.else$" + ifIndex; code.add(LLVM.createLabel("if.else$" + ifIndex)); setBasicBlock(elseLabel); elseIndex = elseBody.codegen(); mountCode(elseBody); elseBB = getBasicBlock(); code.add(LLVM.createBranchNotCond("if.exit$" + ifIndex)); exitLabel = "if.exit$" + ifIndex; code.add(LLVM.createLabel("if.exit$" + ifIndex)); setBasicBlock(exitLabel); result = LLVM.createVariable(LLVM.getCounterSSA()); code.add(LLVM.createPHINode(result, this.type, thenBB, thenIndex, elseBB, elseIndex)); return result; } }
To conclude, the parser needs additional fields and methods, too; in particular, it defines a code
field to store it, plus a setter for that field. It is up to the parser to start the IR generation since it knows when parsing has finished: we need the full AST before generating the code. But how and when does code generation start? At the last reduction. In particular, the parser calls the codegen()
method of the root node of the AST when the production rule for the start symbol is reduced.
PROGRAM ::= /* empty program */ {: if (noCompileErrors) { System.out.println("CODE COMPILED SUCCESSFULLY"); } RESULT = new AST.Program(new AST.FunctPart(), new AST.ImperPart(null)); // Code generation RESULT.codegen(); // Retrieve the code setCode(RESULT.getCode()); :} | indent {: // push the top-level symtable symTableStack.pushSymTable(); :} FUNCT_PART:funct_part IMPER_PART:imper_part {: symTableStack.popSymTable(); :} dedent {: // pop the top-level symtable if (noCompileErrors) { System.out.println("CODE COMPILED SUCCESSFULLY"); RESULT = new AST.Program(funct_part, imper_part); // Code generation RESULT.codegen(); // Retrieve the code setCode(RESULT.getCode()); } else { System.out.println("CODE NOT COMPILED: FAILED"); RESULT = new AST.Program(null, null); } :} ;
Installation and usage
LHC is compatible with:
- Ubuntu 20.04 LTS (but it should work with less recent versions or other Linux distributions, too).
Prerequisites
- JDK (Java Development Kit) - Version 16 or above (successfully tested with JDK 16, but it should work with less recent releases, too)
- Cup - This is not the original implementation of Cup, but an enhanced one that displays a graphical representation of the Parse Tree
- LLVM (use
sudo apt install llvm
to install the package) - Version 10 or above (less recent versions may work, too)
Download and Commands
Download link: LHC
After downloading and accessing the main folder, you can use the following commands to play with the compiler:
- make install: Install everything
- make uninstall: uninstall everything
- make %.ll: build the corresponding %.hs file and produce the resulting LLVM code
- lli %.ll: run the output code
In the ./programs folder, you can already find some examples of buildable and working Haskell programs.
Example
Let us suppose that we have just downloaded and unzipped the compiler and that we want to compile the GCD program stored in ./programs/gcd.hs.
First, go to the main folder (LHC).
Then, run the following commands to install the compiler, build the code and run it:
- to install everything: make install
- to build the code: make ./programs/gcd.ll (notice that the target is an LLVM file)
- to run it: lli ./programs/gcd.ll (the output code is in the same folder as the input one)
- finally, to uninstall everything: make uninstall
Other links
Github link to the original repo: https://github.com/palmleon/LHC
If you found any error, or if you want to partecipate to the editing of this wiki, please contact: admin [at] skenz.it
You can reuse, distribute or modify the content of this page, but you must cite in any document (or webpage) this url: https://www.skenz.it/compilers/haskell_to_llvm