Return to Home page

From Haskell to LLVM

Introduction

In this presentation, we illustrate the LHC (Little Haskell Compiler), which can recognize a strict subset of the Haskell Programming Language, build it and produce LLVM code as output.
In the last section, you can find how to install and use the tool.

The High-end Language

Haskell is an advanced, Purely Functional Programming Language. Born in 1990, it has become one of the most popular Functional Programming Languages. Typically, Functional Languages are more confined to the academic context as developing programs is more complex than using an Imperative Language (C, Java, Python). Nonetheless, Haskell is increasingly used in the industrial domain because of its useful properties, e.g. it is much easier to prove Correctness.

Being a Pure Functional Language, Haskell has many interesting properties:

Referential Transparency
Single Assignment
Non-Strictness
Type Inference
Functions are 1st order entities
Lexical Scoping
…

LHC tries to follow in the footsteps of this Language by implementing some of its peculiar features (Referential Transparency, Single Assignment, Lexical Scoping).

The Core

LHC is based on:

a Scanner, built using a Scanner Generator: JFlex
an LALR(1) Parser, designed using a Parser Generator: Cup
additional classes for Semantic Analysis and IR Generation, written in Java

The Target Language

The Target Language of LHC is LLVM, a Programming Language specially used for Intermediate Code Representation.

Born as a research project at the University of Illinois, it has developed into an umbrella project consisting of several subprojects, including the LLVM Language itself and a C++ API to build and optimize LLVM code at a higher level of abstraction.

Features

Currently, LHC supports the following subset of Haskell:

Supported Types:
- Basic Types: Int, Double, Char, Bool
- Aggregate Types: Lists (not Recursive), Strings
- Other Types: Functions (not as 1st order entities)
Supported operations:
- Addition (+)
- Subtraction (-)
- Multiplication (*)
- Division (/)
- Integer Division (div)
- Remainder (rem)
- Logical operators:
  - Logical and (&&)
  - Logical or (||)
  - Logical not (not)
- Relational operators:
  - Greater than (>)
  - Greater or equal to (>=)
  - Less than (<)
  - Less or equal to (<=)
  - Equal to (==)
  - Not equal to (/=)
- Element extraction from List/String (!!)
- Function call
Constructs:
- If-Then-Else (the only possible Control Flow Statement: loops do not exist in Haskell)
- Let Bindings (both in the Functional and Imperative Part)
- Do Block
I/O:
- Print Function (accepts Int, Double, Char, String)
Language Properties:
- Indentation-based Parsing
- Referential Transparency
- Single Assignment (“Everything is a constant”)
- Lexical Scoping

Partial/Missing Features

Functions cannot return Lists/Strings
Global Lists/Strings are not supported

The Compiler

Note: all debug code is not shown at all; we use “[…]” for meaningful parser code that is covered in other sections to focus on what is most important.

Scanner

In this section, we illustrate how Tokens are scanned in LHC using a Scanner written in JFlex.

General Scanning Algorithm

The Scanning Algorithm implemented in LHC uses a FIFO tokenQueue where all tokens are inserted once scanned. All Tokens are sent to the Parser as Symbols to provide additional information for Semantic Analysis. In particular, every time that a Token is requested by the Parser:

if the tokenQueue is not empty, return the next element of the queue (in the pre-scanning method next_token_custom())
otherwise:
1. look for the next token(s) (launches the default scanning method next_token(), called by the Parser by default)
2. manage the token(s) (in the post-scanning method manageToken())
3. return the first next token to the Parser (in manageToken(), too)

Managing basic Tokens, i.e. all those not related to indentation, only consists of inserting them into the tokenQueue. For indentation-related token management, read this section.

About code: when a method is reported, it assumes a form like method_name(), regardless of the number of arguments.

Snippet of next_token_custom():

public Symbol next_token_custom() throws IOException{
	// if the token Queue is not empty, extract the next item
	if (this.tokenQueue.size() > 0) {
		return this.tokenQueue.remove();
	}
	[...] // Additional code for Indentation-based Parsing, we will cover this later
	return this.next_token();
}

Snippet of manageToken():

private Symbol manageToken(Symbol... symbols) {
        [...] // Additional code for Indentation-based Parsing
	for (Symbol sym: symbols) {
		tokenQueue.add(sym);
	}
	return this.tokenQueue.remove();
}

It is always up to the Parser to ask the Scanner for a new token; by default, the Parser calls the next_token() method of the Scanner and a new Token is received eventually. In this case, there is a pre-scanning method to be executed: the scan with code section of the Parser needs to be tuned accordingly:

scan with {:   
	if (true) { // Required because of a bug in this enhanced implementation of Cup
		Scanner scanner = (Scanner) this.getScanner();
		Symbol s = scanner.next_token_custom();
		[...] // Additional code for tracking current line and column and for drawing the Parse Tree
		return s;
	}	
:}

Basic Token management

The Scanner can recognize several keywords and regular expressions that identify specific terminals of the Grammar.
Here is a snippet of the Scanner code for identifying those Tokens:

int = 0|[1-9][0-9]* 					//unsigned int
rational = {int}\.[0-9]+				//unsigned rational
expform = {rational}[eE]-?{int}
double = {rational}|{expform}
bool = True|False
char = \'.\'
string = \"~\"
id = [a-z][a-zA-Z0-9_\']*
nl = \r|\n|\r\n
ws = [ \t]

%%

"="			{return manageToken(createSymbol(sym.eq));}
"::"			{return manageToken(createSymbol(sym.clns));}
","			{return manageToken(createSymbol(sym.cm));}
"("			{return manageToken(createSymbol(sym.ro));}
")"			{return manageToken(createSymbol(sym.rc));}
"["			{return manageToken(createSymbol(sym.bo));}
"]"			{return manageToken(createSymbol(sym.bc));}
"->"			{return manageToken(createSymbol(sym.arrow));}
"+"			{return manageToken(createSymbol(sym.plus));}
"-"			{return manageToken(createSymbol(sym.minus));}
"*"			{return manageToken(createSymbol(sym.times));}
"/"			{return manageToken(createSymbol(sym.div));}
div			{return manageToken(createSymbol(sym.intdiv));}
rem			{return manageToken(createSymbol(sym.rem));}
"&&"			{return manageToken(createSymbol(sym.and));}
"||"			{return manageToken(createSymbol(sym.or));}
not			{return manageToken(createSymbol(sym.not));}
"/="			{return manageToken(createSymbol(sym.relnoteq));}
"=="			{return manageToken(createSymbol(sym.releq));}
">="			{return manageToken(createSymbol(sym.relge));}
">"			{return manageToken(createSymbol(sym.relgt));}
"<="			{return manageToken(createSymbol(sym.relle));}
"<"			{return manageToken(createSymbol(sym.rellt));}
";"			{return manageToken(createSymbol(sym.sep));}
in			{[...]
			 return manageToken(createSymbol(sym.in));}
main			{return manageToken(createSymbol(sym.main));}
do			{[...]
			 return manageToken(createSymbol(sym.do_begin));}
"!!"			{return manageToken(createSymbol(sym.index));}
if			{return manageToken(createSymbol(sym.if_begin));}
then			{return manageToken(createSymbol(sym.then));}
else			{return manageToken(createSymbol(sym.else_begin));}
let			{[...]
			 return manageToken(createSymbol(sym.let));}
print			{return manageToken(createSymbol(sym.print));}
Int			{return manageToken(createSymbol(sym.type_int));}
Double			{return manageToken(createSymbol(sym.type_double));}
Bool			{return manageToken(createSymbol(sym.type_bool));}
Char			{return manageToken(createSymbol(sym.type_char));}
String			{return manageToken(createSymbol(sym.type_string));}
{int}			{return manageToken(createSymbol(sym.val_int, 	 Integer.valueOf(yytext())));}
{double}		{return manageToken(createSymbol(sym.val_double, Double.valueOf(yytext())));}
{bool}			{return manageToken(createSymbol(sym.val_bool, 	 Boolean.valueOf(yytext())));}
{char}			{return manageToken(createSymbol(sym.val_char, 	 yytext().charAt(1)));}
{string}		{return manageToken(createSymbol(sym.val_string, yytext().substring(1, yytext().length()-1)));}
//LLVM does not support the single quote as a valid character for identifiers, while dots are: for this reason, every single tick is replaced with a dot
{id}			{return manageToken(createSymbol(sym.id,         yytext().replace('\'', '.')));} 
{ws}			{;}
{nl}			{[...]}

/* Single-line Comment */
"--" ~ {nl}		{[...]}

Indentation-based Parsing

For nested blocks, Haskell supports both a coding style that uses brackets and an indentation-based one, which LHC supports. However, Haskell uses Off-side rules only for certain Blocks like Let Bindings and Do Blocks, while others are in free-form, as the If-Then-Else block.

How can we recognize Indentation?
A possible approach, used in Python, consists of defining additional tokens on purpose: Indent, Dedent and Separator; Indent starts a block, Dedent terminates it, Separator divides statements inside the same one.
We also define additional variables and data structures:

an indentStack, where each entry contains the column index of a block
an indentEnable flag, raised when an indentation-sensitive block is recognized, i.e. a “let” or “do” token is scanned
a scanIndent flag, raised when indentEnable is active, so that the next Token will define the column index for the block
a foundNewline flag, raised when a newline is recognized
a dedentForce flag, to force a decent for the very peculiar case in which we scan an “in” Token.

The concept is the following: when an indentation-sensitive block is recognized, i.e. its keyword is scanned (“let”, “do”), an Indent is scanned; indentEnable is raised. Before scanning the following Token, scanIndent is raised, too: the Token will define the column index for the block, which is stored in the indentStack.
Beware: nested blocks can only be more to the right than the current one.

In case we are in an indentation-sensitive block:

if a newline is recognized and the following line is more to the left than the current block, a Dedent is scanned: the block is closed; actually, this is a loop: we may scan multiple Dedents at once. The loop halts when a block is aligned more to the left than the line. In the particular case where a Dedent is scanned because dedentForce was raised, only one Dedent is accepted, i.e. the Let Binding corresponding to the “in” is closed, nothing else
if the following and the current line are aligned the same, then a Separator is scanned
if the following line is more to the right than the block, no additional symbol is scanned: we are still in that block

Once defined the principle, we can illustrate the code.

Here are the additional instructions for indentation control, for specific Tokens:

in			{dedentForce = true;
			 return manageToken(createSymbol(sym.in));}
[...]
do			{indentEnable = true;
			 return manageToken(createSymbol(sym.do_begin));}
[...]
let			{indentEnable = true;
			 return manageToken(createSymbol(sym.let));}
[...]
{nl}			{foundNewline = true;}

/* Single-line Comment */
"--" ~ {nl}		{foundNewline = true;}

Code snippet to enable scanIndent (in next_token_custom()):

if (indentEnable) {
	indentEnable = false;
	scanIndent = true;
}

Code snippet to scan a new Indent, define the column level of the nested block and store it (in manageToken()):

if (!endOfCode && scanIndent) {
	indentColumn = yycolumn+1;
	if (indentStack.size() == 0 || indentColumn > indentStack.peek()) {
		indentStack.push(indentColumn);
		tokenQueue.add(createSymbol(sym.indent));
	}
	else {
		report_error("Nested block is not indented further in than the enclosing expression");
		tokenQueue.add(createSymbol(sym.error));
	}
	scanIndent = false;
	foundNewline = false;
}

Code snippet to scan one or more Dedents (in manageToken()):

if (!endOfCode && !indentStack.empty() && (foundNewline || dedentForce)) {
	dedentColumn = yycolumn + 1;
	boolean thereIsDedentToInsert = true;
	while (thereIsDedentToInsert && !indentStack.empty()) {
		indentColumn = indentStack.peek(); 			// take the column of the innermost block
		if (indentColumn > dedentColumn || dedentForce) { 	// check if we are out of the innermost block
			dedentForce = false;
			indentStack.pop();
			[...]
			tokenQueue.add(createSymbol(sym.dedent));
		}
		else {
			thereIsDedentToInsert = false;
			if (indentColumn.equals(dedentColumn)) {
				// the following statement is not outside of the block : add only a Separator
				tokenQueue.add(createSymbol(sym.sep));
			}
		}
	}	
}

Error Handling

If no Token can be recognized, a Lexical Error is notified, calling the report_error() method:

public void report_error(String msg) {
	System.err.print("ERROR: Lexical error");
	System.err.print(" (line " + (yyline+1) + ", column " + (yycolumn+1) + "): " + msg);
	System.err.println(" (Token \"" + yytext() + "\")");
}

If no Token can be recognized, then the text matches with the following catch-all regular expression, defined at the end of the code:

.		{report_error("Not recognized token");
		 return this.manageToken(createSymbol(sym.error));}

Parser

The Parser, written in Cup, recognizes the syntax of the program and handles possible occurring errors.
Here is the full Grammar of the Language recognized by LHC (Semantic Rules are explored in this Section).

PROGRAM ::= /* empty program */ 
	  | indent FUNCT_PART IMPER_PART dedent

/// ////////////////// ///
///  IMPERATIVE PART   ///
/// ////////////////// ///

IMPER_PART ::= main eq IO_ACTION

IO_ACTION ::= PRINT
            | DO_BLOCK
	    | IF_BLOCK_IMPER

IO_ACTIONS ::= IO_ACTIONS sep IO_ACTION
	     | IO_ACTIONS sep LET_BLOCK_IMPER
	     | IO_ACTION
	     | LET_BLOCK_IMPER

PRINT ::= print ACTARG

DO_BLOCK ::= do_begin indent IO_ACTIONS dedent

IF_BLOCK_IMPER ::= if_begin COND then IO_ACTION else_begin IO_ACTION

LET_BLOCK_IMPER ::= let indent LET_STMTS dedent

LET_STMTS ::= LET_STMTS sep DECL_TYPE
	    | LET_STMTS sep DEF_VALUE
	    | DECL_TYPE
	    | DEF_VALUE

/// ////////////////// ///
///  FUNCTIONAL PART   ///
/// ////////////////// ///
 
FUNCT_PART ::= GLOBAL_STMTS
 
GLOBAL_STMTS ::= /* empty List of Statements section */
	       | GLOBAL_STMTS STMT sep
 
///  STATEMENTS  ///

STMT ::= DECL_TYPE
       | DEF_VALUE
       | DEF_FUNCT

DECL_TYPE ::= id cm DECL_TYPE
	    | id clns TYPE 

DEF_VALUE ::= id eq EXPR

/* no nullary functions */
DEF_FUNCT ::= id LFORMARG eq EXPR

LFORMARG ::= LFORMARG id
           | id

/* special expression management for boolean conditions */
COND ::= EXPR

///   EXPRESSIONS   ///

EXPR ::=     EXPR plus EXPR
           | EXPR minus EXPR
           | EXPR times EXPR
	   | EXPR div EXPR
	   | EXPR intdiv EXPR
	   | EXPR rem EXPR
	   | EXPR and EXPR
	   | EXPR or EXPR
	   | EXPR relnoteq EXPR
	   | EXPR releq EXPR
	   | EXPR relgt EXPR
	   | EXPR relge EXPR
	   | EXPR rellt EXPR
	   | EXPR relle EXPR
	   | EXPR index EXPR
	   | ro EXPR rc
	   | not EXPR
	   | minus EXPR %prec uminus 
	   | LET_BLOCK_FUNC
	   | IF_BLOCK_FUNC
	   | FUNCT_CALL
	   | VALUE
	   | EXPR_LIST

FUNCT_CALL ::= id LACTARG

LET_BLOCK_FUNC ::= let indent LET_STMTS dedent in EXPR

IF_BLOCK_FUNC ::= if_begin COND then EXPR else_begin EXPR 

///    args for function call   ///

ACTARG ::= id
         | VALUE
	 | ro EXPR rc

// list of input arguments for function call
LACTARG ::= /* empty list of args: we call a Value */
	  | LACTARG ACTARG

///   VALUES   ///

// Represents constants
VALUE ::= VALUE_BASIC

// VALUE_BASIC represents constants
VALUE_BASIC ::= 
	    val_int
	  | val_double	  
	  | val_bool
	  | val_char
	  | val_string

EXPR_LIST ::= bo LEXPR bc

LEXPR ::= EXPR
	| LEXPR cm EXPR	

///   TYPES   ///

TYPE ::= TYPE_VALUE
       | TYPE_FUNC

// both basic types and compound
TYPE_VALUE ::= TYPE_LIST
	     | TYPE_BASIC

TYPE_BASIC ::= type_int 
             | type_double
             | type_bool
	     | type_char

TYPE_LIST ::= bo TYPE_BASIC bc 
            | type_string

// no higher order functions
TYPE_FUNC ::= TYPE_VALUE arrow TYPE_FUNC
	    | TYPE_VALUE arrow TYPE_VALUE

To avoid Shift-Reduce conflicts, operators have given precedence, shown below:

/// ///////////////////////// ///
///    PRECEDENCE RULES	      ///
/// ///////////////////////// ///

precedence left or;
precedence left and;
precedence nonassoc releq, relnoteq, relgt, relge, rellt, relle;
precedence left plus, minus;
precedence left times, div, intdiv, rem;
precedence nonassoc index;
precedence nonassoc not, uminus;

We can notice that Haskell has some peculiarities, especially in comparison with typical imperative languages, like Java.

Any Haskell program can be split into two parts: a functional part and an imperative one; the latter corresponds to the main and is the only region of the program where it is possible to access I/O devices via IO Actions; in general, it is where we can modify the state of the machine. Instead, all functions in the first part are stateless: this property is fundamental for making Haskell a Purely Functional programming language and is also known as referential transparency.

To declare local variables, the language defines Let Bindings that can be placed inside the function body; here, they are represented by LET_BLOCK_FUNC and LET_BLOCK_IMPER.

Strangely, the If-Then-Else construct and Let Bindings defined in the functional part are Expressions. In particular, If-Then-Else is an expression whose return value corresponds to that of the Then/Else Branch; since expressions have always one and only one type, the two branches must return a value of the same type. Instead, Let Bindings have the same type and return value of the expressions following the in token (the expression is part of the construct).

In Haskell, the If-Then-Else construct imposes to define both the branches; this is due to a particular rule of the language:

Expressions must always return something

The If-Then-Else construct is an expression, so it must always return something regardless of the result of the condition: every If must define an expression for both the branches.

Calling a value is like calling a nullary function, which is how Haskell interprets values, actually.

An n-ary function, i.e. a function taking n arguments, is equivalent to a sequence of unary functions taking one argument at a time. This concept is known as currying and is extensively used in Haskell for defining higher-order functions.

Finally, strings are lists of chars: there exists no Array type.

Error Handling

The Parser is also able to recognize syntax errors and recover them, in some cases. For this purpose, Cup allows using an error token, for error recovery.

Every time an error is detected, the Parser launches the syntax_error() method by default, which forwards the error to report_error(), whose body is reported here:

public void report_error(Object info) {
        noCompileErrors = false;
    System.err.print("ERROR: Syntax error");
    if (info instanceof Symbol)
        if (((Symbol)info).left != -1){
            int line = (((Symbol)info).left);
            int column = (((Symbol)info).right);
            System.err.print(" (line "+line+", column "+column+"): ");
        } else System.err.print(": ");
    else System.err.print(": ");
}

The noCompileErrors flag is true at the beginning and set to false when the first error of any kind occurs.

We can observe that report_error() builds an incomplete error message: no information about the kind of error has been provided since the error has not been recovered yet!
To complete the message, we need to wait until the Parser recovers an error, i.e. reduces a rule containing the error token. At that point, the report_syntax_error() is called and the error is diagnosed.

These are the additional rules, included in the compiler, to handle some syntax errors:

IMPER_PART ::= main eq error

IO_ACTIONS ::= IO_ACTIONS:io_actions sep error

DO_BLOCK ::= do_begin indent error dedent 

IF_BLOCK_IMPER ::= 
            if_begin error then IO_ACTION else_begin IO_ACTION	
	  | if_begin COND then error else_begin IO_ACTION
	  | if_begin COND error else_begin IO_ACTION

GLOBAL_STMTS ::= GLOBAL_STMTS error sep

LET_BLOCK_FUNC ::= let error in EXPR

IF_BLOCK_FUNC ::= if_begin error else_begin EXPR
		| if_begin COND then error else_begin EXPR
		| if_begin COND error else_begin EXPR

This is the list of possible syntax errors that LHC can recognize:

Error in Main
Error in Imperative Part Statement
Error in Do Block
Error in Condition of If-Then-Else block
Error in Then Block of If-Then-Else block
Missing 'then' in If-Then-Else block
Error in Functional Statement
Error in Statement inside a Let Block

Semantic Analysis

Together with lexicon and syntax, semantics is one of the main components of a program. Thanks to semantics, it is possible to define context-sensitive information, like variable scoping and typing, empowering the language.

Note:

In Cup, it is possible to define intermediate actions, like:

A ::= x { something } y

which is equivalent to

A ::= M y

M ::= x { something }

In particular, Cup can define implicit rules for the markers and translate the grammar from the first representation to the second one; however, this may raise conflicts if the parser cannot be sure that it will reduce exactly that rule.

The typing system

Like any other programming language, Haskell has a typing system. It is very complex and uses very high-level concepts as typeclasses, polymorphic data structures, type variables, the kind system, and so on.
In LHC, only pretty simple types are supported: Int, Double, Bool, Char, not recursive List, Strings and Functions. Each of these types has a standard representation as a type tree. Basic types are represented as a simple tree node, whose label corresponds to the name of the type itself. For aggregate types, it is different: Lists are polymorphic, so they are represented as a root node and a child node, holding the type of the elements of the list; strings are simple lists of chars. For functions, it is even more interesting: thanks to currying, we can consider functions as a sequence of unary functions.
So, we can represent them as a binary tree where:

the root node is labelled as “Function”
the left child holds the type of the first argument
the right child holds the type of the current function without the first argument: the tree depth is equal to the number of arguments

In this compiler, type trees are represented by the Type class, composed of the following methods and fields:

constructors, setters, getters;
typeName (the label of the root node);
typeParams (the child nodes of the tree);
typeMap: the type map contains all the “standard” representations of Types, the same described some line above; it is used as a sort of lookup table for type trees;
widen(): method for type widening, i.e. it widens Int to Double if required
isSubtype(): a method for checking subtypes; it takes the name of a certain type as argument and checks if the current type is a subtype of the argument
isEquivalent(): method for type equivalence checking; it works as isSubtype(), but checks equivalence
returnType(): given the type tree of a function, returns its return type
hasArity(): a method for arity checking, i.e. verifies whether a function takes a given number of arguments, provided as argument of the method
createTypeMap(): creates the type map during the parser initialization
addType(): adds a type to the type map
typeExists(): checks if a type is defined in the type map, i.e. if it is recognized by the language
getType(): gets the type tree of a given type from the type map
dumpTypeMap(), dumpType(): used for debugging purposes

In the type map, two additional types are used:

Any, for polymorphic types like functions and lists
error, for error recovery

The symbol table stack

During semantic analysis, one of the most important data structures is the symbol table: it contains all the information about declared values and functions, providing all those context-sensitive functionalities that empower the language.

Every symbol table is a hashmap containing many symbol table entries. Each symbol table entry has a name, a type and a flag to check whether it has been already assigned.

In this compiler, we do not use a unique symbol table; instead, we use a symbol table stack: in LHC, it is possible to define nested blocks with different contexts, e.g. nested let bindings.

The symbol table stack is implemented via the SymTableStack class, which defines an inner SymTableEntry class for the symbol table entries.

The SymTableEntry class defines objects with the following information:

type: the type of the value/function, as a type tree;
isAssigned: boolean flag for checking assignment; initially, it is false

Thanks to this flag, values and functions can be assigned only once. Single assignment is one peculiar characteristic of functional language, where values and functions are intended in a mathematical way. If you define cos(x) as a certain function, why should you define it again?

Where is the entry name? Since a symbol table is a hashmap, we use the name of the entry as a key and a SymTableEntry instance as its value.

This is the information contained in a SymTableStack instance:

symTableStack: a LinkedList containing the stack of symbol tables itself; it is not defined as a stack on purpose: there are cases where it is necessary to look in the symbol tables below (as explained for getEntry());
pushSymTable(): push a new symbol table in the stack;
popsymTable(): pop a symbol table from the stack;
peekSymTable(): get the symbol table on top of the stack, without popping it;
putEntry(): insert an entry in the symbol table on top of the stack;
containsEntry(): check if a value/function is present in the whole stack; search starts at the uppermost symbol table and ends at the bottom;
getEntry(): looks for an entry like containsEntry() and returns it, if present
getLevel(): looks for an entry like containsEntry() and returns its level, where level is a positive integer and 0 corresponds to the bottom
isGlobal(): checks whether a value/function is global, i.e. declared in the bottom symbol table
isLocallyDeclared(): checks whether a value/function has been declared in the top-level symbol table
isAssigned(): checks if a variable has already been assigned; used for assignment uniqueness check
isDeclared(): equivalent of containsEntry()
additional fields and methods for debug and IR Generation (explored here)

In the Parser, the symbol table stack is allocated here:

action code {:
	/* The Symbol Table Stack to be used */
	private SymTableStack symTableStack = new SymTableStack();
:};

The AST

The abstract syntax tree is a tree representation of the abstract syntactic structure of the source code. Each node of the tree denotes a construct occurring in the text. If we define the programming language in Backus-Naur form as Cup allows to, then each symbol of the grammar represents a node of the tree; in particular, the node is a leaf if the symbol is terminal.
AST is very useful for top-down IR generation, as reported here.

In this compiler, the AST nodes are defined in the wrapper AST class, which contains all of them. It contains many different nodes, in practice one per non-terminal symbol:

BasicNode: represents any node of the AST; any other node is subtype of BasicNode; this class defines several methods and fields for code generation, as shown later;
Program: the root of all the syntax tree, represents the whole program
FunctPart: represents the functional part of the program
ImperPart: represents the imperative part of the program
DoStmt: abstract class, parent class of all accepted statements inside a do block
IOAction: abstract class, parent class of all nodes that are also IO actions
DoBlock: represents a do block
IfBlockImper: represents an if-then-else in the imperative part
LetBlockImper: represents a let binding in the imperative part
Stmt: abstract class, parent of all statements, i.e. declarations and definitions
DeclType: represents a type declaration
DefValue: represents a value definition
DefFunct: represents function definition
BasicExpr: abstract class, parent class of all sorts of expressions; it also holds a type
Expr: class for almost all expressions; basically, all operations are represented by this node
LetBlockFunc: expression node for a let binding in the functional part
IfBlockFunc: expression node for an if-then-else in the functional part
FunctCall: expression node for function calls
Value: expression node for constants
ExprList: expression node for a list of expressions
LFormArg: node containing the list of all formal arguments of a function

Now that we have defined the AST nodes, we may ask how they are interconnected. It is rather simple: they are connected according to the production rules of the corresponding non-terminal, e.g. a Program node has always two children: a FunctPart and an ImperPart node, as we can see in the production rule:

PROGRAM ::= indent FUNCT_PART IMPER_PART dedent

Here is the relation between non-terminals and AST nodes:

non terminal AST.Program PROGRAM;
non terminal AST.FunctPart FUNCT_PART;
non terminal AST.ImperPart IMPER_PART;
non terminal LinkedList<AST.DoStmt> IO_ACTIONS;
non terminal AST.IOAction IO_ACTION;
non terminal AST.Print PRINT;
non terminal AST.Expr COND, EXPR;
non terminal AST.ExprList EXPR_LIST, LEXPR;
non terminal AST.FunctCall FUNCT_CALL;
non terminal AST.Expr ACTARG;
non terminal LinkedList<AST.Expr> LACTARG;
non terminal AST.LFormArg LFORMARG;
non terminal LinkedList<AST.Stmt> LET_STMTS, GLOBAL_STMTS;
non terminal AST.DoBlock DO_BLOCK;
non terminal AST.LetBlockFunc LET_BLOCK_FUNC;
non terminal AST.LetBlockImper LET_BLOCK_IMPER;
non terminal AST.IfBlockFunc IF_BLOCK_FUNC;
non terminal AST.IfBlockImper IF_BLOCK_IMPER;
non terminal AST.Stmt STMT;
non terminal AST.Stmt DEF_VALUE;
non terminal AST.DefFunct DEF_FUNCT;
non terminal AST.DeclType DECL_TYPE;
non terminal Type TYPE, TYPE_VALUE, TYPE_FUNC, TYPE_LIST, TYPE_BASIC;
non terminal AST.Value VALUE, VALUE_BASIC;

Semantic rules

After exploring the additional classes for semantic analysis, we can explore the semantic rules, at least the most interesting ones. In particular, we focus extensively on how the symbol table stack is used and much less on building the AST, which is rather straightforward. Error detection and recovery are covered in the following section about error handling, so.

The start and the end of parsing

The PROGRAM non-terminal is the start symbol of the grammar. When parsing begins, we need a symbol table to store all the global variables and functions; instead, when parsing terminates, we need to pop the symbol table, as we do not need it anymore. There is a particular case: empty programs; in this case, no symbol table is pushed or popped: the compiler recognizes the program as good and terminates. Finally, we need to inform the user if parsing is successful or not.
This is a code snippet of the semantic rules related to the PROGRAM symbol. There are still some lines about code generation, which are explored here.

PROGRAM ::= /* empty program */ 
	{:
		if (noCompileErrors) {
		System.out.println("CODE COMPILED SUCCESSFULLY");
		}
		RESULT = new AST.Program(new AST.FunctPart(), new AST.ImperPart(null));
		[...] // Code generation
	:}
	 | indent
	{:	// push the top-level symtable
		symTableStack.pushSymTable();
	:}
	   FUNCT_PART:funct_part IMPER_PART:imper_part 
	{:
		symTableStack.popSymTable();
	:}
	dedent 
	{:  // pop the top-level symtable
		if (noCompileErrors) {
			System.out.println("CODE COMPILED SUCCESSFULLY");
			RESULT = new AST.Program(funct_part, imper_part);
			[...] // Code generation
		}	
		else {
			System.out.println("CODE NOT COMPILED: FAILED");
			RESULT = new AST.Program(null, null);
		}
	:}
;

The Print function

The print() function is the only action allowed to interact with I/O; it accepts numbers, chars and strings and prints them with a trailing newline. Since the only semantic constraint on this function is the type of the argument, this is also the only check to perform in the semantic rule. Its code is reported below.

PRINT ::= print ACTARG:actarg
	{:
		Type argType = actarg.getType();
		if (argType.isSubtype("Double") || argType.isSubtype("String") || argType.isSubtype("Char")) {
			RESULT = new AST.Print(actarg);
		}
		else {
                        [...] // Error handling
                }
	:}
;

Do blocks

A do block consists of a sequence of IO Actions; this construct defines also its own context, i.e. its own symbol table. Like for all other constructs that create symbol tables, the semantic rules both push and pop the symbol table, so that contexts are well defined. Here is the code:

DO_BLOCK ::=
		do_begin indent 
		{:
			symTableStack.pushSymTable();
		:}
		IO_ACTIONS:io_actions dedent
		{:
			symTableStack.popSymTable();			
			RESULT = new AST.DoBlock(io_actions);
		:}
	  | [...] // additional rule for detecting syntax errors 
;

Declarations and definitions

Declarations and definitions are among the most important constructs: they allow to declare and instantiate variables. There is a difference between the two concepts:

Declarations only declare that a value/function exists, but do not allocate it
Definitions allocate them

There are three constructs for this purpose:

Type Declaration
Value definition
Function definition

Type Declarations are responsible for inserting new entries in the current symbol table; given an id and a type, the rule checks if it has already been declared in the current symbol table and that, in case it is a function, it does not return a list as it is not possible in this language. Since the type is on the right of the rule, it is easy to propagate the type to multiple declarations using a synthesized attribute.
Here is the code:

DECL_TYPE ::= id:id cm DECL_TYPE:decl_type
	{:
		Type type = decl_type.getType();
		if (!symTableStack.isLocallyDeclared(id) && !(type.isSubtype("Function") && type.returnType().isSubtype("List"))) {
			symTableStack.putEntry(id, symTableStack.new SymTableEntry(type));
		}
		else [...] //Error handling 
		RESULT = new AST.DeclType(new Type(type));
	:}
	     | id:id clns TYPE:type
	{:
		if (!symTableStack.isLocallyDeclared(id) && !(type.isSubtype("Function") && type.returnType().isSubtype("List"))) {
			symTableStack.putEntry(id, symTableStack.new SymTableEntry(type));
		}
		else [...] // Error handling
		RESULT = new AST.DeclType(type);
	:}
;

Value definitions allow defining local and global values. Once we recognize that we are parsing a value definition, an intermediate action defines the value as assigned, allowing recursion on the value, which is possible in Haskell. Before this, the action checks if the value is locally declared and not assigned yet. Finally, when the rule is reduced, the first check verifies if the type of the expression and the type of the value match; then, we check if the value is global or local. In case it is global, the value is treated as a nullary function, as global values are here intended as functions with no arguments; we will see that during IR generation; otherwise, it is treated as a value.
Here is the code:

DEF_VALUE ::= id:id 
		{:
			if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && !symTableStack.isAssigned(id)) {
				symTableStack.getEntry(id).setIsAssigned(true);
			}
			else [...] //Error handling
		:}
		eq EXPR:expr
		{:
			if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && symTableStack.isAssigned(id) && !expr.type.isEquivalent(symTableStack.getEntry(id).getType().getTypeName())) {
                                        [...] //Error handling
			}
			if (symTableStack.isGlobal(id)) {
				// Global Values are actually Nullary Functions
				ArrayList<String> argNames = new ArrayList<>();
				ArrayList<Type> argTypes = new ArrayList<>();
				AST.LFormArg lformarg = new AST.LFormArg(argNames, argTypes, expr.getType());
				RESULT = new AST.DefFunct(id, lformarg, expr);
			}
			else {
				RESULT = new AST.DefValue(symTableStack.createUniqueId(id), expr);
			}
		:}
;

Finally, function definitions allow defining functions. Once we recognize that we are parsing a function definition, an intermediate action defines the function as assigned, allowing recursion. Before that, the action checks if the function is locally declared and not assigned yet. Functions define their own context, i.e. their own symbol table, so we need to push a symbol table before parsing the rest of the function. Finally, the symbol table is popped and the AST node for the function definition is created.

DEF_FUNCT ::= id:id 
            {: 
		if (symTableStack.isDeclared(id) && symTableStack.isLocallyDeclared(id) && !symTableStack.isAssigned(id)) {
			symTableStack.getEntry(id).setIsAssigned(true);	
		}
		else [...] // Error handling
		symTableStack.pushSymTable();
	    :}
	      LFORMARG:lformarg eq EXPR:expr 
            {:
		[...] // Error handling 
		symTableStack.popSymTable();
		RESULT = new AST.DefFunct(symTableStack.createUniqueId(id), lformarg, expr);
	    :}

To handle formal arguments, represented by the LFORMARG symbol, we need to propagate the type of the arguments by using the function type tree, to assign it and define the arguments in the symbol table. In the meanwhile, the parser checks that the number of arguments is right and that there are no multiple arguments with the same name.
Here is the code:

LFORMARG ::= LFORMARG:lformarg id:id 
	    {:
			Type propType = lformarg.getPropType();
			if (!symTableStack.isLocallyDeclared(id) && propType != null && propType.isSubtype("Function")) {
				Type argTree = new Type(propType.getTypeParam(0));
				symTableStack.putEntry(id, symTableStack.new SymTableEntry(argTree, true));
				lformarg.getArgNames().add(symTableStack.createUniqueId(id));
				lformarg.getArgTypes().add(argTree);
				RESULT = new AST.LFormArg(lformarg.getArgNames(), lformarg.getArgTypes(), propType.getTypeParam(1));
			}
			else [...] // Error handling
             :}
	 | id:id 
	    {:
		String funcName = (String) parser.stack(-2);
		if (symTableStack.isDeclared(funcName)) {
			Type type = symTableStack.getEntry(funcName).getType();
			if (!symTableStack.isLocallyDeclared(id) && type != null && type.isSubtype("Function")) {
				Type argTree = new Type(type.getTypeParam(0));
				symTableStack.putEntry(id, symTableStack.new SymTableEntry(argTree, true));
				ArrayList<String> argNames = new ArrayList<>();
				ArrayList<Type> argTypes = new ArrayList<>();
				argNames.add(symTableStack.createUniqueId(id));
				argTypes.add(argTree);
				RESULT = new AST.LFormArg(argNames, argTypes, type.getTypeParam(1));
			}
		else [...] //Error handling
		}
		else [...] //Error handling
	:}
;

Expressions and conditions

Expressions are fundamental for performing computation, as they represent operations which return some value. In this compiler, we define two types of expressions: operations and special constructs. Operations are basic expressions like addition, subtraction and so on, while special constructs represent expressions that need special management, like if-then-else blocks and let bindings. In particular, special constructs are:

let bindings;
if-then-else blocks;
function calls;
constants;
list of expressions

For all these expressions, we use special rules to handle them.
Instead, basic expressions are all handled in the same manner: we check the type of the operands and then produce the type of the result. The type is the only information that we can use for semantic analysis, in this case.

Conditions are special expressions, which must be of boolean type.

Example of operation: addition.

EXPR ::= EXPR:expr1 plus EXPR:expr2
	{:
		String opType = "+";
		AST.BasicExpr[] subExpressions = { expr1, expr2 };
		if (expr1.type.isSubtype("Double") && expr2.type.isSubtype("Double") && expr1.type.isEquivalent(expr2.type.getTypeName())) {
			RESULT = new AST.Expr(new Type(expr1.type), AST.ExprKind.PLUS, subExpressions);
		}
		else [...] //Error handling 
	:}

Function calls

In this compiler, function calls refer both to functions and to values as Haskell does. The semantic rule proceeds as follows: first, we check that the function/value is declared and assigned; then, we look at the number of actual arguments and distinguish the two cases: if there are no arguments, we refer to value, otherwise to a function. If we refer to a value, we check that the id is not a function and finally build the AST node. In the other case, we check that the type of the actual arguments and of the formal arguments matches, then we build the AST node.

FUNCT_CALL ::= id:id LACTARG:lactarg 
	   {:
		int nArgs = lactarg.size();
		Type funcType = null;
		boolean failedArgsTypeChecking = false;
		if(symTableStack.isDeclared(id) && symTableStack.isAssigned(id)) {
			funcType = symTableStack.getEntry(id).getType();
			// This is a Value Call
			if (nArgs == 0) {
				if (funcType != null && !funcType.isSubtype("Function")) { // This is not a Function Call, but a reference to a Value
					RESULT = new AST.FunctCall(new Type(funcType), symTableStack.createUniqueId(id), lactarg); 
				}
			else [...] // Error handling
			}
			// This is a Function Call
			else if (nArgs > 0) {
				if (funcType != null && funcType.isSubtype("Function") && funcType.hasArity(nArgs)) {
					for (AST.Expr actarg : lactarg) {
						if (!actarg.type.isEquivalent(funcType.getTypeParam(0)))
							failedArgsTypeChecking = true;
						funcType = funcType.getTypeParam(1); //extract the right child of the funcType Tree
					}
					if (!failedArgsTypeChecking) {
						RESULT = new AST.FunctCall(new Type(funcType), symTableStack.createUniqueId(id), lactarg);
					}	
					else [...] // Error handling
					}
				else [...] //Error handling
				}
			}
			else [...] //Error handling
			}
	   :}
;

An important point: the symTableStack.getEntry() method extracts the symbol table entry of the value/function and, if it is not in the top-level symbol table, it searches in the symbol table below. This allows implementing lexical scoping, i.e. every value/function is available to all the nested blocks with respect to the one where it has been declared. This means that the following Haskell code works:

value :: Int 
value = let x :: Int; x = 3
        in 
          let y :: Int; y = 4
          in 
            x + y

Let bindings (functional part)

Let bindings exist for one purpose: declaring and defining new variables in a nested context. For this reason, the semantic rules only create and destroy the symbol table where to store the new statements.

LET_BLOCK_FUNC ::= let indent 
			{:
				symTableStack.pushSymTable();
			:} 
		   LET_STMTS:let_stmts dedent in EXPR:expr 
                        {:
				symTableStack.popSymTable();
				RESULT = new AST.LetBlockFunc(new Type(expr.type), let_stmts, expr);
			:}
			| [...] //Additional rule for syntax errors
;

The if-then-else block (functional part)

Like for let bindings, if-then-else blocks are rather simple to analyze: the parser needs only to check that the type of the value returned by the two branches is the same, otherwise it is an error.

IF_BLOCK_FUNC ::= if_begin COND:cond then EXPR:thenBody else_begin EXPR:elseBody 
		{:
			if(thenBody.type.isEquivalent(elseBody.type)) {
				RESULT = new AST.IfBlockFunc(thenBody.type, cond, thenBody, elseBody);
			}
			else [...] // Error handling
		:}
                 | [...] //Additional rules for syntax errors
;

Building types

How are type trees built? By mounting the subtrees according to the production rules, as the code clearly shows.

TYPE ::= TYPE_VALUE:type_value {: RESULT = type_value; :}
       | TYPE_FUNC:type_func {: RESULT = type_func; :}
;
 
// both basic types and compound
TYPE_VALUE ::= TYPE_LIST:type_list {: RESULT = type_list; :}
	     | TYPE_BASIC:type_basic {: RESULT = type_basic; :}
;
 
TYPE_BASIC ::= type_int {: RESULT = Type.getType("Int"); :}
	     | type_double {: RESULT = Type.getType("Double"); :}
	     | type_bool {: RESULT = Type.getType("Bool"); :}
	     | type_char {: RESULT = Type.getType("Char"); :}
;
 
TYPE_LIST ::= bo TYPE_BASIC:type_basic bc 
		{: 
		   Type type = Type.getType("List");
		   type.setTypeParam(0, type_basic);
		   RESULT = type;
		:}
	    | type_string {: RESULT = Type.getType("String"); :}
;
 
// no higher order functions
TYPE_FUNC ::= TYPE_VALUE:type_value arrow TYPE_FUNC:type_func
		{:
			Type type = Type.getType("Function");
			type.setTypeParam(0, type_value);
			type.setTypeParam(1, type_func);
			RESULT = type;
		:}
	    | TYPE_VALUE:type_valueL arrow TYPE_VALUE:type_valueR
		{:
			Type type = Type.getType("Function");
			type.setTypeParam(0, type_valueL);
			type.setTypeParam(1, type_valueR);
			RESULT = type;
		:}
;

Error handling

As for lexicon and syntax, the semantics of a program can be wrong: it is necessary to handle these errors, too.

Errors are recognized inside the semantic rules, so error detection and recovery are defined inside the Parser.

Every time an error is detected, the parser calls the report_semantic_error() method, reported below:

public void report_semantic_error(String msg) {
	noCompileErrors = false;
	System.err.print("ERROR: Semantic error");
        System.err.println(" (line "+currLine+", column "+currColumn+"): " + msg);
}

Then, the parser creates a valid AST node for the production rule and so it recovers the error and can move forward.

Example: semantic error in a functional if-then-else block:

IF_BLOCK_FUNC ::= if_begin COND:cond then EXPR:thenBody else_begin EXPR:elseBody 
		{:
			if(thenBody.type.isEquivalent(elseBody.type)) {
				RESULT = new AST.IfBlockFunc(thenBody.type, cond, thenBody, elseBody);
			}
			else {
				report_semantic_error("Multiple Return Types for If-Then-Else");
				RESULT = new AST.IfBlockFunc(Type.getType("error"), cond, thenBody, elseBody); 
			}
		:}
                 | [...] //Additional rules for syntax errors

This is the list of all the possible syntax errors that LHC can recognize:

Print does not support the argument type
No support for local functions
Global lists not supported
Lists as a return value are not supported
Missing value declaration
Missing function declaration
Multiple value declaration
Value is not locally declared
Function is not locally declared
Multiple value assignment
Multiple function assignment
Mismatching type on assignment
Not a function (but you expect one)
Not a value (but you expect one)
Mismatching arity in function (wrong number of arguments)
Return type is not equivalent to the expression type (in Function Definition)
Multiple argument declaration
Wrong type in condition (Expected type: Bool)
Wrong type in LHS/RHS of [expression_kind] expression (expression_kind represents “addition”, “subtraction” and so on, typically by using the operator symbol)
Mismatching operand type on [expression_kind] Expression
Wrong type in arguments applied to function
Value declared but not assigned (when using a not initialized variable)
Multiple return types for if-then-else
Wrong type in list expression (subexpressions in a list expression must have the same type)

IR Generation

After lexical, syntactical and semantic analysis, we can assume that the program is correct and recognized so we can proceed with the next step: generating the intermediate representation. For this purpose, we need additional fields and methods, especially in the AST.

We said that the BasicNode class defines fields and methods for code generation. They are:

code: the LLVM code for the node;
basicBlock: reports in which basic block we are; used for if-then-else code generation;
codegen(): the method where code is generated; the method returns the identifier of the return value, if the block refers to an expression
mountCode(): merges the code of two AST nodes

The LLVM class supports the codegen() method by providing static methods that build any LLVM instruction required, like createAlloca(). The LLVM class provides also two counters, counterSSA and counterIf, to handle static single assignments and label uniqueness for the different if-then-else blocks.

In this section, we focus on the same constructs seen in the previous section with respect to IR generation.

The start and the end of parsing

The Program node has a very simple codegen() method for building the output program:

public static class Program extends BasicNode {
	private FunctPart functPart;
	private ImperPart imperPart;
 
	Program(FunctPart functPart, ImperPart imperPart) {
		this.functPart = functPart;
		this.imperPart = imperPart;
	}
 
	@Override
	String codegen() {
		functPart.codegen();
		mountCode(functPart);
		imperPart.codegen();
		mountCode(imperPart);
		return null;
	}
}

The Print function

The print() function, like any other function call, requires its argument to be evaluated. After that, we can call the function.

public static class Print extends IOAction {
	private Expr actarg;
 
	Print(Expr actarg) {
		this.actarg = actarg;
	}
 
	@Override
	String codegen() { 
		String actargIndex = actarg.codegen();
		mountCode(actarg);
		String result = LLVM.createVariable(LLVM.getCounterSSA());
		code.add(LLVM.createPrintCall(result, actarg.getType(), actargIndex));
		return null;
	}
}

Do blocks

Do blocks are lists of statements, so it is sufficient to generate the code of all the statements composing the block.

public static class DoBlock extends IOAction {
	private LinkedList<DoStmt> ioActions;
 
	DoBlock(LinkedList<DoStmt> ioActions) {
		this.ioActions = ioActions;
	}
 
	@Override
	String codegen() { 
		for (DoStmt ioAction: ioActions) {
			ioAction.codegen();
			mountCode(ioAction);
		}
		return null;
	}
}

Declarations and definitions

Type declarations do not generate code: since in Haskell we cannot use a value/function until it is assigned, it will be allocated and assigned all at once.

Value definitions are rather simple: first, we evaluate the expression to assign; then we allocate the value and store the expression in it. Values here are intended as local: global values are treated as functions and use a DefFunct node.

public static class DefValue extends Stmt {
	private String id; // must be a "unique" id
	private Expr expr;
 
	DefValue(String id, Expr expr) {
		this.id = id;
		this.expr = expr;
	}
 
	@Override
	String codegen() { 
		String exprIndex = expr.codegen();
		mountCode(expr);
		code.add(LLVM.createAlloca(LLVM.createRegister(id), expr.getType()));
		code.add(LLVM.createStore(LLVM.createRegister(id), exprIndex, expr.getType()));
		return null;
	}
}

Function definitions define global functions. To generate a function we need to:

create its definition (return type, name and arguments);
open the function; the first block of the function is always called entry, so we call setBasicBlock to update the current basic block;
store all the values of the arguments and give them their name as in the native Haskell code;
generate the code of the body
create the return statement
close the function

All these steps can be recognized in the code:

public static class DefFunct extends Stmt {
	private String id; // must be a "unique" id
	private LFormArg lformarg;
	private Expr expr;
 
	DefFunct(String id, LFormArg lformarg, Expr expr) {
		this.id = id;
		this.lformarg = lformarg;
		this.expr = expr;
	}
 
	@Override
	String codegen() { 
		ArrayList<String> argNames = lformarg.getArgNames();
		ArrayList<Type> argTypes = lformarg.getArgTypes();
		LLVM.resetCounterSSA();
		code.add(LLVM.createFunctionDefinition(id, expr.getType(), lformarg));
		code.addAll(LLVM.openFunction());
		setBasicBlock("entry");
		for (int i = 0; i < argNames.size(); i++) { 
			code.add(LLVM.createAlloca(LLVM.createRegister(argNames.get(i)), argTypes.get(i)));
			code.add(LLVM.createStore(LLVM.createRegister(argNames.get(i)), LLVM.createRegister(i), argTypes.get(i)));
			LLVM.getCounterSSA(); //update the register index
		}
		String result = expr.codegen();
		mountCode(expr);
		code.add(LLVM.createReturn(result, expr.getType()));
		code.add(LLVM.closeFunction());
		return null;
	}
}

Expressions and conditions

To evaluate an expression, it is necessary to evaluate the subexpressions first, which are stored in an array. Then, we can build the code for the expression itself: first, we create a new register for the result; then, we create the operation instruction and finally use the register of the return value as the return value of the codegen() method itself; in this way, the register is available to the rest of the program, which can use it as operand for other instructions. The same distinction between operations and special constructs is still true during code generation. While operations build code using LLVM instructions for arithmetic and more complex operations, special constructs are managed on their own.
Conditions are boolean expressions, so they are treated as expressions.

public enum ExprKind {
	PLUS, MINUS, TIMES, DIV, INTDIV, REM, AND, OR,
	RELNOTEQ, RELEQ, RELGE, RELGT, RELLE, RELLT,
	INDEX, NOT, UMINUS, LET_BLOCK_FUNC,
	IF_BLOCK_FUNC, FUNCT_CALL, VALUE, EXPR_LIST
}
 
public static class Expr extends BasicExpr {
 
	private ExprKind exprKind;
	private BasicExpr[] subExpressions; //the subexpressions
 
	Expr (Type type, ExprKind exprKind, BasicExpr[] subExpressions) {
		this.type = type;
		this.exprKind = exprKind;
		this.subExpressions = subExpressions;
	}
 
	@Override
	String codegen() {
		String[] subExpResults = new String[subExpressions.length];
		String result = null; 
		for (int i = 0; i < subExpressions.length; i++) {
			subExpResults[i] = subExpressions[i].codegen();
			mountCode(subExpressions[i]);
		}
		switch (exprKind) {
			case PLUS: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createAdd(result, subExpResults, subExpressions[0].getType()));
				break;
			case MINUS: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createSub(result, subExpResults, subExpressions[0].getType()));
				break;
			case TIMES: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createMult(result, subExpResults, subExpressions[0].getType()));
				break;
			case DIV: 
				String[] oldExpResults = new String[subExpressions.length];
				for (int i = 0; i < subExpressions.length; i++) {
					oldExpResults[i] = subExpResults[i];
					if (subExpressions[i].getType().isEquivalent("Int")) {
						subExpResults[i] = LLVM.createVariable(LLVM.getCounterSSA());
						code.add(LLVM.createSItoFP(subExpResults[i], oldExpResults[i], type, subExpressions[i].getType()));
					}
				}
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createDiv(result, subExpResults, type));
				break;
			case INTDIV: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createIntDiv(result, subExpResults, subExpressions[0].getType()));
				break;
			case REM: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createRem(result, subExpResults, subExpressions[0].getType()));
				break;
			case AND: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createAnd(result, subExpResults, subExpressions[0].getType()));
				break;
			case OR: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createOr(result, subExpResults, subExpressions[0].getType()));
				break;
			case RELNOTEQ: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createRelNotEQ(result, subExpResults, subExpressions[0].getType()));
				break;
			case RELEQ: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createRelEQ(result, subExpResults, subExpressions[0].getType()));
				break;
			case RELGE: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createRelGE(result, subExpResults, subExpressions[0].getType()));
				break;
			case RELGT: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createRelGT(result, subExpResults, subExpressions[0].getType()));
				break;
			case RELLE: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createRelLE(result, subExpResults, subExpressions[0].getType()));
				break;
			case RELLT: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createRelLT(result, subExpResults, subExpressions[0].getType()));
				break;
			case INDEX: 
				String elemPtr = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createGEP(elemPtr, subExpResults[0], subExpressions[0].getType(), subExpResults[1]));
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createLoad(result, type, elemPtr));
				break;
			case NOT: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
				code.add(LLVM.createNot(result, subExpResults, subExpressions[0].getType()));
				break;
			case UMINUS: 
				result = LLVM.createVariable(LLVM.getCounterSSA());
			code.add(LLVM.createUMinus(result, subExpResults, subExpressions[0].getType()));
				break;
			case LET_BLOCK_FUNC: 
				result = subExpResults[0];
				break;
			case IF_BLOCK_FUNC: 
				result = subExpResults[0];
				break;
			case FUNCT_CALL: 
				result = subExpResults[0];
				break;
			case VALUE: 
				result = subExpResults[0];
				break;
			case EXPR_LIST: 
				result = subExpResults[0];
				break;
		}
		return result;
	}
}

Function calls

To perform a function call, we need first to evaluate all its arguments. If it is a value, we create a register where to load its content and finally return this register. Otherwise, we create a register where to save the result, create a function call by using the function name and all its arguments one after the other; finally, we return the result register.

public static class FunctCall extends BasicExpr {
	private String id; // must be a unique id
	private LinkedList<Expr> actargs;
 
	FunctCall(Type type, String id, LinkedList<Expr> actargs) {
		this.type = type;
		this.id = id;
		this.actargs = actargs;
	}
 
	@Override
	String codegen() {
		ArrayList<String> argIds = new ArrayList<>(); // list containing all the input argument id for the function
		ArrayList<Type> argTypes = new ArrayList<>();
		String result;
		for (Expr actarg: actargs) {
			argIds.add(actarg.codegen());
			argTypes.add(actarg.type);
			mountCode(actarg);
		}
		if (id.contains("$")) { // only local values have a '$' appended to their name
			result = LLVM.createRegister(LLVM.getCounterSSA());
			code.add(LLVM.createLoad(result, this.type, LLVM.createRegister(id)));
		}
		else {
			result = LLVM.createRegister(LLVM.getCounterSSA());
			code.add(LLVM.createFunctionCall(result, this.type, id, argIds, argTypes));
		}
		return result;
	}
}

About lexical scoping: if we have two values with the same name, how can we refer to the right one? We may distinguish them by observing in which symbol table the value is defined, but symbol tables are destroyed when parsing terminates. So, we append to the name of a value a symbol table identifier reporting this kind of information.
In particular, SymTableStack defines additional fields and methods for this purpose:

symTableStackId: an Arraylist, contains the numeric id for all the levels of the stack and is expressively used for intermediate code generation; in brief, every time a new symbol table is pushed onto the stack at a given level, the id corresponding to its level increments; the id of a given symbol table corresponds to the concatenation of all the ids, from the bottom level to that of the symbol table we are considering, separated by “$” (e.g. let us suppose that we have two symbol tables, one on top of the other; let us suppose also that the id of the two symbol tables is “1” for the bottom one and “1$1” for the top one. If we pop a symbol table, only the bottom one remains with id “1”. Then, we suppose to push another symbol table: its id will be “1$2”). We can safely use “$” as a separator because it is not a valid character for Haskell identifiers, so there is no risk of homonymy.
updateSymTableStackId(): update symTableStackId when a symbol table is pushed
getSymTableIds(): gets a symbol table id, following the specification defined above
createUniqueId(): create an id for a variable (for IR Generation)

When we reduce a function call, such a unique id is stored in the AST node of the expression, thus storing the information and preserving name uniqueness. In this way, the unique identifier can be used during code generation.

Let bindings (functional part)

A let binding consists of a sequence of statements, together with a final expression. So, to generate its code we need to generate all the statements and then evaluate the expression.

public static class LetBlockFunc extends BasicExpr {
	private LinkedList<Stmt> letStmts;
	private Expr expr;
 
	LetBlockFunc(Type type, LinkedList<Stmt> letStmts, Expr expr) {
		this.type = type;
		this.letStmts = letStmts;
		this.expr = expr;
	}
 
	@Override
	String codegen() { 
		for (Stmt letStmt: letStmts) {
			letStmt.codegen();
			mountCode(letStmt);
		}
		String result = expr.codegen();
		mountCode(expr);
		return result;
	}
}

If-then-else block (functional part)

The if-then-else block is one of the most peculiar in terms of IR generation because it exploits the basic block concept, which is used in LLVM to define the control flow graph.
Before generating code, we define an index for this if-then-else block: every if-then-else has a different index so that it is impossible to have homonymy conflicts on the labels. To generate an if-then-else block, we need to evaluate its condition first. Then, we create a conditional branch, depending on the condition result. After this, the body of the then branch can be generated. Since we define a new label, we define a new basic block; for this reason, we call setBasicBlock() to update this information. After generating the body of the then branch, it is very important to catch the current basic block: nested if-then-else blocks may have changed it. Then, we create an unconditional branch towards the end of the block, to skip the else expression. At this point, we can generate the else expression and branch to the exit of the loop: this latter step is mandatory because in LLVM every basic block requires either a branch or a return statement.
Finally, we create a register for the result and define a phi node: the phi node is a sort of merge operator which takes the results of the two branches and selects which one to consider depending on which branch has been executed. For phi nodes to work correctly, it is mandatory to define the exact basic block where the result of the branch is defined, which explains why it is necessary to call getBasicBlock() in both the branches. In the end, the register containing the result is returned, so that other parts of the program can use it.

public static class IfBlockFunc extends BasicExpr {
	private Expr cond, thenBody, elseBody;
 
	IfBlockFunc(Type type, Expr cond, Expr thenBody, Expr elseBody) {
		this.type = type;
		this.cond = cond;
		this.thenBody = thenBody;
		this.elseBody = elseBody;
	}
 
	@Override
	String codegen() { 
		final int ifIndex;
		String condIndex, thenIndex, elseIndex, thenLabel, elseLabel, exitLabel, thenBB, elseBB, result;
		ifIndex = LLVM.getCounterIf();
		condIndex = cond.codegen();
		mountCode(cond);
		code.add(LLVM.createBranchCond(condIndex, "if.then$" + ifIndex, "if.else$" + ifIndex));
		thenLabel = "if.then$" + ifIndex;
		code.add(LLVM.createLabel("if.then$" + ifIndex));
		setBasicBlock(thenLabel);
		thenIndex = thenBody.codegen();
		mountCode(thenBody);
		thenBB = getBasicBlock();
		code.add(LLVM.createBranchNotCond("if.exit$" + ifIndex));
		elseLabel = "if.else$" + ifIndex;
		code.add(LLVM.createLabel("if.else$" + ifIndex));
		setBasicBlock(elseLabel);
		elseIndex = elseBody.codegen();
		mountCode(elseBody);
		elseBB = getBasicBlock();
		code.add(LLVM.createBranchNotCond("if.exit$" + ifIndex));
		exitLabel = "if.exit$" + ifIndex;
		code.add(LLVM.createLabel("if.exit$" + ifIndex));
		setBasicBlock(exitLabel);
		result = LLVM.createVariable(LLVM.getCounterSSA());
		code.add(LLVM.createPHINode(result, this.type, thenBB, thenIndex, elseBB, elseIndex));
		return result;
	}
}

To conclude, the parser needs additional fields and methods, too; in particular, it defines a code field to store it, plus a setter for that field. It is up to the parser to start the IR generation since it knows when parsing has finished: we need the full AST before generating the code. But how and when does code generation start? At the last reduction. In particular, the parser calls the codegen() method of the root node of the AST when the production rule for the start symbol is reduced.

PROGRAM ::= /* empty program */ 
	{:
		if (noCompileErrors) {
		System.out.println("CODE COMPILED SUCCESSFULLY");
		}
		RESULT = new AST.Program(new AST.FunctPart(), new AST.ImperPart(null));
		// Code generation
		RESULT.codegen();
		// Retrieve the code
		setCode(RESULT.getCode());
	:}
	  | indent
	{:	// push the top-level symtable
		symTableStack.pushSymTable();
	:}
	    FUNCT_PART:funct_part IMPER_PART:imper_part 
	{:
		symTableStack.popSymTable();
	:}
	dedent 
	{:  // pop the top-level symtable
		if (noCompileErrors) {
			System.out.println("CODE COMPILED SUCCESSFULLY");
			RESULT = new AST.Program(funct_part, imper_part);
			// Code generation
			RESULT.codegen();
			// Retrieve the code
			setCode(RESULT.getCode());
		}	
		else {
			System.out.println("CODE NOT COMPILED: FAILED");
			RESULT = new AST.Program(null, null);
		}
	:}
;

Installation and usage

LHC is compatible with:

Ubuntu 20.04 LTS (but it should work with less recent versions or other Linux distributions, too).

Prerequisites

JDK (Java Development Kit) - Version 16 or above (successfully tested with JDK 16, but it should work with less recent releases, too)
JFlex
Cup - This is not the original implementation of Cup, but an enhanced one that displays a graphical representation of the Parse Tree
LLVM (use sudo apt install llvm to install the package) - Version 10 or above (less recent versions may work, too)

Download and Commands

Download link: LHC
After downloading and accessing the main folder, you can use the following commands to play with the compiler:

make install: Install everything
make uninstall: uninstall everything
make %.ll: build the corresponding %.hs file and produce the resulting LLVM code
lli %.ll: run the output code

In the ./programs folder, you can already find some examples of buildable and working Haskell programs.

Example

Let us suppose that we have just downloaded and unzipped the compiler and that we want to compile the GCD program stored in ./programs/gcd.hs.
First, go to the main folder (LHC).
Then, run the following commands to install the compiler, build the code and run it:

to install everything: make install
to build the code: make ./programs/gcd.ll (notice that the target is an LLVM file)
to run it: lli ./programs/gcd.ll (the output code is in the same folder as the input one)
finally, to uninstall everything: make uninstall

Table of Contents

From Haskell to LLVM

Introduction

The High-end Language

The Core

The Target Language

Features

Partial/Missing Features

The Compiler

Scanner

General Scanning Algorithm

Basic Token management

Indentation-based Parsing

Error Handling

Parser

Error Handling

Semantic Analysis

The typing system

The symbol table stack

The AST

Semantic rules

The start and the end of parsing

The Print function

Do blocks

Declarations and definitions

Expressions and conditions

Function calls

Let bindings (functional part)

The if-then-else block (functional part)

Building types

Error handling

IR Generation

The start and the end of parsing

The Print function

Do blocks

Declarations and definitions

Expressions and conditions

Function calls

Let bindings (functional part)

If-then-else block (functional part)

Installation and usage

Prerequisites

Download and Commands

Example

Other links