Writing a 'Mini-C' Compiler in 3 weeks

Links

compiling techniques: link

Description

Compiling techniques is a course made up entirely of coursework - no exams ! 75% of that course is building a compiler entirely from scratch, and this is the part I was so enticed by.

The course was cancelled right as I was about to be able to take it in my 3rd year, so naturally I wasn't very happy. I self-enrolled myself onto the course to access the materials and started watching the lectures to be able to hack away.

Mini-C

Mini-c is the professors' own subset of C, it's described by this grammar:

# #  comment
# () grouping
# [] optional
# *  zero or more
# +  one or more
# |  alternative


program    ::= (include)* (structdecl)* (vardecl)* (fundecl)* EOF

include    ::= "#include" STRING_LITERAL

structdecl ::= structtype "{" (vardecl)+ "}" ";"    # structure declaration

vardecl    ::= type IDENT ";"                       # normal declaration, e.g. int a;
             | type IDENT "[" INT_LITERAL "]" ";"   # array declaration, e.g. int a[2];

fundecl    ::= type IDENT "(" params ")" block    # function declaration

type       ::= ("int" | "char" | "void" | structtype) ["*"]
structtype ::= "struct" IDENT

params     ::= [ type IDENT ("," type IDENT)* ]

stmt       ::= blockf
             | "while" "(" exp ")" stmt              # while loop
             | "if" "(" exp ")" stmt ["else" stmt]   # if then else
             | "return" [exp] ";"                    # return
             | exp "=" exp ";"                      # assignment
             | exp ";"                               # expression statement, e.g. a function call

block      ::= "{" (vardecl)* (stmt)* "}"

exp        ::= "(" exp ")"
             | (IDENT | INT_LITERAL)
             | "-" exp
             | CHAR_LITERAL
             | STRING_LITERAL
             | exp (">" | "<" | ">=" | "<=" | "!=" | "==" | "+" | "-" | "/" | "*" | "%" | "||" | "&&") exp  # binary operators
             | arrayaccess | fieldaccess | valueat | funcall | sizeof | typecast

funcall      ::= IDENT "(" [ exp ("," exp)* ] ")"
arrayaccess  ::= exp "[" exp "]"                  # array access
fieldaccess  ::= exp "." IDENT                    # structure field member access
valueat      ::= "*" exp                          # Value at operator (pointer indirection)
sizeof       ::= "sizeof" "(" type ")"            # size of type
typecast     ::= "(" type ")" exp                 # type casting

As you can see the language supports a fair chunk of C:

int, char, void base types
arrays and structures
pointers
binary operators - >, <, >=, <=, !=,==, +, -, /, *, %, ||, &&
type casting
if and while statements

and so on.

There is also a little standard library that we're asked to implement consisting of the functions:

read_i
read_c
print_i
print_c
print_s
mcmalloc.

it's not a small amount of functionality!

First step in writing the compiler was to make the lexer. The goal of the lexer is to go from pure raw text to a stream of tokens according to the grammar.

Since the grammar is LL(k) parse-able we are able to do just that with some lookup.

Lexer

Scanner

First off I needed a scanner, I used the scanner given by the coursework as template and wrote a similar one (pretty much identical with minor but important to me changes)

The scanner looked like this in the end:

public class Scanner {
    /** the file reader */
    private BufferedReader input;

    /** the peeked character if peeked and -1 otherwise */
    private int peekedVal = -1;

    /** current line in the file */
    private int line = 1;

    /** current column in the file */
    private int column = 0;

    public Scanner(File sourceFile) throws FileNotFoundException {
        input = new BufferedReader(new FileReader(sourceFile));
    }

    /**
     * Returns the next character in the file
     * @return
     */
    public char next() throws IOException,EOFException{

        // if we peeked before, return the peeked char
        Boolean havePeeked = peekedVal != -1;
        
        char nextChar;
        if (havePeeked){
            nextChar = (char)peekedVal;
            peekedVal = -1;
        } else {
            int nextVal = input.read();
            if(nextVal == -1){
                throw new EOFException();
            }
            nextChar = (char)nextVal;
        }

        //advance line and column numbers,
        // \n should work for counting lines on both Unix and non-Unix systems
        if (nextChar == '\n') {
            line++;
            column =0;
        } else {
            column++;
        }

        return nextChar;
    }

    /**
     * Peeks the next character without advancing the scanner
     * @return next character
     */
    public char peek() throws IOException{
        
        Boolean havePeeked = peekedVal != -1;
        
        char peekedChar;
        if (havePeeked){
            peekedChar = (char) peekedVal;
        } else {
            int readVal = input.read();
            if(readVal == -1){
                throw new EOFException();
            }
            peekedChar = (char) readVal;
            peekedVal = readVal;
        }

        return peekedChar;
    }

    /**
     * Returns the current line within the file of the scanner starting at 1.
     * @return Current Line
     */
    public int getLine(){
        return line;
    }

    /**
     * Returns the current column within the file of the scanner starting at 0.
     * @return Current Column
     */
    public int getColumn(){
        return column;
    }

    /**
     * Closes the scanner.
     * @throws IOException
     */
    public void close() throws IOException{
        input.close();
    }
}

Token

Next comes the most important class - Token. Tokens are enumerations with some possible data inside.

For example a char literal token: 'a', would have the enum value CHAR_LITERAL and its data would be: "a".

We were also given the Token class but i decided to add a constructor which would build certain tokens slightly differently, i.e. it would remove quotation marks from around strings and characters:

public class Token {
    public enum TokenClass {

        // the \ (backslash) is used as an escape character in the regular expression below
        // ' is used to enclose character while " is used to enclose strings

        IDENTIFIER, // ('a'|...|'z'|'A'|...|'Z'|'_')('0'|...|'9'|'a'|...|'z'|'A'|...|'Z'|'_')*

        // delimiters
        LBRA,  // '{' // left brace
        RBRA,  // '}' // right brace
        LPAR,  // '(' // left parenthesis
        RPAR,  // ')' // right parenthesis
        LSBR,  // '[' // left square brace
        RSBR,  // ']' // left square brace
        SC,    // ';' // semicolon
        COMMA, // ','

	//// KEYWORODS
        // types
        INT,  // "int"
        VOID, // "void"
        CHAR, // "char"

        // keywords
        IF,     // "if"
        ELSE,   // "else"
        WHILE,  // "while"
        RETURN, // "return"
        STRUCT, // "struct"
        SIZEOF, // "sizeof"
	////
        
	// include
        INCLUDE, // "#include"

        // literals
        STRING_LITERAL, // \".*\"  any sequence of characters enclosed within two double quote " (please be aware of the escape character backslash \)
        INT_LITERAL,    // ('0'|...|'9')+
        CHAR_LITERAL,   // \'('a'|...|'z'|'A'|...|'Z'|'\t'|'\b'|'\n'|'\r'|'\f'|'\''|'\"'|'\\'|'\0'|'.'|','|'_'|...)\'  a character starts and end with a single quote '
	
	//// OPERATORS
        
	// logical operators
        AND, // "&&"
        OR,  // "||"

        // comparisons
        EQ, // "=="
        NE, // "!="
        LT, // '<'
        GT, // '>'
        LE, // "<="
        GE, // ">="


        ASSIGN, // '='

        // operators
        PLUS,  // '+'
        MINUS, // '-'
        ASTERIX, // '*'  // can be used for multiplication or pointers
        DIV,   // '/'
        REM,   // '%'

        // struct member access
        DOT, // '.'

        //// special tokens
        
	EOF,    // signal end of file
        INVALID // in case we cannot recognise a character as part of a valid token
    }


    public final TokenClass tokenClass;
    public final String data;
    public final Position position;

    public Token(TokenClass type, int lineNum, int colNum) {
        this(type, "", lineNum, colNum);
    }

    public Token (TokenClass tokenClass, String data, int lineNum, int colNum) {
        assert (tokenClass != null);
        this.tokenClass = tokenClass;
        switch (tokenClass){
            case IDENTIFIER:
                this.data = data;
                break;
            case CHAR_LITERAL:
            case STRING_LITERAL:
                this.data = data.substring(1,data.length()-1);
                break;
            case INT_LITERAL:
                this.data = data;
                break;
            default:
                this.data = "";
                
        }
        this.position = new Position(lineNum, colNum);
    }



    @Override
    public String toString() {
        if (data.equals(""))
            return tokenClass.toString();
        else
            return tokenClass.toString()+"("+data+")";
    }


}

Tokeniser

The tokeniser is the main part of the lexer, it uses the scanner to navigate the file and spits out tokens when requested using its nextToken method. This way the parser can parse the tokens before the tokeniser even reaches the end of the file. The parser, tokeniser and scaner all work in unison later on to pre-process and validate our code.

I had to write quite a few utility functions and even a Match class to help identify which token was the longest possible match when a few were options. The most important methods were these however:

/**
     * returns the next token in the stream.
     * @return
     */
    public Token nextToken() {
        
        clearTokenString();

        Token result;
        try {
             result = next();
        } catch (final EOFException eof) {
            // end of file, nothing to worry about, just return EOF token
            return new Token(TokenClass.EOF,null, scanner.getLine(), scanner.getColumn());
        } catch (final IOException ioe) { 
            ioe.printStackTrace();
            // something went horribly wrong, abort
            System.exit(-1);
            return null;
        }
        //always pop in between tokens to avoid stupid errors
        popCurrentTokenString();

        if(!clearedFirstToken)
            clearedFirstToken = true;
        
        return result;
    }

    private final Scanner scanner;

    private int error = 0;

    private char currChar;

    private boolean reachedEOF = false;
    
    private boolean clearedFirstToken = false;
    

    /** holds the characters belonging to the current token */
    private final StringBuilder tokenStringSoFar = new StringBuilder();


    private Token next() throws IOException {

        final int line = scanner.getLine();
        final int column = scanner.getColumn() - 1; // the scanner tells us that the current character is x chars in, we report that character's postition as x - 1

        // skip white spaces
        if (Character.isWhitespace(currChar)){
            currChar = scanner.next();
            return next();
        }

        try {
            
            // tokens identifiable by first character
            switch(currChar){
                case '#':
                    expectFullString("#include");
                    return new Token(TokenClass.INCLUDE,popCurrentTokenString(),line,column);
                case '"':
                    expectStringLiteral();
                    return new Token(TokenClass.STRING_LITERAL,popCurrentTokenString(),line,column);
                case '\'':
                    expectCharLiteral();
                    return new Token(TokenClass.CHAR_LITERAL,popCurrentTokenString(),line,column);
            }

            // operators
            if(isOperatorStartChar(currChar)){
                switch(currChar){
                    case '&':
                        expectFullString("&&");
                        return new Token(TokenClass.AND,popCurrentTokenString(),line,column);
                    case '|':
                        expectFullString("||");
                        return new Token(TokenClass.OR,popCurrentTokenString(),line,column);
                    case '+':
                        expectSingleChar('+');
                        return new Token(TokenClass.PLUS,popCurrentTokenString(),line,column);
                    case '-':
                        expectSingleChar('-');
                        return new Token(TokenClass.MINUS,popCurrentTokenString(),line,column);
                    case '*':
                        expectSingleChar('*');
                        return new Token(TokenClass.ASTERIX,popCurrentTokenString(),line,column);
                    case '/':
                        Match rightMatch = expectLongestMatch(new ArrayList<Match>(Arrays.asList(
                            new Match("/",TokenClass.DIV),
                            new Match((s)->isACommentSoFar(s),null)
                        )), "expected division or comment");

                        if(rightMatch.getTokenClass() == null){
                            // it's a comment, skip the whole token
                            popCurrentTokenString();
                            return nextToken();
                        } else {
                            // it's division
                            return new Token(TokenClass.DIV,popCurrentTokenString(),line,column);
                        }
                    case '%':
                        expectSingleChar('%');
                        return new Token(TokenClass.REM,popCurrentTokenString(),line,column);
                    case '.':
                        expectSingleChar('.');
                        return new Token(TokenClass.DOT,popCurrentTokenString(),line,column);
                    case '!':
                        expectFullString("!=");
                        return new Token(TokenClass.NE,popCurrentTokenString(),line,column);
                    case '=':
                        int matched = expectEither("=", "==");
                        if(matched == 0){
                            return new Token(TokenClass.ASSIGN,popCurrentTokenString(),line,column);
                        } else {
                            return new Token(TokenClass.EQ,popCurrentTokenString(),line,column);
                        }
                    case '<':
                        matched = expectEither("<","<=");
                        if(matched == 0){
                            return new Token(TokenClass.LT,popCurrentTokenString(),line,column);
                        } else {
                            return new Token(TokenClass.LE,popCurrentTokenString(),line,column);
                        }
                    case '>':
                        matched = expectEither(">",">=");
                        if(matched == 0){
                            return new Token(TokenClass.GT,popCurrentTokenString(),line,column);
                        } else {
                            return new Token(TokenClass.GE,popCurrentTokenString(),line,column);
                        }
                    
                }
            }

            // delimeters
            if(isDelimeter(currChar)){
                expectDelimeter();
                String data = popCurrentTokenString();
                return new Token(getDelimeterTokenClass((data).charAt(0)),data,line,column);
            }

            // int literals
            if (Character.isDigit(currChar)){
                expectIntLiteral();
                return new Token(TokenClass.INT_LITERAL,popCurrentTokenString(),line,column);
            }

            // Identifiers or keywords, last since everything else is disjoint from this set
            if(isIdentifierStartChar(currChar)){
                final Match rightMatch = expectLongestMatch(new ArrayList<Match>(
                    Arrays.asList(
                    new Match("if",TokenClass.IF),
                    new Match("else", TokenClass.ELSE),
                    new Match("while",TokenClass.WHILE),
                    new Match("return",TokenClass.RETURN),
                    new Match("struct", TokenClass.STRUCT),
                    new Match("sizeof",TokenClass.SIZEOF),
                    new Match("int",TokenClass.INT),
                    new Match("void",TokenClass.VOID),
                    new Match("char",TokenClass.CHAR),
                    new Match((s)->isValidIdentifier(s),TokenClass.IDENTIFIER))
                )
                , "Identifier or keyword");

                return rightMatch.buildToken(popCurrentTokenString(), line, column);
            }

        } catch(EOFException e){
            // next token will throw EOF at call to next(), meaning we get INVALID EOF at the end of the token list
            error("Unexpected EOF in the middle of token",line,column);
            return new Token(TokenClass.INVALID,popCurrentTokenString(), line, column);
        }
        catch (UnrecognizedCharacterException e) {
            error(e,line,column);
            return new Token(TokenClass.INVALID,popCurrentTokenString(),line,column);
        } 
        


        // if we reach this point, it means we did not recognise a valid token
        currChar = scanner.next();
        error(currChar, line, column);
        return new Token(TokenClass.INVALID,popCurrentTokenString(), line, column);
    }

As you can see, without the utility classes the tokeniser is quite a simple piece of code, essentially we just look at the possible tokens and pick one which is matching and longest.

We also throw useful errors so that the programmer knows what's wrong if his program doesn't compile.

Parser

The parser is one of the bigger parts of the compiler, it actually parses the tokens and makes sure that our grammar allows them to appear in the order they do, and at the same time it processes the input into an AST - Abstract Syntax Tree, which is essentially the stripped down version of the token tree which arises naturally from the code.

Before I could write the parser I had to clean up the grammar a little:

Removing Left Recursion

left recursion looks like this: A := A + B , why is it bad? Well, if we're writing a simple top-down recursive parser, what it will do when it encounters something like this is something along the lines of:

void parseA(){
    parseA();
    expect('+');
    parseB();
}

Now as you can see this will result in a stack overflow. There are some parsers which you can write which are not bothered by left recursion, but they are not as simple.

Therefore we have to simplify the grammar!

one rule you can use is, every occurrence of:

A := Aa | b

where b is anything that doesn't start with A, is transformed into:

A := b A'

A' := ["a" A' ]

This essentially puts the expression which A will have to eventually evaluate to first before letting it recurse further down. Notice how both these grammars accept the language: ba*

I am not going to lie to you, this was very confusing to me at the start, but eventually I got the hang of it.

Operator precedence & Associativity

Next step in the grammar clean up was to deal with operator precedence. The way this can be done is if we have an operator, say: ||, which has lower precedence than the operator: &&.

I.e. the expression: TERM || TERM && TERM. Should be parsed as: TERM || (TERM && TERM)

Then the grammar will look something like:

BinOp := or

or := and or'

or' := ["||" and or']

and := TERM and'

and' := ["&&" TERM and' ]

TERM := '1' | '2' ... '9'

And that's it!

Tackling associativity is actually done in the actual parser where we have to make sure that every time there is an expression like this one: TERM1 OP TERM2 OP ...

after each new OP occurrence we first create a new BIN_OP node from the TERM1 and TERM2 terms, and then pass them as the lhs to the next BIN_OP node formed.

This will cause the operators to be left associative, meaning that the expression: TERM || TERM || TERM will be parsed as: (TERM || TERM) || TERM

i.e. the brackets 'group' on the left hand side of same precedence operators.

Of course you could make the associativity different, it all depends on your needs, for me all the binary operators are left-associative.

The code for BinOp parsing ended up looking like this:

/**
     * helper function for binary ops
     * @param prefixExpr
     * @param postfixExpr
     * @param opTokens
     * @return
     */
    private Expr binaryOpRecFunc(final Expr lhs,Supplier<Expr> prefixExpr, Function<Expr,Expr> postfixExpr,TokenClass...opTokens){
        // binaryOpRecFunc = ["Op" prefixExpr binaryOpRecFunc]
        if(accept(opTokens)){
            TokenClass firstOpTokenClass= currToken.tokenClass;
            nextToken();
            
            Expr rhs;
            if((rhs = prefixExpr.get()) == null) return null;

            // we check if there is more tail operators
            if(accept(opTokens)){
                Expr secondExpr;
                // if so we pass new expr as lhs deeper
                if((secondExpr = postfixExpr.apply(new BinOp(lhs,opTokenToASTOp(firstOpTokenClass),rhs))) == null) return null;
                else
                    return secondExpr;
    
            } else {
                // if no tail just end the recursion
                return new BinOp(lhs,opTokenToASTOp(firstOpTokenClass),rhs);
            }

        
            
        }
        
        // no operator found
        return null;
    }

    private Expr binaryOpFunc(Supplier<Expr> prefixExpr, Function<Expr,Expr> postfixExpr,TokenClass...opTokens){
        // binaryop = prefixExpr binaryOpRecFunc
        Expr lhs;
        if(( lhs = prefixExpr.get()) == null) return null;
        
        TokenClass opToken = currToken.tokenClass;

        // if we dont find a bin op sign, we just return lhs, this is ok
        if(!accept(opTokens)) return lhs;
        nextToken();

        Expr rhs;
        // we then parse the prefix to the next op as the rhs to this bin op, for left associativity of all bin operators
        if((rhs = prefixExpr.get()) == null){
            // shouldn't happen, we should always have a rhs to an op
            return null;
        } else {
            // a binary op
            Expr tail;
            // check we have a tail of operators ( another OP ) if so we pass this new exp as the lhs itself
            if(accept(opTokens)){
                if((tail = postfixExpr.apply(new BinOp(lhs,opTokenToASTOp(opToken),rhs))) == null) return null;
                else
                    // we create a bin op and pass it on as the lhs to the recursion
                    return tail;
            } else {
                // otherwise just return self
                return new BinOp(lhs,opTokenToASTOp(opToken),rhs);
            }
           
        }
    }

    private Expr parseExp(){
        return parseOr();
    }

    private Expr parseOr(){
        return binaryOpFunc(()->parseAnd(),(l)-> parseOrRec(l), TokenClass.OR);
    }
    private Expr parseOrRec(final Expr lhs){
        return binaryOpRecFunc(lhs,()->parseAnd(), (l)->parseOrRec(l), TokenClass.OR);
    }

    private Expr parseAnd(){
        return binaryOpFunc(()->parseRelEq(), (l)->parseAndRec(l), TokenClass.AND);
    }
    private Expr parseAndRec(Expr lhs){
        return binaryOpRecFunc(lhs,()->parseRelEq(), (l)->parseAndRec(l), TokenClass.AND);
    }
    
    
    private Expr parseRelEq(){
        return binaryOpFunc(()->parseRelComp(), (l)->parseRelEqRec(l), TokenClass.EQ,TokenClass.NE);
    }
    private Expr parseRelEqRec(Expr lhs){
        return binaryOpRecFunc(lhs,()->parseRelComp(), (l)->parseRelEqRec(l), TokenClass.EQ,TokenClass.NE);
    }


    private Expr parseRelComp(){
        return binaryOpFunc(()->parseSummative(), (l)->parseRelCompRec(l), TokenClass.LT,TokenClass.LE,TokenClass.GT,TokenClass.GE);
    }
    private Expr parseRelCompRec(Expr lhs){
        return binaryOpRecFunc(lhs,()->parseSummative(), (l)->parseRelCompRec(l), TokenClass.LT,TokenClass.LE,TokenClass.GT,TokenClass.GE);
    }


    private Expr parseSummative(){
        return binaryOpFunc(()->parseMulti(), (l)->parseSummativeRec(l), TokenClass.PLUS,TokenClass.MINUS);
    }
    private Expr parseSummativeRec(Expr lhs){
        return binaryOpRecFunc(lhs,()->parseMulti(), (l)->parseSummativeRec(l), TokenClass.PLUS,TokenClass.MINUS);
    }


    private Expr parseMulti(){
        return binaryOpFunc(()->parseUnarySecondary(), (l)->parseMultiRec(l), TokenClass.ASTERIX,TokenClass.DIV,TokenClass.REM);
    }
    private Expr parseMultiRec(Expr lhs){
        return binaryOpRecFunc(lhs,()->parseUnarySecondary(), (l)->parseMultiRec(l), TokenClass.ASTERIX,TokenClass.DIV,TokenClass.REM);
    }

DRY at work right here, real proud of this one :)

Other non-terminals

For the rest of the grammar it was all mostly identical work. Seeing a definition like this:

program := (include)* (structdecl)* (vardecl)* (fundecl)* EOF

you'd write something along those lines:

private Program parseProgram(){
        parseIncludes();
        List<StructTypeDecl> structTypeDecls = parseStructDecls();
        List<VarDecl> varDecls = parseVarDecls();
        List<FunDecl> funDecls = parseFunDecls();

        expect(TokenClass.EOF);

        return new Program(structTypeDecls, varDecls, funDecls);
    }

Notice how at the same time I am building up the AST nodes. Those essentially just keep the important information from each combination of tokens. For example: LPAR 2 + 3 RPAR would be boiled down to BIN_OP(2,+,3).

De-cluttering helps in the later stages when we have to semantically analyse the code and then produce assembly.

Semantic Analysis

This is where the grammar is not enough. Yes you can have 10 declarations of "i" in a row, but that doesn't mean that the code makes any sense. Semantic analysis make sure we're only compiling code which follows the context-dependent rules of the langauge we're compiling.

Name Analysis

This part makes sure all variables are declared before they're used, and marks up the AST tree with some data. The Analysis is done using the Visitor Design pattern, the king of compiler design patterns.

Each ASTNode has an acceptVisitor method which accepts any and all visitors. Each visitor can return something from each node that it accepts and each visitor might return a different type. Every visitor must have a method which accepts every existing type of AST node. Here is some example code:

///////////ASTVisitor.java
public interface ASTVisitor<T> {

    // decls
    public T visitStructTypeDecl(StructTypeDecl st);
    public T visitFunDecl(FunDecl p);
    public T visitVarDecl(VarDecl vd);
    
    // structures
    public T visitBlock(Block b);
    public T visitProgram(Program p);

    // Stmts
	public T visitExprStmt(ExprStmt exprStmt);
	public T visitWhile(While w);
	public T visitIf(If i);
	public T visitAssign(Assign a);
    public T visitReturn(Return r);

    // Expression
    public T visitVarExpr(VarExpr v);
    public T visitFunCallExpr(FunCallExpr fce);

    public T visitArrayAccessExpr(ArrayAccessExpr aae);
    public T visitFieldAccessExpr(FieldAccessExpr fae);
	public T visitValueAtExpr(ValueAtExpr vae);
	public T visitSizeOfExpr(SizeOfExpr sizeOfExpr);
    public T visitTypecastExpr(TypecastExpr typecastExpr);
        
    public T visitIntLiteral(IntLiteral il);
    public T visitStrLiteral(StrLiteral sl);
    public T visitChrLiteral(ChrLiteral cl);
    public T visitBinOp(BinOp bo);
    public T visitOp(Op o);
    

    
    // types
    public T visitBaseType(BaseType bt);
    public T visitPointerType(PointerType pt);
    public T visitStructType(StructType st);
    public T visitArrayType(ArrayType at);
    
}

/////////// ASTNode.java
public interface ASTNode {
    public <T> T accept(ASTVisitor<T> v);
}


/////////// example name analysis visitor method


public class NameAnalysisVisitor extends BaseSemanticVisitor<Void> {
	int errCount;
	Scope currentScope;
         
        ...
        ...
        ...

	@Override
	public Void visitVarDecl(VarDecl vd) {
		// check var is not declared already
		// possibly shadow that declaration if it's in another scope

		// first check all good is with the type (say it's a struct, its type needs to have been declared)
		vd.varType.accept(this);

		Symbol s = currentScope.lookup(vd.varName);
		if(s == null){
			currentScope.put(new VarSymbol(vd));
		}else {

			// check that it doesn't exist in the current scope
			Symbol s2 = currentScope.lookupCurrent(vd.varName);
			if(s2 == null){
				// good to go with shadowing
				currentScope.put(new VarSymbol(vd));
			} else {
				error("Cannot declare variable, Identifier " + vd.varName + " is already defined.");
			}
		}

		return null;
	}
}

Type Analysis

The type analysis follows the exact same structure, but this time our visitor returns the type Type.java and assigns a type to each expression (statements do not have types!).

This visitor uses the data that the previous one marked up onto the AST tree as well.

Example type visitor method:

public class TypeCheckVisitor extends BaseSemanticVisitor<Type> {
        
        ...
        ...
        ...


        @Override
	public Type visitValueAtExpr(ValueAtExpr vae) {
		Type ptr = vae.ptr.accept(this);
		if(ptr.isPointerType()){
			return vae.type = ((PointerType)ptr).pointedToType;
		} else {
			error("Type mismatch, expected a pointer type but got: " + ptr + " as the lhs to the value at expression");
			return null;
		}
	}
}

After doing all this, and of course writing hundreds of unit tests I was ready to move on to the last step...

Pretty Printing AST Trees

The AST tree in the thumbnail is the output of the compiler, I used the DOT graph description language to generate those on the fly using another visitor class. It was quite simple but damn does it look good!

Code generation

My compiler compiles into MIPS instructions, this is useful since we can run them on the MARS simulator anywhere, and MIPS is quite a simple instruction set. I set out my own calling convention which was pretty similar to the standard one.

This step is by far the most complicated and the hardest part was definitely looking at hundreds of lines of generated MIPS code and trying to find mistakes. This is the module I wish I had more time for.

I generated the code in a single visitor pass, which is not ideal as the code starts to turn into a bowl of spaghetti. But I was too deep to change my approach and when I realised how I could approach this better it was too late too re-factor the entire code generator.

I had to stick with it.

Dealing with registers

The compiler outlined in the coursework doesn't actually do "register spilling" at all, and so programs which require the usage of many registers will simply not compile. Of course this compiler is not meant for the real world, it's nothing but a toy to learn about compilers!

Because of this all the registers that we are allowed to use are backed up to the stack on every start of a function definition as well as at every return.

This bloats the code insanely but makes sure we aren't overwriting registers between function calls, and also lets every function use the entire set of registers.

MIPS writer

I created a dedicated class to just printing out the code and doing so in a pretty manner to aid debugging. It has methods such as: writeAdd(rt,rd,rs, comment).

Function call convention

The convention I am following is to:

Use all the $s1-$s3, $a and $t registers as temporaries.
Use the $s4-$s7 registers as "dumping grounds" for swap operations which cannot disturb temporaries
Place all arguments above the frame pointer of the function being called such that the argument closest to the frame pointer is the last argument
Place all the other saved registers below the frame pointer at the start of each function definition
use $v0 for all return values

Storage directory

The following rules need to be respected:

All global variables are stored in the .data segment
All local variables are stored in the stack
string literals are stored in the .data segment
char literals and integers are inserted into the instructions

I used "Memory" classes with Stack and Data sub-types which served like "scopes" but for the storage directory, they dealt with the offsets from the frame pointer, retrievals and stores.

Example

Let's see the function call visitor method:

public class CodeGenerator extends BaseASTVisitor<Register> {

    ...
    ...
    ...

    @Override
    public Register visitFunCallExpr(FunCallExpr fce) {

        int idx = 0;
        // the call declares parameters on the stack
        //
        // we need to store callee saved registers


        // store frame pointer value just before pushing fp and sp
        writer.writeCommentNl(WriteTarget.TEXT,"FUNCALL");
        writer.writeCommentNl(WriteTarget.TEXT,"STORE SP");

        Memory oldMemory = currMemory;
        currMemory = new StackMemory(oldMemory, writer, registerAllocator,oldMemory.getStackWordSizeSoFar());
        
        String labelSP = "$sp" + getUniqueNum();
        StoreStackPointer(labelSP);

        writer.writeCommentNl(WriteTarget.TEXT,"ARGUMENTS PUSH TO STACK");

        for (Expr val : fce.args) {
            // we can safely assume we're not going to be calling functions in the global scope in mini-c
            VarDecl correspondingDecl = fce.funDecl.params.get(idx);
            
            // declare (in the scope of the declaration)
            // if it already exists this means that we're actually recursing!
            // we have to "re-declare" it manually
            
            if(currMemory.containsVariable(correspondingDecl)){     
                Register register = val.accept(this);
                writer.writeSw(register, Register.sp, 0, "re-declare variable");
                int extraStackExpansionWords = ((int)Math.ceil((float)correspondingDecl.varType.sizeOfType() / 4f));
                currMemory.expandStack(extraStackExpansionWords);
                registerAllocator.freeRegister(register);
            } else {
                currMemory.declareVariable(StorageDirectory.STACK,correspondingDecl , 1);
                // fill value from parameter passed
                Register register = val.accept(this);
                currMemory.putVariable(StorageDirectory.STACK, correspondingDecl, register,null);
            
                registerAllocator.freeRegister(register);
            }
            idx++;
        }
        writer.writeCommentNl(WriteTarget.TEXT,"CALL");

        // perform jump
        writer.writeJal(fce.funName, "call");

        writer.writeCommentNl(WriteTarget.TEXT,"RETRIEVE REGS");


        Register retReg = registerAllocator.getRegister();
        writer.writeMove(retReg, Register.v0, "move register to permament");

        currMemory.retrieveRegister(labelSP, Register.sp);
        currMemory = oldMemory;

        return retReg;
    }
}

As you can see, the code requires lots of attention to detail and you can easily make a mistake that will cost you light-years in debugging.

Conclusion

But after a while of hacking away and debugging, I finally managed to get a working compiler!

Seeing it pump out working Assembly is honestly one of the most satisfying things I've ever experienced, I would highly recommend going through the pain at least once, just to get a better understanding of what's happening below the hood!

I will never look down upon the design of a language after seeing How difficult the process is!