目录

Project Summary:

This is a complete and functional compiler development project.

This compiler is built for a new language MicroC (or uC), which is a new language I developed. MicroC is a simplified version of C language, which combines the grammars from several different programming languages, such as Java, Python, and C. MicroC is an Object-oriented programming language, and it supports many critical components of a programming language, such as control structures (if, while), functions, type checking, explicit/implicit type conversions, arrays, pointers, etc.

This compiler development project uses a parser generator: ANTLR, to generate a parser in Java that can build and walk parse trees from MicroC grammar. Then, ANTLR can be configured to automatically generate ASTs (Abstract Syntax Tree). Finally, we perform a walk of the generated ASTs and generate Risc-V assembly codes.

Besides, this compiler also performs register allocation to minimize the number of loads and stores by doing basic block splitting and global/local variables liveness analysis.

Many testing uC programs are provided. To test this compiler by generating Risc-V assembly codes using these testing programs, please use the following codes. To learn the details of these codes, please refer to the ‘Build and Test the Compiler’ section.

$ make clean; make compiler
rm -rf build classes
mkdir build classes
antlr -o build MicroC.g4
javac -cp $CLASSPATH -d classes java/*.java build/*.java
$ ./runme AllTests/testsX/testY.uc out

Background of Compiler

The first step of a compiler is checking whether a program is valid: 1. Does it use the correct “vocabulary” (keywords, operators, variable names); 2. Does it use proper “grammar” (do for and while loops have the correct structure, etc.)? A compiler uses two tools to accomplish the validation check:

  1. A scanner that reads in a stream of characters and tokenizes them into the constituent words of the language – the keywords, operators, variable names, etc.
  2. A parser that consumes a stream of tokens (as identified by the scanner) into a parse tree, which is a representation of the structure of the program.

Example:

cur = cur + 4;

The scanner will convert the code above into 6 tokens:

IDENT(cur) OP(=) IDENT(cur) OP(+) LIT(4) SEMI

The parser will then convert this stream of tokens (generated by scanner) into a parse tree to capture the code’s structure:

parse tree for cur = cur + 4;

After parsing the program, the next step is to add semantic actions to the parser to build up an intermediate representation for the program. Two major intermediate representations are used by the compiler, Symbol Table and Abstract Syntax Tree (AST). The Symbol Table is used to record the information about all the symbols (variables, function names, etc.) in the program. The Abstract Syntax Tree is used to capture the structure of the program, which is a cleaned up form of the parse tree that provides a straightforward capture of the program structure (see java/ast for details).

Since the information in the Abstract Syntax Tree (AST) is associated with various nodes in the parse tree, we can pass partially constructed abstract syntax tree nodes, instead of passing information about a declaration. Please see the ‘Building an Abstract Syntax Tree (AST)’ section for details about the codes to build the AST.

Once Abstract Syntax Trees (AST) are generated, the final step is to perform a walk of the AST to generate Risc-V assembly codes. Please see the ‘Generating Risc-V Assembly Codes’ section for details.

Background of ANTLR

ANTLR (ANother Tool for Language Recognition) is a recursive descent parser generator tool, by reading grammar and token definitions. It can also be configured to automatically generate ASTs (Abstract Syntax Tree), which are hierarchical representations of the syntactic structure of a program.

ANTLR program has a .g4 file to define the structure of the programming language. In this project, the structure of MicroC is defined in java/MicroC.g4, the details of tokens and grammar definitions can be found on the ANTLR website.

More Information of developing the compiler

Examples of definitions in MicroC.g4:

Here are the variable declaration and function declaration in MicroC.g4 as examples:

  1. Variable Declaration:
var_decl : type ident ';' {st.addVariable($type.t, $ident.text);};
  1. Function Declaration:
func_decl : func_type ident '(' params ')' ';' {st.addFunction($func_type.t, $ident.text, $params.types);};

Building an Abstract Syntax Tree (AST)

In java/MicroC.g4, we use the following codes to construct AST Nodes for primary expressions. The primary returns an ExpressionNode, which is an abstract AST Node type that has subclasses for particular primary expressions.

primary returns [ExpressionNode node] : ident {$node = new VarNode($ident.text);}
        | '(' expr ')' {$node = $expr.node;}
        | unaryminus_expr {$node = $unaryminus_expr.node;}
        | il = INT_LITERAL {$node = new IntLitNode($il.text);};

In MicroC.g4, we use the following codes as a structure to build up the lists of statements. If statements is rewritten to lambda, we create an “empty” StatementListNode. If statements is rewritten to statement statements, we create a new StatementListNode by combining the StatementNode returned by statement with the StatementListNode returned by `statements.

statements returns [StatementListNode node] : statement s=statements {$node = new StatementListNode($statement.node, $s.node);}
            | /* empty */ {$node = new StatementListNode();};

statement returns [StatementNode node] : base_stmt ';' {$node = $base_stmt.node;}

Here are two examples of building up the lists of statements in MicroC.g4: if/Else statement and while statement.

  1. If/Else Statement:
if_stmt returns [IfStatementNode node] : 'if' '(' cond ')' '{' statements '}' else_stmt {$node = new IfStatementNode($cond.node, $statements.node, $else_stmt.node);};

else_stmt returns [StatementListNode node] : 'else' '{' statements '}' {$node = $statements.node;}
    | /* empty */ {$node = new StatementListNode();};
  1. While Statement:
while_stmt returns [WhileNode node] : 'while' '(' cond ')' '{' statements '}' {$node = new WhileNode($cond.node, $statements.node);};

Generating Risc-V Assembly Codes

The java/ast/visitor/AbstractASTVisitor class is provided to simplify the task of walking the AST. The preprocess and postprocess methods for the AST nodes need to be defined in any visitor (see java/ast/visitor/PrintVisitor as an example).

Then, the java/assembly/CodeGenerator class is used to build up CodeObjects, which is a structure that holds code. CodeObject holds the complete code for a particular AST sub-tree. To process a node in an AST, the CodeObjects from its children nodes need to be combined to form a new CodeObject. Starting this combination process from the head node of an AST, a new CodeObject can be generated to represent the entire AST.

Each Risc-V instruction is defined in a class in the java/assembly/instructions folder. Each CodeObject contains a list of these instructions. Printing this list out will print out the assembly codes.

Build and Test the Compiler

Environment

Please refer to the environment documentations to setup the environment.

Build Compiler

The following codes show how to use ANTLR to translate the MicroC.g4 file to generate a parser:

$ make
using Java
rm -rf build
mkdir build
antlr -o build MicroC.g4

After running the above codes, the following Java files will be generated, which include a Scanner class (MicroCLexer.java) and a Parser class (MicroCParser.java).

MicroC.interp        MicroCLexer.interp    MicroCListener.java
MicroC.tokens        MicroCLexer.java    MicroCParser.java
MicroCBaseListener.java    MicroCLexer.tokens

The following codes in java/compiler/Compiler.java show how to use the Scanner to turn an input stream into a stream of tokens; and then how to pass this stream of tokens to the Parser to build a parse tree.

MicroCLexer lexer = new MicroCLexer(CharStreams.fromFileName(args[0]));

MicroCParser parser = new MicroCParser(new CommonTokenStream(lexer));

ParseTree pt = parser.program();

The following code uses java/compiler/Compiler.java to compile the parser and lexer together to create a program. This program is used to parse any program and check if it matches the grammar.

javac -cp $CLASSPATH -d classes java/*.java build/*.java

Test Compiler

Many testing uC programs are provided. You may test this compiler by generating Risc-V assembly codes using these programs. To simplify the testing, a shell script called runme is provided.

runme is a shell script that runs the Scanner. This script takes two arguments: 1. the input file name; 2. the output file name. Please make sure to run make clean; make compiler before running runme.

There are 7 groups of tests in 7 different folders. Each group contains several tests that can be run. To run the Y test in the X group of tests, use the command below:

$ ./runme AllTests/testsX/testY.uc out

After the assembly files are generated by runme, the RISC simulator can be used to run the assembly codes. After setting up the container environment, the simulator file path should be ~/RiscSim/driver.py.

If you don’t want to generate the assembly files by yourself, I have provided the assembly files corresponding to each testing program in the AllOutputs folder. You may run the assembly file by the command below:

python3 ~/RiscSim/driver.py [assembly file]

Thank you for Exploring my Compiler project!!!

Hooray!!! You have finished reading this README document!

I sincerely hope you enjoyed exploring this compiler! Hope you have learned something new!

If you have any questions or feedback, please reach out to me at: yfzhou23@gmail.com

邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号