This is a complete and functional compiler development project.
This compiler is built for a new language MicroC (or uC), which is a new language I developed. MicroC is a simplified version of C language, which combines the grammars from several different programming languages, such as Java, Python, and C. MicroC is an Object-oriented programming language, and it supports many critical components of a programming language, such as control structures (if, while), functions, type checking, explicit/implicit type conversions, arrays, pointers, etc.
This compiler development project uses a parser generator: ANTLR, to generate a parser in Java that can build and walk parse trees from MicroC grammar. Then, ANTLR can be configured to automatically generate ASTs (Abstract Syntax Tree). Finally, we perform a walk of the generated ASTs and generate Risc-V assembly codes.
Besides, this compiler also performs register allocation to minimize the number of loads and stores by doing basic block splitting and global/local variables liveness analysis.
Many testing uC programs are provided. To test this compiler by generating Risc-V assembly codes using these testing programs, please use the following codes. To learn the details of these codes, please refer to the ‘Build and Test the Compiler’ section.
$ make clean; make compiler
rm -rf build classes
mkdir build classes
antlr -o build MicroC.g4
javac -cp $CLASSPATH -d classes java/*.java build/*.java
$ ./runme AllTests/testsX/testY.uc out
Background of Compiler
The first step of a compiler is checking whether a program is valid: 1. Does it use the correct “vocabulary” (keywords, operators, variable names); 2. Does it use proper “grammar” (do for and while loops have the correct structure, etc.)? A compiler uses two tools to accomplish the validation check:
A scanner that reads in a stream of characters and tokenizes them into the constituent words of the language – the keywords, operators, variable names, etc.
A parser that consumes a stream of tokens (as identified by the scanner) into a parse tree, which is a representation of the structure of the program.
Example:
cur = cur + 4;
The scanner will convert the code above into 6 tokens:
IDENT(cur) OP(=) IDENT(cur) OP(+) LIT(4) SEMI
The parser will then convert this stream of tokens (generated by scanner) into a parse tree to capture the code’s structure:
After parsing the program, the next step is to add semantic actions to the parser to build up an intermediate representation for the program. Two major intermediate representations are used by the compiler, Symbol Table and Abstract Syntax Tree (AST). The Symbol Table is used to record the information about all the symbols (variables, function names, etc.) in the program. The Abstract Syntax Tree is used to capture the structure of the program, which is a cleaned up form of the parse tree that provides a straightforward capture of the program structure (see java/ast for details).
Since the information in the Abstract Syntax Tree (AST) is associated with various nodes in the parse tree, we can pass partially constructed abstract syntax tree nodes, instead of passing information about a declaration. Please see the ‘Building an Abstract Syntax Tree (AST)’ section for details about the codes to build the AST.
Once Abstract Syntax Trees (AST) are generated, the final step is to perform a walk of the AST to generate Risc-V assembly codes. Please see the ‘Generating Risc-V Assembly Codes’ section for details.
ANTLR (ANother Tool for Language Recognition) is a recursive descent parser generator tool, by reading grammar and token definitions. It can also be configured to automatically generate ASTs (Abstract Syntax Tree), which are hierarchical representations of the syntactic structure of a program.
ANTLR program has a .g4 file to define the structure of the programming language. In this project, the structure of MicroC is defined in java/MicroC.g4, the details of tokens and grammar definitions can be found on the ANTLR website.
More Information of developing the compiler
Examples of definitions in MicroC.g4:
Here are the variable declaration and function declaration in MicroC.g4 as examples:
Variable Declaration:
var_decl : type ident ';' {st.addVariable($type.t, $ident.text);};
In java/MicroC.g4, we use the following codes to construct AST Nodes for primary expressions. The primary returns an ExpressionNode, which is an abstract AST Node type that has subclasses for particular primary expressions.
primary returns [ExpressionNode node] : ident {$node = new VarNode($ident.text);}
| '(' expr ')' {$node = $expr.node;}
| unaryminus_expr {$node = $unaryminus_expr.node;}
| il = INT_LITERAL {$node = new IntLitNode($il.text);};
In MicroC.g4, we use the following codes as a structure to build up the lists of statements. If statements is rewritten to lambda, we create an “empty” StatementListNode. If statements is rewritten to statement statements, we create a new StatementListNode by combining the StatementNode returned by statement with the StatementListNode returned by `statements.
The java/ast/visitor/AbstractASTVisitor class is provided to simplify the task of walking the AST. The preprocess and postprocess methods for the AST nodes need to be defined in any visitor (see java/ast/visitor/PrintVisitor as an example).
Then, the java/assembly/CodeGenerator class is used to build up CodeObjects, which is a structure that holds code. CodeObject holds the complete code for a particular AST sub-tree. To process a node in an AST, the CodeObjects from its children nodes need to be combined to form a new CodeObject. Starting this combination process from the head node of an AST, a new CodeObject can be generated to represent the entire AST.
Each Risc-V instruction is defined in a class in the java/assembly/instructions folder. Each CodeObject contains a list of these instructions. Printing this list out will print out the assembly codes.
The following codes show how to use ANTLR to translate the MicroC.g4 file to generate a parser:
$ make
using Java
rm -rf build
mkdir build
antlr -o build MicroC.g4
After running the above codes, the following Java files will be generated, which include a Scanner class (MicroCLexer.java) and a Parser class (MicroCParser.java).
The following codes in java/compiler/Compiler.java show how to use the Scanner to turn an input stream into a stream of tokens; and then how to pass this stream of tokens to the Parser to build a parse tree.
MicroCLexer lexer = new MicroCLexer(CharStreams.fromFileName(args[0]));
MicroCParser parser = new MicroCParser(new CommonTokenStream(lexer));
ParseTree pt = parser.program();
The following code uses java/compiler/Compiler.java to compile the parser and lexer together to create a program. This program is used to parse any program and check if it matches the grammar.
Many testing uC programs are provided. You may test this compiler by generating Risc-V assembly codes using these programs. To simplify the testing, a shell script called runme is provided.
runme is a shell script that runs the Scanner. This script takes two arguments: 1. the input file name; 2. the output file name. Please make sure to run make clean; make compiler before running runme.
There are 7 groups of tests in 7 different folders. Each group contains several tests that can be run. To run the Y test in the X group of tests, use the command below:
$ ./runme AllTests/testsX/testY.uc out
After the assembly files are generated by runme, the RISC simulator can be used to run the assembly codes. After setting up the container environment, the simulator file path should be ~/RiscSim/driver.py.
If you don’t want to generate the assembly files by yourself, I have provided the assembly files corresponding to each testing program in the AllOutputs folder. You may run the assembly file by the command below:
python3 ~/RiscSim/driver.py [assembly file]
Thank you for Exploring my Compiler project!!!
Hooray!!! You have finished reading this README document!
I sincerely hope you enjoyed exploring this compiler! Hope you have learned something new!
If you have any questions or feedback, please reach out to me at: yfzhou23@gmail.com
Project Summary:
This is a complete and functional compiler development project.
This compiler is built for a new language
MicroC(oruC), which is a new language I developed.MicroCis a simplified version of C language, which combines the grammars from several different programming languages, such as Java, Python, and C.MicroCis an Object-oriented programming language, and it supports many critical components of a programming language, such as control structures (if, while), functions, type checking, explicit/implicit type conversions, arrays, pointers, etc.This compiler development project uses a parser generator: ANTLR, to generate a parser in Java that can build and walk parse trees from
MicroCgrammar. Then, ANTLR can be configured to automatically generate ASTs (Abstract Syntax Tree). Finally, we perform a walk of the generated ASTs and generate Risc-V assembly codes.Besides, this compiler also performs register allocation to minimize the number of loads and stores by doing basic block splitting and global/local variables liveness analysis.
Many testing
uCprograms are provided. To test this compiler by generating Risc-V assembly codes using these testing programs, please use the following codes. To learn the details of these codes, please refer to the ‘Build and Test the Compiler’ section.Background of Compiler
The first step of a compiler is checking whether a program is valid: 1. Does it use the correct “vocabulary” (keywords, operators, variable names); 2. Does it use proper “grammar” (do for and while loops have the correct structure, etc.)? A compiler uses two tools to accomplish the validation check:
Example:
The scanner will convert the code above into 6 tokens:
The parser will then convert this stream of tokens (generated by scanner) into a parse tree to capture the code’s structure:
After parsing the program, the next step is to add semantic actions to the parser to build up an intermediate representation for the program. Two major intermediate representations are used by the compiler, Symbol Table and Abstract Syntax Tree (AST). The Symbol Table is used to record the information about all the symbols (variables, function names, etc.) in the program. The Abstract Syntax Tree is used to capture the structure of the program, which is a cleaned up form of the parse tree that provides a straightforward capture of the program structure (see
java/astfor details).Since the information in the Abstract Syntax Tree (AST) is associated with various nodes in the parse tree, we can pass partially constructed abstract syntax tree nodes, instead of passing information about a declaration. Please see the ‘Building an Abstract Syntax Tree (AST)’ section for details about the codes to build the AST.
Once Abstract Syntax Trees (AST) are generated, the final step is to perform a walk of the AST to generate Risc-V assembly codes. Please see the ‘Generating Risc-V Assembly Codes’ section for details.
Background of ANTLR
ANTLR (ANother Tool for Language Recognition) is a recursive descent parser generator tool, by reading grammar and token definitions. It can also be configured to automatically generate ASTs (Abstract Syntax Tree), which are hierarchical representations of the syntactic structure of a program.
ANTLR program has a
.g4file to define the structure of the programming language. In this project, the structure ofMicroCis defined injava/MicroC.g4, the details of tokens and grammar definitions can be found on the ANTLR website.More Information of developing the compiler
Examples of definitions in
MicroC.g4:Here are the variable declaration and function declaration in
MicroC.g4as examples:Building an Abstract Syntax Tree (AST)
In
java/MicroC.g4, we use the following codes to construct AST Nodes forprimaryexpressions. Theprimaryreturns anExpressionNode, which is an abstract AST Node type that has subclasses for particularprimaryexpressions.In
MicroC.g4, we use the following codes as a structure to build up the lists of statements. Ifstatementsis rewritten to lambda, we create an “empty”StatementListNode. Ifstatementsis rewritten tostatement statements, we create a newStatementListNodeby combining theStatementNodereturned bystatementwith theStatementListNodereturned by `statements.Here are two examples of building up the lists of statements in
MicroC.g4:if/Elsestatement andwhilestatement.Generating Risc-V Assembly Codes
The
java/ast/visitor/AbstractASTVisitorclass is provided to simplify the task of walking the AST. Thepreprocessandpostprocessmethods for the AST nodes need to be defined in any visitor (seejava/ast/visitor/PrintVisitoras an example).Then, the
java/assembly/CodeGeneratorclass is used to build upCodeObjects, which is a structure that holds code.CodeObjectholds the complete code for a particular AST sub-tree. To process a node in an AST, theCodeObjects from its children nodes need to be combined to form a newCodeObject. Starting this combination process from the head node of an AST, a newCodeObjectcan be generated to represent the entire AST.Each Risc-V instruction is defined in a class in the
java/assembly/instructionsfolder. EachCodeObjectcontains a list of these instructions. Printing this list out will print out the assembly codes.Build and Test the Compiler
Environment
Please refer to the environment documentations to setup the environment.
Build Compiler
The following codes show how to use ANTLR to translate the
MicroC.g4file to generate a parser:After running the above codes, the following Java files will be generated, which include a Scanner class (
MicroCLexer.java) and a Parser class (MicroCParser.java).The following codes in
java/compiler/Compiler.javashow how to use the Scanner to turn an input stream into a stream of tokens; and then how to pass this stream of tokens to the Parser to build a parse tree.The following code uses
java/compiler/Compiler.javato compile the parser and lexer together to create a program. This program is used to parse any program and check if it matches the grammar.Test Compiler
Many testing uC programs are provided. You may test this compiler by generating Risc-V assembly codes using these programs. To simplify the testing, a shell script called
runmeis provided.runmeis a shell script that runs the Scanner. This script takes two arguments: 1. the input file name; 2. the output file name. Please make sure to runmake clean; make compilerbefore runningrunme.There are 7 groups of tests in 7 different folders. Each group contains several tests that can be run. To run the
Ytest in theXgroup of tests, use the command below:After the assembly files are generated by
runme, theRISCsimulator can be used to run the assembly codes. After setting up the container environment, the simulator file path should be~/RiscSim/driver.py.If you don’t want to generate the assembly files by yourself, I have provided the assembly files corresponding to each testing program in the
AllOutputsfolder. You may run the assembly file by the command below:Thank you for Exploring my Compiler project!!!
Hooray!!! You have finished reading this
READMEdocument!I sincerely hope you enjoyed exploring this compiler! Hope you have learned something new!
If you have any questions or feedback, please reach out to me at:
yfzhou23@gmail.com