IR Framework is a practical solution for implementing JIT in medium-size projects.
It defines Intermediate Representation (IR), provides a simple API for IR construction and
a set of algorithms for optimization, scheduling, register allocation and code
generation. The resulting generated in-memory code, may be directly executed.
A presentation about IR framework design and implementation details is available at
researchgate.
IR is used as a base for PHP JIT since PHP-8.4. It’s also used as a back-end for experimental Rational C Compiler.
Anyway, this is not a stable finished product yet. It’s still under development.
IR - Intermediate Representation
The Framework uses single Medium level Intermediate Representation during all
phases of optimization, register allocation and code generation. It is inspired
by Sea-Of-Nodes introduced by Cliff Click [1]. Sea-Of-Nodes is used in Java
HotSpot Server Compiler, V8 TurboFan JavaScript Compiler, Java Graal
Compiler…
This representation unifies data and control dependencies into a single graph,
where each instruction represented as a Node and each dependency as an Edge
between Nodes. There are no classical CFG (Control Flow Graph) with Basic
Blocks. Instead, IR uses special Control Nodes that start and finish some
code Regions. The data part of the IR is very similar to SSA (Static Single
Assignment) form. Each variable may be assigned only once, but except to SSA,
in our IR, we don’t have any variables, their versions and name. Everything
is represented by computation Nodes and Edges between them. Of course, we
have special PHI Node that represents the Phi() function.
Internally, our graph-based IR is represented as a plain
two-way grow-able array of Nodes. Dependency Edge are represented as indexes
of the other Node. This physical representation is almost completely repeats
the LuaJIT IR designed by Mike Pall [3].
IR Generation
IR Framework provides a simple IR Builder API (ir_builder.h).
It’s implemented as a number of C preprocessor macros, where each macro just
creates a corresponding IR node and ties it with the sources.
IR Optimization
In comparison to classical optimizing compilers (like GCC and LLVM), IR
Framework uses very short optimization pipeline. Together with compact IR
representation, this makes it extremely fast and allows to generate quite
good machine code in reasonable time.
Folding
Folding is done on the fly, during IR generation. It performs a set of local
transformations, but because of the graph nature of the IR where most data
operations (like ADD) are “floating” (not “pinned” to Basic Block), the scope
of transformations is not limited by Basic Block. It’s important to generate
IR in a proper (Reverse Post) order, that would emit all Nodes before their
first usage. (In case of different order the scope of the folding should be
limited).
All the folding rules are written in a declarative style in the single file -
ir_fold.h
Folding Engine performs Constants Folding, Copy Propagation, Algebraic
Simplifications, Algebraic Re-Association and Common Sub-Expression
Elimination. The simple and fast declarative implementation is borrowed from
LuaJIT [3].
Sparse Conditional Constant Propagation
This pass implements a classical algorithm originally designed by M. N. Wegman
and F. K. Zadeck [4] for SSA form. Unification of data and control dependencies
made its implementation even simpler. Despite of constant propagation itself
this pass also performs global Copy Propagation and re-applies the folding rules.
At the end all the “dead” instructions (those whose results go unused)
are replaced with NOPs.
Global Code Motion
Now we have to “fix” places of “floating” instructions. This pass builds CFG
(Control Flow Graph) skeleton and then ”pin” each “floating” instruction to the
best Basic Block. The algorithm is developed by Cliff Click [2].
Local Scheduling
As the final IR transformation pass, we reorder instructions inside each Basic
Block to satisfy the dependencies. Currently this is done by a simple
topological sorting.
Target Instruction Selection
This is the first target dependent step of compilation. It aims to combine
instruction Nodes into tiles that allows better instruction fusion. For example
10 + a + b * 4 may be calculated by a single x86 instruction
lea 10(%eax, %ebx, 4), %ecx. The selection is done by a constrained tree
pattern matching. The current implementation uses simple Max-Munch approach.
(This may be replaced by a smarter BURS method).
Register Allocation
CPU independent implementation of Linear Scan Register Allocation for SSA form
with second chance bin-packing. [5] [6]
Machine Code Generations
IR Framework implements X86_64, x86 and AArch64 back-ends. The current
implementation uses DynAsm [?]. (In the future, this should be replaced with
a faster “binary” encoder). Code generator walks throw all instructions of each
basic blocks and emits some code according to “rules” selected during
instruction selection pass. It uses registers, selected by register allocator
and inserts the necessary spill load/store and SSA deconstruction code.
Tooling
Ability to load and save IR in a textual form
Ability to visualize IR graph through graphviz dot.
Target CPU disassembler for generated code (uses libcapstone [?])
GDB/JIT interface to allow debugging of JIT-ed code
Linux perf interface to analyze the code performance
Building and Playing with IR
IR Framework is under active development and doesn’t provide any stable libraries yet.
In case you like to use IR, it’s better to embed the necessary sources into your project
(like PHP does].
However, we provide a simple driver that may be built to run tests and play with IR.
git clone https://github.com/dstogov/ir.git
cd ir
make
make test
./ir bench/mandelbrit.ir --run
./ir bench/mandelbrit.ir -S
./ir --help
IR Example
The complete described example may be found in examples/mandelbrot.c
and built using make examples.
It generates the code for the following C function:
int32_t mandelbrot(double x, double y)
{
double cr = y - 0.5;
double ci = x;
double zi = 0.0;
double zr = 0.0;
int i = 0;
while(1) {
i++;
double temp = zr * zi;
double zr2 = zr * zr;
double zi2 = zi * zi;
zr = zr2 - zi2 + cr;
zi = temp + temp + ci;
if (zi2 + zr2 > 16)
return i;
if (i > 1000)
return 0;
}
}
This is done through IR builder API by the following code:
The following table shows the benchmarks execution time in comparison to the same benchmarks compiled by gcc -O2 (the more the better). The C benchmarks were compiled by CLAG into LLVM code (without any LLVM optimizations, only SSA construction is necessary now) and then loaded, compiled and executed by IR framework.
Benchmark
Execution time (relative to GCC -02)
array
0.93
binary-trees
0.99
funnkuch-reduce
1.01
hash
1.13
hash2
0.87
heapsort
1.03
lists
0.98
matrix
1.12
method-call
1.00
mandelbrot
0.95
nbody
0.98
sieve
0.93
spectral-norm
0.94
strcat
0.97
oggenc
0.88
minilua
0.99
gzip
0.80
bzip2
0.81
AVERAGE
0.96
GEOMEAN
0.96
Most of the benchmarks are very simple (few screens of code), but oggenc, minilua gzip and bzip2 are real applications.
As you can see, IR produces code that is in average 5% slower than GCC -O2, ~20% slower in the worst case,
and on some benchmarks it’s even faster. The chart shows the same data graphically.
The next table shows the compilation time relative to gcc -O2 (the more the better)
Benchmark
Compilation time (relative to GCC -02)
oggenc
35.22
minilua
40.50
gzip
37.50
bzip2
49.67
AVERAGE
40.69
GEOMEAN
40.34
This comparison is not completely fair, because GCC compiles C source, but IR takes precompiled LLVM asm
Anyway, IR framework provides code that is in average 5% slower, but does this up to ~40 times faster.
PHP JIT based on IR
IR-based JIT integrated into PHP-8.4.0 and above. See JIT in action
C. Click, M. Paleczny. “A Simple Graph-Based Intermediate Representation” In ACM SIGPLAN Workshop on Intermediate Representations (IR ‘95), pages 35-49, Jan. 1995.
C. Click. “Global Code Motion Global Value Numbering” In ACM SIGPLAN Notices, Volume 30, Issue 6, pp 246–257, June 1995
M. N. Wegman and F. K. Zadeck. “Constant propagation with conditional branches” ACM Transactions on Programming Languages and Systems, 13(2):181-210, April 1991
C. Wimmer. “Optimized Interval Splitting in a Linear Scan Register Allocator” In VEE ‘05: Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments, pages 132–141, June 2005
C. Wimmer and M. Franz. “Linear Scan Register Allocation on SSA Form” In CGO ‘10: Proceedings of the 8-th annual IEEE/ACM international symposium on Code generation and optimization, pages 170–179, April 2010
IR - Lightweight JIT Compilation Framework
IR Framework is a practical solution for implementing JIT in medium-size projects. It defines Intermediate Representation (IR), provides a simple API for IR construction and a set of algorithms for optimization, scheduling, register allocation and code generation. The resulting generated in-memory code, may be directly executed.
A presentation about IR framework design and implementation details is available at researchgate.
IR is used as a base for PHP JIT since PHP-8.4. It’s also used as a back-end for experimental Rational C Compiler.
Anyway, this is not a stable finished product yet. It’s still under development.
IR - Intermediate Representation
The Framework uses single Medium level Intermediate Representation during all phases of optimization, register allocation and code generation. It is inspired by Sea-Of-Nodes introduced by Cliff Click [1]. Sea-Of-Nodes is used in Java HotSpot Server Compiler, V8 TurboFan JavaScript Compiler, Java Graal Compiler…
This representation unifies data and control dependencies into a single graph, where each instruction represented as a Node and each dependency as an Edge between Nodes. There are no classical CFG (Control Flow Graph) with Basic Blocks. Instead, IR uses special Control Nodes that start and finish some code Regions. The data part of the IR is very similar to SSA (Static Single Assignment) form. Each variable may be assigned only once, but except to SSA, in our IR, we don’t have any variables, their versions and name. Everything is represented by computation Nodes and Edges between them. Of course, we have special PHI Node that represents the Phi() function.
Internally, our graph-based IR is represented as a plain two-way grow-able array of Nodes. Dependency Edge are represented as indexes of the other Node. This physical representation is almost completely repeats the LuaJIT IR designed by Mike Pall [3].
IR Generation
IR Framework provides a simple IR Builder API (ir_builder.h). It’s implemented as a number of C preprocessor macros, where each macro just creates a corresponding IR node and ties it with the sources.
IR Optimization
In comparison to classical optimizing compilers (like GCC and LLVM), IR Framework uses very short optimization pipeline. Together with compact IR representation, this makes it extremely fast and allows to generate quite good machine code in reasonable time.
Folding
Folding is done on the fly, during IR generation. It performs a set of local transformations, but because of the graph nature of the IR where most data operations (like ADD) are “floating” (not “pinned” to Basic Block), the scope of transformations is not limited by Basic Block. It’s important to generate IR in a proper (Reverse Post) order, that would emit all Nodes before their first usage. (In case of different order the scope of the folding should be limited).
All the folding rules are written in a declarative style in the single file - ir_fold.h
Folding Engine performs Constants Folding, Copy Propagation, Algebraic Simplifications, Algebraic Re-Association and Common Sub-Expression Elimination. The simple and fast declarative implementation is borrowed from LuaJIT [3].
Sparse Conditional Constant Propagation
This pass implements a classical algorithm originally designed by M. N. Wegman and F. K. Zadeck [4] for SSA form. Unification of data and control dependencies made its implementation even simpler. Despite of constant propagation itself this pass also performs global Copy Propagation and re-applies the folding rules. At the end all the “dead” instructions (those whose results go unused) are replaced with NOPs.
Global Code Motion
Now we have to “fix” places of “floating” instructions. This pass builds CFG (Control Flow Graph) skeleton and then ”pin” each “floating” instruction to the best Basic Block. The algorithm is developed by Cliff Click [2].
Local Scheduling
As the final IR transformation pass, we reorder instructions inside each Basic Block to satisfy the dependencies. Currently this is done by a simple topological sorting.
Target Instruction Selection
This is the first target dependent step of compilation. It aims to combine instruction Nodes into tiles that allows better instruction fusion. For example
10 + a + b * 4may be calculated by a single x86 instructionlea 10(%eax, %ebx, 4), %ecx. The selection is done by a constrained tree pattern matching. The current implementation uses simple Max-Munch approach. (This may be replaced by a smarter BURS method).Register Allocation
CPU independent implementation of Linear Scan Register Allocation for SSA form with second chance bin-packing. [5] [6]
Machine Code Generations
IR Framework implements X86_64, x86 and AArch64 back-ends. The current implementation uses DynAsm [?]. (In the future, this should be replaced with a faster “binary” encoder). Code generator walks throw all instructions of each basic blocks and emits some code according to “rules” selected during instruction selection pass. It uses registers, selected by register allocator and inserts the necessary spill load/store and SSA deconstruction code.
Tooling
Building and Playing with IR
IR Framework is under active development and doesn’t provide any stable libraries yet. In case you like to use IR, it’s better to embed the necessary sources into your project (like PHP does].
However, we provide a simple driver that may be built to run tests and play with IR.
IR Example
The complete described example may be found in examples/mandelbrot.c and built using
make examples.It generates the code for the following C function:
This is done through IR builder API by the following code:
The textual representation of the IR after system independent optimizations:
The visualized graph:
The graph was generated by the commands:
The final generated assembler code:
LLVM interoperability
IR is partially compatible with LLVM. It’s possible to convert IR into LLVM and then compile it.
It’s also possible to read an LLVM file and convert it to IR, but this requires LLVM loader and therefore IR should be rebuilt with LLVM support.
Now you may compile some C file into LLVM, and then convert it to IR.
Note that the last command above compiles and runs the Lua interpreter.
Also note that the LLVM code produced by clang is already optimized. In case of benchmarking, it may be more honest to avoid LLVM optimizations.
Performance
The following table shows the benchmarks execution time in comparison to the same benchmarks compiled by
gcc -O2(the more the better). The C benchmarks were compiled by CLAG into LLVM code (without any LLVM optimizations, only SSA construction is necessary now) and then loaded, compiled and executed by IR framework.Most of the benchmarks are very simple (few screens of code), but oggenc, minilua gzip and bzip2 are real applications. As you can see, IR produces code that is in average 5% slower than
GCC -O2, ~20% slower in the worst case, and on some benchmarks it’s even faster. The chart shows the same data graphically.The next table shows the compilation time relative to
gcc -O2(the more the better)This comparison is not completely fair, because GCC compiles C source, but IR takes precompiled LLVM asm
Anyway, IR framework provides code that is in average 5% slower, but does this up to ~40 times faster.
PHP JIT based on IR
IR-based JIT integrated into PHP-8.4.0 and above. See JIT in action
References