Currently the middle layer is not complete but has enough functionality to demonstrate how it can work. The general idea is that Triton IR is lowered into an MLIR core dialect to allow it to be both shared across Triton targets as well as allow back-ends to be shared with other languages.
The basic intended architecture looks like this:
[Triton IR] -> [Middle Layer] -> [HW specific IR]
The middle-layer uses MLIR’s Linalg and Tensor Dialects for operations on Triton block values. Operations on Triton pointers use the Memref Dialect.
This repo now includes triton as a submodule and builds as an out-of-tree backend.
To build this repo clone triton-shared to a folder called triton_shared (notice the underscore).
Triton will use this folder name to create a module under triton.runtime for the reference CPU backend.
You need to set the TRITON_PLUGIN_DIRS environment variable to the location of your triton-shared directory for triton to find it.
The resulting triton-shared binaries will be placed under triton/build/{current_cmake_version}/third_party/triton_shared
1. Stand-Alone
The middle layer can be used as a stand-alone component to convert Triton dialect to the middle layer dialects. This is intended for testing and validation purposes, but could potentially be used before sending the IR to another MLIR complier.
Stand-alone example:
triton-shared-opt --triton-to-linalg %file
2. Backend Component
The intended use of the Triton middle layer is to be used as a component in a Triton back-end. This can be accomplished by adding the cmake targets it produces and its headers files to that back-end. An example back-end will be published at a later date.
3. Reference CPU backend
We also include an experimental reference CPU backend that leverages all existing mlir passes. After building, the CPU backend can be used by setting triton‘s active driver:
import triton
from triton.backends.triton_shared.driver import CPUDriver
triton.runtime.driver.set_active(CPUDriver())
For more examples, please refer to python/examples.
Implementation details
Even though a valid triton program can perform load and store in arbitrary memory locations, the prototype only supports lowering programs that have structured memory access patterns.
Analyses
As part of the conversion process, there are three important analyses:
Pointer analysis:
This analysis is responsible for extracting structured memory access patterns from a triton program during load and store; it walks the IR and visits relevant instructions to build strided memory accesses in the memref dialect. The analysis is still in its early stage and does not support all scenarios.
Use analysis:
After “Pointer analysis”, instructions that are part of memory address calculation will no longer be necessary in a triton program because their semantics have now been captured by memref operations representing strided memory accesses. To aid with removing these instructions safely, we perform Use analysis to mark which instructions are used only in address calculation (called MetaUse) or used in both address calculation and data manipulation (called MixedUse) operations. Those that are MixedUse are cloned and have their users adjusted accordingly with the goal of separating out the MetaUse ops so that they can be safely deleted.
Mask analysis:
This analysis is responsible for handling masked loads and stores.
Conversion strategy
We introduce the TritonToLinalg pass that converts the triton dialect to the linalg dialect on tensors. This means the resulting IR is fully compatible with linalg tiling and fusion transformation passes. As mentioned in the Pointer analysis‘s description, we do however have to deal with memref instructions at the load and store boundaries and have to convert them to tensors using bufferization.to_tensor. Here’s a simple example of what the IR looks like:
tt.load (together with all of its related address calculation instructions such as tt.addptr and tt.splat) are lowered to a combination of memref.reinterpret_cast, memref.alloc, and memref.copy. After the initialization of the local buffer, we convert the memref back to a tensor using bufferization.to_tensor; this op is automatically removed during bufferization.
tt.store lowers to a combination of memref.reinterpret_cast and either affine.store or memref.tensor_store:
%reinterpret_cast = memref.reinterpret_cast %arg2 to offset: [...] memref<*xf32> to memref<1024xf32>
%extracted_slice = tensor.extract_slice %15[0] [%21] [1] : tensor<1024xf32> to tensor<?xf32>
%subview = memref.subview %reinterpret_cast[0] [%21] [1] : memref<1024xf32> to memref<?xf32>
bufferization.materialize_in_destination %extracted_slice in writable %subview
element-wise arith and math operators are converted to their corresponding linalg.generic version.
tt.dot becomes linalg.matmul.
tt.reduce becomes linalg.reduce; known limitation: only support addf and maxf reduction in the reduction body for now.
Testing
The prototype was tested on the following triton kernel examples:
In addition to testing on the tutorial kernels, there are many lit tests covering various scenarios.
Intermediate Representation (IR) Dumps
To facilitate debugging and analysis, the triton-shared project now supports emitting all intermediate representations (IRs) generated during the compilation process. This functionality is controlled via the environment variable TRITON_SHARED_DUMP_PATH.
How It Works
By setting the TRITON_SHARED_DUMP_PATH environment variable, you specify a directory where all intermediate representations will be saved. The Triton compiler will emit IR dumps at various stages of compilation into the specified folder, allowing developers to inspect and analyze the transformations applied to the code.
How to Use
Create a directory where the IR dumps will be stored (e.g., /path/to/dump_dir).
Set the TRITON_SHARED_DUMP_PATH environment variable to the directory path:
export TRITON_SHARED_DUMP_PATH=/path/to/dump_dir
Run your Triton compilation as usual. The compiler will emit IR dumps into the specified directory.
Example
Suppose your dump directory is /tmp/ir_dumps. Before running your code, set the environment variable:
export TRITON_SHARED_DUMP_PATH=/tmp/ir_dumps
After the compilation process completes, you can explore the /tmp/ir_dumps directory to find all the intermediate representation files.
$ ls /tmp/ir_dumps
ll.ir ll.mlir tt.mlir ttshared.mlir
Debugging Triton Programs
Triton-shared includes a build option that enables LLVM-sanitizers - AddressSanitizer (ASan) and ThreadSanitizer (TSan) - to help detect memory safety and concurrency issues in Triton programs. These sanitizers dynamically analyze the program during execution, identifying bugs such as buffer overflows and data races respectively. For more details and setup instructions, refer here.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
Microsoft’s Trademark & Brand Guidelines.
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party’s policies.
triton-shared
A shared middle-layer for the Triton Compiler.
Currently the middle layer is not complete but has enough functionality to demonstrate how it can work. The general idea is that Triton IR is lowered into an MLIR core dialect to allow it to be both shared across Triton targets as well as allow back-ends to be shared with other languages.
The basic intended architecture looks like this:
[Triton IR] -> [Middle Layer] -> [HW specific IR]
The middle-layer uses MLIR’s Linalg and Tensor Dialects for operations on Triton block values. Operations on Triton pointers use the Memref Dialect.
Motivation
This talk at the 2023 Triton Developer Conferene gives some background on the project and its goals.
Usage
This repo now includes
tritonas a submodule and builds as an out-of-tree backend.To build this repo clone
triton-sharedto a folder calledtriton_shared(notice the underscore).Tritonwill use this folder name to create a module undertriton.runtimefor the reference CPU backend.You need to set the
TRITON_PLUGIN_DIRSenvironment variable to the location of yourtriton-shareddirectory fortritonto find it.To build with Clang:
To build with a virtualenv:
The resulting
triton-sharedbinaries will be placed undertriton/build/{current_cmake_version}/third_party/triton_shared1. Stand-Alone
The middle layer can be used as a stand-alone component to convert Triton dialect to the middle layer dialects. This is intended for testing and validation purposes, but could potentially be used before sending the IR to another MLIR complier.
Stand-alone example:
2. Backend Component
The intended use of the Triton middle layer is to be used as a component in a Triton back-end. This can be accomplished by adding the cmake targets it produces and its headers files to that back-end. An example back-end will be published at a later date.
3. Reference CPU backend
We also include an experimental reference CPU backend that leverages all existing
mlirpasses. After building, the CPU backend can be used by settingtriton‘s active driver:For more examples, please refer to
python/examples.Implementation details
Even though a valid triton program can perform load and store in arbitrary memory locations, the prototype only supports lowering programs that have structured memory access patterns.
Analyses
As part of the conversion process, there are three important analyses:
Pointer analysis:
tritonprogram during load and store; it walks the IR and visits relevant instructions to build strided memory accesses in thememrefdialect. The analysis is still in its early stage and does not support all scenarios.Use analysis:
memrefoperations representing strided memory accesses. To aid with removing these instructions safely, we performUse analysisto mark which instructions are used only in address calculation (calledMetaUse) or used in both address calculation and data manipulation (calledMixedUse) operations. Those that areMixedUseare cloned and have their users adjusted accordingly with the goal of separating out theMetaUseops so that they can be safely deleted.Mask analysis:
Conversion strategy
We introduce the
TritonToLinalgpass that converts thetritondialect to thelinalgdialect on tensors. This means the resulting IR is fully compatible withlinalgtiling and fusion transformation passes. As mentioned in thePointer analysis‘s description, we do however have to deal with memref instructions at the load and store boundaries and have to convert them to tensors usingbufferization.to_tensor. Here’s a simple example of what the IR looks like:after conversion:
Important details to note:
tt.load(together with all of its related address calculation instructions such astt.addptrandtt.splat) are lowered to a combination ofmemref.reinterpret_cast,memref.alloc, andmemref.copy. After the initialization of the local buffer, we convert the memref back to a tensor usingbufferization.to_tensor; this op is automatically removed during bufferization.tt.storelowers to a combination ofmemref.reinterpret_castand eitheraffine.storeormemref.tensor_store:arithandmathoperators are converted to their correspondinglinalg.genericversion.tt.dotbecomeslinalg.matmul.tt.reducebecomeslinalg.reduce; known limitation: only supportaddfandmaxfreduction in the reduction body for now.Testing
The prototype was tested on the following triton kernel examples:
The Python tests are setup to run with Pytest and you will need to set the following environment variables to run them:
In addition to testing on the tutorial kernels, there are many lit tests covering various scenarios.
Intermediate Representation (IR) Dumps
To facilitate debugging and analysis, the triton-shared project now supports emitting all intermediate representations (IRs) generated during the compilation process. This functionality is controlled via the environment variable
TRITON_SHARED_DUMP_PATH.How It Works
By setting the
TRITON_SHARED_DUMP_PATHenvironment variable, you specify a directory where all intermediate representations will be saved. The Triton compiler will emit IR dumps at various stages of compilation into the specified folder, allowing developers to inspect and analyze the transformations applied to the code.How to Use
Create a directory where the IR dumps will be stored (e.g., /path/to/dump_dir). Set the
TRITON_SHARED_DUMP_PATHenvironment variable to the directory path:export TRITON_SHARED_DUMP_PATH=/path/to/dump_dirRun your Triton compilation as usual. The compiler will emit IR dumps into the specified directory.Example
Suppose your dump directory is
/tmp/ir_dumps. Before running your code, set the environment variable:After the compilation process completes, you can explore the
/tmp/ir_dumpsdirectory to find all the intermediate representation files.Debugging Triton Programs
Triton-shared includes a build option that enables LLVM-sanitizers - AddressSanitizer (ASan) and ThreadSanitizer (TSan) - to help detect memory safety and concurrency issues in Triton programs. These sanitizers dynamically analyze the program during execution, identifying bugs such as buffer overflows and data races respectively. For more details and setup instructions, refer here.
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.