InfiniStore is an open-source high-performance KV store. It’s designed to support LLM Inference clusters, whether the cluster is in prefill-decoding disaggregation mode or not. InfiniStore provides high-performance and low-latency KV cache transfer and KV cache reuse among inference nodes in the cluster.
In addition to inference clusters, InfiniStore can also be used as a standalone KV store to integrate with any other LLM training or inference services.
Usage
There are two major scenarios how InfiniStore supports :
Prefill-Decoding disaggregation clusters: in such mode inference workloads are separated into two node pools: prefill nodes and decoding nodes. InfiniStore enables KV cache transfer among these two types of nodes, and also KV cache reuse.
Non-disaggregated clusters: in such mode prefill and decoding workloads are mixed on every node. Infinistore serves as an extra large KV cache pool in addition to GPU cache and local CPU cache, and also enables cross-node KV cache reuse.
Currently InfiniStore has been integrated with vLLM. The integration is done via LMCache for the flexibility purpose.
Integration with SGLang and other inference engines are in progress.
Installation
Install from PIP
Most users just need to deploy and run InfiniStore, and they don’t need to understand how InfiniStore works internally. For these users, PIP is the recommended way to install:
pip install infinistore
Development Installation
For users who need to understand how InfiniStore code works or make code contributions to InfiniStore, it’s recommended to install from source code:
The first step is to start an InfiniStore server. The server can be running on a GPU machine or a CPU machine.
Your server machine may be equipped with TCP network or RDMA network. The command line to start a server varies depending on the network configurations:
Check the following example code to run an InfiniStore client:
infinistore/example/client.py
infinistore/example/client_async.py
infinistore/example/client_async_single.py
Run Within a vLLM Cluster
As illustrated in the previous section, InfiniStore enables different functionalities in a vLLM cluster: KV cache transfer between prefill nodes and decoding nodes, extended KV cache pool, cross-node KV cache reuse, etc.
The setup will varies depending on the specific vLLM cluster configurations. But usually it requires the following installations:
Install vLLM on all nodes
Install LMCache on all nodes
Install InfiniStore on all nodes
Contribute to InfiniStore
InfiniStore is an open-source project and community. We welcome anyone who is interested in helping improve InfiniStore, whether code contributions, document contributions or any other contributions.
If you are submitting a code change, run the following unit test and pre-commit check to ensure your code change doesn’t break existing features before submitting the PR:
Make your code changes
Just clone this repo, make code changes according to your feature design.
Run Unit Tests
pytest infinistore/test_infinistore.py
Run Pre-commit Checks
pre-commit run --all-files
Submit PR
If you code change passes both unit tests and pre-commit checks, submit the PR.
What’s InfiniStore
InfiniStore is an open-source high-performance KV store. It’s designed to support LLM Inference clusters, whether the cluster is in prefill-decoding disaggregation mode or not. InfiniStore provides high-performance and low-latency KV cache transfer and KV cache reuse among inference nodes in the cluster.
In addition to inference clusters, InfiniStore can also be used as a standalone KV store to integrate with any other LLM training or inference services.
Usage
There are two major scenarios how InfiniStore supports :
Currently InfiniStore has been integrated with vLLM. The integration is done via LMCache for the flexibility purpose.
Integration with SGLang and other inference engines are in progress.
Installation
Install from PIP
Most users just need to deploy and run InfiniStore, and they don’t need to understand how InfiniStore works internally. For these users, PIP is the recommended way to install:
Development Installation
For users who need to understand how InfiniStore code works or make code contributions to InfiniStore, it’s recommended to install from source code:
Verify Your Installation
After installation, either from PIP or from source code, run the following command to verify your installation is successful:
Run InfiniStore
Run As a Standalone Service
The first step is to start an InfiniStore server. The server can be running on a GPU machine or a CPU machine.
Your server machine may be equipped with TCP network or RDMA network. The command line to start a server varies depending on the network configurations:
For TCP/IP Network:
For RDMA(RoCE):
For RDMA(Infiniband):
Check the following example code to run an InfiniStore client:
infinistore/example/client.pyinfinistore/example/client_async.pyinfinistore/example/client_async_single.pyRun Within a vLLM Cluster
As illustrated in the previous section, InfiniStore enables different functionalities in a vLLM cluster: KV cache transfer between prefill nodes and decoding nodes, extended KV cache pool, cross-node KV cache reuse, etc.
The setup will varies depending on the specific vLLM cluster configurations. But usually it requires the following installations:
Contribute to InfiniStore
InfiniStore is an open-source project and community. We welcome anyone who is interested in helping improve InfiniStore, whether code contributions, document contributions or any other contributions.
If you are submitting a code change, run the following unit test and pre-commit check to ensure your code change doesn’t break existing features before submitting the PR:
Make your code changes
Just clone this repo, make code changes according to your feature design.
Run Unit Tests
If you code change passes both unit tests and pre-commit checks, submit the PR.