We recommend users who are new to this project to read the Overview of system.
ppl.llm.serving is a serving based on ppl.nn.llm for various Large Language Models(LLMs). This repository contains a server based on gRPC and inference support for LLaMA.
See tools/client_qps_measure.cc for more details. --request_rate is the number of request per second, and value inf means send all client request with no interval.
PPL LLM Serving
Overview
ppl.llm.servingis a part ofPPL.LLMsystem.We recommend users who are new to this project to read the Overview of system.
ppl.llm.servingis a serving based on ppl.nn.llm for various Large Language Models(LLMs). This repository contains a server based on gRPC and inference support for LLaMA.Prerequisites
Quick Start
Here is a brief tutorial, refer to LLaMA Guide for more details.
Installing Prerequisites(on Debian or Ubuntu for example)
Cloning Source Code
Building from Source
NCCL is required if multiple GPU devices are used.
Exporting Models
Refer to ppl.pmx for details.
Running Server
Server config examples can be found in
src/models/llama/conf. You are expected to give the correct values before running the server.model_dir: path of models exported by ppl.pmx.model_param_path: params of models.$model_dir/params.json.tokenizer_path: tokenizer files forsentencepiece.Running client: send request through gRPC to query the model
See tools/client_sample.cc for more details.
Benchmarking
See tools/client_qps_measure.cc for more details.
--request_rateis the number of request per second, and valueinfmeans send all client request with no interval.Running inference offline:
See tools/offline_inference.cc for more details.
License
This project is distributed under the Apache License, Version 2.0.