models : kda chunk size = 16 (#19827)
models : add llm_build_delta_net_base
cont : keep qwen35 and qwen35moe graphs intact
cont : add comments [no ci]
add kimi linear to delta-net-base
removed unnecessary ggml_cont from g_exp_t
removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp
removed unnecessary diag mask
cont : simplify
cont : avoid graph splits
scale q after mul instead of beginning
scale q after mul instead of beginning
identical ppl
cont : fix scale and decay mask
minor : remove TODO
block implementation for kda
remove space at the end of line 101
concat+pad
pad+binary row concat
chunk size 16 for kda
removed minor differences to master
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
llama.cpp
Manifesto / ggml / ops
LLM inference in C/C++
Recent API changes
libllamaAPIllama-serverREST APIHot topics
gpt-ossmodel with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Commentllama-server: #12898 | documentationQuick start
Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:
llama.cppusing brew, nix or wingetOnce installed, you’ll need a model to work with. Head to the Obtaining and quantizing models section to learn more.
Example command:
Description
The main goal of
llama.cppis to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud.The
llama.cppproject is the main playground for developing new features for the ggml library.Models
Typically finetunes of the base models below are supported as well.
Instructions for adding support for new models: HOWTO-add-model.md
Text-only
Multimodal
Bindings
UIs
(to have a project listed here, it should clearly state that it depends on
llama.cpp)Tools
Infrastructure
Games
Supported backends
Obtaining and quantizing models
The Hugging Face platform hosts a number of LLMs compatible with
llama.cpp:You can either manually download the GGUF file or directly use any
llama.cpp-compatible models from Hugging Face or other model hosting sites, such as ModelScope, by using this CLI argument:-hf <user>/<model>[:quant]. For example:By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable
MODEL_ENDPOINT. For example, you may opt to downloading model checkpoints from ModelScope or other model sharing communities by setting the environment variable, e.g.MODEL_ENDPOINT=https://www.modelscope.cn/.After downloading a model, use the CLI tools to run it locally - see below.
llama.cpprequires the model to be stored in the GGUF file format. Models in other data formats can be converted to GGUF using theconvert_*.pyPython scripts in this repo.The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with
llama.cpp:llama.cppin the cloud (more info: https://github.com/ggml-org/llama.cpp/discussions/9669)To learn more about model quantization, read this documentation
llama-cliA CLI tool for accessing and experimenting with most of
llama.cpp‘s functionality.Run in conversation mode
Models with a built-in chat template will automatically activate conversation mode. If this doesn’t occur, you can manually enable it by adding
-cnvand specifying a suitable chat template with--chat-template NAMERun in conversation mode with custom chat template
Constrain the output with a custom grammar
The grammars/ folder contains a handful of sample grammars. To write your own, check out the GBNF Guide.
For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
llama-serverA lightweight, OpenAI API compatible, HTTP server for serving LLMs.
Start a local HTTP server with default configuration on port 8080
Support multiple-users and parallel decoding
Enable speculative decoding
Serve an embedding model
Serve a reranking model
Constrain all outputs with a grammar
llama-perplexityA tool for measuring the perplexity ^1 (and other quality metrics) of a model over a given text.
Measure the perplexity over a text file
Measure KL divergence
llama-benchBenchmark the performance of the inference for various parameters.
Run default benchmark
llama-simpleA minimal example for implementing apps with
llama.cpp. Useful for developers.Basic text completion
Contributing
llama.cpprepo and merge PRs into themasterbranchOther documentation
Development documentation
Seminal papers and background on the models
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
XCFramework
The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS, and macOS. It can be used in Swift projects without the need to compile the library from source. For example:
The above example is using an intermediate build
b5046of the library. This can be modified to use a different version by changing the URL and checksum.Completions
Command-line completion is available for some environments.
Bash Completion
Optionally this can be added to your
.bashrcor.bash_profileto load it automatically. For example:Dependencies
llama-server- MIT license