nanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal/hackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost 43,000totrainin2019)foronly48 (2 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$15. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial: --depth, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, …) are calculated automatically in an optimal way.
For questions about the repo, I recommend either using DeepWiki from Devin/Cognition to ask questions about the repo, or use the Discussions tab, or come by the #nanochat channel on Discord.
Time-to-GPT-2 Leaderboard
Presently, the main focus of development is on tuning the pretraining stage, which takes the most amount of compute. Inspired by the modded-nanogpt repo and to incentivise progress and community collaboration, nanochat maintains a leaderboard for a “GPT-2 speedrun”, which is the wall-clock time required to train a nanochat model to GPT-2 grade capability, as measured by the DCLM CORE score. The runs/speedrun.sh script always reflects the reference way to train GPT-2 grade model and talk to it. The current leaderboard looks as follows:
#
time
val_bpb
CORE
Description
Date
Commit
Contributors
0
168 hours
-
0.2565
Original OpenAI GPT-2 checkpoint
2019
-
OpenAI
1
3.04
0.74833
0.2585
d24 baseline, slightly overtrained
Jan 29 2026
348fbb3
@karpathy
2
2.91
0.74504
0.2578
d26 slightly undertrained +fp8
Feb 2 2026
a67eba3
@karpathy
3
2.76
0.74645
0.2602
bump total batch size to 1M tokens
Feb 5 2026
2c062aa
@karpathy
4
2.02
0.71854
0.2571
change dataset to NVIDIA ClimbMix
Mar 4 2026
324e69c
@ddudek @karpathy
The primary metric we care about is “time to GPT-2” - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately 43,000soitisincrediblethatduetomanyadvancesover7yearsacrossthestack,wecannowdosomuchfasterandforwellbelow100 (e.g. at the current ~3/GPU/hr,an8XH100nodeis24/hr, so 2 hours is ~$48).
See dev/LEADERBOARD.md for more docs on how to interpret and contribute to the leaderboard.
Getting started
Reproduce and talk to GPT-2
The most fun you can have is to train your own GPT-2 and talk to it. The entire pipeline to do so is contained in the single file runs/speedrun.sh, which is designed to be run on an 8XH100 GPU node. Boot up a new 8XH100 GPU box from your favorite provider (e.g. I use and like Lambda), and kick off the training script:
bash runs/speedrun.sh
You may wish to do so in a screen session as this will take ~3 hours to run. Once it’s done, you can talk to it via the ChatGPT-like web UI. Make sure again that your local uv virtual environment is active (run source .venv/bin/activate), and serve it:
python -m scripts.chat_web
And then visit the URL shown. Make sure to access it correctly, e.g. on Lambda use the public IP of the node you’re on, followed by the port, so for example http://209.20.xxx.xxx:8000/, etc. Then talk to your LLM as you’d normally talk to ChatGPT! Get it to write stories or poems. Ask it to tell you who you are to see a hallucination. Ask it why the sky is blue. Or why it’s green. The speedrun is a 4e19 FLOPs capability model so it’s a bit like talking to a kindergartener :).
A few more notes:
The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.
All code will run just fine on even a single GPU by omitting torchrun, and will produce ~identical results (code will automatically switch to gradient accumulation), but you’ll have to wait 8 times longer.
If your GPU(s) have less than 80GB, you’ll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for --device_batch_size in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1. Less than that you’ll have to know a bit more what you’re doing and get more creative.
Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven’t personally exercised all of these code paths so there might be sharp edges.
Research
If you are a researcher and wish to help improve nanochat, two scripts of interest are runs/scaling_laws.sh and runs/miniseries.sh. See Jan 7 miniseries v1 for related documentation. For quick experimentation (~5 min pretraining runs) my favorite scale is to train a 12-layer model (GPT-1 sized), e.g. like this:
This uses wandb (run name “d12”), only runs the CORE metric on last step, and it doesn’t sample and save intermediate checkpoints. I like to change something in the code, re-run a d12 (or a d16 etc) and see if it helped, in an iteration loop. To see if a run helps, I like to monitor the wandb plots for:
val_bpb (validation loss in vocab-size-invariant units of bits per byte) as a function of step, total_training_time and total_training_flops.
The important thing to note is that nanochat is written and configured around one single dial of complexity - the depth of the transformer. This single integer automatically determines all other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, …) so that the trained model comes out compute optimal. The idea is that the user doesn’t have to think about or set any of this, they are simply asking for a smaller or bigger model using --depth, and everything “just works”. By sweeping out the depth, you achieve the nanochat miniseries of compute optimal models at various sizes. GPT-2 capability model (which is of most interest at the moment) happens to be somewhere around d24-d26 range with the current code. But any candidate changes to the repo have to be principled enough that they work for all settings of depth.
Running on CPU / MPS
The script runs/runcpu.sh shows a very simple example of running on CPU or Apple Silicon. It dramatically shrinks the LLM that is being trained to make things fit into a reasonable time interval of a few ten minutes of training. You will not get strong results in this way.
Precision / dtype
nanochat does not use torch.amp.autocast. Instead, precision is managed explicitly through a single global COMPUTE_DTYPE (defined in nanochat/common.py). By default this is auto-detected based on your hardware:
Hardware
Default dtype
Why
CUDA SM 80+ (A100, H100, …)
bfloat16
Native bf16 tensor cores
CUDA SM < 80 (V100, T4, …)
float32
No bf16; fp16 available via NANOCHAT_DTYPE=float16 (uses GradScaler)
CPU / MPS
float32
No reduced-precision tensor cores
You can override the default with the NANOCHAT_DTYPE environment variable:
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello" # force fp32
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train # force bf16
How it works: model weights are stored in fp32 (for optimizer precision), but our custom Linear layer casts them to COMPUTE_DTYPE during the forward pass. Embeddings are stored directly in COMPUTE_DTYPE to save memory. This gives us the same mixed-precision benefit as autocast but with full explicit control over what runs in which precision.
Note: float16 training automatically enables a GradScaler in base_train.py to prevent gradient underflow. SFT suppors this too but RL currently does not. Inference in fp16 works fine everywhere.
Guides
I’ve published a number of guides that might contain helpful information, most recent to least recent:
To customize your nanochat, see Guide: infusing identity to your nanochat in Discussions, which describes how you can tune your nanochat’s personality through synthetic data generation and mixing that data into the SFT stage.
Oct 13 2025: original nanochat post introducing nanochat, though now it contains some deprecated information and the model is a lot older (with worse results) than current master.
File structure
.
├── LICENSE
├── README.md
├── dev
│ ├── gen_synthetic_data.py # Example synthetic data for identity
│ ├── generate_logo.html
│ ├── nanochat.png
│ └── repackage_data_reference.py # Pretraining data shard generation
├── nanochat
│ ├── __init__.py # empty
│ ├── checkpoint_manager.py # Save/Load model checkpoints
│ ├── common.py # Misc small utilities, quality of life
│ ├── core_eval.py # Evaluates base model CORE score (DCLM paper)
│ ├── dataloader.py # Tokenizing Distributed Data Loader
│ ├── dataset.py # Download/read utils for pretraining data
│ ├── engine.py # Efficient model inference with KV Cache
│ ├── execution.py # Allows the LLM to execute Python code as tool
│ ├── gpt.py # The GPT nn.Module Transformer
│ ├── logo.svg
│ ├── loss_eval.py # Evaluate bits per byte (instead of loss)
│ ├── optim.py # AdamW + Muon optimizer, 1GPU and distributed
│ ├── report.py # Utilities for writing the nanochat Report
│ ├── tokenizer.py # BPE Tokenizer wrapper in style of GPT-4
│ └── ui.html # HTML/CSS/JS for nanochat frontend
├── pyproject.toml
├── runs
│ ├── miniseries.sh # Miniseries training script
│ ├── runcpu.sh # Small example of how to run on CPU/MPS
│ ├── scaling_laws.sh # Scaling laws experiments
│ └── speedrun.sh # Train the ~$100 nanochat d20
├── scripts
│ ├── base_eval.py # Base model: CORE score, bits per byte, samples
│ ├── base_train.py # Base model: train
│ ├── chat_cli.py # Chat model: talk to over CLI
│ ├── chat_eval.py # Chat model: eval tasks
│ ├── chat_rl.py # Chat model: reinforcement learning
│ ├── chat_sft.py # Chat model: train SFT
│ ├── chat_web.py # Chat model: talk to over WebUI
│ ├── tok_eval.py # Tokenizer: evaluate compression rate
│ └── tok_train.py # Tokenizer: train it
├── tasks
│ ├── arc.py # Multiple choice science questions
│ ├── common.py # TaskMixture | TaskSequence
│ ├── customjson.py # Make Task from arbitrary jsonl convos
│ ├── gsm8k.py # 8K Grade School Math questions
│ ├── humaneval.py # Misnomer; Simple Python coding task
│ ├── mmlu.py # Multiple choice questions, broad topics
│ ├── smoltalk.py # Conglomerate dataset of SmolTalk from HF
│ └── spellingbee.py # Task teaching model to spell/count letters
├── tests
│ └── test_engine.py
└── uv.lock
Contributing
The goal of nanochat is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM “framework”; there are no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable “strong baseline” codebase designed to run start to end and produce a ChatGPT model you can talk to. Currently, the most interesting part personally is speeding up the latency to GPT-2 (i.e. getting a CORE score above 0.256525). Currently this takes ~3 hours, but by improving the pretraining stage we can improve this further.
Current AI policy: disclosure. When submitting a PR, please declare any parts that had substantial LLM contribution and that you have not written or that you do not fully understand.
Acknowledgements
The name (nanochat) derives from my earlier project nanoGPT, which only covered pretraining.
nanochat is also inspired by modded-nanoGPT, which gamified the nanoGPT repo with clear metrics and a leaderboard, and borrows a lot of its ideas and some implementation for pretraining.
Thank you to HuggingFace for fineweb and smoltalk.
Thank you Lambda for the compute used in developing this project.
Thank you to chief LLM whisperer 🧙♂️ Alec Radford for advice/guidance.
Thank you to the repo czar Sofie @svlandeg for help with managing issues, pull requests and discussions of nanochat.
Cite
If you find nanochat helpful in your research cite simply as:
@misc{nanochat,
author = {Andrej Karpathy},
title = {nanochat: The best ChatGPT that \$100 can buy},
year = {2025},
publisher = {GitHub},
url = {https://github.com/karpathy/nanochat}
}
nanochat
nanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal/hackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost
43,000totrainin2019)foronly48 (2 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$15. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial:--depth, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, …) are calculated automatically in an optimal way.For questions about the repo, I recommend either using DeepWiki from Devin/Cognition to ask questions about the repo, or use the Discussions tab, or come by the #nanochat channel on Discord.
Time-to-GPT-2 Leaderboard
Presently, the main focus of development is on tuning the pretraining stage, which takes the most amount of compute. Inspired by the modded-nanogpt repo and to incentivise progress and community collaboration, nanochat maintains a leaderboard for a “GPT-2 speedrun”, which is the wall-clock time required to train a nanochat model to GPT-2 grade capability, as measured by the DCLM CORE score. The runs/speedrun.sh script always reflects the reference way to train GPT-2 grade model and talk to it. The current leaderboard looks as follows:
The primary metric we care about is “time to GPT-2” - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately 43,000soitisincrediblethatduetomanyadvancesover7yearsacrossthestack,wecannowdosomuchfasterandforwellbelow100 (e.g. at the current ~3/GPU/hr,an8XH100nodeis 24/hr, so 2 hours is ~$48).
See dev/LEADERBOARD.md for more docs on how to interpret and contribute to the leaderboard.
Getting started
Reproduce and talk to GPT-2
The most fun you can have is to train your own GPT-2 and talk to it. The entire pipeline to do so is contained in the single file runs/speedrun.sh, which is designed to be run on an 8XH100 GPU node. Boot up a new 8XH100 GPU box from your favorite provider (e.g. I use and like Lambda), and kick off the training script:
You may wish to do so in a screen session as this will take ~3 hours to run. Once it’s done, you can talk to it via the ChatGPT-like web UI. Make sure again that your local uv virtual environment is active (run
source .venv/bin/activate), and serve it:And then visit the URL shown. Make sure to access it correctly, e.g. on Lambda use the public IP of the node you’re on, followed by the port, so for example http://209.20.xxx.xxx:8000/, etc. Then talk to your LLM as you’d normally talk to ChatGPT! Get it to write stories or poems. Ask it to tell you who you are to see a hallucination. Ask it why the sky is blue. Or why it’s green. The speedrun is a 4e19 FLOPs capability model so it’s a bit like talking to a kindergartener :).
A few more notes:
torchrun, and will produce ~identical results (code will automatically switch to gradient accumulation), but you’ll have to wait 8 times longer.--device_batch_sizein the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1. Less than that you’ll have to know a bit more what you’re doing and get more creative.Research
If you are a researcher and wish to help improve nanochat, two scripts of interest are runs/scaling_laws.sh and runs/miniseries.sh. See Jan 7 miniseries v1 for related documentation. For quick experimentation (~5 min pretraining runs) my favorite scale is to train a 12-layer model (GPT-1 sized), e.g. like this:
This uses wandb (run name “d12”), only runs the CORE metric on last step, and it doesn’t sample and save intermediate checkpoints. I like to change something in the code, re-run a d12 (or a d16 etc) and see if it helped, in an iteration loop. To see if a run helps, I like to monitor the wandb plots for:
val_bpb(validation loss in vocab-size-invariant units of bits per byte) as a function ofstep,total_training_timeandtotal_training_flops.core_metric(the DCLM CORE socre)train/mfu(Model FLOPS utilization),train/tok_per_sec(training throughput)See an example here.
The important thing to note is that nanochat is written and configured around one single dial of complexity - the depth of the transformer. This single integer automatically determines all other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, …) so that the trained model comes out compute optimal. The idea is that the user doesn’t have to think about or set any of this, they are simply asking for a smaller or bigger model using
--depth, and everything “just works”. By sweeping out the depth, you achieve the nanochat miniseries of compute optimal models at various sizes. GPT-2 capability model (which is of most interest at the moment) happens to be somewhere around d24-d26 range with the current code. But any candidate changes to the repo have to be principled enough that they work for all settings of depth.Running on CPU / MPS
The script runs/runcpu.sh shows a very simple example of running on CPU or Apple Silicon. It dramatically shrinks the LLM that is being trained to make things fit into a reasonable time interval of a few ten minutes of training. You will not get strong results in this way.
Precision / dtype
nanochat does not use
torch.amp.autocast. Instead, precision is managed explicitly through a single globalCOMPUTE_DTYPE(defined innanochat/common.py). By default this is auto-detected based on your hardware:bfloat16float32NANOCHAT_DTYPE=float16(uses GradScaler)float32You can override the default with the
NANOCHAT_DTYPEenvironment variable:How it works: model weights are stored in fp32 (for optimizer precision), but our custom
Linearlayer casts them toCOMPUTE_DTYPEduring the forward pass. Embeddings are stored directly inCOMPUTE_DTYPEto save memory. This gives us the same mixed-precision benefit as autocast but with full explicit control over what runs in which precision.Note:
float16training automatically enables aGradScalerinbase_train.pyto prevent gradient underflow. SFT suppors this too but RL currently does not. Inference in fp16 works fine everywhere.Guides
I’ve published a number of guides that might contain helpful information, most recent to least recent:
File structure
Contributing
The goal of nanochat is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM “framework”; there are no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable “strong baseline” codebase designed to run start to end and produce a ChatGPT model you can talk to. Currently, the most interesting part personally is speeding up the latency to GPT-2 (i.e. getting a CORE score above 0.256525). Currently this takes ~3 hours, but by improving the pretraining stage we can improve this further.
Current AI policy: disclosure. When submitting a PR, please declare any parts that had substantial LLM contribution and that you have not written or that you do not fully understand.
Acknowledgements
Cite
If you find nanochat helpful in your research cite simply as:
License
MIT