Checkpointless training on Amazon SageMaker HyperPod

Checkpointless training on Amazon SageMaker HyperPod eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes.

Key Features

In-Process Recovery: Recover from node failures in minutes without losing training progress by using redundant model copies stored in GPU memory
Fast Initialization: Accelerate training restarts by bypassing expensive communication (NCCL/Gloo) setup processes
Smart Data Caching: Pre-load and cache training data batches to eliminate delays when resuming training after failures
Built-in Redundancy: Leverage distributed optimizer instances for checkpointless recovery
NeMo Integration: Works seamlessly with PyTorch Lightning and NVIDIA NeMo toolkit for large language model training

Getting Started Examples

Model	Method	Size	Nodes	Instance	Accelerator	Recipe	Script
GPT OSS	Full finetune example	120b	16	p5.48xlarge	GPU H100	link	link
GPT OSS	LoRA-example	120b	2	p5.48xlarge	GPU H100	link	link
Llama3	Pretrain example	70b	16	p5.48xlarge	GPU H100	link	link
Llama3	LoRA-example	70b	2	p5.48xlarge	GPU H100	link	link

User Guide

For comprehensive documentation including installation steps, environment setup, configuration options, and detailed usage examples, see the tutorials at Amazon SageMaker HyperPod Checkpointless training..

Quick Start Guide

Launch Training

Hyperpod Recipe Launcher

You can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

bash launcher_scripts/gpt_oss/run_checkpointless_nemo_gpt_oss_120b_fine_tuning.sh

Launch Using kubectl

Alternatively, you can deploy the training job directly using kubectl:

kubectl apply -f <path_to_config>.yaml

Monitor Job Status

kubectl get pods
kubectl logs <pod-name>

For detailed installation steps, environment setup, and configuration options, see the tutorials at Amazon SageMaker HyperPod Checkpointless training.

Recommended Requirements

Component	Version
Python	>=3.12
PyTorch	>=2.6.0
NeMo Toolkit	2.6.0rc0
CUDA	12.5+
Infrastructure	AWS HyperPod Kubernetes cluster
Storage	Shared storage (FSx/NFS)

Security

See CONTRIBUTING for more information. Note: This repository is temporarily not accepting pull requests.

License

This project is licensed under the Apache-2.0 License.