Checkpointless training on Amazon SageMaker HyperPod
Checkpointless training on Amazon SageMaker HyperPod eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes.
Key Features
In-Process Recovery: Recover from node failures in minutes without losing training progress by using redundant model copies stored in GPU memory
Fast Initialization: Accelerate training restarts by bypassing expensive communication (NCCL/Gloo) setup processes
Smart Data Caching: Pre-load and cache training data batches to eliminate delays when resuming training after failures
Built-in Redundancy: Leverage distributed optimizer instances for checkpointless recovery
NeMo Integration: Works seamlessly with PyTorch Lightning and NVIDIA NeMo toolkit for large language model training
For comprehensive documentation including installation steps, environment setup, configuration options, and detailed usage examples, see the tutorials at Amazon SageMaker HyperPod Checkpointless training..
Quick Start Guide
Launch Training
Hyperpod Recipe Launcher
You can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.
Checkpointless training on Amazon SageMaker HyperPod
Checkpointless training on Amazon SageMaker HyperPod eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes.
Key Features
Getting Started Examples
User Guide
For comprehensive documentation including installation steps, environment setup, configuration options, and detailed usage examples, see the tutorials at Amazon SageMaker HyperPod Checkpointless training..
Quick Start Guide
Launch Training
Hyperpod Recipe Launcher
You can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.
Launch Using kubectl
Alternatively, you can deploy the training job directly using kubectl:
Monitor Job Status
For detailed installation steps, environment setup, and configuration options, see the tutorials at Amazon SageMaker HyperPod Checkpointless training.
Recommended Requirements
Security
See CONTRIBUTING for more information. Note: This repository is temporarily not accepting pull requests.
License
This project is licensed under the Apache-2.0 License.