目录

Checkpointless training on Amazon SageMaker HyperPod

Checkpointless training on Amazon SageMaker HyperPod eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes.

Key Features

  • In-Process Recovery: Recover from node failures in minutes without losing training progress by using redundant model copies stored in GPU memory
  • Fast Initialization: Accelerate training restarts by bypassing expensive communication (NCCL/Gloo) setup processes
  • Smart Data Caching: Pre-load and cache training data batches to eliminate delays when resuming training after failures
  • Built-in Redundancy: Leverage distributed optimizer instances for checkpointless recovery
  • NeMo Integration: Works seamlessly with PyTorch Lightning and NVIDIA NeMo toolkit for large language model training

Getting Started Examples

Model Method Size Nodes Instance Accelerator Recipe Script
GPT OSS Full finetune example 120b 16 p5.48xlarge GPU H100 link link
GPT OSS LoRA-example 120b 2 p5.48xlarge GPU H100 link link
Llama3 Pretrain example 70b 16 p5.48xlarge GPU H100 link link
Llama3 LoRA-example 70b 2 p5.48xlarge GPU H100 link link

User Guide

For comprehensive documentation including installation steps, environment setup, configuration options, and detailed usage examples, see the tutorials at Amazon SageMaker HyperPod Checkpointless training..

Quick Start Guide

Launch Training

Hyperpod Recipe Launcher

You can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

bash launcher_scripts/gpt_oss/run_checkpointless_nemo_gpt_oss_120b_fine_tuning.sh

Launch Using kubectl

Alternatively, you can deploy the training job directly using kubectl:

kubectl apply -f <path_to_config>.yaml

Monitor Job Status

kubectl get pods
kubectl logs <pod-name>

For detailed installation steps, environment setup, and configuration options, see the tutorials at Amazon SageMaker HyperPod Checkpointless training.

Component Version
Python >=3.12
PyTorch >=2.6.0
NeMo Toolkit 2.6.0rc0
CUDA 12.5+
Infrastructure AWS HyperPod Kubernetes cluster
Storage Shared storage (FSx/NFS)

Security

See CONTRIBUTING for more information. Note: This repository is temporarily not accepting pull requests.

License

This project is licensed under the Apache-2.0 License.

关于
436.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号