Amazon SageMaker HyperPod recipes help customers get started with training and fine-tuning popular publicly available foundation models in just minutes, with state-of-the-art performance. They provide a pre-configured training stack that is tested and validated on Amazon SageMaker.
The recipes support the following infrastructure (unless otherwise specified in documentation):
Amazon SageMaker HyperPod with Amazon EKS for workload orchestration
Amazon SageMaker HyperPod with Slurm for workload orchestration
Amazon SageMaker training jobs (SMTJ)
Version History
This repository contains v2.0.0 of Amazon SageMaker HyperPod recipes, which includes recipes built on the latest training frameworks.
Looking for v1 recipes? Please refer to the v1 branch. We recommend using v2 recipes for new projects as they provide improved performance and additional features.
Advanced fine-tuning framework with optimized implementations for:
DeepSeek R1 Distilled models (Llama and Qwen variants)
GPT-OSS models (20B, 120B)
Llama models (3.1, 3.2, 3.3, 4)
Qwen models (2.5, 3)
Techniques: SFT (Full Fine-Tuning and LoRA), DPO (Full Fine-Tuning and LoRA)
VERL (Versatile Reinforcement Learning)
Reinforcement learning framework using the GRPO algorithm for:
Llama models (3.1, 3.2, 3.3)
Qwen models (2.5, 3)
DeepSeek R1 Distilled models
GPT-OSS models
Techniques: RLAIF and RLVR, both available with Full Fine-Tuning or LoRA
Checkpointless Training
Memory-efficient training that eliminates traditional checkpoint storage during training, significantly reducing memory overhead and storage requirements. Particularly beneficial for large-scale models where checkpoint sizes can be substantial.
Dynamic resource scaling that enables automatic adjustment of training resources based on cluster availability. Workloads can scale up or down to optimize resource utilization and reduce training costs.
Amazon SageMaker HyperPod recipes should be installed on the head node of your HyperPod cluster or on your local machine with a virtual python environment.
When using the SageMaker HyperPod recipes, you can either create your own training script or use the provided recipes which include popular publicly-available models. Based on your specific needs, you might need to modify the parameters defined in the recipes for pre-training or fine-tuning. Once your configurations are setup, you can run training on SageMaker HyperPod (with Amazon EKS for workload orchestration) or on SageMaker training jobs using the Amazon SageMaker Python SDK. Note that Amazon Nova model recipes are only compatible with SageMaker HyperPod with Amazon EKS and SageMaker training jobs.
Container Images
The following container images are available for different recipe types:
For LLMFT recipes: 327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:llmft-v1.0.0
For VERL recipes (EKS): 327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:verl-v1.0.0-eks
For VERL recipes (SageMaker Training Jobs): 327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:verl-v1.0.0-smtj
To use a container image for training, modify the recipes_collection/config.yaml file with your chosen container image:
container: <your_container_image>
The launcher scripts have variables such as TRAIN_DIR which need to be set either by modifying the launcher script, or by setting environment variables. For example:
Running a recipe on a SageMaker HyperPod cluster orchestrated by Amazon EKS
Prior to commencing training on your cluster, you are required to configure your local environment by adhering to the installation instructions. Additionally, you will need to install Kubectl and Helm on your local machine. Refer to the following documentation for installation of Kubectl and Helm.
Using the recipes involves updating k8s.yaml, config.yaml, and running the launch script.
In k8s.yaml, update persistent_volume_claims. It mounts the Amazon FSx claim to the /data directory of each computing pod
persistent_volume_claims:
- claimName: fsx-claim
mountPath: data
Update your launcher script (e.g., launcher_scripts/deepseek/run_llmft_deepseek_r1_distilled_llama_8b_seq4k_gpu_sft_lora.sh)
your_container: Use the LLMFT container image: 327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:llmft-v1.0.0
(Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:
recipes.model.hf_access_token=<your_hf_token>
#!/bin/bash
#Users should setup their cluster type in /recipes_collection/config.yaml
IMAGE="327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:llmft-v1.0.0"
SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, etc
TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
VAL_DIR="<your_val_data_dir>" # Location of validation dataset
HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
recipes=training/deepseek/llmft_deepseek_r1_distilled_llama_8b_seq4k_gpu_sft_lora \
base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
recipes.run.name="llmft-deepseek-r1" \
recipes.exp_manager.exp_dir="$EXP_DIR" \
cluster=k8s \
cluster_type=k8s \
container="${IMAGE}" \
recipes.model.data.train_dir=$TRAIN_DIR \
recipes.model.data.val_dir=$VAL_DIR
To run Amazon Nova recipe on SageMaker HyperPod clusters orchestrated by Amazon EKS, you will need to create a Restricted Instance Group in your cluster. Refer to the following documentation to learn more.
Running a recipe on a SageMaker HyperPod cluster orchestrated by Slurm
Note: Only LLMFT recipes are supported on Slurm clusters. VERL recipes are not supported on Slurm but are available on EKS and SageMaker training jobs.
To run a recipe on a HyperPod cluster with Slurm, SSH into the head node and clone the HyperPod recipes repository onto a shared filesystem (FSx or NFS). Follow the installation instructions to set up a Python virtual environment with the required dependencies.
Configuring the Recipe
Update the recipes_collection/config.yaml file with the LLMFT container image:
SageMaker training jobs automatically spin up a resilient distributed training cluster, monitors the infrastructure, and auto-recovers from faults to ensure a smooth training experience. You can leverage the SageMaker Python SDK to execute your recipes on SageMaker training jobs.
The following Python code-snippet demonstrates how to submit a recipe to run on a SageMaker training jobs by utilizing the PyTorch estimator from the SageMaker Python SDK.
For example, to run the llama3-8b recipe on a SageMaker training jobs, you need to set training_recipe arg to indicate which recipe: this can be a recipe from one of the available ones, or a url or a local yaml file containing a modified recipe. Please also modify the local directory paths and hf access token either by providing recipe_overrides or by modifying the recipe yaml file directly (the url or local file).
Running the above code creates a PyTorch estimator object with the specified training recipe and then trains the model using the fit() method. The new training_recipe parameter enables you to specify the recipe you want to use.
To learn more about running Amazon Nova recipe on SageMaker training job, refer to this documentation.
Troubleshooting
During training, if GPU memory usage approaches its limit, attempting to save sharded checkpoints to an S3 storage may result in a core dump. To address this issue, you may choose to:
Reduce the overall memory consumption of the model training:
Increase the number of compute nodes for the training process
Decrease the batch size
Increase the sharding degrees
Use FSx as the shared file system
By taking one of the above approaches, you can alleviate the memory pressure and prevent a core dump from occurring during checkpoint saving.
Testing
Follow the instructions on the “Installing” section then use the following command to install the dependencies for testing:
pip install pytest
pip install pytest-cov
Unit Tests
To run the unit tests, navigate to the root directory and use the command python -m pytest plus any desired flags.
The pyproject.toml file defines additional options that are always appended to the pytest command:
[tool.pytest.ini_options]
...
addopts = [
"--cache-clear",
"--quiet",
"--durations=0",
"--cov=launcher/",
# uncomment this line to see a detailed HTML test coverage report instead of the usual summary table output to stdout.
# "--cov-report=html",
"tests/",
]
For the golden tests including the launch JSON ones, the golden outputs can be updated by running GOLDEN_TEST_WRITE=1 python -m pytest.
Contributing
We use pre-commit to unify our coding format, steps to setup are as follows:
Install pre-commit which helps us run formatters before commit using pip install pre-commit
Setup hooks from our pre-commit hook configs in .pre-commit-config.yaml using pre-commit install
When you commit, pre-commit hooks will be applied. If for some reason you need to skip the check, you can run git commit ... --no-verify but make sure to include the reason to skip pre-commit in the commit message.
Amazon SageMaker HyperPod Recipes
Overview
Amazon SageMaker HyperPod recipes help customers get started with training and fine-tuning popular publicly available foundation models in just minutes, with state-of-the-art performance. They provide a pre-configured training stack that is tested and validated on Amazon SageMaker.
Please see Amazon SageMaker HyperPod recipes documentation for full documentation.
The recipes support the following infrastructure (unless otherwise specified in documentation):
Version History
This repository contains v2.0.0 of Amazon SageMaker HyperPod recipes, which includes recipes built on the latest training frameworks.
Looking for v1 recipes? Please refer to the v1 branch. We recommend using v2 recipes for new projects as they provide improved performance and additional features.
Supported Models and Techniques
Supported Models
Supported Techniques
• LoRA: Low-rank adaptation for parameter efficiency
• QLoRA: Quantized LoRA for reduced memory
• LoRA
• LoRA
• LoRA
• LoRA
Supported Accelerators
Advanced Training Frameworks
LLMFT (LLM Fine-Tuning Framework)
Advanced fine-tuning framework with optimized implementations for:
VERL (Versatile Reinforcement Learning)
Reinforcement learning framework using the GRPO algorithm for:
Checkpointless Training
Memory-efficient training that eliminates traditional checkpoint storage during training, significantly reducing memory overhead and storage requirements. Particularly beneficial for large-scale models where checkpoint sizes can be substantial.
Supported Models:
Key Benefits:
Available Recipes:
Elastic Training
Dynamic resource scaling that enables automatic adjustment of training resources based on cluster availability. Workloads can scale up or down to optimize resource utilization and reduce training costs.
Supported Models:
Key Features:
Benefits:
How to use:
With supported SFT/DPO recipes and elastic training prerequisites, just add the following line to your launching script:
Evaluation
Logging Support
Installation
Amazon SageMaker HyperPod recipes should be installed on the head node of your HyperPod cluster or on your local machine with a virtual python environment.
Usage Guide
When using the SageMaker HyperPod recipes, you can either create your own training script or use the provided recipes which include popular publicly-available models. Based on your specific needs, you might need to modify the parameters defined in the recipes for pre-training or fine-tuning. Once your configurations are setup, you can run training on SageMaker HyperPod (with Amazon EKS for workload orchestration) or on SageMaker training jobs using the Amazon SageMaker Python SDK. Note that Amazon Nova model recipes are only compatible with SageMaker HyperPod with Amazon EKS and SageMaker training jobs.
Container Images
The following container images are available for different recipe types:
327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:llmft-v1.0.0327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:verl-v1.0.0-eks327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:verl-v1.0.0-smtjTo use a container image for training, modify the
recipes_collection/config.yamlfile with your chosen container image:The launcher scripts have variables such as
TRAIN_DIRwhich need to be set either by modifying the launcher script, or by setting environment variables. For example:Running a recipe on a SageMaker HyperPod cluster orchestrated by Amazon EKS
Prior to commencing training on your cluster, you are required to configure your local environment by adhering to the installation instructions. Additionally, you will need to install Kubectl and Helm on your local machine. Refer to the following documentation for installation of Kubectl and Helm.
Using the recipes involves updating
k8s.yaml,config.yaml, and running the launch script.In k8s.yaml, update persistent_volume_claims. It mounts the Amazon FSx claim to the /data directory of each computing pod
Update your launcher script (e.g.,
launcher_scripts/deepseek/run_llmft_deepseek_r1_distilled_llama_8b_seq4k_gpu_sft_lora.sh)your_container: Use the LLMFT container image:327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-recipes:llmft-v1.0.0(Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:
After you’ve submitted the training job, you can use the following command to verify if you submitted it successfully.
If the
STATUSisPENDINGorContainerCreating, run the following command to get more details.After the job
STATUSchanges toRunning, you can examine the log by using the following command.The
STATUSwill turn toCompletedwhen you runkubectl get pods.For more information about the k8s cluster configuration, see Running a training job on HyperPod k8s.
To run Amazon Nova recipe on SageMaker HyperPod clusters orchestrated by Amazon EKS, you will need to create a Restricted Instance Group in your cluster. Refer to the following documentation to learn more.
Running a recipe on a SageMaker HyperPod cluster orchestrated by Slurm
To run a recipe on a HyperPod cluster with Slurm, SSH into the head node and clone the HyperPod recipes repository onto a shared filesystem (FSx or NFS). Follow the installation instructions to set up a Python virtual environment with the required dependencies.
Configuring the Recipe
Update the
recipes_collection/config.yamlfile with the LLMFT container image:Running the Training Job
Set the required environment variables and launch the training script. For example, to run an LLMFT recipe:
Or for a DeepSeek R1 Distilled model:
The launcher scripts will submit Slurm jobs to your cluster. You can monitor job status using standard Slurm commands:
Running a recipe on SageMaker training jobs
SageMaker training jobs automatically spin up a resilient distributed training cluster, monitors the infrastructure, and auto-recovers from faults to ensure a smooth training experience. You can leverage the SageMaker Python SDK to execute your recipes on SageMaker training jobs.
The following Python code-snippet demonstrates how to submit a recipe to run on a SageMaker training jobs by utilizing the
PyTorchestimator from the SageMaker Python SDK.For example, to run the llama3-8b recipe on a SageMaker training jobs, you need to set
training_recipearg to indicate which recipe: this can be a recipe from one of the available ones, or a url or a local yaml file containing a modified recipe. Please also modify the local directory paths and hf access token either by providingrecipe_overridesor by modifying the recipe yaml file directly (the url or local file).Running the above code creates a
PyTorchestimator object with the specified training recipe and then trains the model using thefit()method. The newtraining_recipeparameter enables you to specify the recipe you want to use.To learn more about running Amazon Nova recipe on SageMaker training job, refer to this documentation.
Troubleshooting
During training, if GPU memory usage approaches its limit, attempting to save sharded checkpoints to an S3 storage may result in a core dump. To address this issue, you may choose to:
By taking one of the above approaches, you can alleviate the memory pressure and prevent a core dump from occurring during checkpoint saving.
Testing
Follow the instructions on the “Installing” section then use the following command to install the dependencies for testing:
Unit Tests
To run the unit tests, navigate to the root directory and use the command
python -m pytestplus any desired flags.The
pyproject.tomlfile defines additional options that are always appended to thepytestcommand:For the golden tests including the launch JSON ones, the golden outputs can be updated by running
GOLDEN_TEST_WRITE=1 python -m pytest.Contributing
We use pre-commit to unify our coding format, steps to setup are as follows:
pip install pre-commit.pre-commit-config.yamlusingpre-commit installWhen you commit, pre-commit hooks will be applied. If for some reason you need to skip the check, you can run
git commit ... --no-verifybut make sure to include the reason to skip pre-commit in the commit message.Security
See CONTRIBUTING for more information.
License
This project is licensed under the Apache-2.0 License.