A comprehensive Python SDK for fine-tuning and customizing Amazon Nova models. This SDK provides a unified interface for training, evaluation, deployment, and monitoring of Nova models across both SageMaker Training Jobs and SageMaker HyperPod.
The SDK requires sagemaker, which is automatically set by pip.
Setup
In most cases, the SDK will inform you if the environment lacks the required setup to run a Nova customization job.
Below are some common requirements which you can set up in advance before trying to run a job.
Supported Python Versions
Nova Forge SDK is tested on:
Python 3.12
IAM Roles/Policies
You will need an IAM role with sufficient permissions in order to use the Nova Forge SDK. You can find a list of these permissions in the docs/iam_setup.md file.
Instances
Nova customization jobs also require access to enough of the right instance type to run:
The requested instance type and count should be compatible with the requested job. The SDK will validate your instance configuration for you.
The SageMaker account quotas for using the requested instance type in training jobs (for SMTJ) or HyperPod clusters (for SMHP) should allow the requested number of instances.
(For SMHP) The selected HyperPod cluster should have a Restricted Instance Group with enough instances of the right type to run the requested job. The SDK will validate that your cluster contains a valid instance group.
You can look in the docs/instance_type_spec.md file for the different instance types and combinations for specific jobs and methods.
HyperPod CLI
For HyperPod-based customization jobs, the SDK uses the SageMaker HyperPod CLI to connect to HyperPod Clusters and start jobs.
Prerequisites (required for both Forge and Non-Forge customers)
Install Helm 3. Verify with helm version. If not installed:
Before launching a training job, your data needs to be in the right format. The SDK’s dataset module handles loading, transforming, validating, filtering, and saving training data — supporting JSONL, JSON, CSV, Parquet, and Arrow formats from local files or S3.
For the complete guide — including column mappings, dataset splitting, filtering, chaining operations, and end-to-end examples — see Data Preparation Guide.
Handles data loading, transformation, validation, filtering, and persistence for training datasets. Supports JSONL, JSON, CSV, Parquet, and Arrow formats from local files or S3.
Manages runtime infrastructure for executing training and evaluation jobs.
For the allowed instance types for each model/method combination, see docs/instance_type_spec.md.
Main Methods:
execute() - Start a training or evaluation job
cleanup() - Stop and clean up a running job
scale_cluster() - (SMHP only) Scale HyperPod cluster instance groups up or down
get_instance_groups() - (SMHP only) View instance groups and current instance counts
Key Classes:
SMTJRuntimeManager - For SageMaker Training Jobs
SMHPRuntimeManager - For SageMaker HyperPod clusters
BedrockRuntimeManager - For Amazon Bedrock managed service
Cluster Scaling (SMHP):
The SMHPRuntimeManager provides a scale_cluster() method to dynamically adjust the number of instances in a HyperPod cluster instance group:
from amzn_nova_forge.manager import SMHPRuntimeManager
# Create a runtime manager for your cluster
manager = SMHPRuntimeManager(
instance_type="ml.p4d.24xlarge",
instance_count=4,
cluster_name="my-hyperpod-cluster",
namespace="default"
)
# View the available instance groups to update
available_instance_groups = manager.get_instance_groups()
# Scale up the worker group from 4 to 8 instances
result = manager.scale_cluster(
instance_group_name="worker-group",
target_instance_count=8
)
For more cluster scaling documentation, see docs/spec.md.
Model Module
Provides the main SDK entrypoint for orchestrating model customization workflows.
Main Methods:
train() - Launch a training job
evaluate() - Launch an evaluation job
deploy() - Deploy trained model to Amazon SageMaker or Bedrock
batch_inference() - Run batch inference on trained model
get_logs() - Retrieve CloudWatch logs for current job
get_data_mixing_config() - Get data mixing configuration
set_data_mixing_config() - Set data mixing configuration
Key Class:
NovaModelCustomizer - Main orchestration class
Monitor Module
Provides job monitoring and experiment tracking capabilities.
Main Methods:
show_logs() - Display CloudWatch logs
get_logs() - Retrieve logs as list
from_job_result() - Create monitor from job result
from_job_id() - Create monitor from job ID
Key Classes:
CloudWatchLogMonitor - For viewing job logs
MLflowMonitor - For experiment tracking with presigned URL generation
RFT Multiturn Module
Manages infrastructure for reinforcement fine-tuning with multi-turn conversational tasks.
Main Methods:
setup() - Deploy SAM stack and validate platform
start_training_environment() - Start training environment
The Nova Forge SDK supports iterative fine-tuning of Nova models.
This is done by progressively running fine-tuning jobs on the output checkpoint from the previous job:
# Stage 1: Initial training on base model
stage1_customizer = NovaModelCustomizer(
model=Model.NOVA_LITE,
method=TrainingMethod.SFT_LORA,
infra=infra,
data_s3_path="s3://bucket/stage1-data.jsonl",
output_s3_path="s3://bucket/stage1-output"
)
stage1_result = stage1_customizer.train(job_name="stage1-training")
# Wait for completion...
stage1_checkpoint = stage1_result.model_artifacts.checkpoint_s3_path
# Stage 2: Continue training from Stage 1 checkpoint
stage2_customizer = NovaModelCustomizer(
model=Model.NOVA_LITE,
method=TrainingMethod.SFT_LORA,
infra=infra,
data_s3_path="s3://bucket/stage2-data.jsonl",
output_s3_path="s3://bucket/stage2-output",
model_path=stage1_checkpoint # Use previous checkpoint
)
stage2_result = stage2_customizer.train(job_name="stage2-training")
Note: Iterative fine-tuning requires using the same model and training method (LoRA vs Full-Rank) across all stages.
Dry Run
The Nova Forge SDK supports dry_run mode for the following functions: train(), evaluate(), and batch_inference().
When calling any of the above functions, you can set the dry_run parameter to True.
The SDK will still generate your recipe and validate your input, but it won’t begin a job.
This feature is useful whenever you want to test or validate inputs and still have a recipe generated, without starting a job.
# Training dry run
customizer.train(
job_name="train_dry_run",
dry_run=True,
...
)
# Evaluation dry run
customizer.evaluate(
job_name="evaluate_dry_run",
dry_run=True,
...
)
Data Mixing
Data mixing allows you to blend your custom training data with Nova’s high-quality curated datasets, helping maintain the model’s broad capabilities while adding your domain-specific knowledge.
Key Features:
Available for CPT and SFT training for Nova 1 and Nova 2 (both LoRA and Full-Rank) on SageMaker HyperPod
Mix customer data (0-100%) with Nova’s curated data
Nova data categories include general knowledge and code
Nova data percentages must sum to 100%
Example Usage:
# Initialize with data mixing enabled
customizer = NovaModelCustomizer(
model=Model.NOVA_LITE_2,
method=TrainingMethod.SFT_LORA,
infra=SMHPRuntimeManager(...), # Must use HyperPod
data_s3_path="s3://bucket/data.jsonl",
output_s3_path="s3://bucket/output", # Optional
data_mixing_enabled=True
)
# Configure data mixing percentages
customizer.set_data_mixing_config({
"customer_data_percent": 50, # 50% your data
"nova_code_percent": 30, # 30% Nova code data (30% of Nova's 50%)
"nova_general_percent": 70 # 70% Nova general data (70% of Nova's 50%)
})
# Or use 100% customer data (no Nova mixing)
customizer.set_data_mixing_config({
"customer_data_percent": 100,
"nova_code_percent": 0,
"nova_general_percent": 0
})
Important Notes:
The dataset_catalog field is system-managed and cannot be set by users
Data mixing is only available on SageMaker HyperPod platform for Forge customers.
Refer to the Get Forge Subscription page to enable Nova subscription in your account to use this feature.
Job Notifications
Get email notifications when your training jobs complete, fail, or are stopped. The SDK automatically sets up the required AWS infrastructure (CloudFormation, DynamoDB, SNS, Lambda, EventBridge) to monitor job status and send notifications.
Features:
Automatic AWS infrastructure setup and management
Email notifications for terminal job states (Completed, Failed, Stopped)
Email notifications for SMHP master pods that are stuck in a crash loop
Output artifact validation for successful jobs (manifest.json)
Optional customer key KMS encryption for SNS topics
Platform Support:
SMTJ (SageMaker Training Jobs): Minimal configuration required
SMTJServerless (SageMaker Serverless): No instance type needed — SageMaker manages compute automatically
SMHP (SageMaker HyperPod): Requires kubectl Lambda layer + additional parameters (see docs/spec.md for more details)
Bedrock (Amazon Bedrock): Fully managed, no infrastructure configuration required
Quick Example:
# Start a training job
result = customizer.train(job_name="my-job")
# Enable notifications (SMTJ)
result.enable_job_notifications(
emails=["user@example.com"]
)
# Enable notifications (SMHP)
result.enable_job_notifications(
emails=["user@example.com"],
namespace="kubeflow", # Required for SMHP
kubectl_layer_arn="arn:aws:lambda:<region>:123456789012:layer:kubectl:1" # Required for SMHP
)
Important Notes:
Users must confirm their email subscription by clicking the link in the AWS SNS confirmation email
See docs/spec.md for complete API documentation on job notifications.
Telemetry
The Nova Forge SDK has telemetry enabled to help us better understand user needs, diagnose issues, and deliver new features. This telemetry tracks the usage of various SDK functions. If you prefer to opt out of telemetry, you can do so by setting the TELEMETRY_OPT_OUT environment variable to true:
export TELEMETRY_OPT_OUT=true
Getting Started
This comprehensive SDK enables end-to-end customization of Amazon Nova models with support for multiple training methods, deployment platforms, and monitoring capabilities. Each module is designed to work together seamlessly while providing flexibility for advanced use cases.
To get started customizing Nova models, please see the following files:
Notebook with “quick start” examples to start customizing at samples/nova_quickstart.ipynb
Specification document with detailed information about each module at docs/spec.md
Security Best Practices for SDK Users
1. IAM and Access Management
Execution Roles
Use dedicated execution roles for SageMaker training jobs with minimal required permissions
Avoid using admin roles - follow the principle of least privilege
Regularly audit role permissions and remove unused policies
# Good: Explicit execution role
runtime = SMTJRuntimeManager(
instance_type="ml.p5.48xlarge",
instance_count=2,
execution_role="arn:aws:iam::123456789012:role/SageMakerNovaTrainingRole"
)
# Avoid: Using default role without validation
Required Permissions
The SDK requires specific IAM permissions. Review the IAM section and:
Grant only the minimum permissions needed for your use case
Use condition statements to restrict resource access
Regularly review and rotate access keys
2. Credential Management
AWS Credentials
Never hardcode credentials in code or configuration files
Use IAM roles instead of access keys when possible
Rotate credentials regularly
Use AWS Secrets Manager for application secrets
Enable credential monitoring through AWS Config
MLflow Integration
Secure MLflow tracking URIs with proper authentication
Use encrypted connections to MLflow servers
Implement access controls on experiment data
Regularly audit MLflow access logs
3. Data Security and Privacy
Training Data Protection
Encrypt data at rest in S3 using KMS keys
Use S3 bucket policies to restrict access
Validate data sources before processing
# Ensure your S3 buckets have proper encryption and access controls
customizer = NovaModelCustomizer(
model=Model.NOVA_LITE_2,
method=TrainingMethod.SFT_LORA,
infra=runtime,
data_s3_path="s3://secure-training-bucket/encrypted-data/data.jsonl",
output_s3_path="s3://secure-output-bucket/results"
)
4. Network Security
VPC Configuration
Deploy in private subnets when possible
Use VPC endpoints for AWS service access
Implement security groups with minimal required ports
Enable VPC Flow Logs for network monitoring
5. Secure Communication
Always use HTTPS endpoints
Never disable SSL certificate verification
Keep TLS libraries updated
6. Input Validation
Always validate user inputs before passing to SDK
Sanitize data that will be stored or processed
Check resource quotas before job submission
Sanitize job names and resource identifiers
# The SDK includes built-in validation
loader = JSONLDatasetLoader(question="input", answer="output")
loader.load("s3://your-bucket/training-data.jsonl")
# Always validate your data format
loader.validate(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE_2)
7. Monitoring & Logging
Enable CloudTrail for API audit logs
Use CloudWatch for operational monitoring
Never log sensitive data (tokens, credentials, PII)
Amazon Nova Forge SDK
A comprehensive Python SDK for fine-tuning and customizing Amazon Nova models. This SDK provides a unified interface for training, evaluation, deployment, and monitoring of Nova models across both SageMaker Training Jobs and SageMaker HyperPod.
Table of Contents
Installation
Setup
In most cases, the SDK will inform you if the environment lacks the required setup to run a Nova customization job.
Below are some common requirements which you can set up in advance before trying to run a job.
Supported Python Versions
Nova Forge SDK is tested on:
IAM Roles/Policies
docs/iam_setup.mdfile.Instances
Nova customization jobs also require access to enough of the right instance type to run:
docs/instance_type_spec.mdfile for the different instance types and combinations for specific jobs and methods.HyperPod CLI
For HyperPod-based customization jobs, the SDK uses the SageMaker HyperPod CLI to connect to HyperPod Clusters and start jobs.
Prerequisites (required for both Forge and Non-Forge customers)
Install Helm 3. Verify with
helm version. If not installed:If you are using a Python virtual environment, activate it before installing the CLI:
For Non-Forge Customers
release_v2branch of the HyperPod CLI:For Forge Customers
Supported Models and Training Methods
Models
NOVA_MICROamazon.nova-micro-v1:0:128kNOVA_LITEamazon.nova-lite-v1:0:300kNOVA_LITE_2amazon.nova-2-lite-v1:0:256kNOVA_PROamazon.nova-pro-v1:0:300kTraining Methods
CPTDPO_LORADPO_FULLSFT_LORASFT_FULLRFT_LORARFT_FULLRFT_MULTITURN_LORARFT_MULTITURN_FULLEVALUATIONPlatform Support
SMTJSMTJServerlessSMHPBEDROCKData Preparation
Before launching a training job, your data needs to be in the right format. The SDK’s dataset module handles loading, transforming, validating, filtering, and saving training data — supporting JSONL, JSON, CSV, Parquet, and Arrow formats from local files or S3.
A typical data preparation workflow:
For the complete guide — including column mappings, dataset splitting, filtering, chaining operations, and end-to-end examples — see Data Preparation Guide.
For a hands-on notebook walkthrough, see
samples/dataprep_quickstart.ipynb.Core Modules Overview
The Nova Forge SDK is organized into the following modules:
JSONLDatasetLoader,JSONDatasetLoader,CSVDatasetLoaderSMTJRuntimeManager,SMTJServerlessRuntimeManager,SMHPRuntimeManager,BedrockRuntimeManagerNovaModelCustomizerCloudWatchLogMonitor,MLflowMonitorRFTMultiturnInfrastructuredocs/spec.mdsamples/nova_quickstart.ipynbsamples/rft_singleturn_quickstart.ipynbdocs/rft_multiturn.mdsamples/rft_multiturn_quickstart.ipynbDataset Module
Handles data loading, transformation, validation, filtering, and persistence for training datasets. Supports JSONL, JSON, CSV, Parquet, and Arrow formats from local files or S3.
See the Data Preparation section above for usage overview, or the full Data Preparation Guide for detailed documentation.
Manager Module
Manages runtime infrastructure for executing training and evaluation jobs. For the allowed instance types for each model/method combination, see
docs/instance_type_spec.md.Main Methods:
execute()- Start a training or evaluation jobcleanup()- Stop and clean up a running jobscale_cluster()- (SMHP only) Scale HyperPod cluster instance groups up or downget_instance_groups()- (SMHP only) View instance groups and current instance countsKey Classes:
SMTJRuntimeManager- For SageMaker Training JobsSMHPRuntimeManager- For SageMaker HyperPod clustersBedrockRuntimeManager- For Amazon Bedrock managed serviceCluster Scaling (SMHP):
The
SMHPRuntimeManagerprovides ascale_cluster()method to dynamically adjust the number of instances in a HyperPod cluster instance group:For more cluster scaling documentation, see
docs/spec.md.Model Module
Provides the main SDK entrypoint for orchestrating model customization workflows.
Main Methods:
train()- Launch a training jobevaluate()- Launch an evaluation jobdeploy()- Deploy trained model to Amazon SageMaker or Bedrockbatch_inference()- Run batch inference on trained modelget_logs()- Retrieve CloudWatch logs for current jobget_data_mixing_config()- Get data mixing configurationset_data_mixing_config()- Set data mixing configurationKey Class:
NovaModelCustomizer- Main orchestration classMonitor Module
Provides job monitoring and experiment tracking capabilities.
Main Methods:
show_logs()- Display CloudWatch logsget_logs()- Retrieve logs as listfrom_job_result()- Create monitor from job resultfrom_job_id()- Create monitor from job IDKey Classes:
CloudWatchLogMonitor- For viewing job logsMLflowMonitor- For experiment tracking with presigned URL generationRFT Multiturn Module
Manages infrastructure for reinforcement fine-tuning with multi-turn conversational tasks.
Main Methods:
setup()- Deploy SAM stack and validate platformstart_training_environment()- Start training environmentstart_evaluation_environment()- Start evaluation environmentget_logs()- Retrieve environment logskill_task()- Stop running taskcleanup()- Clean up infrastructure resourcescheck_all_queues()- Check message counts in all queuesflush_all_queues()- Purge all messages from queuesKey Classes:
RFTMultiturnInfrastructure- Main infrastructure management classCustomEnvironment- For creating custom reward environmentsSupported Platforms:
LOCAL- Local development environmentEC2- Amazon EC2 instancesECS- Amazon ECS clustersBuilt-in Environments:
VFEnvId.WORDLE- Wordle game environmentVFEnvId.TERMINAL_BENCH- Terminal benchmark environmentIterative Training
The Nova Forge SDK supports iterative fine-tuning of Nova models.
This is done by progressively running fine-tuning jobs on the output checkpoint from the previous job:
Note: Iterative fine-tuning requires using the same model and training method (LoRA vs Full-Rank) across all stages.
Dry Run
The Nova Forge SDK supports
dry_runmode for the following functions:train(),evaluate(), andbatch_inference().When calling any of the above functions, you can set the
dry_runparameter toTrue. The SDK will still generate your recipe and validate your input, but it won’t begin a job. This feature is useful whenever you want to test or validate inputs and still have a recipe generated, without starting a job.Data Mixing
Data mixing allows you to blend your custom training data with Nova’s high-quality curated datasets, helping maintain the model’s broad capabilities while adding your domain-specific knowledge.
Key Features:
Example Usage:
Important Notes:
dataset_catalogfield is system-managed and cannot be set by usersJob Notifications
Get email notifications when your training jobs complete, fail, or are stopped. The SDK automatically sets up the required AWS infrastructure (CloudFormation, DynamoDB, SNS, Lambda, EventBridge) to monitor job status and send notifications.
Features:
Platform Support:
docs/spec.mdfor more details)Quick Example:
Important Notes:
docs/job_notifications.mdfor detailed setup instructions, troubleshooting, and advanced usagedocs/spec.mdfor complete API documentation on job notifications.Telemetry
The Nova Forge SDK has telemetry enabled to help us better understand user needs, diagnose issues, and deliver new features. This telemetry tracks the usage of various SDK functions. If you prefer to opt out of telemetry, you can do so by setting the
TELEMETRY_OPT_OUTenvironment variable totrue:Getting Started
This comprehensive SDK enables end-to-end customization of Amazon Nova models with support for multiple training methods, deployment platforms, and monitoring capabilities. Each module is designed to work together seamlessly while providing flexibility for advanced use cases.
To get started customizing Nova models, please see the following files:
samples/nova_quickstart.ipynbdocs/spec.mdSecurity Best Practices for SDK Users
1. IAM and Access Management
Execution Roles
Required Permissions
The SDK requires specific IAM permissions. Review the IAM section and:
2. Credential Management
AWS Credentials
MLflow Integration
3. Data Security and Privacy
Training Data Protection
4. Network Security
VPC Configuration
5. Secure Communication
6. Input Validation
7. Monitoring & Logging
Security Monitoring
8. Deployment Security
Bedrock Deployment
9. Validation
The SDK includes built-in validation:
Validation is enabled by default.