The Amazon SageMaker HyperPod command-line interface (HyperPod CLI) is a tool that helps manage clusters, training jobs, and inference endpoints on the SageMaker HyperPod clusters orchestrated by Amazon EKS.
The SageMaker HyperPod CLI is a tool that helps create training jobs and inference endpoint deployments to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of jobs, including create, describe, list, and delete operations, as well as accessing pod and operator logs where applicable. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
Prerequisites
Region Configuration
Important: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.
Prerequisites for Training
HyperPod CLI currently supports starting PyTorchJobs. To start a job, you need to install Training Operator first.
This command lists the available SageMaker HyperPod clusters and their capacity information.
hyp list-cluster
Option
Type
Description
--region <region>
Optional
The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
--namespace <namespace>
Optional
The namespace that users want to check the quota with. Only the SageMaker managed namespaces are supported.
--output <json|table>
Optional
The output format. Available values are table and json. The default value is json.
--debug
Optional
Enable debug mode for detailed logging.
Connecting to a Cluster
This command configures the local Kubectl environment to interact with the specified SageMaker HyperPod cluster and namespace.
The SageMaker HyperPod cluster name to configure with.
--namespace <namespace>
Optional
The namespace that you want to connect to. If not specified, Hyperpod cli commands will auto discover the accessible namespace.
--region <region>
Optional
The AWS region where the HyperPod cluster resides.
--debug
Optional
Enable debug mode for detailed logging.
Getting Cluster Context
Get all the context related to the current set Cluster
hyp get-cluster-context
Option
Type
Description
--debug
Optional
Enable debug mode for detailed logging.
CLI
Cluster Management
Important: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.
Cluster stack names must be unique within each AWS region. If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.
Initialize Cluster Configuration
Initialize a new cluster configuration in the current directory:
hyp init cluster-stack
Important: The resource_name_prefix parameter in the generated config.yaml file serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.
Configure Cluster Parameters
Configure cluster parameters interactively or via command line:
Comma-separated list of logical resource IDs to retain during deletion (only works on DELETE_FAILED stacks). Resource names are shown in failed deletion output, or use AWS CLI: aws cloudformation list-stack-resources STACK_NAME --region REGION.
DesiredStatus specifies the desired operational status
--ownership-type
TEXT
No
OwnershipType specifies who can modify the space. ‘Public’ means anyone with RBAC permissions can update/delete the space. ‘OwnerOnly’ means only the creator can update/delete the space.
--node-selector
TEXT
No
NodeSelector specifies node selection constraints for the space pod (JSON string)
--affinity
TEXT
No
Affinity specifies node affinity and anti-affinity rules for the space pod (JSON string)
--tolerations
TEXT
No
Tolerations specifies tolerations for the space pod to schedule on nodes with matching taints (JSON string)
--lifecycle
TEXT
No
Lifecycle specifies actions that the management system should take in response to container lifecycle events (JSON string)
--app-type
TEXT
No
AppType specifies the application type for this workspace
--service-account-name
TEXT
No
ServiceAccountName specifies the name of the ServiceAccount to use for the workspace pod
# List spaces in default namespace
hyp list hyp-space
# List spaces in specific namespace
hyp list hyp-space --namespace my-namespace
# List spaces across all namespaces
hyp list hyp-space --all-namespaces
# List spaces with JSON output
hyp list hyp-space --output json
Port forward to access a space from your local machine:
# Port forward with default port (8888)
hyp portforward hyp-space --name myspace
# Port forward with custom local port
hyp portforward hyp-space --name myspace --local-port 8080
Access the space via http://localhost:<local-port> after port forwarding is established. Press Ctrl+C to stop port forwarding.
# List all cluster stacks
stacks = HpClusterStack.list(region="us-east-2")
print(f"Found {len(stacks['StackSummaries'])} stacks")
Describing a Cluster Stack
# Describe a specific cluster stack
stack_info = HpClusterStack.describe("my-stack-name", region="us-east-2")
print(f"Stack status: {stack_info['Stacks'][0]['StackStatus']}")
Monitoring Cluster Status
from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack
stack = HpClusterStack()
response = stack.create(region="us-west-2")
status = stack.get_status(region="us-west-2")
print(status)
Deleting a Cluster Stack
# Delete with custom logger
import logging
logger = logging.getLogger(__name__)
HpClusterStack.delete("my-stack-name", region="us-west-2", logger=logger)
# Delete with retained resources (only works on DELETE_FAILED stacks)
HpClusterStack.delete("my-stack-name", retain_resources=["S3Bucket", "EFSFileSystem"])
Training SDK
Creating a Training Job
from sagemaker.hyperpod.training.hyperpod_pytorch_job import HyperPodPytorchJob
from sagemaker.hyperpod.training.config.hyperpod_pytorch_job_unified_config import (
ReplicaSpec, Template, Spec, Containers, Resources, RunPolicy
)
from sagemaker.hyperpod.common.config.metadata import Metadata
# Define job specifications
nproc_per_node = "1" # Number of processes per node
replica_specs =
[
ReplicaSpec
(
name = "pod", # Replica name
template = Template
(
spec = Spec
(
containers =
[
Containers
(
# Container name
name="container-name",
# Training image
image="123456789012.dkr.ecr.us-west-2.amazonaws.com/my-training-image:latest",
# Always pull image
image_pull_policy="Always",
resources=Resources\
(
# No GPUs requested
requests={"nvidia.com/gpu": "0"},
# No GPU limit
limits={"nvidia.com/gpu": "0"},
),
# Command to run
command=["python", "train.py"],
# Script arguments
args=["--epochs", "10", "--batch-size", "32"],
)
]
)
),
)
]
# Keep pods after completion
run_policy = RunPolicy(clean_pod_policy="None")
# Create and start the PyTorch job
pytorch_job = HyperPodPytorchJob
(
# Job name
metadata = Metadata(name="demo"),
# Processes per node
nproc_per_node = nproc_per_node,
# Replica specifications
replica_specs = replica_specs,
# Run policy
run_policy = run_policy,
)
# Launch the job
pytorch_job.create()
List Training Jobs
from sagemaker.hyperpod.training import HyperPodPytorchJob
import yaml
# List all PyTorch jobs
jobs = HyperPodPytorchJob.list()
print(yaml.dump(jobs))
Describe a Training Job
from sagemaker.hyperpod.training import HyperPodPytorchJob
# Get an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job)
List Pods for a Training Job
from sagemaker.hyperpod.training import HyperPodPytorchJob
# List Pods for an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.list_pods())
Get Logs from a Pod
from sagemaker.hyperpod.training import HyperPodPytorchJob
# Get pod logs for a job
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.get_logs_from_pod("pod-name"))
Get Training Operator Logs
from sagemaker.hyperpod.training import HyperPodPytorchJob
# Get training operator logs
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.get_operator_logs(since_hours=0.1))
Delete a Training Job
from sagemaker.hyperpod.training import HyperPodPytorchJob
# Get an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")
# Delete the job
job.delete()
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# List JumpStart endpoints
jumpstart_endpoints = HPJumpStartEndpoint.list()
print(jumpstart_endpoints)
# List custom endpoints
custom_endpoints = HPEndpoint.list()
print(custom_endpoints)
Describe an Endpoint
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# Get JumpStart endpoint details
jumpstart_endpoint = HPJumpStartEndpoint.get(name="js-endpoint-name", namespace="test")
print(jumpstart_endpoint)
# Get custom endpoint details
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
print(custom_endpoint)
Invoke an Endpoint
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
data = '{"inputs":"What is the capital of USA?"}'
jumpstart_endpoint = HPJumpStartEndpoint.get(name="endpoint-jumpstart")
response = jumpstart_endpoint.invoke(body=data).body.read()
print(response)
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
response = custom_endpoint.invoke(body=data).body.read()
print(response)
List Pods
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# List pods
js_pods = HPJumpStartEndpoint.list_pods()
print(js_pods)
c_pods = HPEndpoint.list_pods()
print(c_pods)
Get Logs
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# Get logs from pod
js_logs = HPJumpStartEndpoint.get_logs(pod=<pod-name>)
print(js_logs)
c_logs = HPEndpoint.get_logs(pod=<pod-name>)
print(c_logs)
Get Operator Logs
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# Invoke JumpStart endpoint
print(HPJumpStartEndpoint.get_operator_logs(since_hours=0.1))
# Invoke custom endpoint
print(HPEndpoint.get_operator_logs(since_hours=0.1))
from sagemaker.hyperpod.observability.utils import get_monitoring_config
monitor_config = get_monitoring_config()
Space SDK
Creating a Space
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
from hyperpod_space_template.v1_0.model import SpaceConfig
# Create space configuration
space_config = SpaceConfig(
name="myspace",
namespace="default",
display_name="My Space",
)
# Create and start the space
space = HPSpace(config=space_config)
space.create()
List Spaces
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# List all spaces in default namespace
spaces = HPSpace.list()
for space in spaces:
print(f"Space: {space.config.name}, Status: {space.status}")
# List spaces in specific namespace
spaces = HPSpace.list(namespace="your-namespace")
Get a Space
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# Get specific space
space = HPSpace.get(name="myspace", namespace="default")
print(f"Space name: {space.config.name}")
print(f"Display name: {space.config.display_name}")
Update a Space
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# Get existing space
space = HPSpace.get(name="myspace")
# Update space configuration
space.update(
display_name="Updated Space Name",
)
Start/Stop a Space
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# Get existing space
space = HPSpace.get(name="myspace")
# Start the space
space.start()
# Stop the space
space.stop()
Get Space Logs
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# Get space and retrieve logs
space = HPSpace.get(name="myspace")
# Get logs from default pod and container
logs = space.get_logs()
print(logs)
List Space Pods
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# Get space and list associated pods
space = HPSpace.get(name="myspace")
pods = space.list_pods()
for pod in pods:
print(f"Pod: {pod}")
Create Space Access
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# Get existing space
space = HPSpace.get(name="myspace")
# Create VS Code remote access
vscode_access = space.create_space_access(connection_type="vscode-remote")
print(f"VS Code URL: {vscode_access['SpaceConnectionUrl']}")
# Create web UI access
web_access = space.create_space_access(connection_type="web-ui")
print(f"Web UI URL: {web_access['SpaceConnectionUrl']}")
Delete a Space
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# Get existing space
space = HPSpace.get(name="myspace")
# Delete the space
space.delete()
Port Forward to a Space
from sagemaker.hyperpod.space.hyperpod_space import HPSpace
# Get existing space
space = HPSpace.get(name="myspace")
# Port forward with default remote port (8888)
space.portforward_space(local_port="8080")
# Port forward with custom remote port
space.portforward_space(local_port="8080", remote_port="8888")
Access the space via http://localhost:<local-port> after port forwarding is established. Press Ctrl+C to stop port forwarding.
Space Template Management
from sagemaker.hyperpod.space.hyperpod_space_template import HPSpaceTemplate
# Create space template from YAML file
template = HPSpaceTemplate(file_path="template.yaml")
template.create()
# List all space templates
templates = HPSpaceTemplate.list()
for template in templates:
print(f"Template: {template.name}")
# Get specific space template
template = HPSpaceTemplate.get(name="my-template")
print(template.to_yaml())
# Update space template
template.update(file_path="updated-template.yaml")
# Delete space template
template.delete()
Examples
This repository provides both a full end-to-end example walkthrough of using the CLI for real-world training and inference workloads as well as standalone example notebooks for individual features.
This CLI and SDK requires access to the user’s file system to set and get context and function properly.
It needs to read configuration files such as kubeconfig to establish the necessary environment settings.
Working behind a proxy server ?
Follow these steps from here to set up HTTP proxy connections
SageMaker HyperPod command-line interface
The Amazon SageMaker HyperPod command-line interface (HyperPod CLI) is a tool that helps manage clusters, training jobs, and inference endpoints on the SageMaker HyperPod clusters orchestrated by Amazon EKS.
This documentation serves as a reference for the available HyperPod CLI commands. For a comprehensive user guide, see Orchestrating SageMaker HyperPod clusters with Amazon EKS in the Amazon SageMaker Developer Guide.
Note: Old
hyperpodCLI V2 has been moved torelease_v2branch. Please refer release_v2 branch for usage.Table of Contents
Overview
The SageMaker HyperPod CLI is a tool that helps create training jobs and inference endpoint deployments to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of jobs, including create, describe, list, and delete operations, as well as accessing pod and operator logs where applicable. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
Prerequisites
Region Configuration
Important: For commands that accept the
--regionoption, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.Prerequisites for Training
Prerequisites for Inference
Platform Support
SageMaker HyperPod CLI currently supports Linux and MacOS platforms. Windows platform is not supported now.
ML Framework Support
SageMaker HyperPod CLI currently supports start training job with:
Installation
Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.
Install the sagemaker-hyperpod-cli package.
Verify if the installation succeeded by running the following command.
Usage
The HyperPod CLI provides the following commands:
Getting Started
Getting Cluster information
This command lists the available SageMaker HyperPod clusters and their capacity information.
--region <region>--namespace <namespace>--output <json|table>tableandjson. The default value isjson.--debugConnecting to a Cluster
This command configures the local Kubectl environment to interact with the specified SageMaker HyperPod cluster and namespace.
--cluster-name <cluster-name>--namespace <namespace>--region <region>--debugGetting Cluster Context
Get all the context related to the current set Cluster
--debugCLI
Cluster Management
Important: For commands that accept the
--regionoption, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.Cluster stack names must be unique within each AWS region. If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.
Initialize Cluster Configuration
Initialize a new cluster configuration in the current directory:
Important: The
resource_name_prefixparameter in the generatedconfig.yamlfile serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.Configure Cluster Parameters
Configure cluster parameters interactively or via command line:
Validate Configuration
Validate the configuration file syntax:
Create Cluster Stack
Create the cluster stack using the configured parameters:
Note: The region flag is optional. If not provided, the command will use the default region from your AWS credentials configuration.
List Cluster Stacks
--region <region>--status "['CREATE_COMPLETE', 'UPDATE_COMPLETE']"--debugDescribe Cluster Stack
--region <region>--debugDelete Cluster Stack
Delete a HyperPod cluster stack. Removes the specified CloudFormation stack and all associated AWS resources. This operation cannot be undone.
--region <region>--retain-resources S3Bucket-TrainingData,EFSFileSystem-Modelsaws cloudformation list-stack-resources STACK_NAME --region REGION.--debugUpdate Existing Cluster
Reset Configuration
Reset configuration to default values:
Training
Option 1: Create Pytorch job through init experience
Initialize Pytorch Job Configuration
Initialize a new pytorch job configuration in the current directory:
Configure Pytorch Job Parameters
Configure pytorch job parameters interactively or via command line:
Validate Configuration
Validate the configuration file syntax:
Create Pytorch Job
Create the pytorch job using the configured parameters:
Option 2: Create Pytorch job through create command
Example with accelerator parititons:
--job-name--image--namespace--command--args--environment--pull-policy--instance-type--node-count--tasks-per-node--label-selector--deep-health-check-passed-nodes-only--scheduler-type--queue-name--priority--max-retry--volume--service-account-name--accelerators--vcpu--memory--accelerators-limit--vcpu-limit--memory-limit--accelerator-partition-type--accelerator-partition-count--accelerator-partition-limit--preferred-topology--required-topology--max-node-count--elastic-replica-increment-step--elastic-graceful-shutdown-timeout-in-seconds--elastic-scaling-timeout-in-seconds--elastic-scale-up-snooze-time-in-seconds--elastic-replica-discrete-values--debugList Available Accelerator Partition Types
This command lists the available accelerator partition types on the cluster for a specific instance type.
List Training Jobs
Describe a Training Job
Listing Pods
This command lists all the pods associated with a specific training job.
job-name(string) - Required. The name of the job to list pods for.Accessing Logs
This command retrieves the logs for a specific pod within a training job.
--job-name--pod-name--namespace--containerGet Operator Logs
Delete a Training Job
Inference
Jumpstart Endpoint Creation
Option 1: Create jumpstart endpoint through init experience
Initialize Jumpstart Endpoint Configuration
Initialize a new jumpstart endpoint configuration in the current directory:
Configure Jumpstart Endpoint Parameters
Configure jumpstart endpoint parameters interactively or via command line:
Validate Configuration
Validate the configuration file syntax:
Create Jumpstart Endpoint
Create the jumpstart endpoint using the configured parameters:
Option 2: Create jumpstart endpoint through create command
Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io/en/v2.82.0/doc_utils/jumpstart.html and fed into the call for creating the endpoint
--model-id--instance-type--namespace--metadata-name--accept-eula--model-version--endpoint-name--tls-certificate-output-s3-uri--debugInvoke a JumpstartModel Endpoint
Managing an Endpoint
List Pods
Get Logs
Get Operator Logs
Deleting an Endpoint
Custom Endpoint Creation
Option 1: Create custom endpoint through init experience
Initialize Custom Endpoint Configuration
Initialize a new custom endpoint configuration in the current directory:
Configure Custom Endpoint Parameters
Configure custom endpoint parameters interactively or via command line:
Validate Configuration
Validate the configuration file syntax:
Create Custom Endpoint
Create the custom endpoint using the configured parameters:
Option 2: Create custom endpoint through create command
--instance-type--model-name--model-source-type--image-uri--container-port--model-volume-mount-name--namespace--metadata-name--endpoint-name--env--metrics-enabled--model-version--model-location--prefetch-enabled--tls-certificate-output-s3-uri--fsx-dns-name--fsx-file-system-id--fsx-mount-name--s3-bucket-name--s3-region--model-volume-mount-path--resources-limits--resources-requests--dimensions--metric-collection-period--metric-collection-start-time--metric-name--metric-stat--metric-type--min-value--cloud-watch-trigger-name--cloud-watch-trigger-namespace--target-value--use-cached-metrics--invocation-endpoint--debugInvoke a Custom Inference Endpoint
Managing an Endpoint
List Pods
Get Logs
Get Operator Logs
Deleting an Endpoint
Space
Create a Space
--name--display-name--namespace--image--desired-status--ownership-type--node-selector--affinity--tolerations--lifecycle--app-type--service-account-name--idle-shutdown--template-ref--container-config--storage--volume--accelerator-partition-count--accelerator-partition-type--gpu-limit--gpu--memory-limit--memory--cpu-limit--cpuList Spaces
Describe a Space
Update a Space
Start/Stop a Space
Get Logs
Delete a Space
Port Forward to a Space
Port forward to access a space from your local machine:
Access the space via
http://localhost:<local-port>after port forwarding is established. Press Ctrl+C to stop port forwarding.Space Template Management
Create reusable space templates:
Space Access
Create remote access to spaces:
SDK
Along with the CLI, we also have SDKs available that can perform the cluster management, training and inference functionalities that the CLI performs
Cluster Management SDK
Creating a Cluster Stack
Listing Cluster Stacks
Describing a Cluster Stack
Monitoring Cluster Status
Deleting a Cluster Stack
Training SDK
Creating a Training Job
List Training Jobs
Describe a Training Job
List Pods for a Training Job
Get Logs from a Pod
Get Training Operator Logs
Delete a Training Job
Inference SDK
Creating a JumpstartModel Endpoint
Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io/en/v2.82.0/doc_utils/jumpstart.html and fed into the call for creating the endpoint
Creating a Custom Inference Endpoint (with S3)
List Endpoints
Describe an Endpoint
Invoke an Endpoint
List Pods
Get Logs
Get Operator Logs
Delete an Endpoint
Observability - Getting Monitoring Information
Space SDK
Creating a Space
List Spaces
Get a Space
Update a Space
Start/Stop a Space
Get Space Logs
List Space Pods
Create Space Access
Delete a Space
Port Forward to a Space
Access the space via
http://localhost:<local-port>after port forwarding is established. Press Ctrl+C to stop port forwarding.Space Template Management
Examples
This repository provides both a full end-to-end example walkthrough of using the CLI for real-world training and inference workloads as well as standalone example notebooks for individual features.
End-to-End Walkthrough
End-to-End Walkthrough Example
Standalone Examples
Cluster Management Example Notebooks
CLI Cluster Management Example
SDK Cluster Management Example
Training Example Notebooks
CLI Training Init Experience Example
CLI Training Example
SDK Training Example
Inference Example Notebooks
CLI
CLI Inference Jumpstart Model Init Experience Example
CLI Inference JumpStart Model Example
CLI Inference FSX Model Example
CLI Inference S3 Model Init Experience Example
CLI Inference S3 Model Example
SDK
SDK Inference JumpStart Model Example
SDK Inference FSX Model Example
SDK Inference S3 Model Example
Disclaimer
Working behind a proxy server ?