feat: Upgrade GPU Operator v25.10.1 → v26.3.1 (#419)
CVE remediation — Mirador findings against v25.10.1 base images resolved in v26.3.1. Pure version bump, toolkit stays enabled (parallel coexistence).
- gpu-operator: v25.10.1 → v26.3.1
- device-plugin: v0.18.1 → v0.19.0
- container-toolkit: v1.18.1 → v1.19.0
- mig-manager: v0.13.1 → v0.14.0
- gfd: v0.18.1 → v0.19.0
- validator: v25.10.1 → v26.3.1 (consolidated into gpu-operator image)
SIM: https://t.corp.amazon.com/V2203884559
Co-authored-by: Stephen Via svia@amazon.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802047560号
SageMaker HyperPod command-line interface
The Amazon SageMaker HyperPod command-line interface (HyperPod CLI) is a tool that helps manage clusters, training jobs, and inference endpoints on the SageMaker HyperPod clusters orchestrated by Amazon EKS.
This documentation serves as a reference for the available HyperPod CLI commands. For a comprehensive user guide, see Orchestrating SageMaker HyperPod clusters with Amazon EKS in the Amazon SageMaker Developer Guide.
Note: Old
hyperpodCLI V2 has been moved torelease_v2branch. Please refer release_v2 branch for usage.Table of Contents
Overview
The SageMaker HyperPod CLI is a tool that helps create training jobs and inference endpoint deployments to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of jobs, including create, describe, list, and delete operations, as well as accessing pod and operator logs where applicable. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.
Prerequisites
Region Configuration
Important: For commands that accept the
--regionoption, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.Prerequisites for Training
Prerequisites for Inference
Platform Support
SageMaker HyperPod CLI currently supports Linux and MacOS platforms. Windows platform is not supported now.
ML Framework Support
SageMaker HyperPod CLI currently supports start training job with:
Installation
Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.
Install the sagemaker-hyperpod-cli package.
Verify if the installation succeeded by running the following command.
Usage
The HyperPod CLI provides the following commands:
Getting Started
Getting Cluster information
This command lists the available SageMaker HyperPod clusters and their capacity information.
--region <region>--namespace <namespace>--output <json|table>tableandjson. The default value isjson.--debugConnecting to a Cluster
This command configures the local Kubectl environment to interact with the specified SageMaker HyperPod cluster and namespace.
--cluster-name <cluster-name>--namespace <namespace>--region <region>--debugGetting Cluster Context
Get all the context related to the current set Cluster
--debugCLI
Cluster Management
Important: For commands that accept the
--regionoption, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.Cluster stack names must be unique within each AWS region. If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.
Initialize Cluster Configuration
Initialize a new cluster configuration in the current directory:
Important: The
resource_name_prefixparameter in the generatedconfig.yamlfile serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.Configure Cluster Parameters
Configure cluster parameters interactively or via command line:
Validate Configuration
Validate the configuration file syntax:
Create Cluster Stack
Create the cluster stack using the configured parameters:
Note: The region flag is optional. If not provided, the command will use the default region from your AWS credentials configuration.
List Cluster Stacks
--region <region>--status "['CREATE_COMPLETE', 'UPDATE_COMPLETE']"--debugDescribe Cluster Stack
--region <region>--debugDelete Cluster Stack
Delete a HyperPod cluster stack. Removes the specified CloudFormation stack and all associated AWS resources. This operation cannot be undone.
--region <region>--retain-resources S3Bucket-TrainingData,EFSFileSystem-Modelsaws cloudformation list-stack-resources STACK_NAME --region REGION.--debugUpdate Existing Cluster
Reset Configuration
Reset configuration to default values:
Training
Option 1: Create Pytorch job through init experience
Initialize Pytorch Job Configuration
Initialize a new pytorch job configuration in the current directory:
Configure Pytorch Job Parameters
Configure pytorch job parameters interactively or via command line:
Validate Configuration
Validate the configuration file syntax:
Create Pytorch Job
Create the pytorch job using the configured parameters:
Option 2: Create Pytorch job through create command
Example with accelerator parititons:
--job-name--image--namespace--command--args--environment--pull-policy--instance-type--node-count--tasks-per-node--label-selector--deep-health-check-passed-nodes-only--scheduler-type--queue-name--priority--max-retry--volume--service-account-name--accelerators--vcpu--memory--accelerators-limit--vcpu-limit--memory-limit--accelerator-partition-type--accelerator-partition-count--accelerator-partition-limit--preferred-topology--required-topology--max-node-count--elastic-replica-increment-step--elastic-graceful-shutdown-timeout-in-seconds--elastic-scaling-timeout-in-seconds--elastic-scale-up-snooze-time-in-seconds--elastic-replica-discrete-values--debugList Available Accelerator Partition Types
This command lists the available accelerator partition types on the cluster for a specific instance type.
List Training Jobs
Describe a Training Job
Listing Pods
This command lists all the pods associated with a specific training job.
job-name(string) - Required. The name of the job to list pods for.Accessing Logs
This command retrieves the logs for a specific pod within a training job.
--job-name--pod-name--namespace--containerGet Operator Logs
Delete a Training Job
Recipe Job
Use
hyp-recipe-jobto submit fine-tuning and evaluation jobs using pre-built recipes from SageMaker JumpStart Hub — no YAML authoring required.Initialize Recipe Job Configuration
Supported job types:
SFT,DPO,CPT,PPO,RLAIF,RLVRdeterministic,LLMAJConfigure Recipe Job Parameters
Validate Configuration
Reset Configuration
To reset
config.yamlback to its default values:Submit Recipe Job
List Recipe Jobs
Describe a Recipe Job
List Pods for a Recipe Job
Get Logs from a Recipe Job Pod
Get Operator Logs
Delete a Recipe Job
Inference
Jumpstart Endpoint Creation
Option 1: Create jumpstart endpoint through init experience
Initialize Jumpstart Endpoint Configuration
Initialize a new jumpstart endpoint configuration in the current directory:
Configure Jumpstart Endpoint Parameters
Configure jumpstart endpoint parameters interactively or via command line:
Validate Configuration
Validate the configuration file syntax:
Create Jumpstart Endpoint
Create the jumpstart endpoint using the configured parameters:
Option 2: Create jumpstart endpoint through create command
Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io/en/v2.82.0/doc_utils/jumpstart.html and fed into the call for creating the endpoint
--model-id--instance-type--namespace--metadata-name--accept-eula--model-version--endpoint-name--tls-certificate-output-s3-uri--debug--version--accelerator-partition-type--accelerator-partition-validation--replicas--max-deploy-time-in-seconds--execution-role--env'{"KEY":"value"}'--metrics-enabled--metrics-scrape-interval-seconds--model-metrics-path--model-metrics-port--additional-configs--gated-model-download-role--model-hub-name--intelligent-routing-enabled--routing-strategy--enable-l1-cache--enable-l2-cache--l2-cache-backend--l2-cache-local-url--cache-config-file--load-balancer-health-check-path--load-balancer-routing-algorithm--custom-certificate-acm-arn--custom-certificate-domain-name--auto-scaling-spec--dns-hosted-zone-id--data-captureInvoke a JumpstartModel Endpoint
Managing an Endpoint
List Pods
Get Logs
Get Operator Logs
Deleting an Endpoint
Custom Endpoint Creation
Option 1: Create custom endpoint through init experience
Initialize Custom Endpoint Configuration
Initialize a new custom endpoint configuration in the current directory:
Configure Custom Endpoint Parameters
Configure custom endpoint parameters interactively or via command line:
Validate Configuration
Validate the configuration file syntax:
Create Custom Endpoint
Create the custom endpoint using the configured parameters:
Option 2: Create custom endpoint through create command
--model-name--model-source-type--image-uri--container-port--model-volume-mount-name--namespace--metadata-name--endpoint-name--version--instance-type--instance-types--env'{"KEY":"value"}'--metrics-enabled--metrics-scrape-interval-seconds--model-metrics-path--model-metrics-port--model-version--model-location--prefetch-enabled--tls-certificate-output-s3-uri--fsx-dns-name--fsx-file-system-id--fsx-mount-name--s3-bucket-name--s3-region--huggingface-model-id--huggingface-commit-sha--huggingface-token-secret-name--huggingface-token-secret-key--model-volume-mount-path--resources-limits'{"nvidia.com/gpu":"1"}'--resources-requests'{"cpu":"1","memory":"2Gi"}'--replicas--initial-replica-count--max-deploy-time-in-seconds--worker-args--worker-command--working-dir--invocation-endpoint--intelligent-routing-enabled--routing-strategy--enable-l1-cache--enable-l2-cache--l2-cache-backend--l2-cache-local-url--cache-config-file--load-balancer-health-check-path--load-balancer-routing-algorithm--max-concurrent-requests--max-queue-size--overflow-status-code--custom-certificate-acm-arn--custom-certificate-domain-name--kubernetes--node-affinity--tags--probes--auto-scaling-spec--dns-hosted-zone-id--data-capture--dimensions--metric-collection-period--metric-collection-start-time--metric-name--metric-stat--metric-type--min-value--cloud-watch-trigger-name--cloud-watch-trigger-namespace--target-value--use-cached-metrics--debugInvoke a Custom Inference Endpoint
Managing an Endpoint
List Pods
Get Logs
Get Operator Logs
Deleting an Endpoint
Space
Create a Space
--name--display-name--namespace--image--desired-status--ownership-type--node-selector--affinity--tolerations--lifecycle--app-type--service-account-name--idle-shutdown--template-ref--container-config--storage--volume--accelerator-partition-count--accelerator-partition-type--gpu-limit--gpu--memory-limit--memory--cpu-limit--cpuList Spaces
Describe a Space
Update a Space
Start/Stop a Space
Get Logs
Delete a Space
Port Forward to a Space
Port forward to access a space from your local machine:
Access the space via
http://localhost:<local-port>after port forwarding is established. Press Ctrl+C to stop port forwarding.Space Template Management
Create reusable space templates:
Space Access
Create remote access to spaces. The
--connection-typeacceptsweb-uior any{ide}-remotepattern (e.g.vscode-remote,kiro-remote,cursor-remote):SDK
Along with the CLI, we also have SDKs available that can perform the cluster management, training and inference functionalities that the CLI performs
Cluster Management SDK
Creating a Cluster Stack
Listing Cluster Stacks
Describing a Cluster Stack
Monitoring Cluster Status
Deleting a Cluster Stack
Training SDK
Creating a Training Job
List Training Jobs
Describe a Training Job
List Pods for a Training Job
Get Logs from a Pod
Get Training Operator Logs
Delete a Training Job
Inference SDK
Creating a JumpstartModel Endpoint
Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io/en/v2.82.0/doc_utils/jumpstart.html and fed into the call for creating the endpoint
Creating a Custom Inference Endpoint (with S3)
List Endpoints
Describe an Endpoint
Invoke an Endpoint
List Pods
Get Logs
Get Operator Logs
Delete an Endpoint
Observability - Getting Monitoring Information
Space SDK
Creating a Space
List Spaces
Get a Space
Update a Space
Start/Stop a Space
Get Space Logs
List Space Pods
Create Space Access
Delete a Space
Port Forward to a Space
Access the space via
http://localhost:<local-port>after port forwarding is established. Press Ctrl+C to stop port forwarding.Space Template Management
Examples
This repository provides both a full end-to-end example walkthrough of using the CLI for real-world training and inference workloads as well as standalone example notebooks for individual features.
End-to-End Walkthrough
End-to-End Walkthrough Example
Standalone Examples
Cluster Management Example Notebooks
CLI Cluster Management Example
SDK Cluster Management Example
Training Example Notebooks
CLI Training Init Experience Example
CLI Training Example
SDK Training Example
Inference Example Notebooks
CLI
CLI Inference Jumpstart Model Init Experience Example
CLI Inference JumpStart Model Example
CLI Inference FSX Model Example
CLI Inference S3 Model Init Experience Example
CLI Inference S3 Model Example
SDK
SDK Inference JumpStart Model Example
SDK Inference FSX Model Example
SDK Inference S3 Model Example
Disclaimer
Working behind a proxy server ?