Checkpointless training in Amazon SageMaker HyperPod

Checkpointless training on Amazon SageMaker HyperPod enables faster recovery from training infrastructure faults. The following documentation helps you get started with checkpointless training and fine-tuning for NeMo-supported models.

Checkpointless training has the following pre-requisites:

Getting started with Amazon EKS support in SageMaker HyperPod
Installing the training operator. You must install v1.2.0 or above.

Checkpointless training on SageMaker HyperPod is built on top of the NVIDIA NeMo framework. You can run checkpointless training with pre-created SageMaker HyperPod recipes. If you're familiar with NeMo, the process of using the checkpointless training recipes is similar. With minor changes, you can start training a model using checkpointless training features that enable you to recover quickly from training faults.

The following HyperPod recipes are pre-configured with checkpointless training optimizations. You can specify your data paths as part of the recipe and use the associated launch script to run training (see the quick start guide below):

Model	Method	Size	Nodes	Instance	Accelerator	Recipe	Script
GPT OSS	Full finetune example	120b	16	p5.48xlarge	GPU H100	link	link
GPT OSS	LoRA-example	120b	2	p5.48xlarge	GPU H100	link	link
Llama3	Pretrain example	70b	16	p5.48xlarge	GPU H100	link	link
Llama3	LoRA-example	70b	2	p5.48xlarge	GPU H100	link	link

The following quick-start guide provides tutorials for using the checkpointless training recipes:

Getting started examples

You can also use checkpointless training with your custom models, to get started, see the checkpointless training GitHub page.

Note

We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to job operations, resource management, and essential service functionality.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Custom Kubernetes labels and taints

Repositories