Cluster Management
TensorPool makes it easy to deploy and manage GPU clusters of any size, from single GPUs to large multi-node configurations.Core Commands
Cluster Management
tp cluster create- Deploy a new GPU clustertp cluster list- View all your clusterstp cluster info <cluster_id>- Get detailed information about a clustertp cluster edit <cluster_id>- Edit cluster settingstp cluster destroy <cluster_id>- Terminate a cluster
Creating Clusters
Deploy GPU clusters with simple commands. TensorPool supports both single-node and multi-node cluster configurations.Single-Node Clusters
Single-node clusters are ideal for development, experimentation, and smaller training workloads. They provide direct access to GPU resources without the complexity of distributed training.Supported Instance Types
Single-node clusters support a wide variety of GPU configurations:Accessing Single-Node Clusters
Single-node clusters provide direct SSH access. Once your cluster is ready:Multi-Node Clusters
Multi-node clusters are designed for distributed training workloads that require scaling across multiple machines. All multi-node clusters come with SLURM preinstalled for job scheduling and resource management.Supported Instance Types
Multi-node support is currently available for:- 8xH200 - 2 or more nodes, each with 8 H200 GPUs
- 8xB200 - 2 or more nodes, each with 8 B200 GPUs
Creating Multi-Node Clusters
Create multi-node clusters by specifying the number of nodes with the-n flag:
Multi-node support is currently available for 8xH200 and 8xB200 instance types only.
Accessing Multi-Node Clusters
All multi-node clusters come with SLURM preinstalled and configured. For detailed information about using SLURM for distributed training, see the Multi-Node Training Guide.Cluster Architecture
Multi-node clusters use a jumphost architecture for network access. Multi-node clusters consist of:- Jumphost:
{cluster_id}-jumphost- The SLURM login/controller node with a public IP address - Worker Nodes:
{cluster_id}-0,{cluster_id}-1, etc. - Compute nodes with private IP addresses only
Accessing Your Cluster
Follow these steps to access your multi-node cluster:- Get cluster information to see all nodes and their instance IDs:
- SSH into the jumphost (this is the only node with direct public access):
- Access worker nodes from the jumphost. You can use either the instance name or private IP:
Cluster and Instance Statuses
A cluster’s status is derived from the statuses of its individual instances. Each instance within a cluster progresses through its own lifecycle, and the cluster’s displayed status reflects the highest-priority status among all its instances.Instance Status Lifecycle
Each instance in a cluster follows this lifecycle:Status Definitions
| Status | Description |
|---|---|
| PENDING | Instance creation request has been submitted and is being queued for provisioning. |
| PROVISIONING | Instance has been allocated and is being provisioned. |
| CONFIGURING | Instance is being configured with software, drivers, networking, and storage. |
| RUNNING | Instance is ready for use. |
| DESTROYING | Instance shutdown in progress, resources are being deallocated. |
| DESTROYED | Instance has been successfully terminated. |
| FAILED | System-level problem (e.g., hardware failure, no capacity). |
Cluster Status Priority
A cluster’s status is determined by the highest-priority status among its instances. Priority order (highest to lowest):- FAILED - Any failed instance causes the cluster to show as failed
- DESTROYING - Cluster is being torn down
- PENDING - Instances are waiting to be provisioned
- PROVISIONING - Instances are being provisioned
- CONFIGURING - Instances are being configured
- RUNNING - All instances are running
- DESTROYED - All instances have been terminated
RUNNING and 1 is CONFIGURING, the cluster status will show as CONFIGURING.
Next Steps
- Explore instance types available
- Learn about NFS storage for persistent data
- Read the CLI reference for detailed command options