Skip to content

Conversation

@shubpal07
Copy link
Contributor

@shubpal07 shubpal07 commented Oct 10, 2025

This PR introduces two major enhancements for Cloud TPU users: it makes the core GKE modules "TPU-aware" for automated job creation and adds a new blueprint for TPU v6e with GCS integration.

  1. Automated TPU Job Configuration:
    Problem: Deploying TPU jobs previously required manual manifest editing for resource requests, tolerations, and GKE Warden selectors.
    Solution: The gke-node-pool and gke-job-template modules have been refactored. A new internal tpu-definition module centralizes TPU logic, allowing the gke-job-template to automatically generate complete and correct manifests for TPU workloads via the use: directive, eliminating all manual steps.
  2. New GCS + TPU v6e Blueprint:
    A new gke-tpu-v6-gcs.yaml blueprint has been added.
    This production-ready example demonstrates best practices for TPU workloads, including dedicated service accounts (Workload Identity) and performance-tuned GCS FUSE mounts for training data and checkpoints, all provisioned automatically.

NOTE: There was a change in output name made as a part of PR #4607 . Refer CP. This resulted in name mismatch of node_count variable in gke-job-template module. To solve the issue, we needed to pass the node_count variable explicitly from the blueprint itself.

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@shubpal07 shubpal07 self-assigned this Oct 10, 2025
@shubpal07 shubpal07 force-pushed the shubham/tpuv6-storage branch from d5e6b96 to 69a34e4 Compare October 14, 2025 18:29
@shubpal07 shubpal07 changed the title Adding GCS storage support for TPU v6e Add automated TPU support and GCS integration n TPU v6 blueprint Oct 14, 2025
@shubpal07 shubpal07 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Oct 14, 2025
@shubpal07 shubpal07 force-pushed the shubham/tpuv6-storage branch from 69a34e4 to 8d97681 Compare October 14, 2025 18:53
@shubpal07 shubpal07 marked this pull request as ready for review October 14, 2025 18:53
@shubpal07 shubpal07 requested review from a team and samskillman as code owners October 14, 2025 18:53
@shubpal07 shubpal07 force-pushed the shubham/tpuv6-storage branch from 8d97681 to 7dff3ce Compare October 14, 2025 19:02
@shubpal07 shubpal07 changed the title Add automated TPU support and GCS integration n TPU v6 blueprint Add automated TPU support and GCS integration in TPU v6 blueprint Oct 14, 2025
@shubpal07 shubpal07 force-pushed the shubham/tpuv6-storage branch from a666bcc to 2d57698 Compare October 15, 2025 10:03
Copy link
Contributor

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Making gke-node-pool module, gke-job-template module TPU compliant
Adding new tpu v6 blueprint with advanced storage options
@shubpal07
Copy link
Contributor Author

Checks passed. Failing tests were not linked to this PR

Copy link
Contributor

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shubpal07 shubpal07 merged commit 2d5b02d into GoogleCloudPlatform:develop Oct 17, 2025
10 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants