Skip to content

feat: integrate GKE-managed ML Diagnostics#5731

Merged
AdarshK15 merged 4 commits into
GoogleCloudPlatform:developfrom
AdarshK15:gke-mldiagon
Jun 7, 2026
Merged

feat: integrate GKE-managed ML Diagnostics#5731
AdarshK15 merged 4 commits into
GoogleCloudPlatform:developfrom
AdarshK15:gke-mldiagon

Conversation

@AdarshK15

@AdarshK15 AdarshK15 commented Jun 1, 2026

Copy link
Copy Markdown
Member

Summary

This PR integrates GKE-managed Machine Learning Diagnostics support to the gke-cluster module. It also refactors the mldiagnostics module (which installs ML Diagnostics helm charts) to support older GKE versions (< 1.35.0-gke.3065000). Validations are added to ensure only one installation method is used at a time to prevent duplicate or conflicting installations.

Changes Made

  • Added the boolean variable enable_ml_diagnostics to the gke-cluster module to support GKE's managed_machine_learning_diagnostics_config for GKE version >= 1.35.0-gke.3065000.
  • Moved workload namespace labelling (managed-mldiagnostics-gke: "true") to the gke-cluster module after namespace creation.
  • Added preflight metadata validation to ensure Workload Identity is enabled when enable_ml_diagnostics is set to true.
  • Added a validation in mldiagnostics module to ensure only one ML Diagnostics installation method is used in the blueprint.
  • Updated both gke-tpu-7x and gke-tpu-v6e blueprints to use GKE managed ML Diagnostics.

Documentation

Usage Example

To enable GKE managed ML Diagnostics, set enable_managed_ml_diagnostics to true in gke-cluster module:

  - id: gke-tpu-v6e-cluster
    source: modules/scheduler/gke-cluster
    settings:
      configure_workload_identity_sa: true
      namespace: $(vars.user_namespace)
      enable_managed_ml_diagnostics: true

@AdarshK15 AdarshK15 added the release-key-new-features Added to release notes under the "Key New Features" heading. label Jun 1, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates native GKE-managed Machine Learning Diagnostics support into the gke-cluster module. It refactors the existing mldiagnostics module to maintain backward compatibility for older GKE versions while introducing strict validation to prevent conflicting installation methods. The update also migrates namespace labeling to the gke-cluster module and updates the gke-tpu-7x and gke-tpu-v6e blueprints to adopt the managed diagnostics approach, accompanied by refreshed documentation.

Highlights

  • GKE-managed ML Diagnostics Integration: Added the enable_ml_diagnostics boolean variable to the gke-cluster module to support native GKE-managed ML Diagnostics for supported versions.
  • Mutual Exclusion Validation: Implemented a precondition in the mldiagnostics module to prevent concurrent use of managed and manual installation methods, ensuring configuration consistency.
  • Blueprint Updates: Updated gke-tpu-7x and gke-tpu-v6e blueprints to leverage the new managed diagnostics feature and removed the manual mldiagnostics module dependency.
  • Namespace Management: Refactored workload namespace labeling logic, moving it directly into the gke-cluster module to streamline cluster configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions Google Cloud ML Diagnostics from a manual Helm-based installation to a native GKE-managed feature in the gke-cluster module for GKE versions 1.35.0-gke.3065000 and higher. The manual module remains as an optional fallback for older GKE versions, with validation added to prevent both methods from being active simultaneously. Feedback on the changes highlights several key improvement opportunities: preventing silent failures on unsupported GKE versions by using strict validation to fail fast, adding robust validation checks in the gke-cluster module to ensure GKE version and Workload Identity requirements are met, correcting a variable name mismatch in the metadata validation error message, and removing the unused helm provider and its requirement from the gke-cluster module.

Comment thread modules/scheduler/gke-cluster/main.tf
Comment thread modules/scheduler/gke-cluster/main.tf
Comment thread modules/scheduler/gke-cluster/metadata.yaml Outdated
Comment thread modules/scheduler/gke-cluster/providers.tf Outdated
Comment thread modules/scheduler/gke-cluster/versions.tf Outdated
Comment thread modules/management/mldiagnostics/variables.tf Outdated
annuay-google
annuay-google previously approved these changes Jun 1, 2026
@AdarshK15 AdarshK15 marked this pull request as ready for review June 3, 2026 11:13
@AdarshK15 AdarshK15 requested a review from a team as a code owner June 3, 2026 11:13
Comment thread examples/gke-tpu-7x/README.md Outdated
LAVEEN
LAVEEN previously approved these changes Jun 3, 2026

@LAVEEN LAVEEN left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread modules/scheduler/gke-cluster/variables.tf Outdated

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please ensure at least one of the TPU 7x or TPU v6e tests are running successfully.

@AdarshK15

Copy link
Copy Markdown
Member Author

Test failures

  • Test PR-test-gke-tpu-v6e failed due to a known timeout issue at Run CHS suites task, ML Diagnostics related tasks completed successfully.
  • Other GKE tests passed.

@AdarshK15 AdarshK15 changed the title feat: integrate GKE-managed ML Diagnostics and validate mutual exclusion feat: integrate GKE-managed ML Diagnostics Jun 7, 2026
@AdarshK15 AdarshK15 merged commit 57ae585 into GoogleCloudPlatform:develop Jun 7, 2026
38 of 88 checks passed
@AdarshK15 AdarshK15 deleted the gke-mldiagon branch June 7, 2026 15:49
mikhailpovolotskiy pushed a commit to mikhailpovolotskiy/cluster-toolkit that referenced this pull request Jun 14, 2026
ksaishree pushed a commit to ksaishree/cluster-toolkit that referenced this pull request Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants