feat: integrate GKE-managed ML Diagnostics#5731
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates native GKE-managed Machine Learning Diagnostics support into the gke-cluster module. It refactors the existing mldiagnostics module to maintain backward compatibility for older GKE versions while introducing strict validation to prevent conflicting installation methods. The update also migrates namespace labeling to the gke-cluster module and updates the gke-tpu-7x and gke-tpu-v6e blueprints to adopt the managed diagnostics approach, accompanied by refreshed documentation. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request transitions Google Cloud ML Diagnostics from a manual Helm-based installation to a native GKE-managed feature in the gke-cluster module for GKE versions 1.35.0-gke.3065000 and higher. The manual module remains as an optional fallback for older GKE versions, with validation added to prevent both methods from being active simultaneously. Feedback on the changes highlights several key improvement opportunities: preventing silent failures on unsupported GKE versions by using strict validation to fail fast, adding robust validation checks in the gke-cluster module to ensure GKE version and Workload Identity requirements are met, correcting a variable name mismatch in the metadata validation error message, and removing the unused helm provider and its requirement from the gke-cluster module.
SwarnaBharathiMantena
left a comment
There was a problem hiding this comment.
LGTM! Please ensure at least one of the TPU 7x or TPU v6e tests are running successfully.
Test failures
|
Summary
This PR integrates GKE-managed Machine Learning Diagnostics support to the
gke-clustermodule. It also refactors themldiagnosticsmodule (which installs ML Diagnostics helm charts) to support older GKE versions (< 1.35.0-gke.3065000). Validations are added to ensure only one installation method is used at a time to prevent duplicate or conflicting installations.Changes Made
enable_ml_diagnosticsto thegke-clustermodule to support GKE'smanaged_machine_learning_diagnostics_configfor GKE version >= 1.35.0-gke.3065000.managed-mldiagnostics-gke: "true") to thegke-clustermodule after namespace creation.enable_ml_diagnosticsis set to true.mldiagnosticsmodule to ensure only one ML Diagnostics installation method is used in the blueprint.Documentation
Usage Example
To enable GKE managed ML Diagnostics, set
enable_managed_ml_diagnosticsto true ingke-clustermodule: