Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

sg/msp: support for super-simple alerts on custom metrics#62885

Merged
jac merged 5 commits intomainfrom
jac/core-16
May 24, 2024
Merged

sg/msp: support for super-simple alerts on custom metrics#62885
jac merged 5 commits intomainfrom
jac/core-16

Conversation

@jac
Copy link
Member

@jac jac commented May 23, 2024

Closes CORE-16

Currently service owners needs to either rely on our default alerting or use terraform to add custom alerts. With this PR service owners can specify simple custom alerts using MQL or PromQL in the service file e.g.
(ignore bad alerts; just for demonstration purposes)

monitoring:
  alerts:
    customAlerts:
      - name: Billable Time MQL
        severityLevel: WARNING
        description: "Check if mean billable time is below 30 seconds"
        condition:
          query: >
            fetch cloud_run_revision
            | metric 'run.googleapis.com/container/billable_instance_time'
            | filter resource.project_id == 'msp-testbed-test-77589aae45d0'
            | group_by 30m,
                [value_billable_instance_time_mean: mean(value.billable_instance_time)]
            | every 30m
            | group_by [resource.revision_name],
                [value_billable_instance_time_mean_min:
                  min(value_billable_instance_time_mean)]
            | condition val() < 30 's'
          type: mql
      - name: CloudSQL Bytes Used over 1000000000
        severityLevel: CRITICAL
        description: a lotta bytes
        condition:
          type: promql
          query: sum(avg_over_time(cloudsql_googleapis_com:database_disk_bytes_used{monitored_resource="cloudsql_database"}[30m])) > 100000000
          durationMinutes: 2

Test plan

generated alerts using service file config and viewed them in msp alert dashboard
Screenshot 2024-05-23 at 20 29 42

@jac jac requested a review from a team May 23, 2024 19:52
@cla-bot cla-bot bot added the cla-signed label May 23, 2024
Copy link
Contributor

@unknwon unknwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

- renamed alert types & lowercased values
- renamed severity level
- renamed duration to durationMinutes; changed to uint
Copy link
Member

@bobheadxi bobheadxi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 😁 Thank you!

@jac jac merged commit cb71a2d into main May 24, 2024
@jac jac deleted the jac/core-16 branch May 24, 2024 19:47
bobheadxi added a commit that referenced this pull request Jun 3, 2024
Follow-ups for #62885:

- Better docstrings for `mql`, `promql`
- `duration` -> `durationMinutes` to align with other config
- `alertpolicy.ResponseCodeMetric` -> `spec.CustomAlertCondition`: they're effectively the same type

Test plan: CI
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants