Probes Documentation
Everything you need to know about how Probes monitors your sites and APIs, evaluates alert conditions, manages incidents, and notifies your team.
Probes
A probe is a monitored HTTP endpoint. When you add a URL, Probes begins checking it from every active region simultaneously at a regular interval. Each check performs a full HTTP request using Go's net/http/httptrace package, capturing granular timing data for the entire connection lifecycle.
Every region fires independently, so a single probe round produces one result per region. This multi-region approach lets you distinguish between a localized network issue and a global outage.
Timing Breakdown
Each probe result records six distinct timing phases:
- Name Lookup — DNS resolution time. How long it takes to resolve the hostname to an IP address.
- Connect — TCP handshake. The time from SYN to SYN-ACK, reflecting raw network distance.
- TLS Handshake — TLS session negotiation after TCP connects. Affected by certificate chain depth, cipher selection, and OCSP stapling.
- TTFB (Time to First Byte) — The wait between sending the HTTP request and receiving the first byte of the response. This is your server's processing time.
- Response — Time to download the full response body after the first byte arrives.
- Total — The complete round trip from DNS lookup through response body. This is what users actually experience.
Check Interval
Probes fire once per minute. Every minute, the scheduler dispatches a request to each region's probe agent, which performs the HTTP check and reports results back to the central server. Each batch of checks shares a round ID so results can be correlated across regions.
Alerts
Alerts define the conditions under which you want to be notified. Each alert is attached to a single probe and specifies a type (what to watch for), a threshold (when to trigger), and a channel (how to notify you). You can have multiple alerts per probe with different types and thresholds.
Downtime Alerts
Downtime alerts fire when your endpoint is unreachable from a majority of probe regions for a sustained period.
Region voting: A single region failing does not immediately trigger an alert. Instead, Probes uses majority voting — at least half of the active regions (rounded up) must report a failure in the same round for it to count as a failing round. This prevents false alarms from transient network issues in one region.
Consecutive rounds: You configure how many consecutive failing rounds are required before the alert triggers. The default is 2 rounds. With a 1-minute check interval, that means at least 2 minutes of sustained majority-region failure before you are notified.
Example: If you have 3 regions active and set consecutive rounds to 2, the alert triggers when at least 2 of 3 regions fail for 2 rounds in a row.
Latency Alerts (P95)
Latency alerts fire when the 95th percentile response time exceeds a threshold over a sliding time window.
Window: You specify the window size in minutes (default: 5). Probes collects all results from all regions within that window and calculates the P95 value.
Threshold: When the calculated P95 exceeds your configured threshold (in milliseconds), the window is considered breaching. The alert does not fire on the first breaching window — it requires 2 consecutive breaching evaluations to open an incident, reducing noise from brief spikes.
Apdex Score Alerts
Apdex (Application Performance Index) distills response times into a single satisfaction score between 0 and 1. Probes calculates Apdex using industry-standard thresholds:
- Satisfied — response time ≤ T (your probe's configured threshold, default 500ms)
- Tolerating — response time between T and 4×T
- Frustrated — response time > 4×T, or the request failed entirely
The score is calculated as:
A score of 1.0 means every request was satisfied. A score of 0.0 means every request was frustrated.
Window and threshold: Like latency alerts, Apdex alerts use a sliding time window (default: 5 minutes). You set a minimum score (e.g., 0.70). When the Apdex score drops below your threshold for 2 consecutive evaluations, the alert fires.
Notification Channels
When an alert fires (or resolves), Probes sends a notification through the channel you configured on the alert.
Email notifications are sent to the address you specify as the alert destination. The email includes the probe name, URL, alert type, current status, a summary of the condition, and a link to the incident in the Probes dashboard.
Webhooks
Webhook notifications send an HTTP POST request with a JSON body to the URL you specify. This lets you integrate Probes with Slack, PagerDuty, Discord, or any system that accepts incoming webhooks.
The JSON body includes the alert type, probe name, status (triggered or resolved), detail text, and timestamp. Example payload:
{
"alert_type": "DOWNTIME",
"probe_name": "My API",
"status": "triggered",
"detail": "Majority of regions failing for 2 consecutive rounds",
"started_at": "2026-06-15T12:00:00Z"
}
Webhook Signing
To verify that a webhook request genuinely came from Probes, you can configure a signing secret on the alert. When a secret is set, every webhook request includes an X-Probes-Signature header containing an HMAC-SHA256 signature of the request body.
The header format is:
X-Probes-Signature: sha256=<hex-encoded HMAC-SHA256>
To verify the signature on your end:
- Read the raw request body
- Compute
HMAC-SHA256(your_secret, body) - Hex-encode the result
- Compare it to the value after
sha256=in the header using a constant-time comparison
Always use a constant-time comparison function (e.g., hmac.Equal in Go, crypto.timingSafeEqual in Node.js) to prevent timing attacks.
Retry Behavior
Webhook delivery uses 3 attempts with exponential backoff:
- Attempt 1 — immediate
- Attempt 2 — after 1 second
- Attempt 3 — after 5 seconds
A request is considered successful if the endpoint returns an HTTP 2xx status code. If all 3 attempts fail, the notification is marked as failed and the error is recorded in the alert event log.
Incidents
An incident is an open record that tracks a period of degraded performance or downtime for a probe. Incidents are created and resolved automatically by the alert engine.
Lifecycle
An incident moves through two states:
- Opened — Created when an alert's threshold condition is met (after debounce). A notification is sent via the alert's configured channel. The incident records which alert triggered it and the cause (downtime, latency, or apdex).
- Resolved — Closed when the condition clears (after debounce). A resolution notification is sent. The incident records the resolution timestamp.
Only one incident per cause can be open for a given probe at a time. If the same condition triggers while an incident is already open, no duplicate is created.
Debounce
To avoid noisy alerting from brief fluctuations, the alert engine requires 2 consecutive evaluations in the same direction before changing incident state:
- Opening: The condition must breach for 2 consecutive evaluation cycles before an incident is created.
- Resolving: The condition must recover for 2 consecutive evaluation cycles before the incident is closed.
If the condition flips back before reaching 2 consecutive cycles, the counter resets and no state change occurs.
Flap Detection and Auto-Disabling
If a probe sits right at a threshold boundary, it can rapidly cycle between triggered and resolved states. Probes detects this flapping pattern and automatically disables the alert to prevent notification fatigue.
How it works: After every incident resolution, Probes counts how many times the alert has triggered in the last 30 minutes. If there have been 5 or more trigger events in that window, the alert is automatically disabled and you receive a notification explaining why.
When an alert is auto-disabled, you can re-enable it from the alerts tab on the probe detail page. Consider adjusting the threshold to avoid the boundary condition that caused flapping.
Status Pages
Any probe can optionally have a public status page. When enabled, Probes generates a publicly accessible page that displays the probe's current status, recent uptime percentage, and active incidents.
Status pages are useful for giving your users visibility into your service's health without exposing your internal monitoring dashboard. You can enable a status page from the probe's settings and share the URL with your users.