Overview
Alerts in Osuite let you define conditions on your metrics and get notified the moment something crosses a threshold.
Creating an alert
Navigate to Alerts → New Alert in the Osuite UI.
Step 1: Define PromQL queries and expression
Write a PromQL queries and expression.
http_server_request_duration_seconds{service="payment-service"}
A > 10
Step 2: Define severity & Evaluation interval
Assign a severity to the alert so on-call engineers know how urgently to respond:
- Critical — Production is down or severely degraded
- Warning — Something is off, but not yet user-impacting
- Info — Informational threshold for awareness
The duration avoids false positives from momentary spikes. Osuite only fires when the condition is continuously true for the full window.
Save the alert.
Step 3: Configure notifications
You can configure notication channels in “Settings > General > Notifications”.
| Channel | Description |
|---|---|
| Slack | Send a message to the configured Slack channel. |
| Opsgenie | Create a Opsgenie incident. |
Managing alerts
Alert list
All your alerts are visible at Alerts → All Alerts with their current state:
| State | Meaning |
|---|---|
| Normal | Condition is not met — system is healthy |
| Firing | Condition has been met |
| Alerting | Condition is continuously true for the full window |
Muting alerts
You can mute any alert for a defined window — useful during planned maintenance, deployments, or load tests where you expect elevated error rates. Navigate to the alert and click Pause.
Best practices
-
`Start with the error rate, then tune. The most valuable first alert for any service is an error rate threshold. Start at 1% and observe for a few days before tightening.
-
Use the pending duration to avoid alert fatigue. A momentary error spike at 2% for 30 seconds is often not worth waking someone up. Requiring 5 consecutive minutes filters the noise without losing real incidents.
-
Correlate alert thresholds with your SLOs. If your SLO is 99.9% success rate, set a warning at 0.5% error rate and a critical at 1%. This gives you time to respond before breaching the SLO.
-
Name alerts clearly. Use names that describe the impact:
"High error rate – payment-service"is more useful than"payment_service_errors_alert"when you’re being paged at 3am.