The Investigation Agent lives in your AI IDE (Cursor, Antigravity, and any IDE that supports agent skills). When something breaks in production, you describe the problem in plain language and the agent investigates across your entire stack — logs, traces, individual spans, Kubernetes events, and frontend session (RUM) events — correlates the signals, pinpoints the root cause, and proposes a fix, right where you already work.
This is the “during the incident” half of Osuite: find the root cause and a fix in minutes, not hours.
Why it lives in your IDE
A traditional investigation means alt-tabbing between dashboards, copying trace IDs by hand, and reconstructing a timeline in your head. The Investigation Agent does that correlation for you and surfaces the answer in the same window where you’ll write the fix. You stay in your editor; the agent does the digging.
The agent has the full context of your system through Osuite’s correlation engine, so a single failure deep in a microservice graph can be traced from the frontend click that triggered it all the way to the database query behind it.
Prerequisites
- An Osuite Cloud account and an API key (Settings → API Keys)
- The Osuite CLI installed and the AI skills set up — see Step 2 below
Step 1: Authenticate the CLI
The agent uses the osuite CLI under the hood to query your telemetry. Install it and point it at your account:
npm install @osuite/cli -g
Add your credentials to your shell profile:
export OSUITE_API_ENDPOINT=https://api.<region>.osuite.io
export OSUITE_API_KEY=<your api key>
Run osuite whoami to confirm you’re authenticated.
Step 2: Install the AI skills
osuite init-ai
This installs the Osuite agent skills globally so your AI IDE can invoke them — including the Investigation Agent (osuite-investigate).
Step 3: Investigate
In your AI IDE, trigger the agent with a plain-language description of what’s wrong:
/osuite-investigate the checkout API is throwing 500s since the last deploy
You don’t need to name a service, a trace ID, or a time window — though the more context you give, the faster the agent converges. It will:
- Form a hypothesis from your description and state it before querying anything.
- Pick a starting signal — logs for errors, traces for latency, session events for a frontend report, Kubernetes events for a deploy or infra issue.
- Pivot across sources using
traceIdas the universal join key — from a log line to its trace, from a trace to the logs emitted during it, from a session event to the backend trace it triggered, or from an app error to the Kubernetes event (restart, OOMKill) that overlaps it. - Conclude with evidence — the specific service, the time window, and at least one trace ID or log line backing the finding.
- Propose a fix — if the root cause is in the repo you’re working in, the agent edits the code or config. If it belongs elsewhere, it writes a structured bug report you can hand to the owning team.
The signals it correlates
| Signal | What it tells the agent |
|---|---|
| Logs | Backend events, errors, exceptions, app-level messages |
| Traces | End-to-end request flow, latency breakdown, service-to-service calls |
| Spans | Individual units of work — e.g. a slow database call, filtered by attribute |
| Kubernetes events | Pod restarts, OOMKills, scheduling failures, deploy-time events |
| Session events (RUM) | Frontend signals from the browser, linked back to backend traces |
traceId links backend logs, traces, spans, and (often) session events. Kubernetes events have no trace ID, so the agent correlates them by time window and namespace.
What a finding looks like
The agent doesn’t stop at “it’s probably the database.” A finding is concrete and evidence-backed:
The 500s on
POST /checkoutare coming frompayments-svcbetween 14:02–14:11 UTC. Traceabc123defshows a 9.8s timeout callingstripe-proxy, which restarted at 14:01 per the Kubernetes eventpod/stripe-proxy-7d… BackOff. The log line at 14:03 instripe-proxyconfirmsconnection refusedto upstream. Suggested fix: add a connection retry with backoff instripe-proxy’s client, and raise the readiness-probe delay so traffic isn’t routed before the pod is ready.
When the fix belongs to another team, the agent writes a bug-report-<timestamp>.md with the symptom, evidence, a reconstructed timeline, the suspected cause, and the full trace saved as JSON — everything the owning team needs to reproduce it.
How it relates to the other agents
- The Instrumentation Agent gets your telemetry flowing in the first place — the Investigation Agent needs that data to work.
- The Visualizer Agent turns findings into dashboards you can keep an eye on.
Next steps
- Instrumentation Agent — wire up your services so the Investigation Agent has signal to work with
- APM & Distributed Tracing — the trace data the agent pivots through
- Log Management — how logs are correlated to traces via
trace_id