Investigation Agent | Osuite Docs

The Investigation Agent lives in your AI IDE (Cursor, Antigravity, and any IDE that supports agent skills). When something breaks in production, you describe the problem in plain language and the agent investigates across your entire stack — logs, traces, individual spans, Kubernetes events, and frontend session (RUM) events — correlates the signals, pinpoints the root cause, and proposes a fix, right where you already work.

This is the “during the incident” half of Osuite: find the root cause and a fix in minutes, not hours.

Why it lives in your IDE

A traditional investigation means alt-tabbing between dashboards, copying trace IDs by hand, and reconstructing a timeline in your head. The Investigation Agent does that correlation for you and surfaces the answer in the same window where you’ll write the fix. You stay in your editor; the agent does the digging.

The agent has the full context of your system through Osuite’s correlation engine, so a single failure deep in a microservice graph can be traced from the frontend click that triggered it all the way to the database query behind it.

Prerequisites

An Osuite Cloud account and an API key (Settings → API Keys)
The Osuite CLI installed and the AI skills set up — see Step 2 below

Step 1: Authenticate the CLI

The agent uses the osuite CLI under the hood to query your telemetry. Install it and point it at your account:

npm install @osuite/cli -g

Add your credentials to your shell profile:

export OSUITE_API_ENDPOINT=https://api.<region>.osuite.io
export OSUITE_API_KEY=<your api key>

Run osuite whoami to confirm you’re authenticated.

Step 2: Install the AI skills

osuite init-ai

This installs the Osuite agent skills globally so your AI IDE can invoke them — including the Investigation Agent (osuite-investigate).

Step 3: Investigate

In your AI IDE, trigger the agent with a plain-language description of what’s wrong:

/osuite-investigate the checkout API is throwing 500s since the last deploy

You don’t need to name a service, a trace ID, or a time window — though the more context you give, the faster the agent converges. It will:

Form a hypothesis from your description and state it before querying anything.
Pick a starting signal — logs for errors, traces for latency, session events for a frontend report, Kubernetes events for a deploy or infra issue.
Pivot across sources using traceId as the universal join key — from a log line to its trace, from a trace to the logs emitted during it, from a session event to the backend trace it triggered, or from an app error to the Kubernetes event (restart, OOMKill) that overlaps it.
Conclude with evidence — the specific service, the time window, and at least one trace ID or log line backing the finding.
Propose a fix — if the root cause is in the repo you’re working in, the agent edits the code or config. If it belongs elsewhere, it writes a structured bug report you can hand to the owning team.

The signals it correlates

Signal	What it tells the agent
Logs	Backend events, errors, exceptions, app-level messages
Traces	End-to-end request flow, latency breakdown, service-to-service calls
Spans	Individual units of work — e.g. a slow database call, filtered by attribute
Kubernetes events	Pod restarts, OOMKills, scheduling failures, deploy-time events
Session events (RUM)	Frontend signals from the browser, linked back to backend traces

traceId links backend logs, traces, spans, and (often) session events. Kubernetes events have no trace ID, so the agent correlates them by time window and namespace.

What a finding looks like

The agent doesn’t stop at “it’s probably the database.” A finding is concrete and evidence-backed:

The 500s on POST /checkout are coming from payments-svc between 14:02–14:11 UTC. Trace abc123def shows a 9.8s timeout calling stripe-proxy, which restarted at 14:01 per the Kubernetes event pod/stripe-proxy-7d… BackOff. The log line at 14:03 in stripe-proxy confirms connection refused to upstream. Suggested fix: add a connection retry with backoff in stripe-proxy’s client, and raise the readiness-probe delay so traffic isn’t routed before the pod is ready.

When the fix belongs to another team, the agent writes a bug-report-<timestamp>.md with the symptom, evidence, a reconstructed timeline, the suspected cause, and the full trace saved as JSON — everything the owning team needs to reproduce it.

How it relates to the other agents

The Instrumentation Agent gets your telemetry flowing in the first place — the Investigation Agent needs that data to work.
The Visualizer Agent turns findings into dashboards you can keep an eye on.

Next steps

Instrumentation Agent — wire up your services so the Investigation Agent has signal to work with
APM & Distributed Tracing — the trace data the agent pivots through
Log Management — how logs are correlated to traces via trace_id