Pipelines Docs is in beta — content is actively being added.
AgentsObservability

Inspecting runs

Data Explorer columns and the Agent Trace tab.

Inspect runs from Data Explorer:

  1. Open the project, then the pipeline.
  2. Navigate to the project and pipeline Data Explorer route.
  3. Click any row to open the task modal. If at least one agent run exists, an Agent Trace tab appears in the left rail.

Data Explorer columns

ColumnShows
AgentName and version.
Run statuspending, running, completed, failed, or cancelled. Failed rows render a failure-mode chip.
Judge verdictPASS, FAIL, PENDING, or NOT APPLICABLE. FAIL rows include failure_mode snippet.
Failure modeOptional standalone column, hidden by default.
Tool callsTotal call count with per-source tooltip breakdown.
LatencyEnd-to-end run latency.
CostSum of agent, judge, and simulator LLM cost.
Behavior configTruncated odyssey_seed.behavior_instructions value.
Ledger violationsConsistency-mode violations, hidden by default.
TraceView trace link.

Failure-mode chips

Chips map to run error_class:

ChipPaletteError Class
auth faileddangerauth_failed
timeoutwarningagent_timeout
5xxdangeragent_5xx
4xxamberagent_4xx
bad responsewarningagent_bad_response
unreachabledangeragent_unreachable
proxy misconfiguredamberproxy_misconfigured
contract errordangercontract_error
transport errorwarningtransport_error
internal errordangerinternal_error

Palette meanings: danger means customer intervention is required at agent side; warning means retry-friendly behavior; amber means integration drift.

Hover for one-liner; full table on Troubleshooting.

The Agent Trace tab

Top to bottom:

  1. Run header strip, showing agent name, status badge, judge verdict, total cost, total latency.
  2. Completion warning banner, shown when the run completes with tools declared but zero proxy calls. See Completion warning banner.
  3. Judge verdict card, including verdict, rubric sub-scores, reasoning, judge cost. See Judge verdict card.
  4. Task seed panel, collapsed by default, with user_instruction, behavior_instructions, initial_state, and failure_rules.
  5. Multi-agent structure card, multi-agent runs only, showing reconstructed sub-agent graph. See Multi-agent structure.
  6. Trajectory timeline with agent turns, tool calls, and live trace events.
  7. Debug panel, collapsed by default, with raw response body and any soft_warnings.

If a task has multiple runs, a run picker appears above header strip.

Trajectory timeline

Three row types interleave in chronological order:

  • Agent turns, drawn from response messages array. Assistant rows with non-empty thinking arrays show a thinking count badge and expand to a Reasoning panel.
  • Tool calls, one row per proxied call, with tool name, source badge, and latency. Rows expand to arguments, response, ledger snapshot, ledger updates, and any error.
  • Live trace events, one row per side-channel event such as assistant_message, thinking, system_prompt, or custom. Ordered by __occurred_at when present, otherwise by proxy arrival time.

Agent turns render only in rich mode, when messages is non-null. Live trace events render regardless.

Multi-agent structure

For an internally multi-agent system, the trace tab renders a Multi-agent structure card. It mounts only when run is multi-agent. Single-agent runs omit it entirely (flat timeline, no card).

The card is reconstructed at read time from run tool-call attribution, where actor_id call paths define delegation edges, unioned with ordered handoff trace events (peer / cyclic edges), grounded against the declared topology when one exists. Top to bottom:

  • Summary stats with sub-agent count, delegation depth, handoff count, and declared badge when topology scoped run.
  • Attribution integrity showing derived calls, credited via single declared owner) vs honestly-unattributed calls, plus declared-vs-observed divergence, including declared-only and undeclared nodes.
  • Sub-agent roster with actors, tool ownership, call counts, and provenance.
  • Relationships, including delegation edges and declared-but-unexercised edges.
  • Handoff timeline with ordered sub-agent transfers.

See Multi-agent systems → Read the agent graph for the provenance tags and the honesty metrics.

Live updates

In-progress runs poll every 3 seconds. Polling stops when all runs on the task reaches a terminal status.

For live trace events to appear, the agent must opt in. See SDK live trace forwarding or the trace-events endpoint.

Tool-call sources

BadgeSourceMeaning
odysseyodysseySimulated against seeded world.
injectedinjectedA failure rule fired.
passthroughpassthroughForwarded to a registered tool endpoint and returned successfully.
errorerrorSimulator could not produce a valid response.
transport errortransport_errorPassthrough hop failed, for example missing or inactive endpoint, cross-org mismatch, or live tool error.

Completion warning banner

When the run completes successfully but records zero proxy calls despite declaring tools, the trace tab renders a yellow banner. The run is not failed. This indicates likely proxy reachability issues.

Common causes:

  • Agent caught transport error and folded it into final_response.
  • The agent's framework swallowed an exception in its tool body.
  • Proxy URL provided to agent is unreachable from agent runtime.

Diagnose using outbound curl from Register an agent → Network reachability.

Ledger viewer

Each tool-call row pins world-state ledger captured after the call, with two views toggled in the header:

  • State, literal ledger contents at that step, including live entities in entity_type to entity_id shape, plus separate sections for state-change flags and removed (tombstoned) entities.
  • Changes, ledger operations applied at that step, including add, update, remove, and set_flag. This is authoritative per-step delta and not a structural diff inferred from snapshots.

Judge verdict card

The built-in judge scores each completed run on a fixed rubric, one to five per axis:

AxisScaleWhat it measures
Task completion1 to 5Whether the agent resolved the user request.
Instruction adherence1 to 5 or nullWhether simulated environment followed behavior_instructions. Null when behavior instructions were not configured.
Efficiency1 to 5Step count relative to expected minimal path.

In addition to sub-scores:

  • Verdict, PASS when outcome is acceptable and FAIL when escalation is likely.
  • Failure mode, short label on FAIL runs and null on PASS.
  • Reasoning, concise textual rationale citing trajectory evidence.

Rubric is currently fixed and not configurable per agent or workflow. To add custom scoring dimensions, use a downstream llm_judge criterion targeting

agent field. That system has an independent prompt and scale.

The expected_outcome seed axis affects scoring. Setting refusal indicates that a clean refusal is correct outcome. See Task seeding.

When judge execution is not possible, for example agent failure before final_response, the card shows NOT APPLICABLE and row verdict shows PENDING.

Coding runs

When run type is coding, meaning a code agent in seeded repository workspace, the trace tab includes additional surfaces and behavior differences from ledger or simulation-only runs.

The Trajectory timeline fills in after the run

A CLI coding agent edits files and runs commands natively in sandbox and does not use proxy path for command execution. Trajectory therefore does not stream live. The platform reconstructs it after run completion, from either CLI transcript for recognized harnesses or syscall capture for an unknown CLI. Until completion the timeline can look empty even on a long run; it back-fills once the harness exits.

Because syscall capture records executed commands but not always terminal output semantics, recovered shell steps can show Output not captured messages. This is expected behavior for unknown CLI capture mode and is not a failure condition.

The Final diff panel

Below trajectory, the Final diff panel renders all agent changed versus the seeded baseline as one unified diff (added green, removed red). Setup artifacts created while seeding the workspace are excluded, so the diff reflects only agent net change. The panel self-hides on runs with no diff, so a ledger run is visually unchanged.

Scorer badges

A Scorers row appears alongside diff, one row per scorer the run was graded against (tests, lint, allowed-paths, …), each with a PASS / FAIL verdict and an expandable detail (the failing test tail, etc.), plus an OVERALL PASS/FAIL summary. For scorer semantics and coding-run grading model, see Scorers and grading.

The workspace-eval banner

The run still completes when grading phase fails, and mechanical trajectory signal remains available. In this case, a red banner indicates eval phase failure and potential missing diff or scorer outputs. This indicates partial or absent grading artifacts, not run failure.

Coding sessions

A multi-turn coding run is inspected as a session. Each turn shows turn-level trajectory and turn-level diff, and session output includes cumulative diff across all turns plus session-level scorer breakdown and judge verdict. See Multi-turn testing for the session machinery.

Re-running

A task accumulates one agent run per dispatch. When more than one exists, the run picker on the trace tab scrolls through history, and the Data Explorer shows the most recent run by default.

Runs are created when a task is seeded, by the automatic re-dispatch on transient transport failures (see Troubleshooting), and on demand via Re-run (below).

Re-run a run

The trace tab's header has a Re-run button (admin task-detail view). It dispatches a fresh run for the same task and agent field, preserving the original run in history — no more delete-and-re-seed to recover a blip. The new run reuses the current workflow config (latest agent version, tool modes, model picks), so a re-run also picks up edits made since the original dispatch. The new run appears in the run picker as soon as it finishes; while it is in flight the trace tab polls for it.

A run that failed because the agent was briefly unreachable can be recovered this way directly — the platform re-primes the task so the retried result writes back normally.

The same action is available over the API for scripted / IaC setups:

POST /api/projects/{project_id}/workflows/{workflow_id}/tasks/{task_id}/agent-runs/{run_id}/re-run

It accepts the pk_live_ org API key (or a dashboard session), requires workflow manage permission, and returns 202 Accepted with the source run id — re-poll GET /api/projects/{p}/workflows/{w}/tasks/{task_id}/agent-runs for the new run.

Re-run is currently available for single-shot runs. Multi-turn sessions are re-run by re-seeding the task.