Getting Started
Platform Capabilities
Agent Tracing with Pipelines
Grading Continuity
Deterministic input/output regression tests can't capture how an agent reasons, adapts, and recovers. Pipelines grades the behavior behind the outcome:
- Task Completion: Given a goal and a set of tools, did the agent actually get the job done? Determine Accuracy.
- Tool Use: With APIs and live MCP connections on hand, did it reach for the right tool at the right moment — and hold onto what mattered? Understand Trajectory Quality.
- Failure Handling: A dependency misfires; an API stalls. When the environment turns unpredictable, does the agent recover gracefully? Measure Resilience.
- Complex System Behavior: Dispatch a fleet of agents together. Do they carry context forward, coordinate, and reach the shared objective? Observe System-level Behavior.
Key capabilities
Register and version agents
- Agent Library: Register external HTTP agents or in-platform code/sandbox agents, declare tool schemas, and configure per-run timeout and concurrency.
- Agent versioning: Track material config/tool updates as new versions so run history stays reproducible.
Simulate and stress test behavior
- Odyssey execution modes: Run tools in sandbox simulation, passthrough mode to live services, or failure-injected mode to test recovery paths.
- Task seeding: Author scenarios across instruction, behavior instructions, initial state, failure rules, expected outcome, and tracked constraints.
- Synthetic generation grounded in real data: Generate synthetic seed tasks in bulk, then ground generation with dataset-backed synthetic profiles so scenarios reflect real operating patterns.
- Model controls: Select simulator and judge models independently, with org-level defaults and BYOK model credentials.
Inspect runs with evidence
- Trajectory visibility: Inspect tool-call timelines, arguments, responses, and per-call provenance (simulated, injected, passthrough, or error).
- Judge + mechanical metrics: Review semantic pass/fail verdicts alongside structural metrics such as completion, schema compliance, and consistency.
- State inspection: Analyze ledger/state transitions and final run artifacts to understand why a run passed or failed.
Evaluate sessions and automate workflows
- Multi-turn testing: Run model-as-user sessions to evaluate coherence, memory, and cross-turn task completion.
- Datasets and comparisons: Store runs in versioned datasets and compare behavior across agent versions or scenario revisions.
- API, SDK, and CLI: Automate agent registration, task seeding, run dispatch, and result analysis programmatically.
- RBAC: Control access with Org Admin, Project Admin, and Contributor roles.