Glossary

Glossary (A-Z)

Agent: An agent is the system under test. Pipelines dispatches one task at a time to your registered agent and records the resulting run trajectory.

Agent Library: The Agent Library is the admin surface where agents are registered, versioned, and configured. See Agents.

Agent-mode field: An agent-mode field is form field configured in the Pipelines graphical workspace for agent dispatch.

Code agent: A code agent runs inside the platform sandbox using either a Python entrypoint or a CLI coding profile. These runs can include workspace diff and scenario scorer grading.

Contributor: A Contributor is a project-scoped role focused on task completion and human review work.

Criteria: Criteria are reusable evaluation definitions, including human, LLM, and programmatic checks, that can be attached as evaluators. See Evaluations.

Data Explorer: Data Explorer is the aggregate dataset for individual agent configurations. Attached are judge verdicts, latency and cost measurements, and trace links.

Data Vault: Data Vault is the unified dataset hub for all agent runs. See Datasets.

Dataset: A dataset is the collection of agent decisions, traces, pass rate, and financial metrics that can be used for analysis, comparison, and export.

Evaluator: An evaluator is a user or LLM-defined criterion attached to an agent's input or output that scores task completions.

Expected outcome: Expected outcome is an input that defines what "correct" looks like, such as task completion or action-refusal. Used for judgement which compares the agent's final output against expected outcome.

External HTTP agent: An external HTTP agent is a customer-hosted agent endpoint that Pipelines calls via HTTP dispatch.

Failure rules: Failure rules are seeded rules that deterministically inject failures so you can test recovery behavior.

Field session: A field session is a multi-step interaction for an agent. Conversation state and context progresses as the interaction grows in length.

Judge verdict: A judge verdict is LLM-judge output on a fixed rubric, including pass or fail, reasoning, and a failure mode when failed.

Ledger schema: A ledger schema is an optional typed schema for simulated world entities and state used by Odyssey.

MCP tools: MCP tools are tools sourced from MCP servers and callable by agents through declared tool schemas.

Multi-agent system: A multi-agent system is a topology (defined agent hierarchy) where one system contains multiple internal agents or sub-agents and the handoffs are traced.

Multi-turn testing: Multi-turn testing is session-based testing where a model-as-user interacts with the agent over multiple turns. This simulates conversations and tests agent memory.

Odyssey: Odyssey is the world simulation and runtime layer that mediates tool calls and applies simulation behavior.

Odyssey proxy URL: The Odyssey proxy URL is a per-run endpoint the agent uses for tool calls so Pipelines can observe, simulate, and score execution.

Organization: An organization is the top-level account boundary for projects, members, models, tools, and permissions.

Org Admin: An Org Admin is an organization-scoped admin role with full control over organization resources.

Passthrough mode: Passthrough mode is a tool execution mode that forwards calls to live external endpoints (e.g. Tavily, Zapier) instead of simulation.

Pipeline: A pipeline is the workflow scaffold that executes tasks, agents, evaluators, and review steps. See Pipelines.

Project: A project is a workspace inside an organization that scopes agents, datasets, tasks, and role assignments.

Project Admin: A Project Admin is a project-scoped admin role. Owners have management permissions while viewers are read-only.

Run: A run is one (agent, task) execution with trajectory, outputs, metrics, and verdicts.

Sandbox mode: Sandbox mode is a tool execution mode where Odyssey returns simulated responses from seeded state.

Scenario scorers: Scenario scorers are mechanical checks, especially for coding or code-agent scenarios, that are used alongside judge scoring.

Seed / task seed: A seed, or task seed, is the scenario input for a run. It includes instruction, behavior instructions, initial world state, failure rules, and expected outcome.

Studio: Studio is the dataset analytics and charting workspace for exploration and comparison.

Task: A task is the unit of work for one pipeline row. Agent execution and optional human review operate at this level.

Tools schema: A tools schema is the declared list of tools and JSON input schemas that an agent can call during a run.

Trajectory: A trajectory is the ordered record of tool calls, arguments, responses, and sources across a run or session.

Glossary (A-Z)

On this page