Multi-turn testing setup
Configure model-as-user conversations with task seed axes, simulator modes, and session-level inspection.
Multi-turn testing evaluates an agent as a conversation rather than a single request and response cycle. Each task row can create an agent session with ordered turns, shared world state, and a session-level judge verdict.
Use this mode when testing memory, instruction consistency across turns, or policy regressions that emerge only after follow-up prompts.
Prerequisites
Before enabling multi-turn behavior on dataset rows:
- Register the agent and confirm one-shot runs are healthy: Register an agent
- Define tools and simulator behavior: Tools schema
- Prepare task seed mappings in your pipeline field: Task seeding
Enable multi-turn by row
In Pipeline Builder, on the agent field:
- Open Odyssey Seed Columns.
- Map a dataset column to turn_mode.
- Set row values to model_as_user for rows that should run as multi-turn sessions. Leave blank, or set single_shot, for one-shot rows.
This allows one-shot and multi-turn scenarios to coexist in the same dataset.
Multi-turn seed axes
All multi-turn controls are optional. If omitted, defaults are applied.
| Axis | Accepted values / shape | Default | Notes |
|---|---|---|---|
| turn_mode | single_shot or model_as_user | single_shot | model_as_user enables multi-turn session orchestration. |
| max_turns | Integer 1..50 | 10 | Hard cap on conversation length. |
| memory_mode | replay or stateful | replay | replay resends transcript each turn. stateful relies on agent-side memory. |
| simulator_mode | persona or scripted | persona | persona generates next user turn with an LLM. scripted replays fixed turns. |
| user_simulator_persona | Text | none | Used by persona mode to shape user behavior. |
| scripted_user_turns | JSON array of strings | [] | Used by scripted mode. Each string represents one user turn. |
| tracked_constraints | JSON array | [] | Session-level constraints checked in judge output, including fact-checking matrix. |
| termination_keyword | Text substring | none | Session ends early when this substring appears in an agent reply. |
Recommended starter configuration
Recommended initial configuration:
- turn_mode = model_as_user
- simulator_mode = persona
- memory_mode = replay
- max_turns = 6-10
- A short user_simulator_persona describing goals, tone, and escalation style
After baseline quality stabilizes, add stateful rows to catch memory regressions inside your own runtime/framework memory layer.
Dispatch envelope in multi-turn
Each turn is dispatched as a new agent run with its own per-run token. The current user prompt is always delivered in input.user_instruction, matching single-shot dispatch behavior. It is also mirrored in input.latest_user_prompt as a backward compatibility alias.
In replay mode, the platform also sends prior conversation history in input.messages and carried world state in input.scenario_state. In stateful mode, those replay fields are omitted. input.session_id and input.turn_id remain available as optional context for agents that key memory explicitly.
CSV examples
Persona-driven conversations
user,turn_mode,max_turns,simulator_mode,memory_mode,user_simulator_persona,tracked_constraints
"Help me fix my failed refund for order #4521.","model_as_user","8","persona","replay","Customer is frustrated but cooperative. They expect the agent to remember previous order details and avoid repeating verification steps.","[""Never reveal internal refund policy notes."",""Always confirm order id before refunding.""]"Scripted deterministic replay
user,turn_mode,simulator_mode,scripted_user_turns,max_turns,memory_mode,termination_keyword
"Resolve this support issue end-to-end.","model_as_user","scripted","[""My name is Alex and order is #9001."",""Can you repeat my name and order number?"",""Please process a refund.""]","6","stateful","Your refund has been processed"How sessions terminate
A multi-turn session ends when any of the following conditions is met:
- The simulator emits a terminate signal.
- max_turns is reached.
- termination_keyword appears in an agent reply.
- A turn fails or times out under end-session policy.
After termination, the platform runs a session-level judge on the full transcript and writes verdict and metrics to the session.
Multi-turn with a coding workspace
When task seed targets a coding scenario with seeded repository workspace, the session adds coding workspace behavior on top of multi-turn orchestration. Core session mechanics remain unchanged.
- One workspace reused across turns. Sandbox and seeded repository are created once at turn 0, reused on subsequent turns, and torn down at session end.
- Cumulative diff per turn. Each turn diff is measured against the original seeded baseline, not prior-turn state.
- Session judge receives workspace evidence. For coding sessions, the session-level judge receives cumulative workspace diff plus scorer and objective outputs alongside transcript.
The following do not change: conversation configuration fields, session termination behavior, and session-level judge rubric. Multi-turn and coding are independent axes. Any valid combination is supported. Non-coding sessions skip workspace-specific behavior.
Inspecting results
From Data Explorer, open a task row and inspect Agent Trace for that task. Multi-turn rows expose:
- Canonical transcript across turns.
- Turn-by-turn trajectory, including tool calls and trace events.
- Session-level judge verdict and metrics.
- Constraint outcomes from tracked_constraints.
For the baseline trace UI model, see Inspecting runs.
Common setup mistakes
- turn_mode typo, such as model_as_users or multi_turn, causes silent fallback to one-shot defaults.
- max_turns outside 1..50 is dropped and default value is applied.
- scripted_user_turns that is not a JSON array of strings is ignored.
- simulator_mode set to scripted with an empty script can terminate session immediately.
- stateful mode without robust agent-side session memory can appear as context regression.