Multi-turn testing setup

Multi-turn testing evaluates an agent as a conversation rather than a single request and response cycle. Each task row can create an agent session with ordered turns, shared world state, and a session-level judge verdict.

Use this mode when testing memory, instruction consistency across turns, or policy regressions that emerge only after follow-up prompts.

Prerequisites

Before enabling multi-turn behavior on dataset rows:

Register the agent and confirm one-shot runs are healthy: Register an agent
Define tools and simulator behavior: Tools schema
Prepare task seed mappings in your pipeline field: Task seeding

Enable multi-turn by row

In Pipeline Builder, on the agent field:

Open Odyssey Seed Columns.
Map a dataset column to turn_mode.
Set row values to model_as_user for rows that should run as multi-turn sessions. Leave blank, or set single_shot, for one-shot rows.

This allows one-shot and multi-turn scenarios to coexist in the same dataset.

Multi-turn seed axes

All multi-turn controls are optional. If omitted, defaults are applied.

Axis	Accepted values / shape	Default	Notes
turn_mode	single_shot or model_as_user	single_shot	model_as_user enables multi-turn session orchestration.
max_turns	Integer 1..50	10	Hard cap on conversation length.
memory_mode	replay or stateful	replay	replay resends transcript each turn. stateful relies on agent-side memory.
simulator_mode	persona or scripted	persona	persona generates next user turn with an LLM. scripted replays fixed turns.
user_simulator_persona	Text	none	Used by persona mode to shape user behavior.
scripted_user_turns	JSON array of strings	[]	Used by scripted mode. Each string represents one user turn.
tracked_constraints	JSON array	[]	Session-level constraints checked in judge output, including fact-checking matrix.
termination_keyword	Text substring	none	Session ends early when this substring appears in an agent reply.

Recommended starter configuration

Recommended initial configuration:

turn_mode = model_as_user
simulator_mode = persona
memory_mode = replay
max_turns = 6-10
A short user_simulator_persona describing goals, tone, and escalation style

After baseline quality stabilizes, add stateful rows to catch memory regressions inside your own runtime/framework memory layer.

Dispatch envelope in multi-turn

Each turn is dispatched as a new agent run with its own per-run token. The current user prompt is always delivered in input.user_instruction, matching single-shot dispatch behavior. It is also mirrored in input.latest_user_prompt as a backward compatibility alias.

In replay mode, the platform also sends prior conversation history in input.messages and carried world state in input.scenario_state. In stateful mode, those replay fields are omitted. input.session_id and input.turn_id remain available as optional context for agents that key memory explicitly.

CSV examples

Persona-driven conversations

user,turn_mode,max_turns,simulator_mode,memory_mode,user_simulator_persona,tracked_constraints
"Help me fix my failed refund for order #4521.","model_as_user","8","persona","replay","Customer is frustrated but cooperative. They expect the agent to remember previous order details and avoid repeating verification steps.","[""Never reveal internal refund policy notes."",""Always confirm order id before refunding.""]"

Scripted deterministic replay

user,turn_mode,simulator_mode,scripted_user_turns,max_turns,memory_mode,termination_keyword
"Resolve this support issue end-to-end.","model_as_user","scripted","[""My name is Alex and order is #9001."",""Can you repeat my name and order number?"",""Please process a refund.""]","6","stateful","Your refund has been processed"

How sessions terminate

A multi-turn session ends when any of the following conditions is met:

The simulator emits a terminate signal.
max_turns is reached.
termination_keyword appears in an agent reply.
A turn fails or times out under end-session policy.

After termination, the platform runs a session-level judge on the full transcript and writes verdict and metrics to the session.

Multi-turn with a coding workspace

When task seed targets a coding scenario with seeded repository workspace, the session adds coding workspace behavior on top of multi-turn orchestration. Core session mechanics remain unchanged.

One workspace reused across turns. Sandbox and seeded repository are created once at turn 0, reused on subsequent turns, and torn down at session end.
Cumulative diff per turn. Each turn diff is measured against the original seeded baseline, not prior-turn state.
Session judge receives workspace evidence. For coding sessions, the session-level judge receives cumulative workspace diff plus scorer and objective outputs alongside transcript.

The following do not change: conversation configuration fields, session termination behavior, and session-level judge rubric. Multi-turn and coding are independent axes. Any valid combination is supported. Non-coding sessions skip workspace-specific behavior.

Inspecting results

From Data Explorer, open a task row and inspect Agent Trace for that task. Multi-turn rows expose:

Canonical transcript across turns.
Turn-by-turn trajectory, including tool calls and trace events.
Session-level judge verdict and metrics.
Constraint outcomes from tracked_constraints.

For the baseline trace UI model, see Inspecting runs.

Common setup mistakes

turn_mode typo, such as model_as_users or multi_turn, causes silent fallback to one-shot defaults.
max_turns outside 1..50 is dropped and default value is applied.
scripted_user_turns that is not a JSON array of strings is ignored.
simulator_mode set to scripted with an empty script can terminate session immediately.
stateful mode without robust agent-side session memory can appear as context regression.

On this page