Run a coding CLI agent
Register a code agent through CLI on a workspace, and read its trajectory, diff, and grading.
A coding agent runs a standard coding CLI in a sandbox that already contains a seeded git repository. You provide a task, the CLI edits the repository, and the platform records the full trajectory, the final diff, and grading outputs.
You do not need to instrument the CLI or add custom logging flags. Point the platform to the CLI binary, attach a coding scenario, and run.
A coding agent is one sandbox-agent profile. Repository workspace handling, diffs, and workspace scorers are specific to coding mode. Code sources, sandbox runtime, trajectory capture, and judge behavior are shared with other sandbox agents. If you only need to run a Python agent in a sandbox without a repository, see Register an agent.
The pieces you configure
A coding CLI agent is five layers, not just a CLI name. Each maps to a step on this page:
| Layer | Where you set it | What it decides |
|---|---|---|
| Source (optional) | Code source picker | Extra agent files materialized in the sandbox. Skip it when the CLI binary plus task brief is all you need. See Sandbox agent reference. |
| Entrypoint | Run command | How the platform starts the CLI inside the repository. |
| Runtime | Custom Dockerfile | Which CLI binary exists before the run. |
| Credentials | Environment variable, From credential | Which model-provider key the CLI can use. |
| Observability | Tools, harness config, result metadata | What the platform can render, simulate, grade, and cost. |
Importing source never fills the other four layers. If you point the platform at a repo, the run command must still call those files.
What a run looks like
- The platform boots a sandbox and seeds the workspace from your coding scenario, including clone, optional setup, and a baseline commit.
- It writes the task brief to a file and runs your CLI command.
- The CLI runs inside the repository, executes commands, and edits files.
- After command completion, the platform reconstructs trajectory, captures the final diff against baseline, runs scorers, and requests a judge verdict.
Register a coding agent
In the sidebar, select Agents, then Register agent. This view is available to Org Admins and Project Admin Owners.
Mode and execution
In the Connection step, select the Sandbox Agents mode card.
Under How your agent runs, select Shell command (any CLI). The platform then executes your command in the seeded repository. The Python function option is for Python-driven agents that call platform tools through proxy endpoints and is not a coding harness.
Run command
In Run command, define how the platform launches the CLI. The task brief is available in PIPELINES_TASK_FILE. A standard invocation pattern is:
claude -p "$(cat $PIPELINES_TASK_FILE)"Preset chips fill in a working command for Claude Code, Codex, Cursor, and Aider. For recognized CLIs, the platform appends headless and approval flags unless you specify your own. The command can also write an optional JSON result to PIPELINES_RESULT_PATH.
The full task input, not just the brief, is available at PIPELINES_TASK_INPUT_FILE. The JSON result at PIPELINES_RESULT_PATH carries final_response and optional metadata; if the command writes no non-empty final_response, the platform injects one from the captured command output. A nonzero exit with an empty diff fails as a command error, while a nonzero exit with a non-empty diff is left for the scorer or judge to grade.
If the command does not read the task file, the agent does not receive the brief. Invoke the CLI binary directly. Wrapping the full command in bash -c is not recognized and disables rich trajectory and harness customization.
Docker image for the CLI
The base sandbox image does not include coding CLIs. Install your CLI under Sandbox environment (advanced), Base image, Custom Dockerfile. Write only the Dockerfile body. The platform prepends FROM pipelines-workspace-base automatically. Example:
RUN sudo npm install -g @anthropic-ai/claude-code| CLI | Install line |
|---|---|
| Claude Code | RUN sudo npm install -g @anthropic-ai/claude-code |
| Codex | RUN sudo npm install -g @openai/codex |
| Cursor | RUN curl https://cursor.com/install -fsSL | bash |
| Aider | RUN uv tool install --python 3.11 aider-chat |
Leave Build image now enabled so the managed build starts after agent creation. Wait for the Custom image chip on the agent detail page to reach Ready before dispatch. See Sandbox agent reference → Sandbox environment for the build rules.
Credential for the CLI
The CLI requires a model-provider key. Add it under Sandbox environment (advanced) as an Environment variable and choose From credential so the key is decrypted only at dispatch and is not stored in plaintext configuration.
| CLI | Variable |
|---|---|
| Claude Code | ANTHROPIC_API_KEY |
| Codex | OPENAI_API_KEY (aliased to CODEX_API_KEY) |
| Cursor | CURSOR_API_KEY |
| Aider | Provider key used by the model selected in --model, typically OPENAI_API_KEY or ANTHROPIC_API_KEY |
Save and publish the agent. Draft agents cannot be dispatched.
Point it at a coding scenario
An in_sandbox agent always requires a workspace and must run against a coding scenario, including repository seed, optional setup, and scorer configuration. Create the scenario, then map the agent field seed columns to it, or attach it inline at seed time. A task without workspace seed fails with in_sandbox_requires_workspace.
What the platform sets up automatically
For recognized CLIs, the platform handles runtime wiring automatically:
- Seeds the workspace at /home/user/workspace, runs scenario setup, and creates a baseline commit for stable diffing.
- Appends headless and approval flags for non-interactive execution.
- Injects harness customization, including system prompt additions, MCP servers, Claude subagents, and file overlays, through each CLI native channel. See Harness customization.
- Wires platform tools to the CLI through the pipelines MCP shim when tools are attached, except for Aider, which does not support MCP.
Supported CLIs at a glance
| CLI | Notes |
|---|---|
| Claude Code | Richest trajectory coverage. Only CLI with token and cost extraction plus Claude subagent support. |
| Codex | Captures exec_command shell steps, apply-patch edits, and reasoning. MCP servers must support streamable HTTP. |
| Cursor | Captures read, search, and edit steps with diffs and reasoning. Requires --force for direct edits. |
| Aider | Rewrites full files, so each change appears as file-write operations. No MCP support. |
Other CLIs can run, but execution degrades to coarse command-level trajectory, and harness customization and platform tools are skipped.
What you get after a run
Open the task and inspect the Agent Trace tab. See Inspecting runs for the full UI.
- Trajectory timeline: ordered Shell, Edit, Read/Search, and Assistant steps, with command output, exit status, and file diffs. Trajectory is rendered after run completion.
- Final diff: net repository changes against baseline.
- Scorer badges: pass, fail, or not-applicable per mechanical scorer.
- Judge verdict: pass or fail with rubric breakdown and reasoning.
Scorers and judge evaluation are independent gates. See Scorers and grading.
Trajectory vs platform tool calls
The platform records a coding run through two independent channels:
| Channel | What it shows | Backed by |
|---|---|---|
| Harness trajectory | The CLI transcript: assistant text, shell commands, file edits, and command output. | Harness logs, or coarse fallback capture. |
| Platform tool calls | Calls routed through platform tools, shown in the Tool Calls summary. | agent_tool_calls rows. |
A coding CLI can produce a rich trajectory while Tool Calls stays at zero. That means the harness ran but never called the platform tool proxy; native CLI edits do not count as platform tool calls.
To expose platform workspace tools to a CLI harness:
- Add the workspace tool definitions to the agent's tools schema.
- Enable the platform shim with harness_config.register_platform_shim. See Harness customization.
- Use a harness that supports MCP: Claude Code, Codex, or Cursor. Aider has no MCP support.
- Confirm the agent actually calls the tools during a smoke run.
Do not set config.workspace_tools: true for in_sandbox shell agents. That flag is for proxy-topology agents and is rejected for in_sandbox. Shell agents receive declared tools through the MCP shim.
The workspace tool set is read_file, write_file, edit, run_bash, grep, glob, and ls. Paste the canonical schema through Import JSON on the Tools step:
[
{
"name": "read_file",
"description": "Read a file from the workspace. Optionally offset/limit lines.",
"input_schema": {
"type": "object",
"properties": {
"path": { "type": "string" },
"offset": { "type": "integer" },
"limit": { "type": "integer" }
},
"required": ["path"]
},
"default_execution_mode": "workspace"
},
{
"name": "write_file",
"description": "Create or overwrite a file with the given content.",
"input_schema": {
"type": "object",
"properties": {
"path": { "type": "string" },
"content": { "type": "string" }
},
"required": ["path", "content"]
},
"default_execution_mode": "workspace"
},
{
"name": "edit",
"description": "Replace the first occurrence of `old` with `new` in a file.",
"input_schema": {
"type": "object",
"properties": {
"path": { "type": "string" },
"old": { "type": "string" },
"new": { "type": "string" }
},
"required": ["path", "old", "new"]
},
"default_execution_mode": "workspace"
},
{
"name": "run_bash",
"description": "Run a shell command in the workspace root and return stdout/stderr/exit code.",
"input_schema": {
"type": "object",
"properties": {
"command": { "type": "string" },
"timeout": { "type": "integer" }
},
"required": ["command"]
},
"default_execution_mode": "workspace"
},
{
"name": "grep",
"description": "Search file contents for a regex pattern.",
"input_schema": {
"type": "object",
"properties": {
"pattern": { "type": "string" },
"path": { "type": "string" }
},
"required": ["pattern"]
},
"default_execution_mode": "workspace"
},
{
"name": "glob",
"description": "List paths matching a glob pattern.",
"input_schema": {
"type": "object",
"properties": {
"pattern": { "type": "string" }
},
"required": ["pattern"]
},
"default_execution_mode": "workspace"
},
{
"name": "ls",
"description": "List directory contents.",
"input_schema": {
"type": "object",
"properties": {
"path": { "type": "string" }
},
"required": []
},
"default_execution_mode": "workspace"
}
]Cost reporting
Do not read $0.0000 as proof a run was free. The cost column reports the
platform cost the run recorded, including simulator and judge cost where
applicable. A CLI's own model spend is only visible when the platform can extract
usage from the harness transcript, or when your command
writes supported result metadata.
For wrappers that can measure their own usage, write a JSON result to PIPELINES_RESULT_PATH:
{
"final_response": "Implemented the requested change.",
"metadata": {
"model": "provider/model-name",
"total_input_tokens": 123,
"total_output_tokens": 456,
"agent_runtime_ms": 4200,
"agent_cost_usd": 0.0123
}
}Capture usage from the CLI or SDK you run, compute the price in the wrapper, and
write the metadata before exiting. If the CLI exposes no machine-readable usage,
treat agent-side cost as unavailable rather than as $0.0000.
Image-build gotchas
- The build runs as a normal user. Global installs such as npm install -g, system pip install, and apt-get require sudo in RUN lines.
- The base image uses Python 3.13. Some Python CLIs pin incompatible dependencies. Install these tools in an isolated environment on a compatible interpreter using uv tool install with the appropriate tool name on Python 3.11 instead of system pip. These binaries are installed under ~/.local/bin, so reference full binary paths in the run command, for example /home/user/.local/bin/aider.
Preflight checklist
Before running a paid matrix, prove the setup with one small task:
- The Dockerfile installs the CLI your run command invokes.
- The run command calls a recognized CLI binary directly, with no bash -c wrapper.
- Credentials are mapped From credential, not pasted as literal values.
- The task is seeded with a coding scenario (a workspace repository).
- Workspace tools and the platform shim are configured if you expect Tool Calls to be nonzero.
- The scorer or judge validates the actual changed files.
- Cost metadata is either captured or knowingly unavailable for that CLI.
If it goes wrong
- FAILED with agent_model_unresolved: no judge model resolved. Select a model in the agent field Models popover, or set an organization default in Settings, Models.
- FAILED with agent_command_failed: command exited non-zero and produced an empty diff, commonly due to interactive prompt blocking. Use a recognized binary so headless flags are appended automatically, or pass the flags explicitly.
- FAILED with in_sandbox_requires_workspace: no coding scenario was attached. Seed the task with a coding scenario.
- Empty trajectory with non-empty diff: unrecognized run command, such as full command wrapping with bash -c, fell back to coarse capture, or the CLI crashed before transcript extraction. Invoke the CLI binary directly.
- Tool Calls is 0 with a non-empty trajectory: the harness ran but never called platform tools. Attach workspace tools and enable the platform shim, or accept trajectory-only observability. See Trajectory vs platform tool calls.
- Cost shows $0.0000 for a real LLM run: agent-side usage was not extracted. Use Claude Code for token extraction, or write result metadata. See Cost reporting.