Discovery
Back to browse

AgentBox - SDK to run coding agents in any sandbox

One SDK to run Claude Code, Codex, or OpenCode inside Docker, E2B, Modal, Daytona, or Vercel sandboxes - boots each agent's native server (JSON-RPC, HTTP/SSE) instead of using non-interactive --print mode.

4 min readView source ↗

If you've ever tried to "run Claude Code in a sandbox" and ended up shelling out to claude --print in non-interactive mode, you've felt the limit AgentBox is built around. Non-interactive mode strips most of what makes the agent useful: approval flows, tool-use control, streaming events. AgentBox does it differently - it boots each agent's native server inside the sandbox and talks to it over WebSocket or HTTP. Full interactive capabilities, intact.

The other half of the value: agent and sandbox are both pluggable. Swap providers and your application code stays the same.

The minimal example

import { Agent, Sandbox } from "agentbox-sdk";

const sandbox = new Sandbox("local-docker", {
  workingDir: "/workspace",
  image: process.env.IMAGE_ID!,
  env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY! },
});

await sandbox.findOrProvision();

const run = new Agent("claude-code", {
  sandbox,
  cwd: "/workspace",
  approvalMode: "auto",
}).stream({
  model: "sonnet",
  input: "Create a hello world Express server in /workspace/server.ts",
});

for await (const event of run) {
  if (event.type === "text.delta") process.stdout.write(event.delta);
}

await sandbox.delete();

That's the whole shape: construct, provision, run, stream, delete.

Install and image setup

npm install agentbox-sdk

Requires Node >= 20. The agent CLI you want to run (claude, opencode, codex) needs to be installed inside your sandbox image - AgentBox boots the CLI's server, so the binary has to be there.

For each sandbox provider, build a base image from one of the bundled presets:

npx agentbox image build --provider local-docker --preset browser-agent

The build prints the provider's native image reference - a Docker tag, Modal image ID, E2B template, or Daytona snapshot. Set it as IMAGE_ID.

Agents

Three providers, all running their CLI inside the sandbox:

ProviderCLIModel format
claude-codeclaudesonnet, opus, haiku
opencodeopencodeanthropic/claude-sonnet-4-6, openai/gpt-4.1, ...
codexcodexgpt-5.3-codex, gpt-5.4

A reasoning level can be passed alongside model: low | medium | high | xhigh. AgentBox maps it to each provider's native control - Codex's effort on turn/start, Claude Code's --effort flag, OpenCode's reasoningEffort agent variant. xhigh requires a model that supports it (Opus 4.7+, Codex gpt-5.4).

Sandboxes

Five providers, same interface:

ProviderWhat it isAuth
local-dockerLocal Docker containerDocker daemon
e2bCloud micro-VME2B_API_KEY
modalCloud containerMODAL_TOKEN_ID + MODAL_TOKEN_SECRET
daytonaCloud dev environmentDAYTONA_API_KEY
vercelEphemeral cloud VMVERCEL_TOKEN + team + project

Every sandbox supports findOrProvision, run, runAsync, gitClone, uploadAndRun, openPort, getPreviewLink, snapshot, stop, delete.

The lifecycle quirk worth knowing: new Sandbox(...) only stores configuration. It doesn't create or attach to a real sandbox. You have to call findOrProvision() once before any operation that needs a live sandbox, including agent runs. Calling anything else first throws a clear error rather than silently lazy-creating - which makes the (potentially slow) attach/create step explicit.

Vercel is the odd one out. Two specifics:

  • It uses runtime snapshots, not pre-built images. Call sandbox.snapshot() to capture state and pass the returned id via provider.snapshotId next run.
  • Ports must be declared at create time via provider.ports. openPort() is a no-op at runtime, so any port the agent or your code will listen on must be listed up front (e.g. opencode uses 4096; codex/claude-code use 43180).

Skills, sub-agents, MCPs, custom commands - all addressable

The SDK exposes the parts you'd otherwise reach into provider-specific config to set:

  • Skills - attach GitHub repos as agent skills (cloned into the sandbox), or embed inline with a SKILL.md string.
  • Sub-agents - declare named delegates with their own instructions and tool allowlist.
  • MCP servers - both local (spawn a process inside the sandbox) and remote (URL with SSE).
  • Custom commands - register slash commands (or $-prefixed for Codex) the agent can invoke.
  • Multimodal input - mix text, images, and PDFs (provider-dependent: opencode does text/images/files, claude-code does text/images/PDFs, codex does text/images).
  • Custom images - define your own image with a small .mjs config and npx agentbox image build --file ./my-image.mjs.

Hooks

Each provider's native hook format is exposed - Claude Code's PostToolUse/PreToolUse hook config maps directly, OpenCode and Codex have their own equivalents. The SDK doesn't try to invent a unified hook abstraction; it forwards each provider's native shape.

When to reach for it

  • You're building a product on top of coding agents and need provider portability without rewriting your app code.
  • You want full interactive capabilities (approval flows, streaming, tool-use control) inside a sandbox.
  • You need to mix sandbox providers - dev on local-docker, staging on E2B, production on Modal - without forking application logic.

When not to

  • One-off scripts. The SDK is overkill if you just want to run an agent against a local repo.
  • Workflows where the non-interactive --print mode is genuinely sufficient. AgentBox's value is interactive parity; if you don't need it, you're paying complexity for nothing.

Trade-offs

The "boot the agent's server inside the sandbox" approach is the right call for capability but not for cold start - each new sandbox has to spin up the agent CLI inside it before you can issue the first turn. For high-volume, sub-second workloads, pair AgentBox with a sandbox provider that snapshots fast (CubeSandbox, E2B, or Modal) and reuse where possible.

The provider-portable surface is real but not absolute. Multimodal capability differs by provider; reasoning levels map differently; some hook shapes are provider-specific. Read the matrix before assuming a feature works everywhere.

Recent discussion

From the wider web

Featured in

Related entries

GitHubLibraryFeatured

Garden Skills - production skill pack for Claude Code, Cursor, and Codex

Three carefully-scoped skills: web-design-engineer (with an anti-cliche blocklist that breaks the generic-AI-landing-page loop), gpt-image-2 (80+ templates, three runtime modes including advisor-only fallback), and kb-retriever (layered data_structure.md navigation for bounded local-KB retrieval). Tested across Claude Code, Claude.ai, Cursor, Codex, Gemini, OpenCode.

Why I saved this - The web-design skill's anti-cliche blocklist is the most opinionated take on 'stop producing the same hero + 3 cards' I've seen.
GitHubToolFeatured

PostTrainBench - can a CLI agent post-train a base LLM in 10 hours?

Benchmark measuring whether Claude Code, Codex CLI, Gemini CLI, and OpenCode can autonomously improve 4 small base models (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) on 7 evals (AIME, BFCL, GPQA, GSM8K, HealthBench, HumanEval, Arena Hard) within a single H100 GPU and 10 hours. Includes agent-as-judge anti-reward-hacking and baseline-replacement penalties for tampering.

Why I saved this - Current leader: Opus 4.6 via Claude Code at 23.2 average. The reward-hacking safeguards (eval tampering and model-substitution detection, baseline-replacement penalty) are the part most agent benchmarks skip.