Trace capture

ac7’s trace capture is a first-class feature for replacing custom agent orchestrations whose main value prop is built-in logging. The idea: let directors see what the LLM actually said and what tools it actually called, scoped to each objective, without embedding observability hooks into the agent itself.

The runner runs upstream of the agent process and intercepts its network traffic at the TLS layer via a loopback MITM TLS proxy with a per-session local CA. Every HTTPS request the agent makes is transparently decrypted by the proxy, observed as plaintext, re-encrypted toward the real upstream, and passed through. From the upstream’s point of view we are a normal TLS client doing standard SNI + cert validation — it can’t tell us apart from any other user-agent, which means OAuth flows, token refreshes, streaming responses, and SSE all work identically.

Zero external tools. No tshark. No pcap. No SSLKEYLOGFILE shenanigans. Just Node’s built-in crypto + tls + a small amount of node-forge for cert signing.

The proxy works for both runners (ac7 claude-code and ac7 codex), with a small env-var translation on the codex side to satisfy reqwest’s expectations instead of Node’s. What’s captured is the same.

Setup

# Verify everything's in place before your first run.
ac7 claude-code --doctor

The --doctor command runs five checks:

CheckStatus
claude binary on PATH (or $CLAUDE_PATH)FAIL if missing
$TMPDIR writable at 0o600FAIL if not
Loopback proxy bindable on 127.0.0.1:0FAIL on networking issues
Trace CA + leaf cert generationFAIL on crypto runtime issues
TLS validation postureWARN if NODE_TLS_REJECT_UNAUTHORIZED=0 is set in the current env

Exit code is 0 if no checks failed, 1 otherwise. WARNs proceed; only FAILs abort ac7 claude-code startup.

There’s no equivalent --doctor for ac7 codex today; the same underlying checks apply (the trace pipeline is shared) so running ac7 claude-code --doctor validates the codex prerequisites too.

What the runner does at startup (claude-code)

When you run ac7 claude-code (without --no-trace), the runner:

  1. Generates a fresh per-session local CA. One CA keypair plus one shared leaf keypair, both held in memory. The CA cert (public half only) is written to $TMPDIR/ac7-trace-ca-<pid>-<nonce>.pem at 0o600. The CA’s private key never touches disk.

  2. Starts a loopback HTTP CONNECT proxy on a random ephemeral port. The proxy is configured with the CA’s cert pool so it can mint leaf certs on demand for any hostname the agent asks for.

  3. Starts the activity uploader. Seeds objective_open events for every objective currently assigned to the member (from the initial briefing) and begins streaming any activity events to POST /members/:name/activity.

  4. Backs up .mcp.json to a pid-scoped tmp directory and atomic-writes a new one with a ac7 entry pointing at ac7 mcp-bridge.

  5. Auto-injects three claude flags: --dangerously-skip-permissions, --dangerously-load-development-channels server:ac7, and --append-system-prompt <briefing>. (Each can be suppressed by passing it yourself.)

  6. Spawns claude with these env vars merged in:

    HTTPS_PROXY=http://127.0.0.1:<port>
    HTTP_PROXY=http://127.0.0.1:<port>
    ALL_PROXY=http://127.0.0.1:<port>
    NO_PROXY=localhost,127.0.0.1,::1,<caller's value>
    NODE_USE_ENV_PROXY=1
    NODE_EXTRA_CA_CERTS=$TMPDIR/ac7-trace-ca-<pid>-<nonce>.pem
    NODE_OPTIONS=<existing> --loader <ac7 ssl-keylog loader>
    SSLKEYLOGFILE=<runner-managed path>
    AC7_RUNNER_SOCKET=/tmp/.ac7-runner-<pid>.sock
  7. Waits for claude to exit. On any exit path (normal, SIGINT, SIGTERM, uncaughtException), restores the original .mcp.json, deletes the CA cert PEM, closes the proxy relay, and unlinks the IPC socket.

What the runner does at startup (codex)

The trace pipeline is the same — same proxy, same CA, same parser, same uploader. What changes is the env vars and the absence of a .mcp.json rewrite (codex reads MCP config from the ephemeral CODEX_HOME we create instead).

HTTPS_PROXY=http://127.0.0.1:<port>
HTTP_PROXY=http://127.0.0.1:<port>
ALL_PROXY=http://127.0.0.1:<port>
NO_PROXY=localhost,127.0.0.1,::1,<caller's value>
CODEX_CA_CERTIFICATE=$TMPDIR/ac7-trace-ca-<pid>-<nonce>.pem
SSL_CERT_FILE=$TMPDIR/ac7-trace-ca-<pid>-<nonce>.pem
CODEX_HOME=~/.cache/agentc7/codex/ac7-codex-<random>/

CODEX_CA_CERTIFICATE is codex’s reqwest-style canonical knob; SSL_CERT_FILE is a fallback. NODE_EXTRA_CA_CERTS and NODE_USE_ENV_PROXY are deleted from the inherited env — they’re Node-only and would confuse reqwest.

For codex specifics see runners/codex.

How the MITM works

When the agent issues CONNECT api.anthropic.com:443 through the proxy:

      agent                proxy               upstream
        │                    │                    │
        │ CONNECT host:443   │                    │
        │───────────────────>│                    │
        │                    │ TLS handshake      │
        │                    │───────────────────>│
        │                    │ (standard SNI +    │
        │                    │  cert validation)  │
        │                    │<───────────────────│
        │ 200 Established    │                    │
        │<───────────────────│                    │
        │ ClientHello        │                    │
        │───────────────────>│                    │
        │ [proxy issues leaf │                    │
        │  cert for host,    │                    │
        │  signs with CA,    │                    │
        │  wraps socket in   │                    │
        │  TLSSocket server] │                    │
        │ ServerHello...     │                    │
        │<───────────────────│                    │
        │ plain HTTP req ──> │ encrypted req ──>  │
        │ plain HTTP rsp <── │ encrypted rsp <──  │

Two independent TLS sessions. The agent talks to us over TLS (trusting our CA via the runner-injected env var); we talk to the upstream over TLS with the upstream’s real cert. In between we have plaintext in both directions.

The streaming activity model

There are no per-objective spans. The runner maintains one activity stream per member — an append-only timeline of everything observed:

  • llm_exchange — an Anthropic API request/response pair, parsed into a typed entry
  • opaque_http — every other HTTP exchange, with headers + body previews
  • objective_open — the member just took ownership of an objective
  • objective_close — the member just released it

Per-objective “traces” are a time-range view over this stream: the web UI queries GET /members/<assignee>/activity?from=<open>&to=<close>&kind=llm_exchange to pull the LLM calls made during an objective’s lifetime, rather than loading a separately-stored per-objective blob.

Capture runs entirely live: as soon as the proxy finishes reassembling an HTTP/1.1 request/response pair, the runner parses it, extracts + redacts, wraps it as an event, and enqueues it for streaming upload. No per-span buffering, no memory accumulation over an objective lifetime, no big flush at span close.

For the data model see activity-and-traces.

The decode pipeline

For every HTTP/1.1 exchange the reassembler completes:

  1. Incremental parse via Http1Reassembler (reads plaintext chunks as they arrive from the MITM, keeps rolling buffers per TLS session, emits completed request/response pairs in FIFO order). Handles Content-Length, chunked, gzip / deflate / br.
  2. Extract Anthropic API shape via extractEntries (anthropic.ts). For POST /v1/messages on *.anthropic.com, parse into a typed AnthropicMessagesEntry with model, maxTokens, system, messages, tools, stopReason, and usage (input/output/cache_creation/cache_read tokens). Everything else becomes an OpaqueHttpEntry with headers + body previews.
  3. Redact secrets via redactJson (redact.ts):
    • Headers stripped: Authorization, x-api-key, cookie, set-cookie, x-anthropic-api-key, proxy-authorization.
    • Patterns scrubbed in string values: sk-ant-…, sk-… (length-checked to avoid false positives), AKIA…, ghp_…, xox[baprs]-…. Replaced with [REDACTED].
  4. Enqueue in the ActivityUploader — a batched streaming sender that flushes every 50 events OR 64 KB OR 500ms, whichever comes first. Failures retry with exponential backoff (200ms → 30s); the queue is hard-capped at 1000 events / 1 MB with oldest-first eviction under sustained broker unreachability.

objective_open / objective_close markers are emitted by the runner whenever the objectives tracker’s open set changes — the diff adds opens for new ids and closes for ids that just left the set. They flow through the same uploader.

Viewing traces

Members with activity.read (and the assignee themselves) review captured traces in the web UI’s TracePanel on each objective’s detail page:

  • Queries GET /members/<assignee>/activity?from=<objective.createdAt>&to=<objective.completedAt ?? now>&kind=llm_exchange
  • Renders each returned LLM exchange with model name, token usage (in=150 out=42 cache_read=100 cache_creation=...), and message list
  • Expands Anthropic messages into text blocks + tool_use + tool_result entries inline

The panel is gated server-side: GET /members/:name/activity requires activity.read (or self). The client-side gate (the TracePanel only mounts when the briefing carries activity.read) is a UX optimization; the server is the real boundary.

Security posture

Trace capture inherently reveals secrets the agent used during the work. ac7 mitigates this with defense in depth:

  1. MITM is loopback-only and session-scoped. The proxy binds only to 127.0.0.1 on a random ephemeral port. The CA is generated fresh per runner process; its cert is written with 0o600; its private key never touches disk.
  2. Redaction at parse time. Secrets are replaced with [REDACTED] before entries leave the runner. The server never sees the plaintext token.
  3. Permission-gated view. Only members with activity.read (or the captured member themselves) can read the activity stream. Watchers, originators, and assignees of OTHER members’ objectives all get 403 on the GET endpoint.
  4. CA cert deleted on runner exit. The cert PEM is unlinked on every exit path (normal, SIGINT, SIGTERM, uncaughtException).
  5. .mcp.json restored on every exit (claude-code only) — the original is backed up and restored idempotently.
  6. Ephemeral CODEX_HOME removed on exit (codex only) — the entire temp directory is rm -rf’d, including the symlink to the user’s ~/.codex/auth.json (the symlink is removed; the real file isn’t).
  7. Upload is best-effort. If the upload fails, the runner logs and moves on. It does NOT retry past the queue cap, and it does NOT persist the trace to disk.

Opting out

Both runners support --no-trace:

ac7 claude-code --no-trace
ac7 codex --no-trace

This disables the entire trace subsystem: no proxy relay, no CA generation, no env var injection, no busy reporter (the busy signal needs MITM-captured traffic to drive it). The runner still handles the briefing, SSE forwarder, objectives, and bridge IPC normally.

Use --no-trace when:

  • You’re debugging the runner / bridge plumbing and don’t want extra moving parts.
  • The agent doesn’t honor HTTPS_PROXY and the proxy adds latency without capturing anything.
  • You’re piping through a network-layer proxy that already captures traffic.

Storage planning

Activity rows are the heaviest-write path in the broker. Real shape on a single active agent:

  • Event volume: up to 50 events per batch, flushed every 500ms / 64 KB / full batch (whichever first).
  • Payload size: an llm_exchange row is typically 10–100 KB of JSON — model, messages, tool_use / tool_result blocks, usage stats. opaque_http rows are smaller.
  • Aggregate: 10 LLM calls/min × 50 KB × 24h × 10 concurrent agents ≈ ~7 GB/day/team.

Two operational controls keep that bounded:

Dedicated activity database

The activity store runs on its own SQLite file — separate from the main broker DB. Default location: <dbPath>-activity.db (e.g. ./ac7.db./ac7-activity.db). Override via AC7_ACTIVITY_DB_PATH.

Why two DBs: trace writes are bursty and heavy. Keeping them off the main broker’s single writer lock ensures a burst doesn’t stall chat / objective / auth / session writes. Both DBs use WAL + busy_timeout=5000 + wal_autocheckpoint=1000.

Retention with ac7 prune-traces

ac7 prune-traces --older-than 30d

Deletes every activity row with event.ts older than the cutoff. Prompts with the activity DB path + cutoff timestamp before running unless --yes is passed. Non-TTY runs without --yes refuse rather than silently destroying data.

Accepted duration shapes: 30d, 7d, 24h, 60m, 3600s, 500ms. Case-insensitive.

Typical cadence: daily cron, 30–90 day retention depending on audit requirements.

Limitations (v1)

  • HTTP/1.1 only. HTTP/2 agents (which negotiate h2 via ALPN) produce no llm_exchange events. Adding an HPACK-aware parser is a follow-up. In practice the Anthropic SDK defaults to HTTP/1.1 for /v1/messages, so this is rarely hit.
  • Anthropic parser only. Other LLM providers (OpenAI, Gemini, Mistral) land as opaque_http. Codex traces today fall in this bucket — adding a typed OpenAI parser so codex traces render the same way claude-code traces do is a follow-up.
  • Uploader queue cap. The uploader caps in-flight at 1000 events / 1 MB and evicts oldest-first under sustained broker unreachability. Events dropped here won’t appear in the UI.
  • Cert pinning. If an agent ships bundled cert pins for the upstream’s real cert chain, our MITM leaf won’t match and the handshake will fail. Claude Code v2 does not currently pin; if that changes we’d need to intercept at a different layer.