Near the end of my
last post about moving FFmpeg off
my primary Fly.io server, I wrote:
I really wish I didn’t have to do the heartbeat dance. That would be a nice
feature to just have built-in Cloudflare friends!
Well. Twenty-four hours later, the heartbeat dance was gone.
That post explains the full story,
but here is the short version: I was running FFmpeg inline on my primary app
server and it saturated the CPU whenever I published a longer episode. The fix
was to move the job onto Cloudflare: a queue delivers the work to a Worker,
which forwards it to a Cloudflare Container, which does the FFmpeg stitching
and uploads the outputs to R2.
That solved the production problem. But the container design had a rough edge I
wasn’t happy about.
The container didn’t know it was done. When the FFmpeg job finished, the
container couldn’t tell Cloudflare “you can stop me now.” That control lives in
the Cloudflare Worker and Durable Object wrapper, so I had to wire up heartbeat
pings from inside the container process while FFmpeg was running, plus a
“shutdown if idle” endpoint that the container called on job completion to check
whether any other jobs were active and, if not, tell the controller to stop the
container. All of that coordination plumbing existed purely to manage a
lifecycle that shouldn’t need that much ceremony.
Cloudflare Sandboxes have a different model. You call sandbox.exec() to run a
command, wait for it to finish, and the sandbox is done. No heartbeats. No
shutdown signals. No idle checks.
The day I shipped the container migration, I handed a Cursor agent the keys to
a spike: migrate the audio pipeline from Cloudflare Containers to Cloudflare
Sandboxes.
That spike became PR #726.
It worked, and it deleted the heartbeat/shutdown plumbing entirely. But when I
looked at the code (yes, I do sometimes still read the code), it still felt like
a container system wearing a sandbox costume. A wolf in sheep’s clothing! 😆
The design in that PR was: a dedicated call-kent-audio-sandbox service with
its own Wrangler config, its own deploy workflow, and its own HTTP endpoint at
/jobs/episode-audio. The existing queue worker would POST a job to that
endpoint. The sandbox service would start a Node process inside the sandbox,
wait for a port, proxy the request to it, and run the job. The sandbox service
itself owned the callback logic and held the R2 credentials.
The heartbeat was gone, but the overall shape was the same: a long-lived service
sitting between the queue worker and the actual work.
So I closed that first attempt and started over with a new agent which turned
into PR #729. It took a
different approach. Instead of building a new service around the sandbox, it
made the sandbox an implementation detail of the existing worker.
Here is the full flow:
The queue worker is now the orchestrator. It receives the message, sends a
started callback, creates short-lived presigned R2 URLs for both the inputs
and the outputs, runs a single exec() call in a fresh sandbox, and then sends
completed or failed. The sandbox runs one shell script, exits, and is
destroyed in a finally block. That’s it.
The key difference from PR #726 is where things live:
| PR #726 (abandoned) | PR #729 (merged) | |
|---|---|---|
| Sandbox invocation | Worker POSTs to sandbox service endpoint | Worker calls sandbox.exec() directly |
| Callback ownership | Sandbox service sends callbacks | Worker sends callbacks |
| R2 credentials | Passed into the sandbox | Kept in the worker; sandbox gets signed URLs only |
| Sandbox lifecycle | Long-lived service process, port-ready check | One-shot exec, destroy() in finally |
| Deploy surface | Separate service + separate workflow | Embedded in worker package |
The sandbox image is correspondingly tiny:
FROM docker.io/cloudflare/sandbox:0.7.16
RUN apt-get update \
&& apt-get install -y --no-install-recommends ffmpeg \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /opt/call-kent-audio
COPY assets ./assets
COPY sandbox/call-kent-audio-cli.sh /usr/local/bin/call-kent-audio-cli
RUN chmod +x /usr/local/bin/call-kent-audio-cli
The Cloudflare Sandbox base image provides the runtime. I add FFmpeg, copy the
bumper audio assets, and copy one shell script. That shell script downloads the
input audio files from presigned URLs, runs the FFmpeg stitching pipeline,
uploads the three output files to presigned upload URLs, and prints JSON to
stdout with the output file sizes. Then it exits. Nothing inside the sandbox
needs credentials, secrets, or any knowledge of the broader system.
The worker side is equally readable:
const completed = await runCallKentAudioSandboxJob({
binding: env.Sandbox,
sandboxId: createSandboxId(parsed.draftId),
request: {
draftId: parsed.draftId,
attempt,
callAudioUrl: signedUrls.callAudioUrl,
responseAudioUrl: signedUrls.responseAudioUrl,
episodeUploadUrl: signedUrls.episodeUploadUrl,
callerSegmentUploadUrl: signedUrls.callerSegmentUploadUrl,
responseSegmentUploadUrl: signedUrls.responseSegmentUploadUrl,
},
})
And runCallKentAudioSandboxJob boils down to:
const sandbox = getSandbox(binding, sandboxId)
try {
const result = await sandbox.exec('/usr/local/bin/call-kent-audio-cli', {
env: createSandboxCommandEnvironment(request),
timeout: sandboxExecTimeoutMs,
})
return getSandboxOutput(result.stdout)
} finally {
await sandbox.destroy()
}
Start it, run it, destroy it. No coordination layer needed.
I added Cloudflare Containers the day before and had exactly one real production
run to judge by. That is not a lot of data.
But the justification here isn’t performance numbers. It’s that the final system
has fewer moving parts and fewer places for things to go wrong. The control
plane I deleted (the heartbeat loop, the idle check, the shutdown signal, the
separate service with its own deploy pipeline) was complexity I was adding on
top of a problem that already had a simpler solution. A sandbox that runs one
job and exits does not need any of that. The right lifecycle for job-shaped work
is a job-shaped sandbox.
The container migration was still worth it. It solved the immediate production
problem and running on the containers version for even one day made it obvious
that the heartbeat/shutdown ceremony was the part that didn’t need to exist. I
just didn’t know the sandbox API well enough to see that until I’d tried the
first version.
The container implementation, the sandbox spike (PR #726), the redesigned
sandbox approach (PR #729), the comparison between the two, and the final
validation all happened in under an hour of my own time.
I described the problem to a Cursor agent. It built the first sandbox direction.
I looked at what it built, thought “this is still shaped like a container
service,” described the simpler shape I wanted, and it rebuilt it. I reviewed
the result, merged it, and moved on.
The agent handled the exploration cost. That is the part that usually makes
architectural iteration slow: you have to build the thing before you can have
an informed opinion about whether it is the right thing. When that cost is
close to zero (or the cost of the amount of tokens 😅), you can just try both
and choose the better one (or as I like to say, “choose the one I hate the
least”). The PR history here has an entire abandoned direction that I genuinely
used to inform the final design, and it cost me very time little to produce.
Agents didn’t catch everything. Two things only surfaced once the real system
ran.
Sandbox ID length. The original worker generated sandbox IDs like this:
const sandboxId = `call-kent-audio-${draftId}-${crypto.randomUUID()}`
A UUID is 36 characters, so this came out to roughly 89 characters. Cloudflare
Sandbox IDs must be 1-63 characters. The first real production run failed
immediately with Sandbox ID must be 1-63 characters long.
The fix was to keep the ID traceable but compact: strip dashes from both the
draft ID and the random suffix, take the first 12 characters of each, and
combine them:
function createSandboxId(draftId: string) {
const compactDraftId = draftId.replaceAll('-', '').slice(0, 12)
const randomSuffix = crypto.randomUUID().replaceAll('-', '').slice(0, 12)
return `call-kent-${compactDraftId}-${randomSuffix}`
}
call-kent- is 10 characters, each compact segment is 12, the separator is 1,
giving a total of 35. Well under the limit, still traceable to the draft, still
unique enough.
I could have caught this by running it myself in a staging/preview
environment… or giving the agent the keys to that it for me.
The sandbox image wasn’t actually a sandbox image. This one is a better
story.
During the PR review, one of the automated bots noted that the Dockerfile ran
as root and suggested adding a non-root user. The agent implementing that change
also set up a minimal HTTP server (busybox httpd) as the container entrypoint,
probably from some pattern about containers needing a running process. The
problem is that Cloudflare Sandboxes aren’t containers in that sense. The
@cloudflare/sandbox SDK expects to talk to the Cloudflare sandbox runtime
that’s baked into the base image. When I based the image on plain Debian and set
my own CMD, the SDK’s exec session setup got 501 errors because the runtime
wasn’t there.
I didn’t catch this in testing because the local mock path doesn’t go through a
real sandbox image at all.
Here is the cool part: I handed this debugging task to an agent. It connected to
the live production environment using real env vars, enqueued throwaway jobs
with fake draft IDs (so nothing could accidentally publish), and ran through the
actual queue-to-sandbox path in production. Within a few minutes it had isolated
the failure: the queue delivery and callback routing were fine, the worker logic
was fine, and the sandbox exec was failing with 501s. It traced that back to the
image setup, identified the missing base image requirement, and wrote the fix.
Shout-out to the Cloudflare MCP server 🔥.
I got a summary back describing exactly what was wrong and what was changed. I
looked at the diff, the explanation made sense, and I merged it. The next
production probe succeeded and produced the expected MP3 outputs in R2.
That is genuinely cool. Not “AI wrote code” cool, which at this point is table
stakes. I mean “I delegated a real production debugging investigation, the agent
ran it safely without my supervision, and I got back a correct diagnosis and
fix” cool. I did not spend an evening poking at logs. I did not have to
reconstruct the failure path manually. I just reviewed the result and moved on.
The fixed Dockerfile is now six lines:
FROM docker.io/cloudflare/sandbox:0.7.16
RUN apt-get update \
&& apt-get install -y --no-install-recommends ffmpeg \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /opt/call-kent-audio
COPY assets ./assets
COPY sandbox/call-kent-audio-cli.sh /usr/local/bin/call-kent-audio-cli
RUN chmod +x /usr/local/bin/call-kent-audio-cli
The official base image handles the sandbox runtime. I add FFmpeg and the
assets. Nothing else.
One thing I’m not covering in detail here: all of this sandbox work happened on
the same day I also migrated the repo to npm workspaces and Nx, which moved
everything under services/*. That migration had its own production incident
involving hardcoded content paths and a broken Docker stage.
I wrote about all of that separately in
Migrating to Workspaces and Nx. The
short version is: structural refactors break assumptions you didn’t know you
had, and “the agent was confident it would work” is not the same as “it will
work.”
I only really mention this to say that I never could have gotten so much done at
once before agents. I love building software in 2026!
New infrastructure primitives only help if you let them change the shape of the
thing you’re building. Cloudflare Sandboxes let me delete a lifecycle control
plane that the container approach required but sandboxes simply don’t need. The
win wasn’t “sandboxes are faster” or “sandboxes are cheaper” (I don’t have
enough data to make those claims after two days). The win was that the right
design for a one-shot job is a one-shot execution model, and the sandbox API
makes that straightforward.
The container migration fixed the production problem. The sandbox migration
fixed the architectural shape that was left behind. Both were worth doing, and
together they cost me about an hour of my own time.
If you want to hear more about any of this, give me a call.
Source link
