January 2025

What I Learned Building an AI Agent From Scratch

15 min read AI, Agents, Python, Architecture

I spent the last couple months building ShedboxAI Agent, a conversational AI that turns natural language into data workflows. The kind of thing where you say "analyze my sales data and find the top performing regions" and it figures out how to make that happen.

Along the way I learned a bunch of stuff that I wish someone had told me before I started. This post is that list. No hype, no "AI will change everything" nonsense. Just practical lessons from actually building this thing.

The Big Insight: Let the LLM Write Code, Not Choose Tools

Most AI agent frameworks work like this: you define a bunch of tools (functions), the LLM picks which ones to call, and you execute them. Sounds reasonable. Turns out it's actually pretty wasteful.

Every time you send a message to the LLM, you have to include the full schema of every tool. That's a lot of tokens. And the LLM often picks the wrong tool anyway, especially when tools have similar names or overlapping functionality.

What we ended up doing instead: give the LLM a Python sandbox and let it write code. The code can call our APIs directly. No tool selection step. No schema overhead.

This saved us something like 98% of tokens compared to the traditional approach. And the LLM was actually better at writing code to accomplish tasks than it was at selecting from a menu of tools. Go figure.

The catch is you need a secure sandbox. We used RestrictedPython to limit what the code can do. Can't let an LLM run arbitrary code on your server. But once you have that, the code execution approach is way more flexible.

One Big Agent is Worse Than Many Small Ones

My first attempt was a single agent that handled everything. It had a huge context window (like 180K tokens), knew about all our APIs, and was supposed to figure out how to do any task.

It sucked. The agent would get confused, mix up context from earlier in the conversation, and make weird decisions because it was trying to keep track of too much stuff at once.

What actually works: a small orchestrator that delegates to specialist agents. The orchestrator stays lean (around 60K tokens of context) and its only job is to figure out what kind of task this is and hand it off to the right specialist.

Specialists are beautiful because they're disposable. They spin up fresh for each task with no memory of previous conversations. No context pollution. They do their one thing, return the result, and get garbage collected.

We ended up with three specialists:

An analysis specialist for investigating data
A YAML builder for generating workflow configs
A results interpreter for explaining what happened

Each one has its own 40K token budget. The orchestrator can run them in parallel if needed. And because they're stateless, you don't get the weird bugs where the agent remembers something from 50 turns ago that's no longer relevant.

Context Management is the Whole Game

This was the thing I most underestimated. In a long conversation, your context window fills up. When it's full, something has to give. Either you truncate (and lose important history) or you compress (and lose some detail). Neither is great.

The naive approach is to wait until you're at 95% capacity and then panic-compress. We tried this. It's terrible. Compressing a lot at once causes you to lose important context, and then the agent makes dumb decisions, and users hate you.

What works better: compress early and often. We trigger compression at 85% capacity. Smaller, more frequent compressions preserve more context than one big one. It's counterintuitive but true.

We also split context into tiers:

Hot context (in-memory, exact): The system prompt, current task, last 5 messages. This never gets compressed.
Warm context (summarized): Conversation history, past decisions, workspace metadata. Compressed but accessible.
Cold context (RAG-retrieved): Documentation, data schemas, workflow history. Only pulled in when relevant.

This three-tier approach let us handle 100+ turn conversations without the agent losing track of what it was doing. The key insight is that not all context is equally important. Recent stuff matters more than old stuff. Current task matters more than background knowledge.

Make Errors Visible to the LLM

When the agent generates a YAML workflow and it fails to execute, what do you do? The obvious thing is to tell the user "that didn't work" and ask them to try again. But users hate that.

What we do instead: parse the error logs, extract the actual error messages, and feed them back to the LLM. Let the LLM debug its own code.

This sounds obvious in retrospect but it took me a while to figure out. The LLM is actually really good at debugging when it can see the error. It just needs the information.

We have a retry loop that goes: generate YAML, execute it, if it fails parse the logs, feed the error back to the LLM, let it try again. Up to 3 attempts.

This fixed about 80% of errors that would have otherwise required user intervention. The remaining 20% are usually fundamental misunderstandings of what the user wanted, which you can't fix with retries anyway.

Use Files as APIs

Here's a pattern that seemed dumb to me at first but turned out to be really useful: instead of having specialists call APIs to get data schemas, we just write the schema to a file and have everyone read from that file.

On startup, we run an introspection command that looks at all the data sources and writes out a markdown file with column names, types, sample rows, row counts, everything. Then every specialist just reads that file.

Why is this better than API calls?

Faster. File reads are instant, API calls have latency.
Single source of truth. Everyone sees the same data.
Works offline. Once the file exists, no network needed.
Easier to debug. You can just look at the file.

The tradeoff is staleness. If the data changes, the file is out of date. But for our use case that's fine. We regenerate the file at the start of each session and data doesn't usually change mid-conversation.

Contract-First Architecture Saved My Sanity

Every interaction between components is defined by a Pydantic schema. The orchestrator sends a PlanRequest to the planner and gets back an ExecutionPlan. The orchestrator sends SpecialistInput to specialists and gets back SpecialistOutput. Everything is typed.

This sounds like busywork but it's actually essential. When you're building a system with multiple async agents talking to each other, you need to know exactly what each one expects and produces. Otherwise you spend all your time debugging weird type mismatches.

The other benefit: you can develop in parallel. I could work on the orchestrator while someone else worked on specialists, and as long as we both matched the schema it would work when we merged. No "let me see your code so I know what format to send" conversations.

One Command is Better Than Three

Early versions of the CLI had multiple commands: shedbox-agent init for setup, shedbox-agent chat for conversations, shedbox-agent reinit to start over. Users kept running the wrong one.

What we have now: just shedbox-agent. The CLI figures out what to do.

First run? Run the onboarding flow, then start chatting.
Setup incomplete? Resume from where you left off.
Already set up? Jump straight to chat.
Want to start over? Pass --reinit.

One command with smart defaults. Users don't have to remember anything. This sounds small but it made a huge difference in how people felt about using the tool.

Debug Mode is Not Optional

We have an environment variable, SHEDBOX_AGENT_DEBUG=true, that makes the orchestrator log everything. Which specialist it's spawning, why it made that decision, what the token budget is, what context is being passed around.

This was invaluable for debugging. When something goes wrong in a multi-agent system, it's really hard to figure out why. Having a mode where you can see exactly what decisions were made and in what order saves hours of head-scratching.

The key is making it opt-in. In normal operation you don't want all that noise. But when you're debugging, you want all of it.

The Things That Didn't Work

Not everything was a success. Some stuff we tried that failed:

Fancy tool selection algorithms

We spent way too long building a "smart" system for selecting which tool to use. Keyword matching, semantic similarity, learned embeddings. All of it was worse than just letting the LLM write code directly. Ended up ripping the whole thing out.

Long-lived specialists

We tried keeping specialists around between tasks so they could remember context. It just made them confused. They'd apply lessons from previous tasks to the current one in ways that didn't make sense. Disposable specialists are better.

Aggressive caching

We cached LLM responses thinking it would save money. But conversations are so variable that cache hit rates were terrible. We spent more time maintaining the cache than we saved in API costs. Removed it.

What I'd Do Differently

If I started over, I'd do a few things differently:

Start with context management. I treated it as an afterthought and had to retrofit it. Should have been the first thing I designed.
Skip the tool registry entirely. Went straight to code execution. Would have saved weeks.
Build more integration tests earlier. Unit tests are fine but the bugs that matter are in how components interact.
Keep the LLM calls dumber. Every time I tried to make prompts clever, they got worse. Simple, direct prompts work better.

Is It Worth Building Your Own?

Honestly, it depends. There are good agent frameworks out there now (LangChain, CrewAI, AutoGen). If your use case fits their patterns, use them. Don't build from scratch for the sake of it.

We built custom because we had specific requirements around the code execution model and context management that didn't fit existing frameworks. And because I wanted to learn how this stuff actually works, not just how to use someone else's abstractions.

If you do build your own, the lessons above should save you some pain. The core insight is that AI agents are really just distributed systems with an LLM in the loop. All the normal distributed systems wisdom applies. Keep components small. Make failures visible. Design for observability. Use contracts between components.

The LLM part is almost the easy part. The hard part is everything around it.

Check out ShedboxAI if you want to see how it turned out. It's open source, so you can dig into the code if any of this sounds interesting.