What Is An Agent Harness?

6 Layers That Turn An LLM Into An Agent

Jun 01, 2026

Open ChatGPT. Ask: “What is the best phone under 30,000 rupees in 2026?”

It searches the web. Reads a few pages. Searches again with a better query. Then it writes an answer.

Now disable web search and ask the same question. Or run it on the bare model through the API with no tools.

The same model gives you a confident answer from training data. Discontinued phones. Wrong prices. Nothing released after its cutoff.

Same model. Same prompt. Two different products.

What changed is the agent harness around the model.

In our previous pieces, we covered evals and traces. Those measure what an agent does. This piece is about what an agent actually is.

If you are building AI agents at work, or preparing for AI PM interviews, this is the model you need. We teach this in our course. (PMs at Microsoft, Coinbase, Indeed & 600+ PMs rated 4.9/ 5). See testimonials and course details — Extra 60% OFF - Use Code NYE26

The Model Is Not The Agent

The model is a function. Tokens in, tokens out. That is the entire contract.

The model can decide that a tool should be called. The model cannot run the tool. The model can decide a task is finished. The model cannot enforce that decision.

Search execution. Memory across sessions. Retries on failure. Loop termination. None of these happens inside the model. They happen in the agent harness.

What ChatGPT Actually Does When You Ask About Phones

The full agentic turn:

Goal received. The system records your question.
State check. Is there enough context to answer? No. Fresh data needed.
Pick a tool. The model emits a tool call: web search with a specific query.
Call the tool. The system runs the search and returns results.
Read the result. Results are fed back into the model as the next turn.
Decide if done. Not yet. The model emits another tool call. Browse a URL.
Loop. Steps three through six repeat until the model stops calling tools.
Stop and answer. The model emits a response with no tool call. The system exits the loop.

Steps three, six, and eight are the model. Pick the tool. Decide to continue. Decide to stop.

Every other step is code. The search has to run. Results have to be parsed and shrunk to fit context. History has to be tracked. The loop has to terminate even when the model wants to keep going.

That code is the agent harness.

Where The Loop Breaks

Five failures from products you have used:

The model picks the right tool but the wrong query. It searches “best phone 2026” instead of “best phone under 30000 INR 2026”.
The tool returns 50 results. The model loses the original constraint by the time it finishes reading them.
The model declares the task done. It is not. You asked for a phone under 30k. It recommended one at 45k.
The model decides it is not done. It is. Eight more searches. 90 seconds of waiting.
A tool errors out. The model has no idea why. It retries. Same error. Retries again.

Each failure maps to a specific agent harness layer.

Failure 1: tool design. Force structured parameters, not free-text queries.
Failure 2: context management. Return five results, not fifty.
Failure 3: verification. Check the constraint before declaring done.
Failure 4: stop condition. Cap turns.
Failure 5: error handling. Classify errors, halt retry loops.

Better prompting will not solve any of these reliably. The fix sits in the harness.

The 6 Layers Of An Agent Harness

1. The Orchestration Loop

The while-loop above. Stop condition. Max turns. Tool-error behaviour. When to summarise older turns. When to spawn sub-agents.

ChatGPT caps how many searches per turn. Perplexity caps sources read. Both are product decisions, not model decisions.

A loop with no max-turn cap will burn your token budget.

2. Tool Definitions

ChatGPT’s web_search is a function. Whoever defined it decided the query format, the number of results, the fields per result, and the snippet length.

The model can pick any tool you give it. The model cannot redesign the tool.

A file-editing tool that returns just the diff scales to large codebases. One that returns the whole file runs out of context after three edits. Tool design decides what the agent can do.

3. Context Management

At some point the context window fills. Then you choose: truncate from the top, summarise older turns, keep a scratchpad, or spawn a sub-agent with a fresh context.

This decides whether your agent feels coherent at turn 30 or has forgotten the original question.

When ChatGPT says “as I mentioned earlier” about something you never discussed, the context layer failed.

4. Memory

Context lives for one task. Memory lives across tasks.

If a user comes back on Friday after talking to your agent on Monday, where is what they said stored? Vector store? Structured profile? Database keyed by user ID?

ChatGPT’s Memory feature is this layer. The base model has no memory between sessions. The harness added it.

5. Guardrails

Before any tool call executes, something has to inspect it.

Is this a write to production? Is this destructive? Did the model just try to email every customer?

When Cursor asks “Apply this change?” before editing your file, that is a guardrail. When ChatGPT Agent pauses before buying something, that is a guardrail.

Without this layer, one hallucinated tool call can do real damage.

6. Observability

Every turn, every tool call, every input, every output, every failure has to be logged. Not just for debugging. For evals.

If your agent fails in production and you cannot replay the trace, you do not have an agent. You have a black box that occasionally works.

What Product Managers Need To Decide

How agentic do you want this to be

A single tool call wrapped in a UI is not an agent. A 50-turn autonomous loop with no human in the path is the other extreme.

ChatGPT in regular mode is mildly agentic. ChatGPT in Agent mode is heavily agentic. Same model. Different harness. Different product.

How will you cap cost?

Token cost per task. Turns per task. Rupees per resolved query. Pick the unit. Track from day one.

What is your stop condition?

Is the model allowed to declare success on its own, or does another system verify the work?

In Cursor, the verifier is the test suite. In a support agent, a customer satisfaction signal. Without a verifier, the agent will tell you it is done when it is not.

Where is the human in the loop

For low-stakes tasks, the agent runs free. For high-stakes ones, it pauses for approval.

ChatGPT Agent pauses before buying. Cursor pauses before applying edits. Where those pauses sit is a product decision, not a model decision.

What do you log

If you log only the final answer, you cannot debug. Log every tool call, every model output, every reasoning step.

If this changed how you think about our Job Ready AI PM Cohort

(12 Weeks, ~50 Sessions, ~100 Hours, ~10+ Products built, ~20 Hours of Interview Prep, 2 Mock Interviews) ~goes deeper. Live cohort. Cohort registrations open. Limited seats. Fill this Form to Show Interest

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

More Resources

Product Management Mock Interview (Detailed)
Crack AI Business Roles (AI Management Consulting, AI Category Management, AI General Manager, Revenue Planning, etc.) - Course Details
Crack AI Program Manager Roles - Course Details

Technomanagers

Discussion about this post

Ready for more?