Technomanagers

What Is An Agent Harness?

Shailesh Sharma — Mon, 01 Jun 2026 23:54:27 GMT

Open ChatGPT. Ask: “What is the best phone under 30,000 rupees in 2026?”

It searches the web. Reads a few pages. Searches again with a better query. Then it writes an answer.

Now disable web search and ask the same question. Or run it on the bare model through the API with no tools.

The same model gives you a confident answer from training data. Discontinued phones. Wrong prices. Nothing released after its cutoff.

Same model. Same prompt. Two different products.

What changed is the agent harness around the model.

In our previous pieces, we covered evals and traces. Those measure what an agent does. This piece is about what an agent actually is.

If you are building AI agents at work, or preparing for AI PM interviews, this is the model you need. We teach this in our course. (PMs at Microsoft, Coinbase, Indeed & 600+ PMs rated 4.9/ 5). See testimonials and course details — Extra 60% OFF - Use Code NYE26

The Model Is Not The Agent

The model is a function. Tokens in, tokens out. That is the entire contract.

The model can decide that a tool should be called. The model cannot run the tool. The model can decide a task is finished. The model cannot enforce that decision.

Search execution. Memory across sessions. Retries on failure. Loop termination. None of these happens inside the model. They happen in the agent harness.

What ChatGPT Actually Does When You Ask About Phones

The full agentic turn:

Goal received. The system records your question.
State check. Is there enough context to answer? No. Fresh data needed.
Pick a tool. The model emits a tool call: web search with a specific query.
Call the tool. The system runs the search and returns results.
Read the result. Results are fed back into the model as the next turn.
Decide if done. Not yet. The model emits another tool call. Browse a URL.
Loop. Steps three through six repeat until the model stops calling tools.
Stop and answer. The model emits a response with no tool call. The system exits the loop.

Steps three, six, and eight are the model. Pick the tool. Decide to continue. Decide to stop.

Every other step is code. The search has to run. Results have to be parsed and shrunk to fit context. History has to be tracked. The loop has to terminate even when the model wants to keep going.

That code is the agent harness.

Where The Loop Breaks

Five failures from products you have used:

The model picks the right tool but the wrong query. It searches “best phone 2026” instead of “best phone under 30000 INR 2026”.
The tool returns 50 results. The model loses the original constraint by the time it finishes reading them.
The model declares the task done. It is not. You asked for a phone under 30k. It recommended one at 45k.
The model decides it is not done. It is. Eight more searches. 90 seconds of waiting.
A tool errors out. The model has no idea why. It retries. Same error. Retries again.

Each failure maps to a specific agent harness layer.

Failure 1: tool design. Force structured parameters, not free-text queries.
Failure 2: context management. Return five results, not fifty.
Failure 3: verification. Check the constraint before declaring done.
Failure 4: stop condition. Cap turns.
Failure 5: error handling. Classify errors, halt retry loops.

Better prompting will not solve any of these reliably. The fix sits in the harness.

The 6 Layers Of An Agent Harness

1. The Orchestration Loop

The while-loop above. Stop condition. Max turns. Tool-error behaviour. When to summarise older turns. When to spawn sub-agents.

ChatGPT caps how many searches per turn. Perplexity caps sources read. Both are product decisions, not model decisions.

A loop with no max-turn cap will burn your token budget.

2. Tool Definitions

ChatGPT’s web_search is a function. Whoever defined it decided the query format, the number of results, the fields per result, and the snippet length.

The model can pick any tool you give it. The model cannot redesign the tool.

A file-editing tool that returns just the diff scales to large codebases. One that returns the whole file runs out of context after three edits. Tool design decides what the agent can do.

3. Context Management

At some point the context window fills. Then you choose: truncate from the top, summarise older turns, keep a scratchpad, or spawn a sub-agent with a fresh context.

This decides whether your agent feels coherent at turn 30 or has forgotten the original question.

When ChatGPT says “as I mentioned earlier” about something you never discussed, the context layer failed.

4. Memory

Context lives for one task. Memory lives across tasks.

If a user comes back on Friday after talking to your agent on Monday, where is what they said stored? Vector store? Structured profile? Database keyed by user ID?

ChatGPT’s Memory feature is this layer. The base model has no memory between sessions. The harness added it.

5. Guardrails

Before any tool call executes, something has to inspect it.

Is this a write to production? Is this destructive? Did the model just try to email every customer?

When Cursor asks “Apply this change?” before editing your file, that is a guardrail. When ChatGPT Agent pauses before buying something, that is a guardrail.

Without this layer, one hallucinated tool call can do real damage.

6. Observability

Every turn, every tool call, every input, every output, every failure has to be logged. Not just for debugging. For evals.

If your agent fails in production and you cannot replay the trace, you do not have an agent. You have a black box that occasionally works.

What Product Managers Need To Decide

How agentic do you want this to be

A single tool call wrapped in a UI is not an agent. A 50-turn autonomous loop with no human in the path is the other extreme.

ChatGPT in regular mode is mildly agentic. ChatGPT in Agent mode is heavily agentic. Same model. Different harness. Different product.

How will you cap cost?

Token cost per task. Turns per task. Rupees per resolved query. Pick the unit. Track from day one.

What is your stop condition?

Is the model allowed to declare success on its own, or does another system verify the work?

In Cursor, the verifier is the test suite. In a support agent, a customer satisfaction signal. Without a verifier, the agent will tell you it is done when it is not.

Where is the human in the loop

For low-stakes tasks, the agent runs free. For high-stakes ones, it pauses for approval.

ChatGPT Agent pauses before buying. Cursor pauses before applying edits. Where those pauses sit is a product decision, not a model decision.

What do you log

If you log only the final answer, you cannot debug. Log every tool call, every model output, every reasoning step.

If this changed how you think about our Job Ready AI PM Cohort

(12 Weeks, ~50 Sessions, ~100 Hours, ~10+ Products built, ~20 Hours of Interview Prep, 2 Mock Interviews) ~goes deeper. Live cohort. Cohort registrations open. Limited seats. Fill this Form to Show Interest

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

More Resources

Product Management Mock Interview (Detailed)
Crack AI Business Roles (AI Management Consulting, AI Category Management, AI General Manager, Revenue Planning, etc.) - Course Details
Crack AI Program Manager Roles - Course Details

Google Cloud Strategy 2026

Shailesh Sharma — Fri, 29 May 2026 19:34:44 GMT

Before we move ahead, you can find out about our

AI PM Course (PMs at Microsoft, Coinbase, Indeed & 600+ PMs rated 4.9/ 5). See testimonials and course details — Extra 60% OFF - Use Code NYE26

Google Cloud is changing its strategy in a very big way.

They want to move beyond renting compute and storage. The goal is to become the only place enterprise AI runs.

But Why?

Because Alphabet has put more than half of its 2026 ML compute investment into Google Cloud.

Cloud is Alphabet’s smallest revenue line. It is also the only place inside the company where hundreds of billions of AI capex can be converted into uncapped enterprise revenue.

Let’s understand how using the North Star metric.

North Star Metric for Google Cloud = Annual Cloud Revenue.

Breaking Down The Metric

Cloud Revenue = Enterprise Customers x Workloads per Customer x Revenue per Workload

Enterprise Customers: Companies running on Google Cloud.
Workloads per Customer: Compute, storage, and databases each company runs.
Revenue per Workload: Money Google makes per workload per month.

Two of three terms have a problem.

Enterprise sales is slow. AWS has the largest enterprises. Azure has Microsoft’s installed base. Sales cycles run 12 to 18 months. Hard to grow this term fast.
Revenue per Workload is under pressure. Compute and storage are commodities. Prices drop every year. Margins compress.

For years, this is why Google Cloud was treated as a side bet. The math did not work fast enough.

The AI Pivot

Cloud Revenue = Enterprise Customers x Agents per Customer x Tokens per Agent x Price per Million Tokens

By adding Agents and Tokens, Google turns a linear business into an exponential one.

Now they have four levers to pull:

Getting more enterprise customers ( Lever 1 )
Increasing agents per customer ( Lever 2 )
Increasing tokens consumed per agent ( Lever 3 )
Driving down cost per million tokens ( Lever 4 )

Lever 1: Customers

Thomas Kurian has been rebuilding the sales motion since 2019. Vertical specialists. Financial services. Retail. Healthcare. Media.

Cloud Next 2026 was full of Fortune 500 logos. Citadel Securities. Deutsche Telekom. Home Depot. GE Appliances. Highmark Health.

Q4 2025 growth was 48% year over year. Fastest of the three big hyperscalers.

Lever 2: Agents per Customer

Old enterprise software had a ceiling. SaaS licenses tied to humans. You could not sell more seats than there are employees.

AI agents have no ceiling. GE Appliances is deploying 800 agents through Gemini Enterprise. Deutsche Telekom built MINDR, a multi-agent system that runs network operations on its own.

The Gemini Enterprise Agent Platform exists specifically to let one customer deploy thousands of agents inside Google Cloud. Each agent has a unique cryptographic ID and governance policies wired in. Once a company has 800 agents on the platform they cannot easily leave.

This is the new switching cost.

Lever 3: Tokens per Agent

Each agent burns tokens every time it does anything. Reasoning. Retrieval. Generation. Action.

The more capable the agent the more tokens it uses. A simple chatbot might use 10000 tokens per conversation. An autonomous coding agent might use 5 million for one task.

The number to watch is throughput. Google’s first-party APIs went from 10 billion tokens per minute last quarter to 16 billion this quarter. 60 percent quarterly growth on a metric that did not exist three years ago.

This is the one number that tells you if the strategy is working. Not revenue. Not market share. Token throughput growth.

Lever 4: Cost per Token

Google is driving cost per token DOWN. Compute prices got cut 8 percent across regions in Q1 2026. Google Cloud is now 5 to 10 percent cheaper than AWS or Azure for AI workloads.

Why cut prices? Because volume compounds faster than price drops. Price drops 8 percent. Volume grows 60 percent. Revenue still goes up.

Lower cost per token is what makes the TPU investment pay off. Google designs its own chips. No Nvidia margin paid on every inference call. Custom silicon brings the internal cost down. Some savings passed to customers. Customers run more workloads. Volume grows. Cycle compounds.

This is also why they spent 32 billion dollars on Wiz. Security is the trust layer that lets a CISO approve more agents. More agents means more tokens. The volume engine only runs if enterprises trust the stack.

This strategy is very bold because it bets on volume over margin.

In India, we have seen this with Jio. Jio dropped data prices to almost zero. Volume exploded. Other telecom operators died trying to match. Now Jio owns the market.

Google Cloud is playing the same game with AI. Drop the cost per token. Push volume. Lock customers in with agents that cannot be moved.

But there are two real risks.

Multi-cloud adoption is at 89 percent of enterprises. Companies are deliberately splitting AI workloads to avoid lock-in. If they refuse to consolidate on Google Cloud the volume game does not work.
The capex math is the second risk. Alphabet is putting 180 billion dollars of capex into 2026. TPU pods get obsoleted in about two years. Half the life of a traditional server. If AI adoption is real but slow Google ends up structurally over-built.

The strategy is forced not chosen. Cloud is the only place inside Alphabet where the AI capex math can possibly close. Search and YouTube have no real surface left for new monetisation.

If Google is right about exponential token growth and enterprise consolidation, Cloud quietly becomes the most important business inside Alphabet. If they are wrong on either, the integrated stack becomes the most expensive overbuild in tech history.

There is no middle outcome.

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

More Resources

Product Management Mock Interview (Detailed)
Crack AI Business Roles (AI Management Consulting, AI Category Management, AI General Manager, Revenue Planning, etc.) - Course Details
Crack AI Program Manager Roles - Course Details

How Top 1% PM Candidates Answer AI Product Sense Questions in 2026?

Shailesh Sharma — Mon, 25 May 2026 18:34:38 GMT

Anthropic recently asked this Question for their AI PM Role.

Design a Safety Layer for an AI API?

Here is how a top 1% candidate answers this question from first principles.

Break down the question

Three loaded words.

Safety could mean preventing harmful outputs, preventing misuse, or preventing data leaks. Scope it.
Layer means a component within a larger system. Think about where it sits and what it does not own.
AI API means programmatic access. Not a chatbot. The users are developers. The attacks are automated. No human reviews each request.

The problem: Design a system that prevents harm across the entire API request lifecycle, at scale, without killing developer experience.

Clarifying questions

What type of AI API?
Text-only vs multimodal vs tool-use means completely different threat surfaces. A text API faces prompt injection. A multimodal API adds visual jailbreaks like harmful text hidden in images. A tool-use API adds real-world action risks. Assumption: multimodal with tool use. The hardest version.
Who are the consumers?
Solo devs need defaults. Enterprises need configurable policies and compliance guarantees.
Assumption: both.
Latency constraints?
Every safety check adds latency. If the API serves real-time apps like voice assistants, 300ms of safety processing breaks the experience.
Assumption: p50 under 50ms, p99 under 200ms.

Why does this matter?

Safety is the moat. Model capability is converging. GPT-4, Claude, and Gemini perform comparably. What differentiates an API provider is trust. Enterprises choose the API they trust not to embarrass them.

Unsafe APIs do not scale. At 100 developers, misuse is unlikely. At 100,000, misuse is daily. And the downside is asymmetric. One viral safety failure undoes years of brand equity.

User segments

End users are people using apps built on the API. They never see the safety layer. They bear the most harm when it fails. They have zero agency to protect themselves.
External developers are people calling the API. They need sensible defaults, configurable policies, and transparent error messages when requests get blocked.
Trust and Safety team are the internal operator. They need dashboards, investigation tools, and fast policy update workflows.

Prioritise end users. They bear the highest harm and have the least agency. They cannot adjust the safety layer. They cannot complain to you. If harmful content reaches them, they absorb the full impact with no recourse. But they never touch the API directly. Developers are the interface through which you protect them.

Design FOR end users. Design THROUGH developers.

Pain points

Pain point 1. Harmful content reaches end users. The model generates dangerous content even with benign inputs.
A user asks, “How does aspirin work?” and the model includes lethal dosage info. No attack. No adversarial intent. The model just generated something it should not have.
Pain point 2. Adversaries bypass safety controls. Jailbreaking, prompt injection, and encoded attacks. API means programmatic means thousands of automated attacks per hour. Even excellent output classifiers get bypassed when the input is adversarial enough.
Pain point 3. Sensitive data leaks. PII from training data, system prompt extraction, session bleed across users.

Prioritise pain point 1. Three reasons.

Severity is highest. Direct psychological, physical, or legal harm to end users who have no recourse.
Frequency is highest. Harmful outputs occur even with non-adversarial inputs.

And solving PP1 partially solves PP2 and PP3. Even if a jailbreak succeeds and input controls fail, output controls still catch harmful content. PII in output is a subcategory of harmful output. The prioritised pain point creates a cascade.

First principle breakdown

This is where most candidates fail. They jump straight to let us add a content filter.

That is a feature answer. A systems answer starts by understanding what the system actually does.

Trace what happens when a developer calls an AI API. Step by step. From scratch.

Step 1. A developer sends a request. A user message. Maybe an image. Maybe a document to process.
Step 2. But the model does not just see that user message. The system assembles a full context window. That includes the developer’s system prompt (their proprietary instructions that shape the model’s behaviour), the conversation history from previous turns, any documents retrieved via RAG, and outputs from any tools the model previously used. All of these are stitched together into one big input. This is what the model actually sees.
Step 3. The model processes this assembled context and generates a response token by token.
Step 4. That response goes back to the developer, who passes it to their end user.

Now ask the first-principles question. At which of these steps can something go wrong?

Step 1 fails when the input itself is adversarial. A jailbreak attempt disguised as a normal query. A prompt injection hidden inside a document that the model is asked to summarise.

Step 2 fails when the assembled context contains data it should not. PII sitting inside a RAG document that nobody scanned. A previous conversation turn is slowly steering the model off course over many turns. This is the stage most candidates never think about. They conflate input with what the model sees. Those are two different things. The input is what the developer sends. The context is what the model processes.

Step 3 fails when the model generates harmful content from perfectly legit inputs. The model is a probabilistic system. It does not need to be attacked to produce something dangerous. A user asks about chemistry, and the model volunteers synthesis instructions. Nobody attacked the system.

Step 4 fails when the response contains harmful content, leaked PII, or the developer’s system prompt is reproduced verbatim.

Two more failure points sit outside this lifecycle.

Before step 1. Who is even calling this API? If you do not know the caller and what they are allowed to do, every downstream safety decision is flying blind.

After step 4. What patterns are emerging across thousands of requests?

So, from first principles, six stages where safety must operate.

Stage 0. Identity and access. Who is calling?
Stage 1. Input analysis. What are they sending? Is it adversarial?
Stage 2. Context assembly. What does the model actually see?
Stage 3. Model behaviour. What rules constrain the model?
Stage 4. Output evaluation. What is going on? Should it be blocked?
Stage 5. Post-response learning. What patterns emerged? How do we improve?

For more such AI PM interview Questions, find out our AI PM Course - (PMs at Microsoft, Coinbase, Indeed & 600+ PMs rated 4.9/ 5). See testimonials and course details

Solution

Now that the framework is derived, fill in each stage with the specific control and the reasoning behind it.

Stage 0

Tier access system.

Tier 1 is any developer with maximum filter strictness.
Tier 2 is verified developers with configurable filters after business verification and use-case declaration.

Stage 1

Two-pass input classification.

Fast pass runs under 5ms on every request using pattern matching. Catches 70-80% of known attacks at near-zero latency.
Slow pass is an ML classifier running in parallel with model inference. Not before it. So it does not add latency for genuine requests. If flagged after the model starts, the response is blocked before delivery.

Why two passes? One expensive classifier on every request is a latency added on all users.

Stage 2

Two controls.

System prompt protection wraps every prompt in an immutable instruction plus output text-matching as a deterministic backup.
PII scanning before inference prevents the model from ever processing data it should not see. Because once PII enters the context, even output redaction is insufficient since the model’s behaviour is already influenced.

Stage 3

Two-layer policy architecture.

Layer 1 is immutable platform rules. No developer can disable them. Weapons, CSAM, terrorism, fraud. No use case justifies relaxation here.
Layer 2 is developer-configurable for contextual harms. Three settings per category: strict, moderate, permissive.
A medical app sets violence to moderate.
A children’s app sets everything to strict.

Stage 4

Synchronous-asynchronous split. Synchronous blocks catastrophic harms, PII, and system prompt leaks in under 50ms. Uses a small, fast classifier trained specifically on catastrophic categories. Asynchronous flags contextual harms, bias, and hallucination after delivery.

Cross question: Recall or Precision?

High recall catches everything harmful but blocks legitimate requests. Developers lose trust. High precision only blocks when confident. Misses edge cases but maintains trust.

Optimise for precision in the synchronous classifier. A false positive permanently damages developer trust. Use the async pipeline to catch false negatives retroactively. High precision, real-time. High recall over time. Best of both.

Success metrics

False Negative Rate on Catastrophic Harms is the North Star. Harmful content reaching end users. Target under 0.01%.
False Positive Rate measures over-refusal. Target under 2%.

If this changed how you think about our Job Ready AI PM Cohort
(12 Weeks, ~50 Sessions, ~100 Hours, ~10+ Products built, ~20 Hours of Interview Prep, 2 Mock Interviews) ~goes deeper. Live cohort. Cohort registrations open. Limited seats. Fill this Form to Show Interest

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

More Resources

Product Management Mock Interview (Detailed)
Crack AI Business Roles (AI Management Consulting, AI Category Management, AI General Manager, Revenue Planning, etc.) - Course Details
Crack AI Program Manager Roles - Course Details

Nobody reads the FEATURE_SPEC.md. What's the Solution?

Shailesh Sharma — Tue, 19 May 2026 02:18:24 GMT

A PM writes a functional spec in a markdown file. It is 400 lines long.

PM reads the first 50 and the last 20 and skims the rest. The middle 330 lines contain the business rules, edge cases, and constraint definitions that determine whether the feature works or does not.

Then sends it to engineering.

The engineering lead opens the file. He reads the overview section. Skims the requirements table.

Jumps to the technical constraints at the bottom. The 330 lines in the middle remain unread by a second person.

He hands it to the AI agent.

The agent reads every line. All 400 of them. It does not skip. It does not skim.

It implements exactly what the spec says, including the three contradictions between line 87 and line 312 that nobody caught because nobody read both lines.

Two weeks later, the feature ships with a critical defect. The PM blames the agent. Engineering blames the spec. The spec blames nobody because it is a markdown file, and markdown files do not defend themselves.

This is the most common failure mode in AI-native product development. And it did not start with AI.

The Problem That Existed Before Agents

Here is the uncomfortable truth. PMs have been skimming their own specs for years.

Markdown made specs easy to write. It did not make them easy to read.

A 400-line markdown file has no visual hierarchy beyond headers and bullets. No collapsible sections. No embedded mockups. No way to draw your eye to the three lines that matter most out of the 400.

Before agents, this was tolerable. Not good, but tolerable. Because the humans on the other side of the spec had a safety net.

Engineers asked clarifying questions. “What did you mean by this requirement?” “Does this apply to logged-out users too?” “This contradicts what you said in section 3.” Those questions caught the misalignments that skimming introduced.

Sprint reviews caught the rest. You shipped a version. It was wrong. You discussed it. You adjusted. The feedback loop was two weeks. The cost of misalignment was measured in sprints, not dollars.

AI agents removed every one of those safety nets.

The Spec-Driven Development Connection

We wrote about this problem in our article on Spec-Driven Development. The core argument was that a spec is a contract, not a document. It should be a behavioural specification that defines what the system must do, not a technical specification that prescribes how.

That argument stands. But it is incomplete.

A behavioural spec only works if someone reads it. A perfectly written contract that neither party reads is not a contract. It is paperwork.

Most PMs write behavioural specs in markdown. Those specs contain precise requirements. Constraint definitions. Edge case handling rules. Confidence thresholds. Fallback behaviours. All the things that separate a spec from a prayer.

Then nobody reads them.

The Solution: HTML as the Spec Layer?

AI agents generate HTML as easily as they generate Markdown. For the agent, the effort is identical. For the human, the difference is enormous.

An HTML spec can have collapsible sections. The PM sees high-level decisions first and drills into detail only where needed. Engineering sees the data model expanded with product context collapsed. Same document, different views, one handoff.

It can have colour-coded requirement statuses. Green for finalised. Yellow for needs-review. Red for placeholder. The PM sees at a glance which parts of the spec are done and which are still unfinished thoughts pretending to be requirements.

It can embed mockups inline. Not a link to Figma in another tab. A rendered visual sitting next to the requirement it represents. You see what you are specifying as you specify it.

It can use tabs to separate product context, behavioural requirements, technical constraints, and verification criteria. Each audience reads the tab that matters to them. Nothing gets lost crossing the handoff.

This is the Engagement-Quality Loop. Better readability leads to more engagement. More engagement leads to more edits. More edits lead to higher quality specs. Higher-quality specs lead to fewer implementation cycles. Fewer cycles mean lower compute cost.

The Costs and Limits

The token overhead is real. The ROI is better.

A markdown spec costs $0.03 to $0.10 in tokens. An HTML spec costs $0.10 to $0.40. Roughly 3 to 5x more.

But trace the cost through the chain. The markdown workflow leads to skimming, which leads to missed contradictions, which leads to rework cycles at $5 to $50 each. Two or three cycles, and you have spent $15 to $150 plus days of calendar time.

The HTML workflow costs $0.30 more upfront. You catch the contradiction before implementation. One cycle. $5 to $50. Done.

Less than 1% of tokens that most teams generate end up in production code. The rest goes into planning, iterating, and reworking.

The question is not whether the spec costs more tokens. The question is whether those tokens produce a spec someone actually reads.

The token overhead is noise. The rework is the signal.

When will markdown still win?

Short tasks where the output fits on one screen. Agent-to-agent handoffs where no human reads the document. Pure technical artefacts like type definitions and API schemas. If no human judgment is needed, use markdown. If human judgment is needed, use HTML.

A PM who does not spec is spending blindly. A PM who specs in a format nobody reads is spending with everyone’s eyes closed.

The Workflow

Five steps. Start tomorrow.

Spec in HTML. Prompt: “Create an HTML behavioural spec for [feature]. Collapsible sections. Colour-code requirements: green for finalised, yellow for review, red for placeholder. Embed mockups.”
Read your own spec. Open every section. Finalise every yellow. Remove every red. Do not send a spec with placeholder text to an agent. Every red line is a budget leak.
Identify decision nodes. Three to five points where human judgment matters more than agent capability. Business rules. Trade-offs. Prioritisation logic. These are the moments the PM earns their role.
Build micro-software for each node. Rule editors. Priority rankers. Trade-off sliders. Make decisions with full context. Cost: cents per tool. Value: precision instead of intuition.
Hand off clean. Fresh agent session. HTML spec as a single source of truth. One implementation cycle. Verify against the embedded checklist.

Total extra cost: $0.50 to $2.00 per feature. Total savings from eliminated rework: $10 to $200.

If this changed how you think about our Job Ready AI PM Cohort
(12 Weeks, ~50 Sessions, ~100 Hours, ~10+ Products built) ~goes deeper. Live cohort. Cohort registrations open. Limited seats. Fill this Form to Show Interest

More Resources

AI PM Course - (PMs at Microsoft, Coinbase, Indeed & 600+ PMs rated 4.9/ 5)
See testimonials and course details
Crack AI Business Roles (Consulting, Category Management, General Manager) - Course Details
Crack AI Program Manager Roles - Course Details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

5 AI Questions Every Product Manager is getting Asked

Shailesh Sharma — Fri, 15 May 2026 21:16:21 GMT

She had eight years of product experience. Three products with 10 million-plus DAU. Six weeks of dedicated interview prep.

She got eliminated in Round 2.

Not by a trick question. Not by a culture fit screen.

By a single follow-up about how she would evaluate whether a RAG pipeline was retrieving the right documents.

She knew what RAG was. She could draw the architecture diagram.

But the interviewer did not want a diagram. He wanted to know what she would do when the retriever pulled the wrong documents at 2 am on a Tuesday and support tickets started spiking.

She did not have that answer.

Here is the part that should concern you. She was not underprepared. She was prepared for the wrong interview.

The AI PM interview in 2026 is a fundamentally different game. Most candidates are still running the old playbook.

CIRCLES. RICE. Feature prioritisation matrices. These are table stakes now.

Nobody gets hired for knowing them. You just get disqualified for not knowing them.

The real filter is AI depth.

I have spent two years shipping AI products in production. I have also helped 1000+ PMs prepare for AI roles. See testimonials

The pattern is impossible to miss. Five concepts keep showing up. Not ten. Not twenty. Five.

The dangerous part is not that PMs have never heard of them. Most have.

The dangerous part is that most PMs know these concepts at exactly the depth that gets them eliminated.

Let me show you what I mean.

1. RAG (Retrieval-Augmented Generation)

You probably already know what RAG is.

An LLM does not know everything. It hallucinates. It has no access to your company’s internal data. So you add a retrieval step. You search a vector database for relevant documents first. Then you feed those documents to the LLM as context. The LLM generates an answer grounded in actual information instead of its training data.

That explanation is correct. It is also the exact answer that gets you a polite nod followed by a harder follow-up.

RAG has a dirty secret. The retrieval step fails silently. The LLM does not tell you it received the wrong documents. It generates a confident, articulate, completely wrong answer. Your user has no idea. Your metrics might not catch it for weeks.

So the real question is not what RAG is. The real question is what you do when the R in RAG stops working.

How do you measure retrieval quality separately from generation quality?
When do you chunk documents into smaller pieces versus keeping them whole.
What happens when the user query is ambiguous, and the retriever returns five documents that are each partially relevant, but none exactly right?
What happens when you stuff too many documents into the context window and the LLM starts ignoring the important ones because of lost-in-the-middle effects?

These are the questions that separate PMs who have read about RAG from PMs who have debugged RAG in production.

Interviewers expect you to go deeper than this.

Sample Interview Questions

Q1. You are building a customer support chatbot using RAG. Users report that 30% of answers are irrelevant. How would you diagnose whether the problem is in retrieval or generation?
Q2. Your RAG system retrieves the correct document, but the LLM still produces an incorrect answer. What could be going wrong, and how would you fix it?
Q3. A stakeholder wants to add RAG to a feature that currently uses a fine-tuned model. How would you evaluate whether this is the right architectural decision?
Real AI PM Interview Questions (With Detailed Solution) Here

2. Evals (AI Evaluation)

Here is a question that sounds easy.

Your model accuracy improved from 84 % to 91 %. Should you ship it?

Answer it in your head right now.

If your instinct was yes, ship it; accuracy went up, you just failed the interview.

If your instinct was it depends, good. But the interview is only beginning. Because the next question depends on what. And that is where most candidates fall apart.

Most PMs treat evaluation as a checkpoint. Model hits a number. You ship. But AI evaluation is not a checkpoint. It is a continuous argument between what the model does well and what the business actually needs.

There are two worlds of evals, and most PMs only live in one.

Offline evals measure model performance before deployment. You run test datasets. You calculate precision, recall, and F1. You compare against baselines. This world feels safe. The numbers are clean. The comparisons are neat.

Then there is the second world. Online evals. What happens after deployment? User satisfaction. Task completion rates. Time to value. Edge cases your test data never imagined. The queries that real humans type at 11 pm on their phones look nothing like your curated evaluation dataset.

The gap between these two worlds is where AI products go to die.

A model can score 95% on your offline eval set and still make users miserable. Your eval set was built by engineers who write clean, well-structured queries. Your actual users write things like why is this broken and fix it and paste in screenshots of error messages.

The PM who wins the interview connects eval metrics to business outcomes. Not accuracy went up 7%. Instead. Accuracy went up 7%. Did user satisfaction improve? Did support tickets decrease? Did the revenue metric move? If you cannot draw that line from model metric to business metric, you will hear the worst four words in any interview. Let us move on.

Interviewers expect you to go deeper than this.

Sample Interview Questions

Q1. You are the PM for an AI content moderation system. Precision is 97 percent but recall is 72%. The policy team wants a higher recall. The UX team is worried about false positives. How do you navigate this tradeoff?
Q2. Design an evaluation framework for an AI feature that recommends products. What offline and online metrics would you track, and how would you decide when the model is ready to ship?
Q3. Your A/B test shows the new model has 3% better accuracy but 15% higher latency. How do you make the ship or no-ship decision.

Watch Sample Video on Advanced Evals

3. Fine-Tuning vs Prompting

This is the concept where interviewers separate PMs who have shipped AI from PMs who have read blog posts about AI.

The question usually lands like this. Your AI feature is producing mediocre outputs. You have three options. Better prompts. Fine-tuning. A larger model. How do you decide?

Most candidates give a vague cost-benefit answer. Fine-tuning is more expensive but more accurate. Prompting is cheaper but limited. That answer is technically correct. It is also useless to the interviewer. Because they already know the textbook tradeoffs. What they want to hear is your decision framework. The actual sequence of steps you would follow.

Here is one that works.

Start with prompting. Always. Every single time. It is free to experiment with. You can iterate in hours, not weeks. A well-crafted prompt with good examples solves 80% of problems that PMs instinctively want to throw fine-tuning at.

But sometimes, prompting plateaus. You have tried ten prompt variations. You have added a few-shot examples. Quality is stuck at decent, and your users need excellent. Now you have a real decision to make.

Ask one question. Why is the model failing?

If the model is missing domain-specific knowledge or terminology, fine-tuning is probably the answer. A general model does not know your company’s product taxonomy. It does not understand your industry’s jargon. Fine-tuning teaches it patterns it has never seen.

If the model simply is not capable enough for the reasoning required, try a larger model first. Sometimes the problem is not knowledge. It is raw intelligence. A bigger model might handle the complexity without any fine-tuning at all.

Here is the layer that most PMs miss entirely. This is not just a technical decision. It is a product and business decision.

Fine-tuning means you now own a model. You need labelled training data to build it. You need ML engineers to maintain it. You need to retrain it when the underlying data distribution shifts. You have just taken on a recurring operational cost that compounds with every model update.

As a PM you need to justify that investment. What is the incremental quality gain? Is it large enough to warrant the ongoing maintenance burden? Could you get 70 % of the benefit with a smarter prompt and zero operational overhead?

That is the thinking interviewers want to hear. Not fine-tuning is better. But here is how I would make the decision and here is what I would measure to validate it was correct.

Interviewers expect you to go deeper than this.

Sample Interview Questions

Q1. You are the PM for an AI writing assistant. Users complain that outputs feel generic and do not match their brand voice. Walk through how you would decide between prompt engineering, few-shot examples, and fine-tuning.
Q2. Your team fine-tuned a model six months ago. Performance has degraded. What could be causing this, and what is your plan?
Q3. A competitor just shipped a similar feature using GPT-4o. Your team uses a fine-tuned, smaller model that is cheaper but less capable. How do you think about this competitive dynamic?
Real AI PM Interview Questions (With Detailed Solution) Here

4. Agents

Pay close attention to this one. If you are interviewing in the second half of 2026, agents will likely be the longest segment of your interview.

Every major tech company is building agent-based products right now. Not planning them. Building them. Shipping them. And they need PMs who understand the product challenges that agents create. Not the engineering challenges. The product challenges.

Here is the core distinction.

A regular LLM call is a one-shot interaction. You send a prompt. You get a response. Done.

An agent is a loop. It receives a goal. It breaks the goal into steps. It executes the first step. It observes the result. It decides what to do next. It keeps looping until the goal is achieved or it determines it cannot proceed.

That loop is where everything gets interesting.

Because an agent makes decisions. Autonomously. Without asking you. It might send an email on your behalf. It might delete a file it considers irrelevant. It might book a flight for the wrong date because it misinterpreted what next Friday means in context.

The technology question is how agents work. Any PM can read a LangChain tutorial and answer that. The product question is much harder and much more valuable in an interview.

How much autonomy should the agent have?
Where do you insert human checkpoints?
When should it ask for permission versus making a judgment call on its own.
How do you design an experience where the user feels in control even though the agent is doing all the work?

Here is the tension that makes this concept so rich for interviews. Users want agents to be autonomous. That is the entire value proposition. Do this for me so I do not have to think about it. But users also want to feel safe. They want confidence that the agent will not do something catastrophic or irreversible.

Speed and safety pull in opposite directions. The PM who can navigate that tension with clear product principles will stand out in every single interview.

The most common mistake candidates make is treating agents as a pure engineering conversation. Talking about function calling, tool schemas and ReAct patterns. That is the implementation layer. Interviewers want the product layer. The experience design. The trust model. The failure recovery flow.

Sample Interview Questions

Q1. You are building an AI agent that helps users book travel. The agent can search flights, compare prices, and make reservations. Design the user experience for when the agent books the wrong dates.
Q2. Your AI agent completes tasks autonomously, but users report feeling out of the loop. How would you redesign the experience to build trust without slowing the agent down?
Q3. An agent-based feature has a 78 per cent task completion rate. Users love it when it works, but are frustrated when it fails. How do you decide whether to ship broadly or keep it in limited access?

5. Guardrails

Here is a question most PMs never think about until it is too late.

What is the worst thing your AI product could do?

Not the most likely failure. The worst. The one that ends up as a screenshot on social media within hours. The one that gets your CEO pulled into a meeting with Legal at 7 am. The one that makes users question whether they should trust anything your product says ever again.

If you have shipped AI in production, you already know this feeling. Because it either already happened to you or you watched it happen to a competitor and thought that could have been us.

Guardrails are how you prevent those moments. They are the systems you build to stop your AI from generating harmful content, leaking private data, producing confidently wrong information, going off topic, or being manipulated by adversarial users who know exactly how to break your system.

Here is why this topic is deceptively difficult.

The naive approach to guardrails is to block everything that looks risky. Restrict the model outputs aggressively. Add filters on every response. Flag anything that seems remotely problematic.

That approach solves the safety problem. It also kills the product. Users start hitting walls on legitimate queries. The AI becomes so cautious it is useless. You have traded one crisis for another. Instead of a harmful output going viral, you have a product that nobody wants to use because it blocks everything.

The real challenge is surgical precision. Block the dangerous outputs. Let the legitimate ones through. Do it fast enough that the user never notices a filter is running behind the scenes.

The layered approach works. Input filters catch bad requests before they reach the model. System prompts constrain the model behavior at the instruction level. Output filters catch problematic responses before they reach the user. Human review handles the edge cases that automated systems miss.

But the product question is always the same. Where do you draw the line.

A guardrail that is too strict blocks 15 percent of legitimate queries and your users leave for a competitor. A guardrail that is too loose lets one harmful response through and your product is trending for the wrong reasons.

That calibration problem is the entire interview. Not what are guardrails. But how do you set them. How do you measure whether they are working. And how do you adjust them when you get it wrong.

Interviewers expect you to go deeper than this.

Sample Interview Questions

Q1. You are launching an AI chatbot for a healthcare company. What guardrails would you implement and how would you prioritise them given a tight launch timeline.

Q2. Users have discovered they can manipulate your AI assistant through indirect prompt injection via pasted text. How do you approach this as a product problem, not just an engineering problem.

Q3. Your guardrails are blocking 12 percent of legitimate user queries. Engineering says tightening the filters further will increase false positives to 18 percent. How do you handle this.

The Pattern You Need to See

Now here is the honest question.

Go back to the 15 sample interview questions in this article. Read them carefully. For each one, ask yourself whether you could give a structured and confident answer that would satisfy a senior interviewer at a top tech company.

If you could answer 12 or more with real depth, you are in strong shape. Keep sharpening.
If you could answer 8 to 11, you have gaps. Targeted gaps. The kind that a focused two-week sprint could close.
If you could answer fewer than 8, the problem is not intelligence. You are smart enough to be reading this article. The problem is exposure.

Nobody has shown you how these concepts actually play out in real product decisions. You have been learning definitions when you should have been practising trade-offs.

That gap is exactly why we built the AI PM course. 800+ PMs have taken it. 4.9 out of 5 rating

The interview has changed. Your preparation should too.

Become an AI Product Builder | 100 Hrs+ Learning

This is not a course you watch passively. It is a program you go through with a cohort of other PMs. You get office hours. You get demo sessions.

The AI Product Manager Builder 2.0 is a 12-week cohort program. 45 plus classes. Hands-on demos. Interview prep sessions every week. A capstone project. 10+ real-world projects you can add to your portfolio.

Apply for Cohort

About Author

Shailesh Sharma - I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass (Here)

Become AI Product Builder (starting from Zero)

Shailesh Sharma — Mon, 11 May 2026 20:35:23 GMT

The workflow nobody taught you.

I built technomanagers.in in one day.

Full product. Frontend, backend, course catalogue, question bank, deployment. One day.

Link — https://technomanagers.in/

I am not an engineer. I am a product manager.

This is what a building looks like in 2026.

The tools have changed so fundamentally that the gap between idea and shipped product is no longer months. It is days. Sometimes hours.

In 2026, the PM who gets hired is the one who builds. Not manages. Builds.

This article walks through the full pipeline.

Problem discovery to launch. Every step uses tools that did not exist 18 months ago.

If you are a PM, engineer, or builder trying to ship an AI product this year, this is the roadmap.

Step 1. Spot the problem. Validate it fast.

Most AI products fail at step zero. They start with a technology and go looking for a problem. We should build something with RAG. We should make an agent. Backwards.

Start with the problem. But how you validate has changed.

Pick a domain you understand. Not AI for healthcare. Something specific.

Radiologists spend 40 minutes per scan writing structured reports. Observable. Measurable. Worth solving.

Now, validate it in 30 minutes. Open Claude. Run a research session.

Sample prompt:

I am validating a product idea. The problem is that radiologists spend 40 minutes per scan writing structured reports. I need you to do the following. First, find 5 existing products that solve this problem today. Second, find evidence from radiology forums or medical publications that this pain point is real and widespread. Third, estimate the addressable market size using publicly available data on the number of radiologists and scans per year. Fourth, list three reasons this problem might not be worth solving.

That last line matters. You are not looking for confirmation. You are stress-testing the hypothesis.

The output of this step is a one-page problem brief in markdown.

Four sections.

→Problem statement.
→ Evidence it exists.
→ Who has it?
→ What they do today.

If you cannot fill that page with real evidence, move on.

Step 2. Competitor research with agents

This is where the 2026 flow diverges most from the old one.

Traditional competitor research meant 15 browser tabs, free trial signups, and a comparison spreadsheet built over two days.

Now you build a lightweight agent that does the data collection in hours.

Sample prompt for Claude with web search:

I am building an AI product for radiology report generation. Research the following competitors: Nuance DAX, Rad AI, DeepScribe, and Ambra Health. For each one extract the following. Target customer segment. Core AI capability. Pricing model if publicly available. Key differentiator. Weaknesses mentioned in user reviews. Present this as a structured comparison table.

The agent does not replace your analysis. It replaces the manual collection. You still look at the output and ask the hard questions.

Where are the gaps? Where are competitors overserving? Where is there a segment nobody is building for?

Save the output as competitor-matrix.md. It lives alongside your problem brief. Both are markdown. Both feed into the next step.

If you want to learn how to build these agent workflows from scratch, the Technomanagers AI PM course covers everything. 800+ students. 4.9 out of 5 rating.

Step 3. Talk to users. AI does not replace this.

You have a validated problem and a competitor landscape. Now talk to actual humans.

No agent replaces a 30-minute conversation with someone who has the problem you are solving. But how you prepare and synthesise has changed.

Before the call, generate a targeted research script.

Sample prompt:

Here is my problem brief [paste problem-brief.md]. Here is my competitor matrix [paste competitor-matrix.md]. Generate 10 user interview questions that specifically probe the gaps I identified in the competitor landscape. Focus on workflow pain points, current workarounds, and willingness to pay. Avoid generic questions.

After five conversations, paste all your notes into Claude.

Sample prompt:

Here are my notes from 5 user interviews about radiology report generation [paste notes]. Extract the following. Top 3 recurring pain points ranked by frequency. Contradictions between what users say they want and what they actually do. Unmet needs that no current competitor addresses. Willingness to pay signals.

The key discipline is triangulation. User interviews say one thing. Competitor gaps say another. Usage data says a third. The truth is in the overlap.

Step 4. Write the spec in markdown

This step separates the 2026 builder from the traditional PM.

You do not write a 20-page PRD in Google Docs. You write a product spec in markdown. In Claude.

A markdown spec is machine-readable. It can be fed directly into Claude or Cursor to generate functional code. A Google Doc cannot.

The format of your spec determines the speed of your prototype. This is an architectural decision, not a style preference.

The spec has five sections.

Section 1. Problem and user.
One paragraph pulled from your problem brief.
Section 2. Core workflow.
The 3 to 5 steps the user takes to get value. Not features. Steps. User uploads a scan. System extracts findings. The system generates a structured report. User reviews and edits. System learns from edits.
Section 3. Technical architecture.
What model. What retrieval strategy? Input and output formats. Where the data lives. If you cannot write this section, you do not understand your own product.
Section 4. Eval criteria.
How will you know if it works? Precision. Recall. Latency. Hallucination rate. Defined before you build.
Section 5. Out of scope.
What you are deliberately not building in v1. This section saves more time than any other.

Save it as product-spec.md.

Step 5. Build the prototype in Claude

Take your product-spec.md. Open Claude. Paste the entire spec.

Sample prompt:

Here is my product spec [paste product-spec.md]. Build a functional prototype of this application. Use React for the frontend. Use Python with FastAPI for the backend. For the RAG component, use a vector database with cosine similarity search. Generate the full codebase, including file structure, all components, API endpoints, and database schema. Make it deployable.

Claude generates a working application from a well-written spec. Not a mockup. Not a wireframe. A working product.

The quality of the prototype is a direct function of the quality of your spec. Vague spec, vague output. Precise spec, precise output.

This is where the .md format pays off.

Then you iterate.

Sample follow-up prompts:

Change the retrieval to a hybrid search combining keyword and semantic matching.
Add error handling for cases where the API returns empty results or the context window is exceeded.
Refactor the output to match this JSON schema [paste schema].

Each iteration takes minutes. Within a few hours, you have something you can put in front of users. Not a deck. A working product.

Step 6. Build evals before you launch

Most AI products ship without evals. Single biggest mistake in AI product development.

Evals answer one question. Is the AI actually working?

For a RAG product, your metrics are Precision at K, Recall at K, and MRR.
For generative output, hallucination rate, relevance, and coherence.
For an agent, the task completion rate, step accuracy, and cost per task.

Define before launch. Automate. Set thresholds below which the product does not ship.

Sample prompt:

I have a RAG-based radiology report generator. Generate a Python evaluation script that tests the following. Precision at 5 for retrieved context chunks. Recall at 5 for relevant medical findings. Mean Reciprocal Rank for the top result. Hallucination detection by comparing generated text against source documents. Use a test set of 20 sample queries with known correct answers that I will provide.

This separates a demo from a product.

Step 7. Production hardening

Prototype works. Evals pass. Now harden it.

Latency

If your pipeline takes 8 seconds, users leave. Target under 2 seconds. Optimise chunking, cache frequent queries, and pick the right model size.

Cost

Every API call costs money. A prototype at 2 dollars per session is a burn rate, not a product. Find where smaller models work, where you can cache, and where you can batch.

Error handling

What happens when the model returns garbage? When retrieval finds nothing. When the API goes down, every failure mode needs a graceful fallback.

Monitoring. Log inputs, outputs, latencies, costs. Build dashboards that catch quality degradation before users notice.

Can you actually do all of this?

Each step is learnable. Each step uses tools available right now.

If you want to go from reading to shipping, the AI Product Builder Cohort is built for exactly this pipeline.

If you prefer self-paced, the Technomanagers AI PM course covers every concept here. 800+students. 4.9 rating. It is the foundation the cohort builds on → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

How Would You Build Google Photos’ New Wardrobe Feature?

Shailesh Sharma — Sun, 03 May 2026 03:13:55 GMT

Before we move ahead, you can also read the following articles

You own roughly 80 pieces of clothing. You remember maybe 15. You rotate through 7.

The problem is not the clothes. The problem is that your closet has no search bar.

Google Photos just shipped a feature that fixes this. It scans your photo library, finds every piece of clothing you have ever worn in a picture, catalogues it, and turns your gallery into a searchable digital closet. Filter by category. Mix and match pieces into outfits. Try the whole thing on virtually before you get dressed.

That is the product. Now, let us build it.

As an AI Product Manager, you would get such problem statements.

In our previous pieces, we covered how TikTok uses session-based RNNs and why position bias destroys your ranking quality. This one is about a different kind of AI system.

Not one model does one job. Five models chained together, where every failure cascades downstream.

If you are preparing for AI PM interviews, we will prepare you with Real AI PM Interview Questions. We cover this more in our course.

The Problem Statement

Given a user’s unstructured photo library, build a system that:

Identifies every unique garment the user has worn in any photo
Creates a clean catalogue with one entry per garment
Let’s users combine items into outfits
Shows users how the outfit looks on their body

The input is chaos. Vacation photos, selfies, group shots, screenshots, food pictures, photos where you are partially visible, and photos where your jacket covers your shirt.

The output must be a clean wardrobe.

This is not one model. It is five models chained together.

Stage 1: Find the Photos That Matter

A typical library has 1000 images. Maybe 400 contain you wearing visible clothing. The system needs to find those 400 and discard the rest.

Google Photos already detects and groups faces through its People feature. That tells the system which photos contain you. But face detection is not enough. You need body pose estimation.

The system runs a pose estimator to map body joints. Shoulders, elbows, wrists, hips, knees, ankles. These key points answer two questions.

How much of your body is visible?
And where does each garment zone fall in the image?

The region between the shoulders and hips is a top. Between hips and the ankles is the bottom.

If only your face and shoulders are in frame, there is no point in trying to extract pants.

The PM decision here is the visibility threshold.

How many keypoints must be detected for the photo to enter the pipeline?
Too strict and you miss casual selfies that still show a good shirt.
Too lenient and you flood every downstream stage with blurry, occluded garbage.

Stage 2: Segment Each Garment

Each photo might contain multiple garments. A shirt, a jacket, pants, and earrings. The system needs to isolate each item at the pixel level.

This is instance segmentation.

Not “there is clothing in this image” but “these exact pixels belong to this shirt, and those exact pixels belong to that jacket.”

The model takes an image and outputs bounding boxes and pixel masks, each labelled with a garment category.

Here is where the first real complexity shows up. Occlusion.

A jacket covers a shirt. A scarf drapes across a jacket. A bag strap cuts across your torso.

The naive answer is to only extract fully visible items. This fails.

If you always wear a jacket over a particular shirt in photos, that shirt never enters your wardrobe.

The better approach is to extract all detected garments and attach a visibility score.

Visibility Score = Visible Pixels / Estimated Total Pixels

A shirt with 80% visible scores 0.8. A shirt 30% hidden behind a jacket scores 0.3. This score determines which photo gets chosen as the representative thumbnail later.

Stage 3: The Hardest Problem in the Pipeline

You now have roughly 1,000 segmented garment patches across 400 photos. Many are the same physical item photographed in different conditions. Your favourite blue shirt appears in 40 photos. Different lighting. Different wrinkles. Different backgrounds. Different angles.

The system needs to know that these are all the same shirt.

This is a visual re-identification problem. The same class of problem that security systems use to track a person across multiple cameras.

The naive approach is pixel-level image similarity. Compare the raw pixels of two garment patches. This fails immediately.

The same white shirt photographed indoors under warm lighting looks golden. Outdoors, it looks blue-white. Against a dark background, it appears brighter. Pixel similarity would call these three different shirts. They are the same shirt.

The correct approach is to learn a garment embedding.

You pass each segmented garment through a feature extraction network.

A CNN or Vision Transformer fine-tuned on fashion datasets like DeepFashion. The network outputs a compact vector, say 256 dimensions, that captures the garment’s identity. Its colour, texture, pattern, cut, and structure. Not the lighting. Not the background. Not the wrinkles.

Two patches of the same shirt, photographed in completely different conditions, should produce vectors that are close together in this 256-dimensional space. Two different shirts should produce vectors that are farther apart.

Similarity(A, B) = cosine(Embedding(A), Embedding(B))

If cosine similarity exceeds a threshold, say 0.85, the system treats them as the same garment.

This threshold is the single most important PM decision in the entire pipeline.

Set it too high at 0.95, and the system creates duplicates. Your blue shirt appears four times because slightly different photos produced slightly different embeddings.

Set it too low at 0.70, and the system merges two genuinely different items into one. Your navy polo and your navy crew-neck collapse into a single entry.

Which mistake is more tolerable?

For a consumer product, false merges are worse. Showing two entries for one shirt is a mild annoyance. Deleting a unique garment by merging it with another item is data loss. You cannot undo it without rerunning the pipeline.

Bias the threshold higher. Accept some duplicates. Give users a manual merge option in the UI.

Stage 4: Cluster and Build the Catalogue

With 1000 embeddings, the system clusters them. Each cluster represents one unique garment. DBSCAN works well here because it does not require specifying the number of clusters in advance. It finds natural groupings based on embedding distances.

From each cluster, select a representative thumbnail. Highest visibility score. Highest resolution. Best lighting.

But a raw crop from a photo still has a messy background. Your shirt thumbnail would show a slice of a restaurant behind you.

The system takes the segmented garment mask and runs an inpainting model. It generates a clean thumbnail on a neutral background, fills in any occluded parts of the garment, and removes everything else.

The output: a clean, catalogue-style image of each garment. That is what you see in the Wardrobe UI.

Stage 5: Classification

Each garment needs a category label. Tops, bottoms, dresses, outerwear, skirts, jewellery, and footwear.

The same feature extraction network from Stage 3 can feed into a classification head. Standard multi-class classification.

The design question that matters here is whether classification should run before or after clustering.

If you classify first, the category becomes a hard constraint during clustering. A shirt and a jacket can never merge, regardless of how similar their embeddings are. This eliminates absurd false merges. But classification errors now propagate. If a blazer gets mislabelled as a shirt, it will never match with other photos of that blazer correctly labelled as outerwear.

If you run classification and clustering in parallel, you can use category agreement as a soft constraint. Same category plus high embedding similarity means high-confidence merge. Different categories, plus high embedding similarity, mean flagged at a higher threshold.

The parallel approach is more robust. It lets one model compensate for the other’s mistakes.

The Virtual Try-On

You select a top and a bottom from your wardrobe. You tap Try it on.

The system generates a photorealistic image of you wearing that combination.

Google did not build this for Photos. This technology already existed in Google Shopping, where it worked on billions of product listings. The PM reused it.

The underlying system is a diffusion-based image generation model built specifically for fashion. Here is how it works.

The model takes three inputs.

A photo of you.
A garment image.
A body pose estimate from Stage 1.

First, 2D warping. The garment pixels get mapped onto the region of your body where that garment would sit. The pose estimate locates your shoulders, torso, and hips. The segmentation mask provides shape and texture.

But simple warping produces artefacts. Sleeves do not match arm positions. Fabric does not fold correctly around your body.

The diffusion model takes over. It receives the warped image plus a segmentation map of the missing regions and generates the final output. Realistic fabric folds, shadows, and draping.

The critical constraint: the model preserves the exact texture, colour, and pattern of your actual garment. It only generates the physics of how that fabric interacts with your body. This is not “put a blue shirt on this person.” This is “take this specific shirt with this specific weave and show how it drapes on this specific body in this specific pose.”

This is why the output looks like your actual clothes. Not generic AI-generated clothing.

Outfit Compatibility

There is one more model that the press release does not mention, but the product requires.

When a user mixes a top with a bottom, the system should score whether the combination works visually. This is a compatibility scoring problem.

The approach: train a model on labelled fashion datasets where outfit combinations are rated as compatible or incompatible. Each item has an embedding from Stage 3. The compatibility model takes two or more garment embeddings and outputs a score between 0 and 1.

Compatibility(Top, Bottom) = sigmoid(W · concat(Embedding_top, Embedding_bottom) + b)

A navy blazer with khaki chinos might score 0.9. A navy blazer with basketball shorts might score 0.2.

The PM decision is whether to surface this score or use it passively.

Surfacing it as this outfit scores 4 out of 5 risks being wrong and annoying. Using it passively to sort moodboard suggestions by compatibility adds value without overcommitting.

The safe answer is passive integration. Let the user feel the algorithm without seeing it.

What Metrics will you track?

Most PMs would track whether users open the Wardrobe tab. That tells you almost nothing. Here is the framework that actually measures success.

Does the user come back?

Wardrobe retention at Day 7 and Day 30. Not Google Photos retention. Wardrobe-specific retention. A user who opens the Wardrobe tab once and never returns means the catalogue was not useful. A user who returns weekly is planning outfits. That is the behaviour you want.

Track the ratio of outfit creations to wardrobe visits. If users open the wardrobe but never create an outfit, the catalogue is interesting, but the mix-and-match experience is not compelling. Target at least 30 per cent of wardrobe visits resulting in an outfit action (create, save, or share).

Does Try It On convert?

Try It On usage rate among outfit creators. If users create outfits but never tap Try It On, the feature is either hidden or not trusted. Target at least 40 per cent of outfit creators using Try It On at least once.

Try It On completion rate. Does the user look at the generated image for more than 3 seconds? Do they save it or share it? If they generate a try-on and immediately dismiss it, the output quality is not meeting expectations.

Try It On repeat rate. Users who try it once and never again do not trust the result. Users who try it multiple times per session are engaged. A healthy repeat rate means the diffusion model is producing outputs that feel real.

If this breakdown changed how you think about multi-model pipelines, cascade error budgets, and AI feature design, you will find much more depth in our AI PM course. We cover RAG architectures, evaluation frameworks, and real interview questions from top companies.

Check our highest-rated AI PM course (Including AI PM Interview Preparation) · 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma - I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass (Here)

Subscribe to get the FREE Book (AI & Tech Simplified), Link in Welcome Email

Two Versions of Every Click

Shailesh Sharma — Sun, 26 Apr 2026 18:31:44 GMT

Every click in your training data tells two stories.

Your model hears one. The user lived the other.

They are not the same story. And the gap between them is destroying your feed.

We are going to show you both versions, side by side, for every stage of the recommendation pipeline. By the time the two stories merge, you will understand position bias deeper than any offline metric can show you.

In our previous pieces, we covered how TikTok uses session-based RNNs and why recommenders suffer from catastrophic forgetting. This one is about a different failure. Your model remembers clicks. But it misremembers why they happened.

If you are preparing for AI PM interviews, position bias comes up constantly in recommendation system design rounds. We cover this and more in our course.

The Click

What your model recorded

User opened the app store. Saw App X at position 1. Clicked. Label = 1. App X is relevant.

What actually happened

User opened the app on the metro. Had 40 seconds before their stop. App X was the first thing on screen. They did not scroll. Did not compare. Saw one thing. Tapped it.

If App Y had been at position 1, they would have tapped App Y.

The click had nothing to do with App X. It had everything to do with position 1.

The Data

What your model sees:

Position 1 CTR is 0.25. Position 5 is 0.10. Position 15 is 0.03. Items at position 1 are 8x better than items at position 15.

What is actually true

95% of users see position 1. Maybe 60% reach position 5. Fewer than 20% get to position 15.

A great app at position 15 gets 0.03 CTR because nobody saw it. A mediocre app at position 1 gets 0.25 because everyone saw it.

The model has no way to tell these apart. It calls both numbers a preference.

The Loop

What your model believes

It trains weekly. Every cycle, the data confirms that items at the top are the best items. Their CTR is highest. Their scores go up. They stay at the top. The system is stable. The system is working.

What is actually happening?

Top items get more clicks because they are at the top. Those clicks inflate their CTR. Inflated CTR keeps them at the top. Next week, same thing.

A genuinely excellent app debuted at position 12 three weeks ago. CTR of 0.04. Not because users disliked it. Because most users never scrolled that far. The model scored it low. It dropped to 16. Then it disappeared.

Your feed is not ranking by quality. It is ranked by inertia.

This has two costs.

First, diversity dies. The same items win every cycle. New items cannot break through. Your feed feels stale. Engagement decays.
Second, revenue leaks. Your ranking function is f(CTR, bid). If CTR is inflated by position, you are overvaluing items that sit high and undervaluing items that bid high. That is money lost daily.

The Usual Fix (And Why It Fails)

Every team eventually notices something is off. The standard fix is simple. Add position as a feature.

The model sees [user features, item features, context, position]. It learns that clicks at position 1 should be discounted. Sounds reasonable.

What the team believes this achieves

The model now accounts for position. Bias is handled.

What actually happens at inference time?

The model needs a position value to produce a score. But the position has not been decided yet. That is what the ranking is supposed to determine.

You cannot feed a position as input when the position is the output.

So the team picks a default. Position 1 for all items. Or position 5. Or position 9.

They try position 1. They get Ranking A. They try position 5. They get Ranking B. Completely different. Different items in the top 5. Different user experience.

The ranking depends entirely on a number someone picked arbitrarily.

You are now running AB tests to find the best magic number. And the best number for one scenario does not transfer to another. The approach is a dead end.

The Question That Closes the Gap

Here is the question that resolves the two stories into one.

When a user clicks, what are they actually telling you?

Two things. Fused into a single signal.

First: I saw this item. This depends on the position. Position 1 is almost always seen. Position 20 has maybe a 10 per cent chance. This has nothing to do with the item.

Second: I wanted this item. This depends on the user, the item, and the context. This has nothing to do with position.

Every click is the product of these two.

P(click) = P(saw it) x P(wanted it, given I saw it)

Your model treats this product as one number. That is the entire problem. The fix is to split it.

How to solve this?

We can build two modules. Trains them together. Deploys them apart.

Module 1 is ProbSeen.

One input: position.
One output: the probability the user saw the item at that position. Think of it as a small curve. Position 1 outputs 0.95. Position 20 outputs 0.12.

Module 2 is pCTR.

Inputs: user profile, item features, context.
Output: the probability the user would click if they had seen it.

Position never enters pCTR.

During training, predicted click = ProbSeen x pCTR. This is compared against the actual click label. Standard cross-entropy loss.

Here is what makes it work. Both modules share the same loss. They train jointly. Gradients flow through both.

When the model sees that position 1 items get clicked more, the shared gradient forces a split. How much of that signal is visibility? ProbSeen takes it. How much is genuine preference? pCTR takes it.

Neither module can steal the other’s signal. Both are accountable for the same loss. The separation is automatic.

Why not train them separately? Because separate losses mean separate objectives. ProbSeen might absorb preference. pCTR might absorb position. The boundaries blur. Joint training forces a clean separation through coupled gradients.

At inference time, you throw away ProbSeen. You deploy only pCTR.

No position input needed. No default value. No magic number.

What your model now records: This user would click this item, regardless of where it is shown.

What actually happened: Same thing.

The two stories are finally one.

If this article changed how you think about position bias, CTR modelling, and ranking quality, you will find much more depth in our AI PM course Case Studies. (42+ Videos & 25+ Case Studies)

Check our highest-rated AI PM course (Including AI PM Interview Preparation) · 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma - I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass (Here)

How Anthropic PMs Ship Features in 45 Minutes (Without Writing PRDs)

Shailesh Sharma — Sat, 25 Apr 2026 09:46:17 GMT

In this artcile we will see the exact workflow that is being used by the Top Companies like Anthropic, Shopify, etc

The average Product Manager spends 40% of their week writing Jira tickets, updating roadmaps, and arguing over edge cases in 15-page Product Requirements Documents (PRDs).

At elite AI labs like Anthropic, Senior PMs spend exactly 0% of their week doing this.

Instead, a PM has an idea. They write a concise, 3-paragraph Product Note.

They drop it into an automated agentic workflow. 45 minutes later, there is a functional, tested Pull Request (PR) waiting in GitHub for engineering review.

No refinement meetings. No 15-page PRDs. No six-week development cycles.

This is not a sci-fi prediction for 2030. This is happening right now.

It is called the Execution Collapse — the cost and time of turning a product thought into production code has effectively dropped to zero.

To survive the next wave of tech, you have to become an “Orchestrator.”

Orchestrators don’t write PRDs. They write context.md files.

90-Day Plan: Become an AI PM

The PM Workflow of 2026

Step 1: The Product Note (The Seed)

You no longer write a PRD. You write a “Product Note.”

This is a raw, 3-to-4 paragraph summary of the user intent, the desired outcome, and the specific metrics you want to move.

It is pure strategy, stripped of any implementation details.

Step 2: The Injection (`context.md`)

This is the secret weapon. The PM takes the Product Note and feeds it into an orchestrating LLM, but they inject two critical system files alongside it to constrain the AI’s hallucinations:

product_area_context.md: Maintained strictly by the PM. This file defines the rigid business rules.

Example Content: Free users can only generate 5 reports per day. Do not allow PDF exports for Free Tier. If a user hits a paywall, route them to /upgrade. Our tone is professional, never conversational.”

code_context.md: Maintained by the engineering lead. This file maps the current technical reality.

Example Content: “We use React for the frontend and Python/FastAPI for the backend. All user data must pass through the auth_v2 middleware. Our database schema for users is located in /db/schema/users.sql.”

Step 3: The Functional Spec & The PM Review (The New Hero Skill)

The Orchestrator LLM synthesises the Product Note with the strict constraints of the Context files.

It instantly generates a highly technical Functional Spec.

This is the new job of the Product Manager. You don’t write the spec from scratch; you evaluate it.

You act as the Editor-in-Chief.

You review the AI’s logic, check for edge cases it missed, verify it adhered to the product_area_context.md rules, and adjust its assumptions. You are the taste-maker and the final human in the loop.

Once you approve it, you hit “Proceed.”

Step 4: Tech Spec to Autonomous PR

Once the PM approves the Functional Spec, the workflow becomes fully autonomous.

The agent converts the Functional Spec into a Tech Spec (defining architecture and data models).
The agent hands the Tech Spec to a coding model (like Claude 4.6).
The coding model writes the actual code, runs the unit tests, and automatically raises a Pull Request in GitHub.

Total time elapsed: 45 minutes. — -

AI PM 2026 — Winner’s Playbook

The Terrifying Future of Product Management

Read that workflow again. At no point did the PM schedule a backlog refinement meeting.

At no point did they write a user story in Jira.

The Execution Collapse means that engineering execution is rapidly becoming a commodity.

In this new reality, companies don’t need 50 Product Managers to coordinate sprints.

They need 5 elite PMs who understand how to structure context.md files, evaluate AI logic, and orchestrate autonomous agents.

If you don’t understand how to build these pipelines, you are fighting a losing battle against a PM who does.

FREE Book Giveaway — AI & Tech Simplified

Stop Coordinating. Start Orchestrating.

Understanding that this is the future is just a theory.

Actually building these agentic workflows for your own product is how you survive the transition.

You cannot learn this by just reading Medium articles. You have to build it.

If this article changed how you think about Product Management in the AI Era, you will find much more depth in our AI PM course.

Check our highest-rated AI PM course (Including AI PM Interview Preparation) · 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

More Resouces

The Model Choice Playbook Every AI PM Needs in 2026

Shailesh Sharma — Tue, 21 Apr 2026 20:23:03 GMT

AI Model Selection has become a very critical skill and getting asked in a Lot of Interviews

Imagine your CEO asks you to build an AI customer support agent for a food delivery app that handles 2 million tickets a month.

You think that Just use the best model. GPT-5. Claude Opus. Gemini 3, wire it up, and ship it.

You do the math, and you realise the ROI will not make sense.

If your instinct as an AI PM is to default to the frontier model on the leaderboard, you will ship a product that wins the pilot and loses the P&L.

Your agent will feel smart. Your margin will turn negative. Your CFO will ask why AI inference is eating half the savings you promised.

This is the first-principles breakdown of how AI PMs should actually make model-choice decisions. We will walk through it using one real scenario.

Building an AI customer support agent for a food delivery app at the Swiggy or DoorDash scale.

If you are preparing for AI PM interviews, this is one of the questions that now separates real AI PMs from people who have only used ChatGPT. You can check out other AI PM Interview Questions here.

The Scenario

You are the AI PM at a food delivery company.

You have 2 million support tickets a month. Human agents currently handle them at roughly 0.5$/ticket. Total support cost is 1 million dollars a month.

Your CEO gives you a mandate. Cut that by 70% with an AI agent.

The ticket distribution looks like this.

50% ~ Where is my order?
20% ~ missing items or cold food.
15% ~ are refund disputes.
10% ~ are restaurant quality complaints.
5% ~ are complex multi-turn escalations involving payment failures, account issues, or angry users demanding managers.

Some tickets take three seconds to resolve. Some take a thirty-turn conversation. Some require reading six previous tickets to understand what the user is asking. Some are hungry users at 10 PM typing in all caps.

The naive answer is to wire up a frontier model and pipe every ticket through it.

Let us see what happens when you do this.

What is Model Choice, Really?

Model choice is not a vendor decision. It is a five-dimensional optimisation problem under a hard business constraint.

You are picking a point on a Pareto frontier defined by Task Fit, Latency, Cost, Context, and Controllability. No single model wins on all five. Every choice is a trade.

For this support agent, the trade is complicated.

If you use a frontier model for every ticket. Quality is high.

At two million tickets of an average 5 turn conversation, at roughly 0.27$ of inference per ticket, you spend 540,000 dollars a month. You have already burned 54% of the savings you were hired to deliver. Add infrastructure, monitoring, safety guardrails, and retries, and you are above 70%.

If you use a cheap model for every ticket. Cost is fine.

But the model hallucinates refund policies, misreads Hinglish, and escalates half your tickets to humans anyway. Your deflection rate drops from 70% to 30%. The math collapses from the other direction.

Neither extreme works. The right answer lives in between. The PM decides where.

The Strategic Bet

Here is what most PMs miss entirely.

The support agent you are about to build does not run on a single model. It runs on a routing layer that decides which model handles which ticket.

—> Where is my order needs a database lookup, a tiny model to format the response, and a sub-500ms latency target.

—> A refund dispute from an angry user who has already had three bad experiences needs a frontier model, long context, and two seconds of real reasoning.

Shoving both requests into the same model is how you lose money and lose users at the same time.

The router is where your actual product intelligence lives. It is the layer that turns commodity foundation models into a profitable support operation.

Anyone can call the OpenAI or Anthropic API. What no competitor can easily copy is your 18 months of telemetry about which ticket class wins on which model at what confidence threshold with your specific users.

So the Problem Statement becomes - How do we match every incoming ticket to the cheapest model that still clears the user’s quality bar, within our latency SLA, at a positive contribution margin against human support cost?

Over-routing to the expensive model and inference eats the savings. Over-route to the cheap model and deflection collapses because humans have to clean up after the agent.

The Five Constraints

Before you pick a model, you score every candidate on five constraints. Most PMs only think about one or two.

Constraint 1: Task Fit

Task Fit measures how closely the model’s training matches the actual ticket.

Where is my order is not a reasoning task. It is a structured data lookup wrapped in natural language.

A small model paired with a database call beats a frontier model here. The frontier model writes more than needed, invents delivery estimates, and hedges unnecessarily.

A refund dispute is a reasoning task. The user references past tickets, implies context, negotiates, and escalates. A small model collapses. You need the frontier model.

The same agent handles both. Task Fit tells you they cannot be handled by the same model.

The only way to score Task Fit honestly is to build an internal eval set.

200 real tickets, sampled proportionally across ticket types. Every candidate model runs through it. A human or a judge model rates outputs on a rubric. This eval set is owned by the PM, not the ML team, and it is refreshed every month with real production data.

Constraint 2: Latency

Time to first token is what the user feels when they hit send.

For the 50% of tickets that are order status lookups, your target is under 500ms. The user is anxious. Every extra second is another 10% chance they open Twitter instead.

For the 5% of complex escalations, 2 seconds is acceptable if the response is visibly thoughtful. The user is already in a serious conversation and expects weight.

Your router itself has a latency budget. If your intent classifier takes 300ms to decide where to send the ticket, you have already burned most of the user’s budget before the actual model has started generating.

This is why classifier models are almost always small, often distilled or fine-tuned, tuned to run under 50ms. The router cannot be the bottleneck.

Constraint 3: Cost

At 2 million tickets a month, run the math on frontier-only inference.

Average ticket: five turns, roughly 2,000 accumulated input tokens per turn and 400 output tokens. At frontier model prices of roughly 15 dollars per million input tokens and 60 dollars per million output tokens, one ticket costs about 0.27$.

2 million tickets a month at 0.27$ each is 540,000 dollars in pure inference.

Your human baseline was 1 million dollars. Your AI inference alone is 54% of that. Your CEO is not impressed.

Now run the same math with a routed system.

70% of tickets go to a cheap model at $0.01 each.
25% go to a mid-tier at $0.08.
5% go to the frontier model at $0.27.
Weighted average lands at 0.04$ per ticket. Total monthly inference cost: 81K

Same deflection rate. Nearly 7x the margin.

Constraint 4: Context and Memory

The agent needs context like past orders. previous tickets from the same user, restaurant policies, active promotions, delivery agent notes etc.

The instinct is to stuff everything into a 200K context window and let the model figure it out. This fails in two ways.

Models measurably degrade past a certain context length, usually between 30K and 100K tokens, depending on the model. This is the lost-in-the-middle problem, and it is real.
Cost scales with every token included on every turn across millions of tickets. A 50K context blindly passed every turn turns your 0.04$ ticket into a $0.25 ticket. You have rebuilt the frontier-only problem with extra steps.

The right answer is retrieval. Pull only the relevant past order, the specific restaurant’s policy, the last two tickets, and the user’s LTV tier. Keep the context under 4,000 tokens. Let the router decide when a ticket is complex enough to justify pulling the full history.

Retrieval gives you control. Massive context gives you a black box that silently gets worse and more expensive as you fill it.

Constraint 5: Controllability

A customer support model cannot invent a refund policy.

If the model says you will get a full refund plus 500 rupees credit and that is not your policy, you have two problems. You either honour the invention and bleed money. Or you refuse and face a Twitter escalation.

Controllability is how reliably the model sticks to your rules under adversarial inputs.

Frontier models are generally more capable but not always more controllable. A fine-tuned, smaller model trained on your exact refund policy will follow the rules more reliably than a frontier model with a clever prompt. For the 15% of tickets that are refund disputes, controllability beats raw capability.

Most PMs stop at the first two constraints. The best AI PMs score every candidate on all five, document the trade, and revisit the scorecard every quarter.

Why - Just Use the Best Model - Fails Here

The argument is familiar. LLM Prices are dropping or will drop drastically in future. Capability is doubling. Just pick the top model and wait.

Three reasons this is wrong for the support agent.

Your competitor is not waiting. If they run a routed system today, they save 450K $ a month and reinvest it into faster delivery SLAs or cheaper customer acquisition. By the time frontier prices drop, they have already eaten your growth.
Best is relative to the ticket class, not the benchmark. The frontier model loses to a fine-tuned, smaller model on structured refund queries.
Cost drops do not flow to users. Every time inference gets cheaper, users expect richer responses, longer context, and more autonomy. If your unit economics are bad today, they are still bad tomorrow on a cheaper, more capable model.

Betting on the best model is a tax you pay to avoid doing the actual PM work.

The Router Pattern, Built for the Agent

Your router has five components.

An intent classifier sits in front of every ticket. A small fine-tuned model, under 50ms. It reads the ticket and returns one of five labels. order_status, missing_item, refund_dispute, restaurant_complaint, complex_escalation. It also returns a confidence score.
A model assignment table. order_status goes to a small model plus a database call. missing_item goes to a mid-tier model with a template response. refund_dispute goes to a fine-tuned, smaller model trained on your refund policy. restaurant_complaint goes to the mid-tier. Complex escalation goes to the frontier model.
A confidence threshold. If the classifier returns low confidence, the ticket escalates one tier up. If the primary model returns a low-confidence answer or the user replies “this is wrong”, it escalates again. The third escalation goes to a human.
A cache layer. 40% of “where is my order” tickets in a one-hour window ask about the same handful of delayed orders in a single city. Cache the response per order ID with a 60-second TTL. Zero inference cost on a cache hit.
A telemetry layer. Every ticket logs the classifier label, model chosen, tokens consumed, latency, user reaction, and final disposition. This is where your routing intelligence compounds.

The sophistication is not in the components. It is in the ongoing tuning of the assignment table based on telemetry.

One Ticket, End to End

Follow a single ticket through the router.

A user types “bhai order kidhar hai, 45 min ho gaye” at 9:47 PM.

The ticket hits the edge. It is hashed and checked against the cache. No hit.

It goes to the intent classifier. Classifier returns order_status, confidence 0.93. 42 milliseconds elapsed.

The router looks up the assignment table. order_status with high confidence goes to the small model plus a database call.

In parallel, the system pulls the user’s active order and the delivery agent’s current GPS location. 80 milliseconds.

The small model receives the ticket plus structured context. It generates “Your order is 4 minutes away. The delivery agent is on the last stretch”. Time to first token: 210ms. Total response time: 480ms.

Telemetry logs the full trace. Ticket class, model used, tokens consumed, latency, user’s next message.

The user replies “ok thanks”. Telemetry marks this as a positive resolution.

The router chose the right model. The user got a fast answer. The ticket cost you $0.004 against a human cost of $0.5. Multiply by one million similar tickets a month, and you see where the money is actually saved.

That is the product.

Model choice is where the business is either made or broken. And it is the PM’s job to decide.

If this article changed how you think about model choice and AI product strategy, you will find much more depth in our AI PM course. We cover model selection, routing architectures, AI evals, cost modelling, and real interview questions from top companies.

Check our highest-rated AI PM course (Including AI PM Interview Preparation) · 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

Apple AI Strategy

The AI Professional — Sun, 19 Apr 2026 05:43:54 GMT

Everyone thinks Apple is losing the AI race.

But here’s the truth… they’re playing a completely different game.

To dominate AI, companies need four things:

Infrastructure,
Models,
Data,
and Distribution.

Apple may be weak in AI models, but they are insanely strong in distribution.

They sell 200+ million iPhones every year.

And think about how we use phones.

We ask simple things — summarise emails, find photos, send quick messages.

So Apple’s strategy is simple: use our personal data to give hyper-personalised AI, and build the best interface on devices. and let others build the giant models.

For this, Apple is simply partnering with Google.

So Apple may never build the smartest AI… but they might build the most useful AI on our phone. Subscribe for more business strategy breakdowns.

Why Your Recommender Keeps Forgetting You?

Shailesh Sharma — Sat, 18 Apr 2026 12:08:46 GMT

Imagine this, you buy an iPhone on Amazon.

Three days later, you buy a case for it. A week after that, you buy AirPods. Normal journey.

Your recommendation feed is doing its job.

Then life happens. You buy a birthday gift for your niece. A toy. Then a book for your dad. Then a yoga mat. Then some groceries.

Now you come back looking for a screen protector for that iPhone.

Here is the problem. Your recommender has forgotten about the iPhone.

The model remembers what you did most recently.
Toy. Book. Yoga mat. Groceries.
Based on that, it is now quietly convinced you are a gifting parent with a wellness streak. It is showing you more toys, more books, more yoga equipment.

Meanwhile, the single most important signal about what you want right now, the iPhone from three weeks ago, has been washed out.

This is not a theoretical problem. This is happening on most recommendation systems you use today.

Today, we are going to see how to fix that.

In our previous piece, we explained how TikTok uses session-based RNNs to predict your next swipe. At the end, we flagged three pitfalls. One of them was Catastrophic Forgetting. This article is a deep dive into the paper that solved it.

If you are preparing for AI PM interviews, recommendation system design is the most commonly asked system design topic at senior levels. We teach this in our course.

The iPhone Problem

Let us go back to your Amazon story. Why did the model forget the iPhone?

The reason lies in how most recommenders store your history.

An RNN-based recommender works like this. Every time you buy something, the model converts that item into a short numerical fingerprint. Then it mixes that fingerprint into a single vector called the hidden state. One vector. That is the model’s entire memory of you.

Think of the hidden state like a single sticky note. Every time you buy something, the model scribbles on that same note, and whatever was written before gets slightly smudged.

After your iPhone purchase, the note says “wants tech accessories.”

After the case and AirPods, it still says roughly that.

Then you buy a toy. The note gets rewritten. Now it says “tech accessories and a gift.”

Then a book. Yoga mat. Groceries. By the time you come back for that screen protector, the sticky note no longer mentions the iPhone at all. It says something like “parent on a wellness kick with household needs.”

The iPhone signal is not lost. It is buried. Smeared under four unrelated purchases.

This is called Catastrophic Forgetting. And it is not a bug you can fix by tuning the model. It is a fundamental flaw in the architecture. The sticky note itself is too small to hold what it needs to hold.

Why This Breaks Product Experience

This has two costs that hit you directly as a PM.

The first cost is performance. Your model misses the highest-signal moments in a user’s journey because they get washed out by noise. A user who bought an iPhone three weeks ago is an obvious candidate for iPhone accessories. Your model does not see it. Your revenue per user suffers.
The second cost is explainability. You cannot tell a user why something was recommended. You cannot tell your leadership why the model did what it did. A single hidden vector is a black box even to the people who built it.

If you have ever been in a meeting where your head of product asks, “Why is the model recommending this?” and your ML lead says, “The embeddings suggest...”, you have lived this problem.

How Humans Actually Remember

Here is the interesting part. You do not have this problem.

If someone asks you what to get for a new baby, you do not scan every memory from your entire life. You pull up the specific episode of buying baby stuff for your niece last year. You focus on that. Everything else stays quiet in the background.

You have episodic memory. You can pull up specific moments on demand.

Your recommender does not have this. It only has the sticky note.

What if we gave the recommender episodic memory?

The Fix: A Memory Box, Not a Sticky Note

Instead of the single hidden vector, can we give every user a small memory box?

Think of the box as a row of 20 labelled drawers. Each drawer holds one past purchase. When you buy something new, it goes into a fresh drawer. The oldest drawer gets emptied to make space.

At any moment, your box has your last 20 purchases, sitting side by side.
The iPhone is in drawer 17.
The case in drawer 16.
The AirPods in drawer 15.
The toy in drawer 14.
The book in drawer 13.

And so on.

Nothing is smudged. Nothing is averaged. Each purchase sits cleanly in its own drawer.

Now, when you come back looking for a screen protector, the model does something clever. It does not read all 20 drawers equally. It asks a question.

“Which of these past purchases is most relevant to a screen protector?”

It scans each drawer, scores the similarity, and pays attention to the ones that match. The iPhone drawer lights up. The toy drawer stays dim. The book drawer stays dim.

The model pulls out the iPhone signal cleanly and recommends the perfect screen protector.

This is exactly how attention works in modern AI. The model decides what to focus on based on what it is trying to do right now.

Two Versions of the Same Idea

Here this we can do in two ways, both use the same core idea. They differ in what they store.

The first version is called item-level RUM.
Each drawer in the box holds an actual past purchase. iPhone in one drawer. AirPods in another. This is simple. It is also explainable. You can literally tell the user that we showed you this because of that iPhone you bought three weeks ago.
The second version is called feature-level RUM.
Each drawer does not hold a purchase. It holds a preference. One drawer tracks your brand preference. Another tracks your price sensitivity. Another tracks your style preference. Every time you buy something, the drawers get gently updated. Buy an Apple product, and the brand drawer leans more towards Apple. Buy something cheap; the price is more budget-friendly.

The second version tends to perform better. The first is easier to explain.

If you work in a domain that demands explainability, such as finance or healthcare, go item-level.

If you are running a pure engagement product where performance is everything, go feature-level.

How The Memory Updates

The item version is simple. New purchase comes in, oldest one gets kicked out. First in, first out. A 20-slot box always holds the last 20 purchases.

The feature version is more interesting.

When you buy something new, the model does two things.

First, it decides what to forget. If you just bought an Android phone, your brand preference for Apple should fade. The model computes a forget signal and uses it to gently erase the old preference.
Then it decides what to reinforce. Your brand preference for Android should go up. The model computes an add signal and writes it to the drawer.

The mental model is simple. Every time you buy something, the relevant drawers in your memory box get a small dusting-off followed by a small update.

The beautiful thing is that the model learns what to forget and what to reinforce on its own. You do not write rules. You show it millions of user sequences, and it figures out the pattern.

Thing which Product Manager needs to decide

Memory size

How many drawers per user? More drawers mean richer history, but more computing. 20 might work for e-commerce. For a content platform like TikTok, where users burn through items in seconds, you might want 50 or 100.

Item level or feature level

Explainability or performance. Pick one. You cannot have both.

Memory weighting

optimal weight for recent behaviour. Start there. Then an A/B test. Stable domains like books or music can push intrinsic weight higher. Volatile domains like news or short-form video need more memory weight.

Write strategy

For item level, first-in-first-out is fine. For the feature level, you need the forget-and-reinforce approach. It is more powerful. It is also harder to debug.

If this article changed how you think about memory, recommendation architectures, and AI system design, you will find much more depth in our AI PM course.

We cover these in 40+ Videos and 25+ Case Studies, along with AI PM interview questions from top AI companies.

Check our highest-rated AI PM course (Including AI PM Interview Preparation) · 4.9/5 · 600+ enrollments.

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

90-Day Plan: Become an AI PM (starting from Zero)

Shailesh Sharma — Thu, 16 Apr 2026 16:58:07 GMT

If I lost every skill I have tomorrow and had 90 days to get hired as an AI PM, I would not watch a single YouTube video about prompt engineering for the first three weeks.

I would not open ChatGPT. I would not read a blog post about agents. I would not touch a no-code tool.

I would do something most PMs skip entirely.

I would learn how AI systems actually work before I try to build anything with them.

This sounds obvious. It is not what people do. What people do is jump straight to the shiny layer. They learn prompting. They learn vibe coding. They built a chatbot in an afternoon and updated their LinkedIn headline to “AI Product Manager.”

Then they sit in an interview. The interviewer asks how they would design evaluations for a RAG-based support system. They freeze.

They have no answer because they skipped the foundation that the answer sits on.

90 days is enough time. But only if you sequence things correctly.

Here is the exact sequence.

Weeks 1 to 3. The Foundation Nobody Wants to Build.

Most PMs hear about AI and start with Generative AI. This is backwards.

Generative AI is a layer that sits on top of machine learning. Machine learning sits on top of data systems. If you do not understand the layers below, you cannot reason about the layer above.

Week 1 is about machine learning at a PM level. Not math. Not code. Concepts.

What is supervised learning, & What is unsupervised learning?
What is the difference between a classification problem and a regression problem?
When your team says we trained a model, what does that actually mean?
What did they feed it? What did they optimise for? What could go wrong?

You are not learning this to become a data scientist.

You are learning this so you can sit in a design review and know whether the team picked the right approach.

If you cannot do that, you are not an AI PM. You are a project manager with a fancy title.

We are launching a 12-week Cohort ( 100 Hrs of Learning) and 10+ Live Projects. Please find the details here

Become AI Product Builder

Week 2 is about the AI Flywheel.

This is the concept that separates AI products from regular software.

In regular software, a feature works the same way on day 1 and day 1000.

In AI, the product should get smarter the more people use it. Users generate data. Data improves the model. The better model creates better user experiences. Better experiences bring more users.

If you cannot design this loop for your product, you do not have an AI strategy.

You have a feature with an API call.

Week 3 is about data pipelines.

This is the week that will feel the most boring and will be the most valuable.

Your AI is only as good as the data feeding it.

Dirty data. Biased data. Missing data. Poorly labelled data.

These are not engineering problems. These are product problems.

The PM who understands data pipelines catches issues in the design phase. The PM who does not catch them in production, after users have already had a bad experience.

By the end of Week 3, you should be able to whiteboard a basic ML system. Data in. Feature engineering. Model training. Prediction. Feedback loop. If you cannot draw this, you are not ready for what comes next.

Weeks 4 and 5. Algorithms and Case Studies.

You do not need to derive the math behind logistic regression.

You need to know when to use it.

Here is a real scenario.

Your team is deciding between a decision tree and a linear model for a pricing feature.

The engineer explains the trade-offs. If you do not understand what either approach does, you are sitting in that meeting as a spectator.

You are waiting for someone else to make a decision that is yours to make.

Week 4: learn the three algorithms that cover 80% of PM-relevant AI decisions.

Linear regression for predicting continuous values.
Logistic regression for classification.
Decision trees for complex, non-linear problems.

For each one, learn what it does, when it works, when it breaks, and what the output looks like.

Do not learn these from textbooks. Learn them from product case studies. How Uber uses prediction models for pricing.

How Netflix uses collaborative filtering for recommendations.

How Amazon designs the data flywheel behind Alexa.

Week 5 is entirely case studies. Read 10 to 15 real company teardowns.

How does Lyft balance model accuracy against latency in real-time pricing, and how does Amazon show the next best category to the users? How does Netflix do creative personalisation on the Homepage? (You can find Case Studies Here)

Every case study you internalise becomes a mental model you can pull out in a conversation, a strategy discussion, or an interview.

The PMs who sound the sharpest in rooms are the ones with the deepest library of real-world references.

Weeks 6 and 7. Generative AI From First Principles.

Now you are ready for Gen AI. Not before.

Week 6 is about understanding generative AI, how it is different from the AI that we learnt in the last few weeks. How does Generative AI work? What are some of the applications that generative AI can solve?

Then go deeper into the hood of Generative AI

What is a transformer? What is a token? What is a context window?
What happens when you increase the temperature from 0.2 to 0.9?
Why does the same prompt give different outputs each time?
Why does the model hallucinate, and when is hallucination more likely?

These are not academic questions. If your AI feature is hallucinating and you do not know whether the problem is the prompt, the temperature, the model, or the retrieval layer, you cannot diagnose it.

You are dependent on an engineer to figure it out. That is a loss of ownership.

Week 7 is prompting. Not “write me a blog post” prompting. Structural prompting.

Chain of Thought. Tree of Thought. Few-shot examples.
System-level constraints. Prompt chaining for multi-step workflows.

These techniques are the difference between an AI feature that works 60% of the time and one that works 95% of the time.

If you are building a product used by millions of people, that 35% gap is the difference between a feature users trust and a feature users abandon.

By the end of Week 7, you should be able to write a multi-step prompt chain that produces consistent, reliable output for a defined product use case. Not a toy demo. A real workflow.

Weeks 8 and 9. Prototyping. The Phase That Changes Your Career.

Everything before this was understandable. This is where you start building.

The gap between PMs who understand AI and PMs who build with AI is the largest salary gap in product management right now. The difference is not 10% or 20%. It is 2x to 3x.

Week 8: build your first working prototype.

Not a mockup. Not a slide deck. A working thing where the AI takes an input, processes it, and returns an output a real user can interact with.

Use Cursor. Use Replit. Use Claude. The tools are too good now for any PM to say “I cannot build anything.” You do not need to write production code. You need to string together an API, a prompt, and a simple interface.

Pick a real problem you face at work. Build a tool that solves it in an afternoon. When you walk into a meeting and show your team a working prototype instead of a spec, the dynamic changes permanently. You stop being the person who requests things. You become the person who builds things.

Week 9: Make the prototype reliable.

This is where most vibe coding efforts die.

The prototype works 70% of the time. The PM calls it done. Post it on LinkedIn. Gets some likes.

But 70% reliability is not a product. It is a demo.

Week 9 is about model control. Temperature tuning. Choosing between a fast, cheap model and a slow, expensive one for different parts of the workflow.

Building a reliability framework. Understanding the cost-per-query math so you can tell leadership “this feature costs X per user per month” and mean it.

These are PM decisions. Not engineering decisions. The PM who owns these trade-offs owns the product. The PM who delegates them owns a spec.

FREE Book Giveaway — AI & Tech Simplified

Weeks 10 and 11. RAG, Agents, and Evals.

If you have done everything above, you are in the top 10% of PMs by AI fluency.

The top 1% knows three more things. RAG systems. AI agents. And evals.

RAG

Almost every enterprise AI product is a RAG system. The reason is simple. GPT does not know your company data. Claude does not know your Q3 metrics. No off-the-shelf model knows your customer support documentation.

RAG bridges this gap. It retrieves relevant chunks from your private data and feeds them to the model so it can generate answers grounded in your specific context.

If you are PMing an enterprise AI product and you cannot explain how RAG works, how chunking affects retrieval quality, or what a vector database does, you cannot debug the most common failure mode: the system returning wrong or irrelevant answers.

Agents

These are AI systems that do not just respond. They act. They plan a multi-step workflow, use external tools, and execute tasks autonomously. The PM challenge is different here. You need to design guardrails, failure states, and human-in-the-loop checkpoints for a system that makes its own decisions.

Now, the most important skill of all three: evals.

Evals are how you measure whether your AI system is good.

This sounds simple. It is the hardest unsolved problem in AI product management. You cannot use traditional metrics. Pass/fail does not work when the output is a paragraph of text. You need deterministic evals for things you can measure objectively. You need probabilistic evals where you use one AI model to judge another.

The PMs who understand evals ship with confidence. They set measurable quality bars. They can defend their decisions to leadership with data.

We have covered Advanced Evals here

Weeks 12 and 13. Interview Preparation & Portfolio

You can know all of the above. If you cannot communicate it under pressure in 45 minutes, none of it counts.

AI PM interviews do not ask you to design an alarm clock for the blind. They ask questions like these:

How would you measure the success of GPT 5.0?
Design a reliability framework for an AI shopping assistant.
ChatGPT’s regeneration rate has increased. How would you investigate?
How would you price Gemini?
Design a RAG system for TikTok content moderation.
Imagine Google made its model free, and it is better than paid GPT. You are Sam Altman. What do you do?
We have 20 such Real AI PM Interview Questions here

If you have never practised these questions, you will fumble.

Not because you do not know the concepts. Because you have not built the muscle of structuring an AI PM answer under time pressure.

The structure matters.

Start by clarifying the AI system architecture.
Define success metrics specific to AI products.
Address trade-offs unique to probabilistic systems. Show cost-per-query awareness.
Show eval thinking. Demonstrate that you can move between product sense and technical depth in the same answer.

Week 12 is practice. Answer questions out loud. Record yourself. Listen back. Find the moments when you hedged, when you went vague, when you lost the technical thread.

Week 13 is portfolio.

Document the prototype you built in Week 8.

Write up two case study analyses from Week 5.
Create a one-page eval framework for an AI feature.
This is your proof of work. It is the difference between “I learned about AI” and “I built with AI.”

This plan will make you absolutely beast after 12–13 weeks. 95% of PMs cannot do these things right now. The ones who can are not smarter. They just did the work in the right order. Most of this plan maps directly to what I teach. If you want to skip the self-study phase??

Check our highest-rated AI PM course (Including AI PM Interview Preparation )· 4.9/5 · 600+ enrollments · Use NYE26 for 60% off → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. For more, check out my Live Webinars, AI Product Management Course, PM Interview Mastery Course, Cracking Strategy, and other Resources

Advanced Evals - Traces in AI Evals

Shailesh Sharma — Tue, 14 Apr 2026 20:28:17 GMT

You are a product manager at Amazon.

You just shipped Rufus. The AI shopping assistant that lives inside the Amazon app.

A user types: I am looking for running shoes for flat feet under 5000 rupees with good cushioning.

Your system does not just call an LLM and return a response. It runs a chain of operations.

First, it classifies the user’s intent. Is this a product search? A comparison request? A return query?
Then, it extracts structured attributes from the query. Category: running shoes. Foot type: flat feet. Budget: under 5000. Feature: cushioning.
Then, it calls the product search API with those attributes and retrieves 20 candidate products.
Then, it applies a reranking model to sort those 20 products by relevance to the original query.
Then, it feeds the top 5 products and the original query to an LLM, which generates a conversational response with recommendations.
Finally, it applies a safety filter to check for hallucinated claims. Did the LLM say a shoe has orthopaedic certification when the product listing never mentioned it?

Six places where something can go wrong.

The user sees the final response. It recommends three shoes. One of them is a basketball shoe. The cushioning claim on another is fabricated. The third recommendation is fine, but it costs 7200 rupees, which is above the stated budget.

Your VP asks what happened.

You look at the final output. It looks broken. But you have no idea which step broke it.

—> Was it the intent classifier?
—> The attribute extractor?
—> The search API?
—> The reranker?
—> The LLM? The safety filter?

This is where traces come in.

Why Traditional Evals Cannot Debug Multi-Step AI Systems

In the previous article, we covered RAG evals. Precision, Recall, MRR. Those metrics evaluate one specific component: the retrieval layer. They tell you whether your system pulled the right documents.

But modern AI systems are not single-component systems. They are pipelines. Chains. Agents. Multiple models calling multiple tools in sequence, where the output of one step becomes the input of the next.

Traditional evals look at the final output and ask: Was this answer good?

Traces solve this problem. A trace is a complete record of everything your AI system did to produce a single response. Every step. Every input. Every output. Every decision in that order.

If the final answer is wrong, the trace tells you exactly where the pipeline broke.

What Is a Trace?

A trace is borrowed from distributed systems. In traditional software engineering, when a user requests a web application, that request might travel through an API gateway, a load balancer, a backend service, a database, and a cache. A distributed trace records each hop, so engineers can see the full journey of a single request.

AI traces do the same thing, but for AI pipelines.

A trace represents the full lifecycle of a single user interaction with your AI system. From the moment the user sends a query to the moment the system returns a response.

A trace is made up of spans.

A span is one unit of work inside the trace. One step. One operation. One model call. One API request. One tool invocation.

Every span records four things.

What went in. The input to that step.
What came out. The output of that step.
How long did it took? The latency.
What type of operation it was. An LLM call, a retrieval step, a tool call, a function execution.

Spans are nested. A parent span can contain child spans. This creates a tree structure that shows exactly how your system executed.

This is the anatomy of a trace. A tree of spans, each recording the inputs, outputs, and timing of a single step.

Walking Through a Real Trace

Let us go back to Rufus. The user asked:
I am looking for running shoes for flat feet under 5000 rupees with good cushioning.

Here is the trace your system recorded. Six spans, in order.

Span 1: Intent Classifier

Input: I am looking for running shoes for flat feet under 5000 rupees with good cushioning.
Output: intent = product_search
Latency: 45ms
Model: Internal classifier v3

This span worked correctly. The intent is product search. No issues here.

Span 2: Attribute Extractor

Input: I am looking for running shoes for flat feet under 5000 rupees with good cushioning.
Output: {category: “running_shoes”, foot_type: “flat_feet”, max_price: 5000, feature: “cushioning”}
Latency: 120ms
Model: GPT-4o mini

This span also worked correctly. All four attributes were extracted accurately from the query.

Span 3: Product Search API

Input: {category: “running_shoes”, foot_type: “flat_feet”, max_price: 5000, feature: “cushioning”}
Output: 20 products returned. Product IDs: [A1, A2, A3, ... A20]
Latency: 230ms

Here is the first problem. When you inspect the 20 products returned, you find that product A7 is a basketball shoe. The search API matched on cushioning but ignored the running_shoes category constraint. The API treated the category as a soft filter instead of a hard filter.

You also find that product A14 costs 7200 rupees. The max_price filter did not work as a strict cutoff.

Two bugs. Both in the search API span. Neither is visible in the final output without the trace.

Span 4: Reranker

Input: 20 products from Span 3, original query.
Output: Top 5 ranked products: [A3, A7, A12, A14, A1].
Latency: 180ms
Model: Cross-encoder reranker v2

The reranker promoted A7 (the basketball shoe) to position 2 and A14 (the overpriced shoe) to position 4. The reranker matched on cushioning and boosted both products because they had strong cushioning scores.

The reranker did its job given the inputs it received. But those inputs were already contaminated by the search API.

This is a cascading failure. The search API lets in bad products. The reranker, operating on bad inputs, made the problem worse by promoting them.

Span 5: Response Generator (LLM)

Input: Top 5 products + original query
Output: Based on your requirements, here are three great options for flat feet with excellent cushioning: 1. Nike Revolution 6 (Rs 3,499) with orthopaedic-grade cushioning technology... 2. Adidas CourtSmash (Rs 4,299) with premium arch support... 3. ASICS Gel-Kayano (Rs 7,199) with superior gel cushioning...
Latency: 1,200ms
Model: Claude Sonnet Tokens: 340 input, 180 output

Multiple problems surfaced here.

The LLM included A14 (ASICS at Rs 7,199) despite the user asking for under 5000.
The LLM fabricated orthopaedic-grade cushioning technology for the Nike shoe.
That phrase does not exist in the product listing.
And the LLM recommended the Adidas CourtSmash, which is the basketball shoe (A7) that the search API should have filtered out.

Span 6: Safety Filter

Input: Generated response.
Output: Response passed. No safety violations detected.
Latency: 85ms

The safety filter checked for toxicity, PII, and explicit content. It did not check for factual accuracy against product listings. It did not catch the hallucinated orthopaedic-grade claim. It did not catch the budget violation.

The safety filter passed a response that contained two factual errors.

What the Trace Reveals

Without the trace, all you know is that the final answer was bad. With the trace, you know exactly what happened.

The search API had two bugs. The category filter was soft instead of hard. The price filter allowed products above the stated maximum.
The reranker amplified the problem. It promoted bad products because it optimised for feature match without respecting hard constraints.
The LLM hallucinated a product claim. It added orthopaedic-grade cushioning technology, which does not exist in any source data.
The LLM ignored a constraint. It recommended a product above the user’s budget.
The safety filter was incomplete. It checked for toxicity but not for factual grounding or constraint adherence.

Five distinct failure points. Three different components. Two cascading failures. One root cause (the search API) propagated through the entire pipeline.

You cannot find any of this by evaluating only the final output.

Span-Level Evals: Evaluating Each Step Independently

This is where traces and evals converge.

Traditional evals evaluate the system end-to-end. You compare the final output against a ground truth. That tells you whether the system worked, but not where it failed.

Span-level evals evaluate each span independently. You attach an evaluation metric to each span in the trace. Each step gets its own scorecard.

Let us apply this to our Rufus trace.

Eval for Span 1 (Intent Classifier)

Metric: Classification accuracy.
Ground truth: product_search
System output: product_search
Score: 1.0. Correct.

Eval for Span 2 (Attribute Extractor)

Metric: Attribute extraction F1.
Ground truth: {category: “running_shoes”, foot_type: “flat_feet”, max_price: 5000, feature: “cushioning”}
System output: Same.
Score: 1.0. All attributes correctly extracted.

Eval for Span 3 (Search API)

Metric 1: Category precision. What fraction of returned products match the requested category? 18 out of 20 products are running shoes. 2 are not. Score: 0.90.

Metric 2: Price constraint adherence. What fraction of returned products are under the stated max price? 17 out of 20 are under 5000. 3 are above. Score: 0.85.

Both scores reveal a leaky filter. Neither score would surface from an end-to-end eval.

Eval for Span 4 (Reranker)

Metric: NDCG (Normalised Discounted Cumulative Gain). Did the reranker place the most relevant products at the top?

If we define relevance as products that match ALL stated criteria (running shoes, flat feet, under 5000, good cushioning), then positions 2 and 4 in the top 5 contain products that violate at least one constraint.

NDCG@5: 0.72.

The reranker is optimising for partial relevance. It matches on some attributes while ignoring others.

Eval for Span 5 (LLM Response)

Metric 1: Faithfulness. Does every claim in the response have a source in the input products? The orthopaedic-grade cushioning technology claim has no source. Faithfulness score: 0.67.

Metric 2: Constraint adherence. Does the response respect all user-stated constraints? One product exceeds the budget.
Score: 0.67 (2 out of 3 recommendations within budget).

Eval for Span 6 (Safety Filter)

Metric: Hallucination detection rate. What fraction of factually unsupported claims were caught? The safety filter caught 0 out of 1 hallucinated claims. Score: 0.0 for factual grounding.

Now look at what you have.

A full diagnostic report. Each component was scored independently. You know that the intent classifier and attribute extractor are working perfectly. You know the search API has a filter leakage problem. You know the reranker needs constraint-aware scoring. You know the LLM has a faithfulness problem. You know the safety filter has a coverage gap.

This is an eval report built on traces. You cannot produce this without tracing your system.

The Trace-to-Eval Pipeline

Traces do not just help you debug individual failures. They create a flywheel for continuous improvement.

Here is how the pipeline works.

Step 1: Your system logs traces from production. Every user query generates a trace with all its spans.
Step 2: You sample traces. Not every trace needs evaluation. You pick a subset. Maybe 1 to 5 percent of production traffic. Maybe all traces where the user gave a thumbs down. Maybe all traces where the response latency exceeded a threshold.
Step 3: You run automated evals on the sampled traces. LLM-as-a-judge scores each span for faithfulness, relevance, constraint adherence, whatever metrics matter for your product. This is called online evaluation.
Step 4: Traces that score poorly get routed to a human review queue. Domain experts look at the trace, examine each span, and annotate where the system failed. These annotated traces become your golden dataset.
Step 5: You use the golden dataset for offline evaluation. Before shipping any change to any component, you run the new version against your golden dataset and compare span-level scores.
Step 6: The improved system goes to production. It generates better traces. Those traces get sampled and evaluated. The cycle repeats.

This is the trace-eval flywheel. Production traces become eval datasets. Eval datasets drive improvements. Improvements generate better traces. The system gets better every cycle.

Without traces, this flywheel does not exist. You cannot build a golden dataset if you do not know what each component did at each step.

End-to-End Evals vs Span-Level Evals

There is a common mistake teams make after they discover span-level evals. They stop running end-to-end evals entirely.

This is wrong. You need both. Here is why.

Span-level evals catch component failures. They tell you which step broke. But they cannot catch emergent failures. Failures that only appear when components interact.

Consider this scenario. The intent classifier outputs “product_search” correctly. The attribute extractor outputs all four attributes correctly. The search API returns 20 relevant products. The reranker ranks them well. The LLM generates a fluent response. Every span passes its individual eval.

But the final response is still bad. The LLM picked three products that are all from the same brand. The user sees no variety. The response feels like a sponsored advertisement.

No individual span failed. The failure is emergent. It exists only in the interaction between the reranker (which promoted similar products) and the LLM (which did not add a diversity constraint).

End-to-end evals catch this. They evaluate the final output as a whole. Diversity, user satisfaction, and task completion.

The framework is simple.

Use span-level evals to catch component failures. Where did the pipeline break?

Use end-to-end evals to catch emergent failures. Does the full pipeline produce good outcomes even when every component looks fine individually?

Use traces to connect the two. When an end-to-end eval catches a failure, walk the trace to find the root cause.

If you want to go deeper on Advanced Evals (Cohen’s Kappa, Matthew’s Correlation Coefficient), Evals for Agentic Architecture, AI Product Sense, AI Strategy, AI Pricing, AI Prototyping, Advanced Prompting, ML Systems, etc., check out my AI PM course (40+ Videos and 25+ Case Studies) [Certification Included]

Check our highest-rated AI PM course (Including AI PM Interview Preparation) · 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here ), AI PM Resources

Advanced Evals - Evals for RAG

Shailesh Sharma — Mon, 13 Apr 2026 21:11:45 GMT

You are a product manager at Google.

You just shipped AI Overviews.

The feature that puts an AI-generated answer right at the top of search results.

A user types: “Why does my iPhone battery drain fast after the iOS 26 update?”

Your system does two things.

First, it retrieves five web pages from Google’s index that it thinks are relevant.
Then, it feeds those pages to Gemini and generates a summary answer.

The answer looks clean. The formatting is right. Gemini’s language is fluent. Your VP sees it in a demo and says ship it.

But here is the question nobody in the room asked.

Were those five retrieved pages actually the right ones?

Because if your retrieval pulled garbage, Gemini just summarised garbage.

This is the problem every team building RAG systems runs into. And almost nobody evaluates it correctly.

Why Evaluating RAG Is Different From Evaluating LLMs

Traditional LLM evals test whether the model’s output is good. Did it answer correctly? Was the tone right? Did it hallucinate?

RAG evals test something upstream. They test whether the retrieval system fed the right inputs to the model in the first place.

A RAG pipeline has two components.

The retrieval layer that selects documents.
And the generation layer that synthesises an answer from those documents.

These are two separate systems. They fail in different ways. They need to be evaluated separately.

Most teams skip the retrieval evaluation entirely. They look at the final generated answer and if it is good, they assume the whole pipeline works.

That is a mistake. Because sometimes the model gets lucky. It generates a reasonable answer even from mediocre sources. And sometimes the retrieval is perfect, but the model fumbles the generation.

RAG evals separate these two failure modes. They tell you exactly where the pipeline broke.

And the retrieval layer? That is your job as a PM to get right. Because retrieval quality is a product decision.

—> How many documents to retrieve?

—> Which embedding model to use.

—> What similarity threshold to set.

These are all choices that show up in your PRD, not in a prompt.

The RAG Evaluation

Let us go back to Google. You are evaluating AI Overviews for the query: “Why does my iPhone battery drain fast after iOS 26 update.”

Your retrieval system pulls five documents. Here is what it returned, in the exact order it ranked them:

Position 1: Apple Support page on iPhone battery health settings.
Position 2: A CNET article titled “Best Android Phones With Long Battery Life in 2025.”
Position 3: A Reddit thread from r/iPhone where users share iOS 26 battery drain fixes.
Position 4: A MacRumors article covering iOS 26 release notes and known battery bugs.
Position 5: An Amazon product listing for an Anker battery case.

Now, you need a ground truth. You need to know, which documents in your entire corpus were actually relevant to this query.

Your human evaluators (or your golden dataset) say there are exactly four relevant documents in the whole index for this query:

Relevant Doc A: The Apple Support page on battery health settings.
Relevant Doc B: The Reddit thread with iOS 26 battery drain fixes.
Relevant Doc C: The MacRumors article on iOS 26 release notes and battery bugs.
Relevant Doc D: An Apple Developer Forum post about background app refresh causing battery drain in iOS 26.

So here is the picture. Your system retrieved five documents. Three of them are relevant (positions 1, 3, and 4). Two are irrelevant (positions 2 and 5). And one relevant document (the Developer Forum post) was not retrieved at all.

Let us now measure exactly how good or bad this retrieval was.

Precision@K in RAG: Are You Retrieving Junk?

Precision answers a simple question. Out of everything you retrieved, how much of it was actually useful?

The formula is:

Precision@K = (Number of relevant documents in the top K results) / K

Let us calculate it at different values of K.

Precision@1.

You look at only the top result.

Position 1 is the Apple Support page. That is relevant.

Precision@1 = 1/1 = 1.0

Perfect. Your top result is a hit.

Precision@3.

You look at the top three results.

Position 1: Apple Support page. — Relevant.
Position 2: CNET Android article. — Not relevant.
Position 3: Reddit iOS 26 thread. — Relevant.

Precision@3 = 2/3 = 0.67

Two out of three were useful. That CNET Android article diluted the quality.

Precision@5.

You look at all five results.

Position 1: Relevant.
Position 2: Not relevant.
Position 3: Relevant.
Position 4: Relevant.
Position 5: Not relevant.

Precision@5 = 3/5 = 0.60

Three out of five. 60%. That means 40% of what you fed to Gemini was noise.

Here is what this metric tells you as a PM.

A precision of 0.60 at K=5 means your context window is 40% garbage. Gemini has to work harder to ignore the Android article and the Anker battery case listing. Every irrelevant document increases the chance of a confused, diluted, or hallucinated answer.

If your precision is dropping, you need to look at your embedding model. Your similarity threshold is too loose. You are retrieving documents that are only tangentially related to the query.

Precision is a purity metric. It tells you whether your retrieval has a noise problem.

Recall@K in RAG: Are You Missing Important Documents?

Recall asks the opposite question. Out of everything that should have been retrieved, how much did you actually find?

The formula is:

Recall@K = (Number of relevant documents in the top K results) / (Total number of relevant documents in the corpus)

We said there are four relevant documents total. Let us calculate.

Recall@1

You retrieved one document. It is relevant.

Recall@1 = 1/4 = 0.25

You found one out of four relevant documents, 25%. You are missing 75% of the useful information.

Recall@3

The top three results contain two relevant documents (positions 1 and 3).

Recall@3 = 2/4 = 0.50

You have found half the relevant information. Better. But the user is still missing context about the iOS 26 release notes and the developer forum post.

Recall@5

All five results contain three relevant documents.

Recall@5 = 3/4 = 0.75

75%, You captured most of the relevant information. But that fourth document, the Developer Forum post about background app refresh, never made it in.

And that missing document? It might have been the most important one. It contains the actual technical fix.

A user reading the AI Overview gets generic advice on battery health settings but misses the specific step to disable background app refresh in iOS 26. That is a coverage gap. And it is invisible if you only look at Precision.

Here is what Recall tells you as a PM.

Low recall means your users are getting incomplete answers. They are not seeing important perspectives.

In a search product, low recall is how you lose trust. The user tries your AI answer; it does not solve their problem. They scroll past it to the blue links, and eventually, they stop reading AI Overviews entirely.

If your recall is low, you need to retrieve more documents (increase K). Or you need a better embedding model that captures semantic similarity more broadly.

But notice the tension. Increasing K improves recall but can hurt precision.

You pull in more documents, and some of them will be junk.

This is the fundamental tradeoff you manage as a PM. And it is why you need both metrics, not just one.

The Precision-Recall Tradeoff in RAG Systems

Let us put the numbers side by side.

At K=1: Precision is 1.00, Recall is 0.25.
At K=3: Precision is 0.67, Recall is 0.50.
At K=5: Precision is 0.60, Recall is 0.75.

See the pattern? As K increases, precision drops and recall rises. You are pulling in more documents, which means you find more relevant ones (recall goes up), but you also let in more noise (precision goes down).

This is not a math problem. This is a product problem.

If you are building a medical information feature, you want high recall. Missing a relevant safety warning is unacceptable. You tolerate some noise in exchange for completeness.

If you are building a customer support chatbot where context window tokens are expensive and latency matters, you want high precision. Every irrelevant document wastes tokens and slows response time.

If you are building AI Overviews at Google, you need both to be high. A wrong source embarrasses you publicly. A missing source makes users lose trust. Your job is to find the K that maximises both, and to invest in a retrieval model that pushes the tradeoff curve outward.

This is a classic PM decision. It lives in your PRD. Not in a prompt engineering doc.

Mean Reciprocal Rank (MRR): How Fast Does the User Find What They Need?

Precision and Recall measure quantity. How many relevant documents did you retrieve versus how many exist?

MRR measures something different. It measures speed. How quickly does the first relevant document appear in your ranked results?

MRR stands for Mean Reciprocal Rank. Let us break it down from first principles.

First, Reciprocal Rank.

For a single query, the Reciprocal Rank is 1 divided by the position of the first relevant document.

Go back to our query. “Why does my iPhone battery drain fast after iOS 26 update.”

The first relevant document is at position 1 (the Apple Support page).

Reciprocal Rank = 1/1 = 1.0

Perfect. The user’s first result was relevant. No scrolling needed.

But MRR is a system-level metric. It averages the Reciprocal Rank across multiple queries. Because one query hitting position 1 does not mean your system is good. You need to see the pattern.

Let us say you are evaluating AI Overviews across three queries.

Query 1: “Why does my iPhone battery drain fast after iOS 26 update.” Your system retrieves five documents. The first relevant one is at position 1. Reciprocal Rank = 1/1 = 1.0

Query 2: “How to enable dark mode on MacBook Air.” Your system retrieves five documents. The results are: a Windows dark mode guide at position 1, an unrelated MacBook keyboard shortcut article at position 2, and an Apple Support page on macOS dark mode settings at position 3. The first relevant document is at position 3. Reciprocal Rank = 1/3 = 0.33

Query 3: “Is Apple Vision Pro compatible with prescription lenses?” Your system retrieves five documents. Position 1 is a generic VR headset comparison. Position 2 is Apple’s official Vision Pro page, mentioning Zeiss optical inserts. Relevant. Reciprocal Rank = 1/2 = 0.50

Now you calculate MRR:

MRR = (1.0 + 0.33 + 0.50) / 3 = 1.83 / 3 = 0.61

Your MRR is 0.61.

What does this mean?

An MRR of 1.0 means every single query gets its first relevant document at position 1.

An MRR of 0.5 means, on average, the first relevant document is around position 2.

Your score of 0.61 says users are typically finding a relevant result between positions 1 and 2.

Here is why MRR matters as a product metric.

In a RAG system, the order of retrieved documents affects the generated answer. Most LLMs pay more attention to the documents that appear first in the context.

This is called positional bias.

If your first relevant document is buried at position 3 and positions 1 and 2 are noise, the model might give more weight to the noise.

In a search product, position matters even more directly. Users skim from top to bottom. If the first result is irrelevant, many users bounce. Every position of delay costs you engagement.

MRR captures this. It rewards systems that put relevant results first. Not just systems that retrieve relevant results somewhere in the list.

Precision@K, Recall@K, and MRR Together: A Complete RAG Evaluation

Each metric tells you something different about your retrieval system.

Precision@K tells you: Are you retrieving junk? Is your context window polluted?

Recall@K tells you: Are you missing important documents? Is your coverage complete?

MRR tells you: Are the good results ranked at the top? Is your order right?

You need all three. Here is why.

A system can have high precision and low recall.
It retrieves only two documents, both relevant. Great purity. But it missed eight other relevant documents. The user gets a narrow, incomplete answer.
A system can have high recall and low precision.
It retrieves fifty documents and finds all the relevant ones. Great coverage. But it also pulled in forty irrelevant documents that confuse the model and waste tokens.
A system can have good precision and recall but bad MRR.
It retrieves five documents; four are relevant. But the one irrelevant document sits at position 1. The model anchors on it. The answer starts wrong.

The PM’s job is to optimise all three simultaneously. That means choosing the right K, selecting the right embedding model, tuning the similarity threshold, and potentially reranking results after initial retrieval.

Why Most RAG Teams Skip Retrieval Evaluation

Here is what I see happen repeatedly.

A team builds a RAG pipeline. They use an off-the-shelf embedding model. They set K to 5 because someone saw it in a tutorial. They evaluate only the final generated answer. The answers look good in a demo. They ship.

Three months later, users complain. The AI answers are kind of right, but missing the point.

The team starts debugging the LLM. They try better prompts. They switch models. They add guardrails. Nothing helps consistently.

The problem was never the LLM. The problem was the retrieval. Nobody measured Precision. Nobody measured Recall. Nobody checked MRR. Nobody even built a golden dataset of relevant documents for their queries.

The retrieval layer determines the quality ceiling of your entire RAG system. If retrieval is broken, no amount of prompt engineering or model upgrades will fix the output consistently.

Evaluate the retrieval first. Fix it first. Then evaluate the generation.

If you want to go deeper on Advanced Evals ( Cohen’s Kappa, Mathew’s Correlation Coefficient (MCC) ), Evals for Agentic Architecture, AI Product Sense, AI Strategy, AI Pricing, AI Prototyping, Advanced Prompting, ML Systems, etc., check out my AI PM course (40+ Videos and 25+ Case Studies ) [Certification Included]

Check our highest-rated AI PM course (Including AI PM Interview Preparation )· 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here ), AI PM Resources

Will AI Create More Product Managers?

Shailesh Sharma — Wed, 08 Apr 2026 19:46:36 GMT

Before reading this, you can also read the following articles on Technomanagers

Everyone in tech is asking the same question. Will AI replace product managers?

They are making a 160-year-old mistake. By the end of this article, you will see why the answer is the opposite of what they expect.

But first. Coal.

What Is the Jevons Paradox?

In 1865, an economist named William Stanley Jevons noticed something strange. James Watt had made the steam engine more efficient. Everyone expected coal consumption to fall.

Coal consumption exploded.

When each unit of coal did more work, the cost of useful work dropped. When the cost dropped, people found more work to do. Factories that could never afford steam power suddenly could.

Jevons called it a paradox. Make something cheaper per unit and you expect less total usage. That is almost never what happens.

Cars got fuel-efficient. People drove more.

LEDs used less electricity. People installed ten times as many.

AWS made servers cheap. Companies spun up thousands of microservices where they once ran five.

Efficiency does not eliminate demand. It creates it.

This paradox is about to hit product management. But to see how we need to answer a question most PMs have never thought clearly about.

What Does a Product Manager Actually Do?

Not the job description version. The first principles version.

A PM does three things.

Uncertainty reduction. Talking to users. Analysing data. Running experiments. All of it serves one purpose. Figuring out what to build and for whom.
Cross-functional coordination. Keeping engineering, design, data science, and marketing aligned on the same problem.
Tradeoff arbitration. Time versus scope. Revenue versus user experience. Short-term versus long-term. The PM makes the call and owns the outcome.

Which of these three does AI make cheaper?

All three. But not equally.

And the unevenness is where the future of this profession gets decided.

How AI Is Changing Product Management Work

Uncertainty reduction just collapsed in cost. AI synthesises 50 user interview transcripts in two minutes. It scans competitors, generates hypotheses, and drafts experiment designs. Cost per unit down 80% in 18 months.

Coordination got somewhat cheaper. AI drafts PRDs in minutes. Summarises meetings. Translates technical requirements into business language. Maybe 40% cheaper.

Tradeoff arbitration did not get cheaper at all.

AI can present options and model scenarios. But the decision where you weigh strategy against user needs against tech debt against team capacity and say “we are doing this and not that” remains human.

Two of three PM functions got dramatically cheaper. One stayed the same. You already know what Jevons would predict.

The specifics are wilder than you think.

Why AI Will Increase Demand for Product Managers?

Most people stop at step one. Existing PMs get more productive. One PM covers what three did. Fewer PMs needed.

That is the McKinsey argument. The LinkedIn influencer argument. It is step one of four. And the only step that reduces headcount.

Step two. Latent demand unlocks.

Every large company has problems that never got product thinking because it was too expensive. Shopify has over 400 internal tools. Before 2024, fewer than 30 had a dedicated PM. When an AI-augmented PM covers three times the surface area, those neglected tools suddenly deserve attention. Shopify added 15 internal-product PM roles in early 2025.

Step three. New AI product roles emerge.

Every AI feature needs a PM. Every agent workflow needs someone to define boundaries, failure modes, and user experience. A lot of companies did not have a PM for AI-driven personalisation in 2020. Now there is an entire team. Multiply that across every e-commerce, fintech, and SaaS company.

Step four. Non-tech industries hire PMs for the first time.

AI makes building software cheap enough that hospitals, banks, and governments now build their own products.

Step one reduces PM headcount by 30%. Steps two through four increase it by 200 to 300%.

Jevons was right. Again.

This is also why learning to work with AI as a PM is no longer optional. The PMs getting hired in steps two through four are not traditional PMs. They understand how AI systems work and how to build products around them.

Will AI Agents Replace Product Managers Completely?

The serious counterargument. AI agents will handle tradeoff decisions, too. PMs become redundant.

Three problems with this.

Processing is not judgment. Spotify’s AI knows podcast listeners churn%. That is a pattern. Whether to invest in podcasts versus audiobooks versus live audio depends on positioning against Apple, licensing economics, and creator dynamics. Data surfaces patterns. Judgment decides what to do with them.
Product decisions are not optimisation problems. Feature A serves power users but alienates new ones. Feature B grows the funnel but adds 15% support costs. Feature C needs a migration that slows everything for two quarters. No formula resolves this.
Who sets the goal in the first place? AI optimises toward objectives. Someone decides what those objectives are. Which metrics matter? Which users to prioritise? Which problems to solve? That is the core of PM work. It does not get automated. It gets more valuable.

But here is the part that should worry you if you are a certain type of PM.

The Future of Product Management: Three Tiers

The market is splitting. The value distribution is brutal.

Tier one - The Compression Zone.

Execution-heavy work. PRDs, dashboards, tickets, standups. AI compresses this by 70 to 80%. If most of your week looks like this, your leverage is deflating every quarter. Not your job. Your leverage.

Tier two - The Leverage Layer.

Systems-level work. Experiment design, metrics frameworks, and feedback loops. AI augments this but does not replace it. PMs here use AI to multiply their output. Their value goes up with AI.

Tier three - The Taste Premium.

The PM who sees what others miss. Who kills the feature that looks great on a spreadsheet but feels wrong? Who sets the vision that aligns everything else.

The Taste Premium does not get cheaper. It gets scarcer. When supply drops, and demand explodes, the price goes up.

Spreadsheets arrived, and people predicted the end of accountants.

Canva arrived, and people predicted the end of designers.

Design headcount exploded. But the premium for world-class brand work went up.

Democratisation of basic work expands the market. It also concentrates value at the top.

How Product Managers Can Prepare for the AI Era?

One question decides your next five years.

Are you building skills that get cheaper when AI improves or skills that get more valuable?

If you spend your time on uncertainty reduction and coordination, AI is compressing your value. The market will pay less because AI does a version of it for near zero.

If you spend your time on tradeoff arbitration and taste, the market for you is about to expand. Every new AI-augmented PM and every new product surface needs someone at the top making the calls.

Moving from the Compression Zone to the Taste Premium does not happen by accident. It requires understanding how AI systems work, how to build products around them, and how to develop the judgment that AI cannot replicate.

Jevons figured this out in 1865. Coal did not disappear. It powered an industrial revolution.

Product management is not disappearing. It is becoming how every company builds.

The question is which tier you will be in when it happens.

We created a course ( 40+ Videos and 25+ Case Studies ) for PMs who want to build in the right direction. How to think about AI as a PM. How to design AI-first products. How to build the judgment layer that AI cannot replace, AI Deepdive, AI Evals and AI Interview Preparation

Check our highest-rated AI PM course (Including AI PM Interview Preparation )· 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

Technomanagers is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

How Session-Based RNNs Predict Your Next Swipe in TikTok?

Shailesh Sharma — Fri, 03 Apr 2026 11:59:22 GMT

There is a comfortable lie in Product Management.

If you have enough historical data on a user, you know exactly what they want. You build your collaborative filtering models. You map out their lifetime preferences. You assume your algorithm is bulletproof.

Then you look at TikTok.

And you realise historical data is often a trap.

If your AI relies on what a user did yesterday, it fails to understand what they crave right now. In platforms where intent shifts by the minute, historical profiling is dead. You need to predict the immediate future based on the immediate past.

This is the first principle breakdown of how to build the Rabbit Hole effect. We will move from traditional recommendation engines to Session-Based Recommendations using Recurrent Neural Networks.

If you are preparing for AI PM interviews, this is one of the most important system design concepts you can learn. We teach this and many more real interview scenarios in our course.

What is TikTok, Really?

From a product architecture standpoint, TikTok is not a social network. It is an AI-driven bipartite matching engine.

It does not care who your friends are. It does not care what you followed last month. It cares about one thing. Matching an infinite supply of highly fragmented content with highly volatile human attention.

The moment you open the app, you are a blank slate. Not because TikTok does not have your data. But your data from yesterday is almost useless for predicting what you want in the next 30 seconds.

This is already a fundamentally different product philosophy from Instagram or YouTube. Those platforms are built on the social graph. TikTok is built on the interest graph. And the interest graph is rebuilt from scratch every single session.

The Strategic Bet: Kill the Social Graph

This is the part most PMs miss entirely.

TikTok did not just build a better recommendation engine. It made a strategic decision to remove the social graph as the primary distribution mechanism.

On Instagram, your content reaches your followers first. Then the algorithm decides whether to push it further. On YouTube, subscribers see your videos first. Then the algorithm takes over.

On TikTok, follower count is almost irrelevant for distribution. The algorithm decides who sees what, independent of social connections. A creator with 200 followers can get 10 million views on a single video if the algorithm detects high engagement in the first few hundred impressions.

Why does this matter strategically?

Because it means TikTok does not need network effects to retain users. Traditional social platforms are sticky because your friends are there. You cannot leave Instagram because your social circle is on Instagram. That is a network effect moat.

TikTok replaced network effects with algorithmic effects. You do not stay on TikTok because your friends are there. You stay because the algorithm understands you better than any other platform. The algorithm itself is the moat.

This is why TikTok’s valuation swings wildly depending on whether the recommendation algorithm is included in the deal. Reports from the US sale negotiations showed that TikTok, without its algorithm, could be worth as little as 40 billion dollars. TikTok with its algorithm is worth closer to 200 billion dollars.

The algorithm is not a feature. It is the entire business.

The User Behaviour Problem

On traditional platforms like Netflix or Amazon, a user’s session is slow. They search. They read reviews. They watch a two-hour movie. You have time to understand them.

On TikTok, user behaviour is chaotic.

Users do not explicitly tell you what they like. They signal it through micro-actions. A two-second linger. A rapid swipe. A share. Finishing a 15-second loop twice. These are all implicit signals.

A user might open the app wanting comedy. Three swipes later, they see a video about fixing a sink. Suddenly, their intent shifts entirely to DIY home repair.

And here is the hardest part. Even if a user is logged in, every time they open the app, their current emotional state is essentially a cold start. They might have had a bad day. They might be bored. They might be curious about something they have never explored before.

Historical profiles cannot capture this. Only the current session can.

The Problem Statement

How do we accurately predict and serve the next piece of content to a user when their current intent is unknown, rapidly changing, and largely divorced from their long-term preferences?

If the system relies on long-term data, it will continue to show comedy even after the user has mentally shifted to DIY. The algorithm will feel clunky. Out of touch.

This is not a theoretical problem. This directly hits the business.

The Metrics Framework: Three Layers

Before we solve it, we must measure the pain. And we need to measure it at three distinct layers. Most PMs only think about one layer. That is a mistake.

Layer 1: Session Health Metrics (Does the user stay?)

Time to First Abandonment tells you how many videos a user swipes through before killing the session. If this number is high, the algorithm is slow to adapt. Think of it as the “cold start tax.” How many bad recommendations does the user tolerate before leaving?

Session Watch Time is the total minutes spent in a continuous app session. TikTok’s average is 95 minutes per day across multiple sessions. If a single session averages less than 8 to 10 minutes, the recommendation engine is leaking users.

Swipe-to-Completion Ratio is the ratio of videos skipped within three seconds versus videos watched to 80% or more completion. A bad ratio means the system is serving the wrong content. A healthy For You feed should have a completion ratio above 40% within the first 10 videos of a session.

Layer 2: Engagement Depth Metrics (Does the user care?)

Not all engagement is equal. TikTok’s algorithm weights signals differently because some signals are more honest than others.

Watch Time Percentage is the strongest signal. A user who watches 95 per cent of a 60-second video has expressed genuine interest. They did not tap a button. They gave you their attention. Attention is the most expensive thing a human can give.

Replay Rate tracks how many users watch a video more than once. This is a signal that the content was not just good but worth revisiting. Replays are weighted heavily because they are almost impossible to fake.

Share Rate is even more telling. A user who shares a video is doing free distribution work for TikTok. They are putting their social capital on the line by recommending content to their friends. This is the highest intent signal after a purchase.

Save Rate means the user wants to come back to this content later. This is a forward-looking intent signal that most platforms underweight.

Comment Sentiment is trickier. A comment can be positive, negative, or neutral. Raw comment counts are misleading. TikTok’s system analyses whether comments indicate genuine engagement or hate-watching. Both drive views, but only positive engagement drives long-term session health.

Layer 3: Business Impact Metrics (Does the algorithm make money?)

Revenue Per Session connects recommendation quality directly to dollars. If the algorithm serves better content, users stay longer, see more ads, and revenue per session increases.

Ad Completion Rate measures whether users watch the ads placed between videos. If the surrounding content is relevant and engaging, users are in a positive attention state and more likely to watch an ad through. If the content is poor, users are already in “skip mode” and will reflexively skip the ad too.

DAU Retention at Day 1, Day 7, and Day 30 tells you whether the recommendation quality is good enough to bring users back. A single great session means nothing if the user does not return tomorrow. This is the ultimate test.

Most PMs stop at Layer 1. The best AI PMs connect all three layers into a single causal chain. Better recommendations lead to better session health, which leads to deeper engagement, which leads to higher ad revenue and retention.

Why Traditional Recommendation Methods Fail Here

Most PMs are familiar with Matrix Factorisation, also called Collaborative Filtering.

It works like this. Users who liked video A also liked video B. So if you liked A, the system recommends B.

This approach has powered Amazon and Netflix for years. But it fails catastrophically on TikTok.

The reason is simple. It ignores the sequence of actions.

If you watch Video A, then Video B, then Video C, the order in which you watched them contains massive contextual clues about your shifting intent. Matrix Factorisation treats them as a disorganised bucket of likes. It does not know that C came after B. It does not know that the transition from A to B was a signal.

Sequence matters. And for sequence, you need a fundamentally different architecture.

Why RNN is the Way Forward

This is where the paradigm shift happens.

Recurrent Neural Networks, specifically architectures like GRU4Rec (Gated Recurrent Units for Recommendations), are designed exclusively for sequential data.

Think of it this way. An RNN treats a user’s session like a sentence. If you read the words “I want to eat an...” your brain predicts “apple.” An RNN does the same thing with user actions. It looks at the strict chronological sequence of the last 10 swipes and uses that sequential memory to predict the 11th.

This is fundamentally different from collaborative filtering. The RNN does not ask “what did users like you enjoy?” It asks “given the exact order of what you just did, what should come next?”

The PM Requirements Before Any Code is Written

Before the ML engineers write a single line of code, the AI PM must define the constraints. If you fail here, the model will be a theoretical success and a production disaster.

There are three critical requirements.

First is latency. The model must run online inference. When a user swipes, the RNN must update its state and fetch the next video in under 50 milliseconds. If inference takes 200ms, the user sees a loading spinner. On TikTok, a loading spinner is death.
Second is defining a session. Is a session defined by 30 minutes of inactivity? Or is it defined by a hard app close? This seems like a small decision but it fundamentally changes how the model trains. Usually, a 30-minute inactivity threshold works best.
Third is signal weighting. The PM must define what inputs matter and how much they matter. A like is an explicit signal. A video completion is an implicit signal. The model must ingest both. But watch-time percentage should be weighted highest because it is the most honest signal. People lie with likes. They do not pay attention.

These are PM decisions, not engineering decisions. If you get them wrong, no amount of model tuning will save you.

How the RNN Actually Works Under the Hood

Let us look inside the GRU, the engine of the RNN session model.

When a user interacts with a video, that video is converted into an embedding vector. Think of the embedding as a numerical fingerprint that captures everything about that video in a compact format.

The core magic of the GRU is its Hidden State. This hidden state acts as the memory of the current session up to the current point in time.

As the user swipes to a new video, the GRU updates its memory. It uses two internal mechanisms.

The Update Gate decides how much of the past session to remember. If the user’s recent behaviour is consistent, the gate stays mostly open, preserving the session memory.

The Reset Gate decides how much of the past to forget because the user’s intent has shifted. If the user suddenly jumped from comedy to cooking, the reset gate activates and says “Forget the comedy context, something new is happening.”

The formula for the memory update looks like this.

New Memory = (1 - Update Gate) X Old Memory + Update Gate X Candidate Memory

In plain language, the model blends the old session memory with the new signal based on how much the user’s intent has shifted. This blending happens after every single swipe. The memory is always fresh.

How the Model is Trained

You cannot train this like a normal classification problem. The video catalogue has millions of items. You cannot ask the model to predict the exact video.

Instead, we use a technique called Bayesian Personalised Ranking or BPR.

The idea is simple but powerful. Instead of predicting the exact next video, you train the model to rank the actual next video the user watched higher than a randomly sampled video the user did not watch.

You take the video that the user actually watched next. You call it the positive item. You randomly sample a video the user did not watch. You call it the negative item. Then you train the model so that the score for the positive item is always higher than the score for the negative item.

Over millions of such comparisons, the model learns what sequences of behaviour lead to what kinds of content. It learns the grammar of user intent.

Measuring Model Success: Offline and Online

You need two completely different sets of metrics here. One for the engineers during training. One for the business during A/B testing.

For offline ML metrics, you track Recall at K. Out of the top 20 videos the RNN predicted, was the actual next video the user watched in that list? If yes, the model is doing its job.

You also track Mean Reciprocal Rank. It is not enough to be in the top 20. Was it ranked number 1 or number 19? MRR heavily penalises the model if the correct prediction is buried at the bottom of the list.

For online product metrics, you track Next-Click CTR. Does the user actually watch the immediately next video served by the RNN? You also track Session Length Extension. Does the RNN variant increase the average session length compared to the control group?

If Recall at 20 is high but Session Length is flat, your model is technically accurate but not creating the Rabbit Hole effect. Both metrics must move together. This is where AI PMs earn their salary. Bridging the gap between what the ML team optimises and what the business actually needs.

The Three Pitfalls That Will Destroy Your Product

If you implement this blindly, you will destroy your user experience. There are three pitfalls every AI PM must guard against.

The first is the Echo Chamber problem. RNNs are almost too good at detecting immediate intent. If a user pauses on a sad video for two seconds too long, the RNN might plunge them into a depressive rabbit hole, serving nothing but sad content for the rest of the session. The solution is to inject random exploration videos using multi-armed bandits. You intentionally break the sequence to test for new intents. This is a PM decision, not a model decision. TikTok’s own algorithm does this. It deliberately injects novelty and diversity into the feed to prevent monotony and to protect users from harmful content spirals.
The second is Catastrophic Forgetting. Standard RNNs heavily weight the most recent clicks and forget the beginning of the session. If a session is 100 swipes long, the intent from swipe 10 might still be relevant. But the GRU might have forgotten it entirely. This is why some teams use attention mechanisms on top of GRUs, allowing the model to look back at any point in the session, not just the most recent swipes.
The third is Cold Start on New Items. The RNN is great at handling new users because it builds understanding from the very first swipe. But it struggles with brand-new videos that have no embeddings yet. You still need content-based filtering to push new creator videos into the system until they accumulate enough interaction data. TikTok solves this through a tiered distribution system. Every new video is first shown to a small, highly targeted test group. If engagement is strong within that group, the video gets pushed to a larger audience. This is how a creator with 200 followers can wake up with 10 million views.

AI Product Management is not about throwing LLMs at every problem. It is about understanding the structural reality of your user data and connecting it to the business's strategic goals.

If this article changed how you think about recommendation systems and product strategy, you will find much more depth in our AI PM course. We cover system design for recommendations, RAG architectures, AI metrics, agentic systems, and real interview questions from top companies.

Check our highest-rated AI PM course (Including AI PM Interview Preparation )· 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

Technomanagers is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

50 AI PM Job Descriptions. 3 Skills They All Want.

Shailesh Sharma — Thu, 02 Apr 2026 16:28:33 GMT

There is a pattern in how PMs prepare for AI product roles.

They read about LLMs. They learn to write better prompts.

They update their resume with words like “generative AI” and “responsible AI.” They think they are ready.

The job descriptions say otherwise.

I spent three weeks reading 50 AI PM job postings. Google, Meta, OpenAI, Eightfold, and a few high-growth AI-native startups.

I was looking for one thing: what skills are companies actually asking for versus what PMs are actually building.

The gap is larger than I expected.

What Everyone Is Preparing For

The standard AI PM preparation list looks like this.

→ Learn how LLMs work at a high level. Understand transformers.
→ Read about RAG and fine-tuning.
→ Practice writing product strategy documents that mention AI.
→ Build a portfolio project, preferably a chatbot.

This is not wrong. These things appear in JDs.

But they appear the same way “strong communication skills” appear in a generic PM role. As table stakes.

As filters that remove clearly unqualified candidates, not signals that separate good candidates from great ones.

The companies I looked at are past the point where “I understand what an LLM is” is a differentiator. They are hiring for people who have operated at the intersection of AI and product. Operated, not studied.

Three skills kept appearing across JDs in a way that most candidates are not building.

Skill 1: Evals

This is the biggest gap I found.

The OpenAI CPO, at Lenny’s Podcast conference in 2025, said something that should have become required reading for every PM preparing for an AI role. He said, “The most important thing a product manager can learn to do is write evals.”

Most PMs I speak to do not know what an eval is. They think it means user testing. It does not.

An eval is a structured test suite for an AI system. You define a set of inputs, the expected output or behaviour, and a scoring method. You run the system against those inputs. You measure how often it performs correctly. When the model changes, when you change the prompt, when you change the data, you run the evals again. You see what broke.

In traditional software, a bug is a bug. The system does the wrong thing, and you fix it.

In LLM-based products, the system does the wrong thing in some cases, the right thing in others, and something ambiguous in a third set. Without evals, you have no way to know which category a new change falls into. You are shipping blind.

A sample question can be something like this:

How would you measure the reliability of Rufus, the e-commerce AI Assistant?
How would you measure the success of MultiAgentic Workflow?

Every time you change anything about the system, you run the evals again. If the score drops, you do not ship.

This might help you to prepare in detail about Evals

Skill 2: Model Selection Logic

The second skill is the ability to choose between AI approaches, not just use them.

A year ago, “AI feature” meant “integrate GPT-4 via API and ship.” That is no longer a differentiated product decision.

Today, hiring managers want PMs who can reason through the following question: for this specific problem, what is the right approach and why?

The options a PM now needs to reason through include prompt engineering alone, RAG with a vector database, fine-tuning a smaller model, training a purpose-built model from scratch, or using a rule-based system instead of AI entirely. Each option has a different cost structure, latency profile, accuracy ceiling, maintenance burden, and failure mode.

A PM who cannot reason through this tradeoff is not a PM for an AI product. They are a PM who happens to have AI on their roadmap.

Let me make this concrete. Say you are building a feature that answers customer queries about an e-commerce return policy. Your choices are:

Prompt engineering: fast to build, but the model will hallucinate policies that do not exist in your documentation. You have no grounding.

RAG: You retrieve the relevant policy sections and inject them into the prompt context. The model can now answer accurately against your actual policy. Build time is higher, but accuracy is significantly better.

Fine-tuning: you train a smaller model specifically on your policy data and support conversations. Latency is lower, cost per query is lower, but you now have a maintenance responsibility. When your policy changes, you need to retrain.

Rule-based: for simple, high-volume queries like “what is your return window,” a rule-based system has zero hallucination risk and near-zero latency. AI adds no value here.

Real AI PM Interview Questions (with Detailed Solutions)

Skill 3: Failure Mode Thinking

Traditional products fail in predictable ways. If a button does not work, it does not work. You find it in QA. You fix it. It works.

AI products fail in ways that are probabilistic, context-dependent, and sometimes invisible until they are very visible.

The failure modes that appear repeatedly in AI PM JDs are: hallucination (the model generates confident false information), latency degradation under load, context window limits causing incomplete reasoning, prompt injection attacks in user-facing LLM features, and confidence calibration problems where the model is wrong but sounds right.

PMs who can map failure modes before a feature ships are rare.

Most teams discover failure modes after launch because they were not built into the product definition.

Hiring managers know this and look for candidates who proactively think about what can go wrong.

In an interview, this shows up as questions like: “How would you define done for an AI feature?” or “Walk me through how you would monitor this after launch.”

A PM who only talks about launch metrics and A/B tests is signalling that they have not thought about probabilistic failure.

A PM who talks about confidence thresholds, fallback logic, latency monitoring, and a human-in-the-loop escalation path for low-confidence outputs is signalling operational maturity.

What This Means for Your Preparation

The JDs are not asking for people who know about AI. There are thousands of those.

They are asking for people who have operated AI products. Who have thought through eval design, made model selection decisions, and mapped failure modes before launch. These are skills you build by doing, not by reading.

If you are preparing for an AI PM role right now, I would stop spending time on LLM theory and start spending time on the three skills above.

That work is what separates candidates who know AI from candidates who have worked with AI. The JDs are very clear about which one they want.

AI Product Management is the future; you can keep ignoring it, but this will become the baseline in 8 to 14 Months.

You should take action to kill the anxiety — Start today only. Learn about AI Product Management, starting from the basics to Advanced with the Flagship AI PM Course.

You can also check out our highest-rated AI PM course ( Including AI PM Interview Preparation )· 4.9/5 · 600+ enrollments → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass ( Here )

Spec-Driven Development for Product Managers

Shailesh Sharma — Sun, 29 Mar 2026 18:06:15 GMT

You are a Senior Product Manager at Google Maps.

Your VP walks into your Monday standup and says three words: “Build Ask Maps.”

What is “Ask Maps”: Users can type or speak a natural language query directly into Maps and get intelligent, context-aware answers. “Find me a rooftop restaurant near Koramangala that’s open after 10 PM and has good reviews for cocktails.” No filters. No manual search. Just ask.

Everyone in the room is excited. The engineers want to start prototyping immediately. Someone has already opened Cursor.

And this is exactly where most AI-era product development goes wrong.

Because what happens next is what the industry has started calling “vibe coding.” Someone fires a prompt into an AI coding tool.

The tool generates a working prototype in 20 minutes. Everyone is impressed. The demo looks great.

Three sprints later, the codebase is a mess, the AI feature behaves inconsistently across edge cases, and no one can explain why it sometimes returns results in Tamil Nadu when the user is searching in Telangana.

Spec-driven development is the structured alternative.

And in this article, I want to walk you through exactly what it looks like: not in abstract terms, but through the Ask Maps feature, end to end.

Before we go deep into Spec Driven Development, you can find out our articles on the following

What Spec-Driven Development Actually Is?

Spec-driven development (SDD) is a methodology where you write a formal, machine-readable specification before any code is generated.

This specification defines the behaviour, constraints, success criteria, and edge cases for a feature. The AI coding agent then generates code that must satisfy the spec. If the generated code does not meet the spec, the build fails automatically.

This is different from how most teams use AI tools today.

In the traditional AI-assisted workflow, a developer writes a prompt, the AI generates code, the developer reviews it, finds gaps, re-prompts, and this cycle repeats until something “feels right.”

There is no contract. There is no explicit definition of what the feature must and must not do. The AI guesses. The developer hopes. Technical debt accumulates silently.

SDD flips this. You start with the contract. You define what Ask Maps must do before you decide how it should be built. The code is a downstream artefact of the spec, not a starting point.

Three levels of SDD exist: spec-first (the spec guides the AI workflow), spec-anchored (the spec is continuously updated as the feature evolves), and spec-as-source (only the spec is ever edited by humans, never the code directly). For most product teams, spec-first or spec-anchored is the practical operating mode.

Now, let us apply this to Ask Maps, step by step.

Phase 1: Strategic Alignment Before Writing Anything

Most PMs think the spec is the first output of the discovery process. It is not. The first output is alignment with what you are actually building.

Before anyone writes a spec for Ask Maps, the product team needs to resolve a set of foundational questions. These are not design questions. These are strategy questions.

What is the primary job to be done?

Ask Maps could solve multiple problems. It could be a discovery tool (help users find places they did not know existed). It could be a planning tool (help users build a full day itinerary). It could be a real-time assistant (give live, context-aware answers based on current traffic, weather, and availability).

These are three different features. They share a surface but diverge completely in their backend requirements, data dependencies, and success metrics.

At Google Maps, this decision has massive downstream consequences. A discovery-focused Ask Maps integrates deeply with Google’s restaurant and business index. A planning tool needs multi-stop optimisation logic. A real-time assistant needs live data pipelines for weather, traffic, and business hours APIs.

The team needs to pick one primary job. Everything else is scope creep.

What is the explicit scope boundary?

What will Ask Maps not do? This is equally important as what it will do. Will it handle transactional requests (”book a table at this restaurant”)? Or is it purely informational? Will it work offline? Will it support voice input at launch? What languages will it support on day one?

Scope boundaries are not limitations. They are the spec’s load-bearing walls.

What are the success metrics?

Before a single line of spec is written, the team defines what success looks like. For Ask Maps, this might be: query satisfaction rate above 80% (user rates the answer as helpful), mean response latency under 2 seconds, and a 15% increase in session length compared to the current Maps search flow.

These numbers are not arbitrary. They will directly inform the non-functional requirements in the spec later.

Practical SDD tool behaviour at this stage:

In tools like Agent OS, this phase is handled by a “spec researcher” sub-agent. It ingests your product brief and roadmap, then surfaces clarifying questions with suggested default answers.

The PM does not write long paragraphs in response. They respond with “yes” or minor corrections. The agent synthesises the answers into a structured requirements brief.

For Ask Maps, the output of this phase is something like:

Primary job: Place discovery through natural language
Scope: Informational only, no transactions, English-first
Input modes: Text and voice
Success metric: Query satisfaction above 80%, P99 latency under 2 seconds
Out of scope for v1: Itinerary building, multi-stop optimisation, transactional bookings

This is not the spec. This is the raw material for the spec.

Phase 2: Writing the Ask Maps Specification

Now the spec is written. And this is where SDD requires discipline, because the instinct is to write a technical spec. SDD requires a behavioural spec.

A behavioural spec defines what the system must do and how it must behave from the user’s perspective and from a system contract perspective. It does not prescribe the implementation.

Here is what the Ask Maps spec looks like:

Feature: Ask Maps Version: 1.0 Owner: [PM Name] Last updated: [Date]

Goal: Enable Google Maps users to discover places and get location-aware answers through natural language queries, without using traditional filter-based search.

User Stories:

As a Maps user, I want to ask a natural language question about places near me, so that I can discover options I would not find through manual filters.
As a Maps user, I want to ask follow-up questions in the same session without re-entering my original context, so that I can refine my search conversationally.
As a Maps user, I want Ask Maps to consider my current location, time of day, and day of week automatically, so that I get contextually relevant answers without explicitly stating these.

Functional Requirements:

FR-01: The system must accept natural language queries of up to 500 characters via text input.
FR-02: The system must accept voice input and convert it to text before processing.
FR-03: The system must use the user’s current GPS-confirmed location as the default geographic context for all queries.
FR-04: The system must return a minimum of 3 and a maximum of 10 place results per query.
FR-05: Each result must include: place name, distance from user, rating, a one-line AI-generated reason for the recommendation, and a direct CTA to navigate.
FR-06: The system must support follow-up queries within the same session, preserving the context of the initial query.
FR-07: If the system cannot find relevant results with confidence above 0.75, it must surface a “limited results” state rather than hallucinating low-quality matches.

Non-Functional Requirements:

NFR-01: P50 response latency must be under 1 second. P99 must be under 2 seconds.
NFR-02: The system must handle a minimum of 10,000 concurrent queries.
NFR-03: The AI recommendation layer must not surface results from businesses that have a Google rating below 3.5 unless explicitly asked by the user.
NFR-04: The system must not store the user’s query text beyond the active session without explicit consent.

Edge Cases and Failure Modes:

EC-01: User is in a location with no GPS signal. The system must prompt the user to manually enter a location rather than defaulting to a stale cached location.
EC-02: User queries a category with no matches within a 10km radius. The system must expand the radius to 25km and inform the user of this expansion.
EC-03: Query contains a language other than English. V1 must return a graceful “English only” message. V2 will address multi-language support.
EC-04: Query is ambiguous (for example, “good food near me”). The system must ask one clarifying question before returning results, not make an assumption.

Out of Scope:

Transactional bookings (restaurant reservations, ride bookings)
Multi-stop itinerary planning
Queries not related to physical places (for example, “what is the capital of France”)

Notice what this spec does not contain: no database schema, no API structure, no infrastructure decisions. Those are implementation choices. The spec is silent on them intentionally. The AI coding agent gets to make those decisions within the constraints of the spec. The spec defines what must be true. The implementation decides how.

Phase 3: The Design Document

The spec is human-readable. The design document is agent-readable.

Once the spec is approved (and this approval step is non-negotiable in SDD, the PM and engineering lead both sign off before any code is generated), the AI agent translates the spec into a structured design document.

This document contains:

API Contract (from FR-01, FR-02, FR-04):

The Ask Maps endpoint accepts POST requests.

Input schema:

query (string, required, max 500 characters): The natural language query
location (object, required): Contains lat (float) and lng (float) from GPS
session_id (string, optional): For follow-up query context preservation (FR-06)
input_mode (enum: “text” | “voice”, required)

Output schema:

results (array, min 3, max 10): Each object contains place_id, name, distance_km, rating, ai_reason, navigate_url
state (enum: “success” | “limited_results” | “clarification_needed” | “error”)
clarification_question (string, nullable): Populated only when state is “clarification_needed”

Confidence Gate (from FR-07):

The AI recommendation layer must include a confidence score per result. Results with confidence below 0.75 are excluded from the final output array. If this exclusion brings the total results below 3, the system sets the state to “limited_results” and returns whatever results passed the threshold.

Radius Expansion Logic (from EC-02):

Initial query radius: 10km. If the results count is below 3 after confidence filtering, expand to 25km. Append a radius_expanded: true boolean to the response object. The UI layer uses this flag to surface the “We expanded your search area” message.

Security Constraints (from NFR-04):

Query text must not be written to any persistent store. Session data lives in ephemeral cache only, with a TTL of 30 minutes.

This design document becomes the to-do list for the AI coding agent. Each requirement maps to a specific implementation task. Nothing is left to interpretation.

Phase 4: Breaking It Into Testable Tasks

In SDD, the design document is decomposed into discrete, independently testable implementation units. This is where the workflow starts looking like traditional engineering project management, except that AI agents are executing the tasks, not humans writing the code from scratch.

For Ask Maps, the task breakdown looks like this:

Task Group A: Core API Layer

A1: Implement the POST /ask-maps endpoint with input validation (FR-01, FR-02)
A2: Implement GPS location ingestion and validation. Fail gracefully if coordinates are malformed (EC-01)
A3: Implement session management with 30-minute ephemeral TTL (NFR-04, FR-06)

Task Group B: AI Recommendation Engine

B1: Integrate with Google Places API for candidate place retrieval
B2: Implement confidence scoring model with 0.75 threshold gate (FR-07)
B3: Implement radius expansion logic: 10km base, expand to 25km with flag (EC-02)
B4: Generate AI-written one-line reasons per result using the LLM layer

Task Group C: Edge Case Handling

C1: Implement ambiguity detection. If query is flagged as ambiguous, return clarification question instead of results (EC-04)
C2: Implement rating filter: exclude results with Google rating below 3.5 from candidate pool (NFR-03)
C3: Implement “English only” language detection for V1 (EC-03)

Task Group D: Non-Functional Requirements

D1: Load test to confirm P99 latency under 2 seconds at 10,000 concurrent queries (NFR-01, NFR-02)
D2: Security audit on query storage to confirm no persistent writes (NFR-04)

Each task has a direct reference back to a specific requirement in the spec. This is the core discipline of SDD. You can always trace any line of code back to a business requirement. If you cannot, that code should not exist.

Phase 5: Execution Under Constraints

The AI coding agent now generates code. But this is not vibe coding with a spec document sitting nearby. The spec is an active constraint.

In practice, this means:

The CI/CD pipeline has automated spec validation checks embedded.

If the AI agent generates code for the Ask Maps endpoint that does not include the confidence threshold gate, the build fails. Not a code review comment. A hard build failure.

If the agent generates a response schema that returns results without the ai_reason field, the build fails. Because FR-05 explicitly mandates it.

If the agent writes query text to a database table (even a logging table), the build fails. Because NFR-04 says it cannot.

This is what “executable specification” means. The spec is not a document someone reads. It is a contract that the system enforces.

One critical challenge at this phase is what practitioners call context fragmentation. Most AI coding tools understand a single repository. But Google Maps is not a single repository.

The Ask Maps feature will touch the core Maps search service, the Places API integration layer, the user session service, the UI component library, and the AI/ML serving infrastructure. These live in different codebases, owned by different teams.

If the AI agent only sees one of these repositories, it will generate code that is locally correct but architecturally inconsistent. It will reinvent session management that already exists in the session service. It will create a new confidence scoring library instead of using the existing ML inference wrapper.

This is why enterprise SDD needs a context engine that maps semantic dependencies across repositories.

For a team without access to enterprise tooling, the practical workaround is explicit cross-repo documentation injected into the AI agent’s context at task time.

Phase 6: Debugging the Spec, Not Just the Code

This is the phase most PMs never hear about, and it is arguably the most important.

In SDD, when the AI generates code that is wrong, you do not fix the code directly. You fix the specification.

Here is why: AI code generation is non-deterministic. If you fix a bug in the generated code without updating the spec, the next time you regenerate (for a refactor, a new feature, or a regression fix), the AI will reproduce the exact same bug. It is following the spec. The spec said nothing about this case. So the AI guessed.

Concretely: Imagine the Ask Maps agent generates code that sometimes returns results from a different city when the user is near a city boundary. The radius expansion logic triggered, expanded to 25km, and pulled in results from an adjacent city without informing the user.

In vibe coding, a developer patches this edge case in the code and moves on.

In SDD, the PM goes back to the spec, adds a new edge case:

EC-05: When radius expansion crosses an administrative city boundary, the system must segment results by city and surface a separator in the UI indicating “Results from [adjacent city]”.

The spec is updated. The design document is updated. The CI/CD validation check is updated. The AI agent regenerates the affected module. The fix propagates correctly and permanently.

This is the compounding benefit of SDD. Every bug you find and fix in the spec makes the entire feature more robust, not just the one line of code that was wrong.

Why This Matters for Product Managers Specifically

SDD is not just an engineering methodology. It is a PM leverage tool.

In the traditional development model, the PM writes a PRD, hands it to engineering, and then spends the next three sprints in spec review meetings clarifying requirements that were ambiguous in the document. The PM is a translator, repeatedly.

In SDD, the spec is the single source of truth that both the PM and the AI agent operate from. When engineering asks, “Why does the endpoint behave this way?” the answer is always FR-07 or NFR-03. Not “I think I mentioned it in the PRD somewhere.” The spec is precise. The behaviour is traceable.

For PMs building AI-powered features specifically, this precision is not optional. Research shows AI LLMs generate vulnerable code at rates between 9.8% and 42.1%, and a significant fraction of those vulnerabilities are rated Critical severity.

A PM who cannot articulate the exact constraints their AI feature must operate within is not doing product management. They are doing product wishful thinking.

SDD forces PMs to be specific. That specificity is the PM’s highest-leverage contribution in an AI-first development environment.

The Learning Curve Is Real, But It Pays Off

When I first started working through SDD workflows, the upfront planning phase felt slow. Writing behavioural requirements instead of just describing the feature in prose felt overly formal. Defining edge cases before writing a single line of code felt premature.

Three sprints in, the compounding became obvious. The Ask Maps spec, once written, became the source for the engineering scoping document, the QA test plan, the security review checklist, and the launch readiness criteria. The spec was written once and used six times. Every clarifying question in sprint planning was answerable by pointing to a requirement ID.

The slow part upfront makes everything downstream faster.

Where to Go From Here

AI Product Management is the future; you can keep ignoring but this will become the baseline in 8 to 14 Months.

You should take action to kill the anxiety - Start today only, Learn about AI Product Management, start from basics to Advance with the Flagship AI PM Course.

You can also check out our Highest rated AI PM course · 4.9/5 · 500+ enrollments → See testimonials and course details

About Author

Shailesh Sharma! I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. For more, check out my AI Product Management Course, PM Interview Mastery Course, Cracking Strategy, and other Resources

Uber Autonomous Vehicle Strategy

Shailesh Sharma — Fri, 27 Mar 2026 19:41:13 GMT

Before reading about Uber, you can read the following articles on Technomanagers

Many people think Uber is competing with Waymo and Tesla to build the best self-driving car.

This is a flawed way to look at the business.

Uber is not building a car company.

Uber is building the software layer for physical mobility.

Let us break down the Uber autonomous vehicle strategy using first principles.

The Utilisation Problem

Why does Uber not just buy a massive fleet of self-driving cars and keep all the profit?

We need to look at the fundamental unit economics of a transportation marketplace.

Ride hailing demand fluctuates drastically. A Saturday night has massive demand, while a Tuesday morning has very low demand.

If a standalone robotaxi company builds enough cars to serve the Saturday night peak, then most of their expensive cars will sit idle on Tuesday morning.

Idle hardware destroys profitability because you still pay for depreciation and maintenance.

If they only build enough cars for Tuesday morning, then wait times on Saturday night will be too long, and users will leave the platform.

Uber solves this problem through a hybrid network.

They use self-driving cars to serve the base load.

The base load is the predictable, continuous demand that happens every hour of the day. Uber then uses human drivers to handle the burst capacity.

Burst capacity is the sudden spike in demand during bad weather or weekends. By pushing the volatile demand to human drivers, Uber ensures that its partner autonomous vehicles stay constantly utilised.

High utilisation directly leads to profitability.

The Big City Myth

Will autonomous vehicles just take over the major cities?

Software scales instantly, but physical infrastructure scales very slowly.

People assume that cities like San Francisco and Los Angeles generate all the rideshare money.

The latest Uber financial data shows something completely different. Trips in the top twenty US cities represent only twenty five percent of their overall profits.

The vast majority of Uber profits come from smaller cities and suburbs. It will take a very long time for autonomous vehicle companies to map every rural road and complex suburban driveway.

Uber already has human drivers covering these areas. Uber owns the demand in these highly profitable long tail markets while the hardware companies take on the massive capital expense of mapping physical geography.

The Aggregator Advantage

If cars can drive themselves, why do hardware companies need the Uber platform at all?

When a technology becomes a commodity, the company that aggregates customer demand captures the most value.

If Waymo dominates one city and another startup dominates a different city, the end consumer will have a terrible experience.

Users do not want to download five different apps and compare wait times.

Uber is positioning itself as the universal marketplace for mobility.

It does not matter if the vehicle has a Google brain or a Tesla brain. The expensive robotaxi needs a rider to generate revenue. Uber has hundreds of millions of active users. By acting as the aggregator, Uber forces hardware companies to plug into its routing algorithm.

Uber does not need to win the artificial intelligence race. Uber just needs to be the default platform where all the artificial intelligence models come to find their customers.

Commanding The Pricing Power

Whoever controls the user interface controls the pricing power.

If a customer opens the Uber app, they do not care who manufactured the car. They just want the cheapest and fastest ride possible.

This consumer behaviour gives Uber total negotiating control over the hardware companies. Uber can force the different self-driving companies to compete directly against each other on the same screen.

If one hardware company wants a higher cut of the fare, Uber will simply send the customer a cheaper vehicle from a different hardware maker. This dynamic will force the hardware companies to lower their prices to win the ride, while Uber maintains its high profit margins on every single transaction.

What do you think, what’s the future of mobility?

For Full Detailed cases Studies and AI & Strategy — Download this Book ( 5/5 Rated )

Download the Book

You can also check out our Highest rated AI PM course · 4.9/5 · 500+ enrollments → See testimonials and course details 60% OFF for a limited time — Code: NYE26