How Would You Build Google Photos’ New Wardrobe Feature?
AI Product Manager Case Study
Before we move ahead, you can also read the following articles
You own roughly 80 pieces of clothing. You remember maybe 15. You rotate through 7.
The problem is not the clothes. The problem is that your closet has no search bar.
Google Photos just shipped a feature that fixes this. It scans your photo library, finds every piece of clothing you have ever worn in a picture, catalogues it, and turns your gallery into a searchable digital closet. Filter by category. Mix and match pieces into outfits. Try the whole thing on virtually before you get dressed.
That is the product. Now, let us build it.
As an AI Product Manager, you would get such problem statements.
In our previous pieces, we covered how TikTok uses session-based RNNs and why position bias destroys your ranking quality. This one is about a different kind of AI system.
Not one model does one job. Five models chained together, where every failure cascades downstream.
If you are preparing for AI PM interviews, we will prepare you with Real AI PM Interview Questions. We cover this more in our course.
The Problem Statement
Given a user’s unstructured photo library, build a system that:
Identifies every unique garment the user has worn in any photo
Creates a clean catalogue with one entry per garment
Let’s users combine items into outfits
Shows users how the outfit looks on their body
The input is chaos. Vacation photos, selfies, group shots, screenshots, food pictures, photos where you are partially visible, and photos where your jacket covers your shirt.
The output must be a clean wardrobe.
This is not one model. It is five models chained together.
Stage 1: Find the Photos That Matter
A typical library has 1000 images. Maybe 400 contain you wearing visible clothing. The system needs to find those 400 and discard the rest.
Google Photos already detects and groups faces through its People feature. That tells the system which photos contain you. But face detection is not enough. You need body pose estimation.
The system runs a pose estimator to map body joints. Shoulders, elbows, wrists, hips, knees, ankles. These key points answer two questions.
How much of your body is visible?
And where does each garment zone fall in the image?
The region between the shoulders and hips is a top. Between hips and the ankles is the bottom.
If only your face and shoulders are in frame, there is no point in trying to extract pants.
The PM decision here is the visibility threshold.
How many keypoints must be detected for the photo to enter the pipeline?
Too strict and you miss casual selfies that still show a good shirt.
Too lenient and you flood every downstream stage with blurry, occluded garbage.
Stage 2: Segment Each Garment
Each photo might contain multiple garments. A shirt, a jacket, pants, and earrings. The system needs to isolate each item at the pixel level.
This is instance segmentation.
Not “there is clothing in this image” but “these exact pixels belong to this shirt, and those exact pixels belong to that jacket.”
The model takes an image and outputs bounding boxes and pixel masks, each labelled with a garment category.
Here is where the first real complexity shows up. Occlusion.
A jacket covers a shirt. A scarf drapes across a jacket. A bag strap cuts across your torso.
The naive answer is to only extract fully visible items. This fails.
If you always wear a jacket over a particular shirt in photos, that shirt never enters your wardrobe.
The better approach is to extract all detected garments and attach a visibility score.
Visibility Score = Visible Pixels / Estimated Total Pixels
A shirt with 80% visible scores 0.8. A shirt 30% hidden behind a jacket scores 0.3. This score determines which photo gets chosen as the representative thumbnail later.
Stage 3: The Hardest Problem in the Pipeline
You now have roughly 1,000 segmented garment patches across 400 photos. Many are the same physical item photographed in different conditions. Your favourite blue shirt appears in 40 photos. Different lighting. Different wrinkles. Different backgrounds. Different angles.
The system needs to know that these are all the same shirt.
This is a visual re-identification problem. The same class of problem that security systems use to track a person across multiple cameras.
The naive approach is pixel-level image similarity. Compare the raw pixels of two garment patches. This fails immediately.
The same white shirt photographed indoors under warm lighting looks golden. Outdoors, it looks blue-white. Against a dark background, it appears brighter. Pixel similarity would call these three different shirts. They are the same shirt.
The correct approach is to learn a garment embedding.
You pass each segmented garment through a feature extraction network.
A CNN or Vision Transformer fine-tuned on fashion datasets like DeepFashion. The network outputs a compact vector, say 256 dimensions, that captures the garment’s identity. Its colour, texture, pattern, cut, and structure. Not the lighting. Not the background. Not the wrinkles.
Two patches of the same shirt, photographed in completely different conditions, should produce vectors that are close together in this 256-dimensional space. Two different shirts should produce vectors that are farther apart.
Similarity(A, B) = cosine(Embedding(A), Embedding(B))
If cosine similarity exceeds a threshold, say 0.85, the system treats them as the same garment.
This threshold is the single most important PM decision in the entire pipeline.
Set it too high at 0.95, and the system creates duplicates. Your blue shirt appears four times because slightly different photos produced slightly different embeddings.
Set it too low at 0.70, and the system merges two genuinely different items into one. Your navy polo and your navy crew-neck collapse into a single entry.
Which mistake is more tolerable?
For a consumer product, false merges are worse. Showing two entries for one shirt is a mild annoyance. Deleting a unique garment by merging it with another item is data loss. You cannot undo it without rerunning the pipeline.
Bias the threshold higher. Accept some duplicates. Give users a manual merge option in the UI.
Stage 4: Cluster and Build the Catalogue
With 1000 embeddings, the system clusters them. Each cluster represents one unique garment. DBSCAN works well here because it does not require specifying the number of clusters in advance. It finds natural groupings based on embedding distances.
From each cluster, select a representative thumbnail. Highest visibility score. Highest resolution. Best lighting.
But a raw crop from a photo still has a messy background. Your shirt thumbnail would show a slice of a restaurant behind you.
The system takes the segmented garment mask and runs an inpainting model. It generates a clean thumbnail on a neutral background, fills in any occluded parts of the garment, and removes everything else.
The output: a clean, catalogue-style image of each garment. That is what you see in the Wardrobe UI.
Stage 5: Classification
Each garment needs a category label. Tops, bottoms, dresses, outerwear, skirts, jewellery, and footwear.
The same feature extraction network from Stage 3 can feed into a classification head. Standard multi-class classification.
The design question that matters here is whether classification should run before or after clustering.
If you classify first, the category becomes a hard constraint during clustering. A shirt and a jacket can never merge, regardless of how similar their embeddings are. This eliminates absurd false merges. But classification errors now propagate. If a blazer gets mislabelled as a shirt, it will never match with other photos of that blazer correctly labelled as outerwear.
If you run classification and clustering in parallel, you can use category agreement as a soft constraint. Same category plus high embedding similarity means high-confidence merge. Different categories, plus high embedding similarity, mean flagged at a higher threshold.
The parallel approach is more robust. It lets one model compensate for the other’s mistakes.
The Virtual Try-On
You select a top and a bottom from your wardrobe. You tap Try it on.
The system generates a photorealistic image of you wearing that combination.
Google did not build this for Photos. This technology already existed in Google Shopping, where it worked on billions of product listings. The PM reused it.
The underlying system is a diffusion-based image generation model built specifically for fashion. Here is how it works.
The model takes three inputs.
A photo of you.
A garment image.
A body pose estimate from Stage 1.
First, 2D warping. The garment pixels get mapped onto the region of your body where that garment would sit. The pose estimate locates your shoulders, torso, and hips. The segmentation mask provides shape and texture.
But simple warping produces artefacts. Sleeves do not match arm positions. Fabric does not fold correctly around your body.
The diffusion model takes over. It receives the warped image plus a segmentation map of the missing regions and generates the final output. Realistic fabric folds, shadows, and draping.
The critical constraint: the model preserves the exact texture, colour, and pattern of your actual garment. It only generates the physics of how that fabric interacts with your body. This is not “put a blue shirt on this person.” This is “take this specific shirt with this specific weave and show how it drapes on this specific body in this specific pose.”
This is why the output looks like your actual clothes. Not generic AI-generated clothing.
Outfit Compatibility
There is one more model that the press release does not mention, but the product requires.
When a user mixes a top with a bottom, the system should score whether the combination works visually. This is a compatibility scoring problem.
The approach: train a model on labelled fashion datasets where outfit combinations are rated as compatible or incompatible. Each item has an embedding from Stage 3. The compatibility model takes two or more garment embeddings and outputs a score between 0 and 1.
Compatibility(Top, Bottom) = sigmoid(W · concat(Embedding_top, Embedding_bottom) + b)
A navy blazer with khaki chinos might score 0.9. A navy blazer with basketball shorts might score 0.2.
The PM decision is whether to surface this score or use it passively.
Surfacing it as this outfit scores 4 out of 5 risks being wrong and annoying. Using it passively to sort moodboard suggestions by compatibility adds value without overcommitting.
The safe answer is passive integration. Let the user feel the algorithm without seeing it.
What Metrics will you track?
Most PMs would track whether users open the Wardrobe tab. That tells you almost nothing. Here is the framework that actually measures success.
Does the user come back?
Wardrobe retention at Day 7 and Day 30. Not Google Photos retention. Wardrobe-specific retention. A user who opens the Wardrobe tab once and never returns means the catalogue was not useful. A user who returns weekly is planning outfits. That is the behaviour you want.
Track the ratio of outfit creations to wardrobe visits. If users open the wardrobe but never create an outfit, the catalogue is interesting, but the mix-and-match experience is not compelling. Target at least 30 per cent of wardrobe visits resulting in an outfit action (create, save, or share).
Does Try It On convert?
Try It On usage rate among outfit creators. If users create outfits but never tap Try It On, the feature is either hidden or not trusted. Target at least 40 per cent of outfit creators using Try It On at least once.
Try It On completion rate. Does the user look at the generated image for more than 3 seconds? Do they save it or share it? If they generate a try-on and immediately dismiss it, the output quality is not meeting expectations.
Try It On repeat rate. Users who try it once and never again do not trust the result. Users who try it multiple times per session are engaged. A healthy repeat rate means the diffusion model is producing outputs that feel real.
If this breakdown changed how you think about multi-model pipelines, cascade error budgets, and AI feature design, you will find much more depth in our AI PM course. We cover RAG architectures, evaluation frameworks, and real interview questions from top companies.
Check our highest-rated AI PM course (Including AI PM Interview Preparation) · 4.9/5 · 600+ enrollments → See testimonials and course details
About Author
Shailesh Sharma - I help PMs and business leaders excel in Product, Strategy, and AI using First Principles Thinking. Weekly Live Webinars/MasterClass (Here)
Subscribe to get the FREE Book (AI & Tech Simplified), Link in Welcome Email



