PM OS
Blog·Experimentation

From 0.6% to 2.0% CTR: an AI-thumbnail experiment across 100+ HTML5 games

Generative AI is a force multiplier on creative tasks that are high-volume, style-bounded, and have a fast feedback loop. Game thumbnails are exactly that shape.

By Mohammad Muzeem··7 min read

Earlier this year we ran a fairly disciplined experiment at Gamezop: replace the human-designed thumbnails for over 100 HTML5 games with Midjourney-generated alternatives, A/B test them on the MSN distribution surface, and measure the click-through impact. The headline result was a CTR lift from 0.6% to 2.0% across the test set. That is roughly a three-fold improvement on the single highest-leverage creative asset in our funnel.

This post is not a hype piece about generative AI. It is a working PM breakdown: why this specific task fit AI well, what the prompt template ended up looking like, how the A/B framework worked, and the categories of games where AI thumbnails made performance worse — because the failure modes are more interesting than the success rate.

Why thumbnails are the right test case

In a games-distribution business, the thumbnail is the conversion event. Users scroll a grid of tiles inside a partner property like MSN, see a still image and a title, and decide in roughly 200 milliseconds whether to tap. There is no preview video, no description paragraph, no review score. The thumbnail does almost all of the persuasion work.

Thumbnails are also high-volume — 100+ active titles, each refreshed when a game gets re-themed or its placement context changes — and style-bounded, in the sense that there is a recognizable "game thumbnail" aesthetic that a model can learn from. They have a very fast feedback loop, since CTR on a partner surface clears statistical significance in days, not weeks. Volume × style-bound × fast feedback is exactly the shape where generative AI is most likely to outperform humans on cost-adjusted output.

The prompt template

Most of the early attempts produced thumbnails that were technically beautiful but commercially useless. The model would render a cinematic, dimly lit fantasy scene with a small character in the middle distance — gorgeous to look at, completely wrong for a tiny thumbnail tile on a busy news page. The competing tiles around it on MSN are loud, saturated, character-forward. A subtle thumbnail loses every time.

After about a week of iteration we settled on a template that looked roughly like this: [game subject], front-facing, centered, exaggerated expression, saturated palette, 3D render, clean background, no text, no UI elements, thumbnail composition, mobile-friendly. The "no text, no UI elements" clause was essential — Midjourney likes to add fake game logos and HUD overlays which confused users. The "thumbnail composition" cue produced tighter framing. The "mobile-friendly" cue, oddly, biased the model toward thumbnails that still read clearly at very small sizes.

For each game we generated 8 candidates, hand-picked the top 2 on recognizability and emotional pull, then sent both into the A/B test rotation. The human in the loop matters more than the prompt. The model produces variance; the PM produces selection pressure.

The A/B framework

Each test ran the original thumbnail (control) against the AI thumbnail (variant) with 50/50 traffic allocation, scoped to a single placement context on MSN to control for surrounding inventory. The primary metric was CTR; the secondary was per-session game completion rate, which catches cases where a misleading thumbnail attracts the wrong users and they bounce off the game.

We required ~50,000 impressions per arm before reading the result, which usually took 2–3 days at the volumes we were seeing. Tests ended on either statistical significance at p < 0.05 or a hard 7-day cap, whichever came first. The hard cap matters because it protects against test-fatigue bias — the longer a test runs, the more likely some unrelated MSN editorial change moves the surrounding context.

What the numbers looked like

Across the full test set the AI thumbnails won 71% of head-to-heads. Average CTR on winners moved from 0.62% to 2.04% (the headline result). Among the games where AI lost, the original thumbnail retained on average a 0.83% CTR — these were games where the human design had been unusually strong to begin with, often for IP-driven titles where users were recognizing a specific character.

The secondary metric was the more interesting signal. Per-session completion rate stayed flat or improved in 89% of winners, meaning the new thumbnails were not attracting users who then bounced. In the remaining 11%, completion rates dropped meaningfully — and digging into those cases is the rest of this post.

Where AI thumbnails made things worse

  • · Puzzle and word games. AI tends to over-promise dynamism. A word game with an explosively colored, character-driven thumbnail attracts users who expected action; they discover a Scrabble-shaped board and leave. We rolled these back and kept human-designed minimalist thumbnails for the category.
  • · Licensed IP games.Users searching for a known character do not want a Midjourney's interpretation of that character; they want the canonical art. AI thumbnails for IP titles consistently underperformed and we never deployed them to production.
  • · Story-driven adventure games.These benefit from atmosphere, mood, narrative context — exactly the dimensions our thumbnail prompt template suppressed in favor of legibility. The template was tuned for "tap-now" conversion, which is wrong for games where the appeal is "sit down for thirty minutes."
  • · Anything with a strong existing CTR baseline. If the original thumbnail was already converting above 1.5%, the AI rarely beat it. The lift was concentrated in the long tail of titles where a human designer had never spent meaningful time.

The generalizable bits

I would not generalize this result to "AI beats humans at creative." The honest framing is narrower: AI is excellent at producing variance inside a tightly specified aesthetic on tasks humans deprioritize. The 100+ thumbnails in our long tail were never going to get a full designer pass. They got the AI pass instead, and that pass beat the previous baseline.

For other PMs considering a similar experiment, the checklist that actually predicts success looks like:

  • · The task is high-volume and the marginal piece is currently under-resourced.
  • · The output sits inside a recognizable style envelope you can describe to a model in 50 words or less.
  • · You have a clean conversion signal that closes the loop in days.
  • · You have a secondary metric that catches false-positive lifts — CTR that comes from misleading creative is worse than no lift at all.
  • · You are willing to roll back categories where the AI underperforms, rather than insisting on the universal "AI is better" narrative.

The last point is the one most teams skip. The interesting outcome from this experiment was not the 3x CTR headline. It was the precise list of categories where the AI lost — because each of those is a product insight about what users are actually shopping for when they scroll a games grid. The losses are where the learning lives.

Written by Mohammad Muzeem. Opinions are personal and do not represent any current or past employer. Corrections welcome at muzeem.mm@gmail.com.