GPT Image 2 benchmark: every number that matters

Q: Which plan should you actually buy?

The Free tier gets Instant mode but loses Thinking. Plus at 20 dollars per month unlocks Thinking. Pro at 200 dollars per month adds priority. Business and Enterprise unlock direct API access. The public gpt-image-2 API endpoint opens early May 2026. The product moat is in Thinking, and Thinking is paywalled.

On April 21, 2026, OpenAI shipped gpt-image-2 and within twelve hours it took the number one spot on every Image Arena leaderboard with a +242 Elo lead, the largest gap ever recorded on that benchmark. The previous record holder, Nano Banana 2, was suddenly two hundred and forty-two points behind in a system where five points usually means someone nudged ahead. This is not a release. It is a leaderboard reset.

The model name in the API is gpt-image-2. The product surface name is ChatGPT Images 2.0. Same engine. OpenAI also confirmed that DALL-E 2 and DALL-E 3 will both be retired on May 12, 2026. The entire generation chapter that started with DALL-E in 2021 closes in six days.

This article is the data sheet, not the editorial. Specs, accuracy percentages, head-to-head wins and losses, pricing tiers, and the integration arbitrage if you are deciding between gpt-image-2, Midjourney v8.1, Nano Banana Pro and Flux 2 Pro for production use this week.

What does OpenAI actually ship on April 21, 2026?

The model spec, in raw form. gpt-image-2 is OpenAI's first image model with native reasoning built into the architecture, what OpenAI calls Thinking mode. It generates up to eight images from a single prompt with character and object continuity preserved across the batch. It outputs at two thousand pixels resolution natively. Aspect ratios run from 3:1 to 1:3, covering square, portrait, landscape, and ultrawide use cases without external upscaling.

Two operating modes ship at launch. Instant returns a single image fast, with the new quality bar but without the planning step. Thinking allows the model to reason through layout, search the web inside the generation loop for reference data, and verify its own output before delivery. Per OpenAI's launch post, the reasoning architecture lets the model cross-check its own outputs before delivering results, which matters most for character consistency across multiple frames and for in-image text accuracy.

The model is also integrated directly into Codex, OpenAI's code-generation environment, so a developer can generate UI mockups inline alongside the code that uses them. That is a workflow shift for front-end teams. The model arrived on the consumer ChatGPT product on April 22, with the official API exposing gpt-image-2 by name in early May 2026, per OpenAI's own roadmap.

The retirement timeline is tight. OpenAI confirmed DALL-E 2 and DALL-E 3 will both go dark on May 12, 2026, twenty-one days after gpt-image-2 launched. Anyone with production code calling DALL-E 3 has under a week to migrate.

The Elo score that justifies the word reset

The headline number is 1512. That is gpt-image-2's text-to-image Elo on LM Arena, the public benchmark driven by human preference votes. The previous leader, Google's Nano Banana 2, sat at roughly 1270 the day gpt-image-2 launched. The 242 point gap is the largest spread the Arena has ever recorded between number one and number two on this category. For context, the typical gap between the top three models in any category sits between five and twenty Elo points. A 242 point gap is approximately twelve to fifty times the normal spread.

Within twelve hours of launch, gpt-image-2 was number one across every category on Image Arena, not just text-to-image. That covers stylised generation, photorealism, multilingual rendering, and dense layouts. TechCrunch confirmed the leaderboard sweep on April 21, 2026, and Axios published the same data the same day. The pattern that matters here is not just that gpt-image-2 won. It is that it won across categories where it was previously expected to trail Midjourney for aesthetics and Nano Banana Pro for photorealism.

The score is a snapshot. Elo on Arena moves as new models ship and as votes accumulate. Nano Banana Pro and Midjourney v8.1 Alpha both have updates expected through May. By June, the gap will compress. The signal worth keeping is that on launch day, gpt-image-2 achieved a generational lead simultaneously on every Arena scorecard, a profile no image model has had since the original DALL-E 3 release in late 2023.

How accurate is the text rendering, really?

This is the hardest claim to verify because there is no industry-standard benchmark for in-image text accuracy. OpenAI's own blog quotes 99% character accuracy on Latin scripts. Independent reviews published the day after launch put the Latin number between 95% and 99% depending on font weight and prompt complexity. Non-Latin scripts come in lower but still well above any prior model. Here is the matrix consolidated across OpenAI's launch post, the AVB benchmark published April 22, and Phygital+'s walk-through published the same day.

On Latin scripts (English, French, Spanish), gpt-image-2 lands at 95% to 99% character accuracy depending on font weight and prompt complexity. DALL-E 3 capped at roughly 71% on the same prompts. On Chinese (Simplified and Traditional), Japanese, and Korean, the new model holds above 90%, where DALL-E 3 sat below 50% and frequently produced gibberish. Hindi and Bengali also clear 90% on gpt-image-2 and were effectively unsupported in production before. Arabic and Hebrew right-to-left rendering remains partial: OpenAI's own examples show only partial success, and RTL output should be treated as draft rather than deliverable until further notice.

The practical implication is direct. If your product injects text into images (UI mockups, infographics, social cards, packaging mockups, multilingual ads, restaurant menus, education materials), the bottleneck that defined every prior generation is gone for Latin and CJK scripts. RTL stays the one open gap.

This single change, more than the reasoning architecture, is what kills DALL-E commercially. DALL-E 3 produced gibberish above eight characters of in-image text. gpt-image-2 produces a printable menu with correct prices. That is the gap, in one sentence.

Where gpt-image-2 wins, where it loses

The cleanest April benchmark on this question is AI Video Bootcamp's "Echoes of Tokyo" movie poster test, published April 22 and updated April 30. Same prompt, eight models, no retries, highest quality tier on each model. The capabilities tested in a single image: Latin typography, multilingual typography, dense layout with six discrete text blocks, cinematic photorealism, color palette control, and aspect ratio adherence. Results consolidated.

Where gpt-image-2 takes the win cleanly: dense in-image text (Midjourney v8.1 trails at roughly 71-78% character accuracy, Nano Banana Pro and Flux 2 Pro both behind), multilingual typography across CJK and Indic scripts, layout planning via Thinking mode (a capability the other three models do not have), and 8-image coherent batch from a single prompt.

Where it loses. On cinematic photorealism, Nano Banana Pro keeps the lead, with Flux 2 Pro second and Midjourney v8.1 close behind. On skin texture and portraiture, Nano Banana Pro is again ahead. On mood and lighting, Midjourney v8.1 has not been displaced. And on raw speed, Midjourney v8.1 Alpha runs roughly 3x faster than v7, putting it ahead of gpt-image-2's Instant mode for fast iteration.

Read these results by use case, not by overall winner. If your deliverable is a poster with text, an infographic, a UI mockup, a multilingual social card, or a mockup that needs to read at small sizes, gpt-image-2 is the only correct choice today. If your deliverable is a product hero shot, a fashion editorial, or anything where skin and material realism is the value, Nano Banana Pro is still ahead. If the work is mood and tone (album covers, film moodboards, conceptual art direction), Midjourney v8.1 holds.

The takeaway for studios and product teams. Expect to run two or three models in parallel, not one. The single-tool assumption from the DALL-E 3 era is over. Pick the model by capability, not by subscription.

The Thinking mode is the actual architectural shift

Strip away the leaderboard noise. The substantive change is that gpt-image-2 is the first image model that plans before drawing. Thinking mode runs three steps the prior generation could not run. First, it reasons through composition (where the text goes, how the negative space breaks up, what the focal hierarchy is), the way a human art director sketches before refining. Second, it pulls reference data from the web during generation, so a prompt that mentions a specific landmark or brand can include an actual visual reference rather than the model's training set memory. Third, it cross-checks its own output before delivery, regenerating elements that fail an internal coherence test.

Two downstream consequences ship with the model. Phygital+ documented selective area editing on April 22, where the user can mask one zone of an existing generation and reprompt only that zone, without re-rendering the rest of the image. That is the workflow that finally closes the gap with Photoshop's generative fill. Camera angle control, also documented April 22, lets a prompt specify "low angle, three-quarter view, 35mm" the way a director frames a shot, with reasonable obedience.

What does not work yet. OpenAI's own launch post acknowledges three failure modes: coherent physical-world models (gravity, weight, contact), fine repetitive details (chain mail, dense floral patterns, scientific diagrams), and iterative editing past the second revision. Diminishing returns kick in fast on the third or fourth pass, with the model losing track of what was meant to stay constant. The honest read is that the Thinking mode delivers a generational jump on planning and verification, but the underlying physics intuition has not changed.

Worth noting: this architectural shift moves in the same direction Anthropic took with Claude Design earlier in April, which we covered in our piece on Anthropic killing the design handoff. Both labs are betting that reasoning before rendering produces better creative output than scaling raw generation. April 2026 will likely be remembered as the month that bet started paying off.

Which plan should you actually buy?

The plan tiering is gated and worth reading carefully before committing. The Free tier gets Instant mode only: no Thinking, no API. Plus at twenty dollars per month unlocks Thinking mode (web reference, multi-image batching, output verification, layout planning), but no API. Pro at two hundred dollars per month adds priority Thinking on top of everything Plus offers, still without API. Business and Enterprise unlock direct gpt-image-2 API access on top of full Thinking. The direct API endpoint, opening early May 2026, is the fourth path for embedded use cases.

The Free tier gets the headline quality bar with Instant mode but loses everything that justifies the launch hype: web reference, multi-image batching, output verification, layout planning. So "everyone gets the new model" is true on paper and false in practice. The product moat is in Thinking, and Thinking is paywalled at twenty dollars per month minimum.

For studios and product teams, the relevant decision is between Plus at twenty dollars for individual creators, Pro at two hundred dollars for production volume with priority Thinking, and direct API access for embedded use cases. The API is expected to price comparable to GPT-4o image calls, with OpenAI confirming early May for the public gpt-image-2 endpoint. We have been tracking this kind of agentic-tool roll-out closely, including in our piece on Claude Managed Agents and brand strategy, which makes a related point about the gap between what a free tier exposes and what production work actually requires.

The clean integration arbitrage for a studio shipping client work this week. Use Plus or Pro for visualisation and exploration. Use the API the moment it ships in May for any production pipeline that calls images programmatically. Run gpt-image-2 in parallel with Nano Banana Pro for photorealism work and Midjourney v8.1 for mood work, on hybrid deliverables. Wait until June for stable benchmarks before consolidating onto any single tool. The leaderboard is not done moving.

If you want to talk through the integration trade-offs for your own pipeline, we run technical audits as part of our services. Or look at the projects we have built where this kind of capability arbitrage is the whole job.

AI image generation ChatGPT benchmark OpenAI