180: New ChatGPT 4.0 Image Generation
Remarkable enhancements available now; bigger challenges persist
What do we mean when we say we want an image?
This is the inherent crux for anyone building any AI-powered image generating tool. Does the user mean a photo (and what kind)? A painting (and what kind)? An illustration (what kind)? And if you, the one writing the prompt, haven’t studied art and didn’t know the nomenclature, how could you ask for a specific image successfully? That’s the first problem.
If what you want from an image is photorealism, then Midjourney’s been the platform to beat. I’ve only generated 3,552 images in Midjourney since I joined two years ago. I know many designers and art directors who’ve generated tens of thousands. We all struggle to discern the magical sequence of words which will elicit the image we’re seeking. But Midjourney’s done a lot to solve the first problem by evolving its UX, shifting out of Discord into a web app, and offering personalization.
And they’re not alone. Adobe’s Firefly has made inroads once they ironed out their UX; their path forward is clearly rooted in corporate brand safety and broad integration across its Creative Suite. Now we have dozens of image generation tools from Ideogram, Leonardo, Meta. Two weeks ago Google quietly released its Gemini 2.0 Flash Experimental model, offering noteworthy enhancements to its image generating capabilities.
Of course, photorealism is only one kind of image. And it turns out lots of people also want to generate words and phrases (spelled correctly, please, and also using specific fonts if you wouldn’t mind) inside their images—regardless of style—which is another layer of complexity amidst an already complicated process. And/or, they’d like to generate an image incorporating their product, represented accurately.
In the realm of Midjourney and its ilk, the solutions to these issues has been ever more granular UX controls and distinctions between “style” (i.e. the surface of what we see—texture, lighting, etc.) versus “composition” (i.e. the architecture of what we’re seeing—the shape, position, grid, etc.).
Which is where OpenAI enters the party.
Originally, their approach to image generation was called DALL-E. It was a standalone platform, distinct from its faster rising sibling, ChatGPT. But in October 2023, OpenAI integrated DALL-E into ChatGPT and the concept of “multi-modal” generation became normalized. Now, one AI interface could deliver “multi” modes of outcomes—text or images, or perhaps text inside an image. With ChatGPT, you typed in a prompt, got an image and if you didn’t like it—you typed an edit, and got another image…often with the change you wanted and many you didn’t ask for. But for some tasks, and users, this conversational interface was easier to work with, if only a little maddening.
And I think this distinction—“use a chatbot (only)” to generate images versus “use a chatbot and an increasingly complex UX” is subtly important. But let’s refocus on photorealism quality for a moment.
As I wrote back in December 2023, contrasting the major image generators in the context of a branding pitch, “Photorealism is not Dalle’s strong suit. It prefers to work in an illustrative yet realistic space, at least at this point. And you only get one image at a time in Dalle, and if you ask it to regenerate, you lose the original.” Here’s how Midjourney, DALL-E3 and Firefly responded to the prompt: “a photorealistic image of a designer working late at night on a branding pitch.”
Much has changed in the past 15 months.
Image “quality” (at least in photorealism) across most of the image generation platforms has been getting better, and better, and better; except inside ChatGPT. While OpenAI’s platform has grown through web search, “deep” research, and text editing capabilities, its image generation remained a novelty, no serious competition for Midjourney.
Until this week.
Here’s the same prompt from above, and the same three platforms today. (And yes, bias for “white men with beards = designer” is absolutely a thing, and it’s boring.)
This improved capability is available right now, to anyone using ChatGPT including the Free tier. Access the function by first insuring you’re using the 4.0 or 4.5 model (upper left), then clicking the three dots under the chat bar, and selecting Create image. This will add two words to the beginning to your prompt—just keep typing what you want visualized.
As OpenAI tells it, their new approach to image generation is quite different from the diffusion model processes previously used by DALL-E, as well as Midjourney, Firefly, et al. They’re now using what’s called “an autoregressive approach” which generates images sequentially from left to right and top to bottom. More on this from The Verge.
But wait, there’s more.
Where I think ChatGPT can provide real value to creative people is its enhanced fidelity for text, logos, and packaging inside a generated image. Here’s an example. And there are many more in Ars Technica’s excellent coverage.
Is it perfect? Of course not! But the degree of photorealistic accuracy now versus a few weeks ago is astounding. The same can also be said of this new model’s approach to a range of different kinds of image.
As we keep saying, this is the worst version of these tools. Notice how the composition remains essentially the same across the different renders. Notice how the spelling remains consistent. These are the small ways in which the tools are improving.
has been exploring this story on his TikTok.Finally, let’s return to my original point—all of what OpenAI is offering now with its latest image generation occurs in a chat thread, and nowhere else. There’s no UX tool bar to open, configure, much less learn. You just type what you want to create, or edit. I think that level of simplicity is a huge advantage, especially in a realm as confusing as “What do we mean when we say we want an image?”
And you can see where this is going in Unilever’s partnership with NVIDIA and recent announcements around digital twins of its TRESemme and Dove products. Or H&M announcing its intentions to use AI-generated models. Imagine prompting “[Specific SKU] in [specific location/context] with [specific lighting/style]” and receiving 100% accurate “photography” in a few seconds.
It’s worth your time to give ChatGPT 4.0 image generation a try.