The Rise of Multimodal Image Generation: A New Era for AI

Over the past two weeks, both Google and OpenAI have unveiled their new multimodal image generation capabilities. This marks a significant leap forward in AI technology. *You would have seen the OpenAI one in action on every other social media post - in the studio Ghibli style. And as always Google’s work in completely inaccessible even to paid Gemini users. * However this post is not about how to create a cartoon caricature, it delves on other use cases. From Text-to-Image to Multimodal Generation Previously, when a Large Language Model (LLM) generated an image, it wasn’t a direct process. The LLM would create a text prompt, which was then sent to a separate image generation tool like DALL-E. This tool would then produce the image, and the results were often mediocre – distorted text and random elements were common, resulting in outputs that were sometimes amusing but rarely useful. Multimodal image generation represents a significant advancement. Now, the AI directly controls the image creation process. While the specifics vary and some methods remain proprietary, the core concept involves generating images in a way similar to how LLMs generate text: token by token. Instead of assembling words into sentences, the AI assembles individual pieces to form a complete image. This allows for much more precise and impressive results, reflecting the LLM’s “thinking” and enabling clear composition and control. Examples of Multimodal Image Generation The before: Image generated from ChatGTP which used DALL-E The after: Much better, and yet some text glitches. Almost feels like there is an OCR being done. Note: Image is cropped Prompting Images Like People In the past, I’ve emphasized the importance of treating AI like a person when prompting, even though it isn’t. Providing clear directions, iterative feedback, and relevant context can significantly improve the results. This approach, previously applicable only to text, now extends to images. For instance, when I asked GPT-4o to ‘create an infographic about omni-channel marketing,’ the AI produced a solid starting point. From there, it was like a conversation: I refined the image by asking it to ‘make the graphics hyper-realistic’ and then ‘shift the colors from earth tones to vibrant.’ Even a small error are easily corrected with a simple prompt. This shows how you can work with the AI to achieve the desired result. The capabilities of these models are truly impressive. Consider the following progression of prompts and results:

“put this infographic in the hands of an marketer standing in front of a shop front”
“Change the sign board of the store to Acme Inc.”

While the results are not always perfect, the progress is undeniable. The Potential of Multimodal Image Generation Just as we’ve been exploring the applications of text-based AI models, we are now on the cusp of discovering the vast potential of image-based LLMs. The ability to upload and manipulate images directly within the AI opens up a world of possibilities. Here are a few examples, all created using GPT-4o:

Transforming a hand-drawn image into UI Mockup, complete with professional-looking logo.
Swapping elements between photographs, such as replacing a coffee table with a touch screen
Creating instant website mockups, ad concepts, and pitch decks for innovative startup ideas.

The applications extend far beyond these examples. Visual recipes, website homepages, video game textures, illustrated poems, photo enhancements, and visual adventure games are just a few of the areas where multimodal image generation is making a difference.