Google unveils its competitor to OpenAI’s text-to-image model

24 May 2022

AI-generated image of two robots sitting at opposite ends of a table, holding glasses of wine. The Eiffel Tower is in the background.

An AI-generated image from Imagen. Image: Google Research

Google Research said its Imagen AI model was preferred in tests over DALL-E 2 in terms of ‘sample quality and image-text alignment’.

Google Research has developed a competitor for OpenAI’s text-to-image system, with its own AI model that can create artworks using a similar method.

Google’s research team said its text-to-image model, Imagen, has an “unprecedented degree of photorealism” and a deep level of language understanding.

Text-to-image AI models are able to understand the relationship between an image and the words used to describe it.

Once a description is added, a system can generate images based on how it interprets the text, combining different concepts, attributes and styles.

For example, if the description is ‘a photo of a dog’, the system can create an image that looks like a photograph of a dog. But if this description is altered to ‘an oil painting of a dog’, the image generated would look more like a painting.

Imagen’s team has shared a number of example images that the AI model has created – ranging from a cute corgi in a house made from sushi, to an alien octopus reading a newspaper.

OpenAI created the first version of its text-to-image model called DALL-E last year. But it unveiled an improved model called DALL-E 2 last month, which it said “generates more realistic and accurate images with four times greater resolution”.

The AI company explained that the model uses a process called diffusion, “which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognises specific aspects of that image”.

In a newly published research paper, the team behind Imagen claims to have made several advances in terms of image generation.

It says large frozen language models trained only on text data are “surprisingly very effective text encoders” for text-to-image generation. It also suggests that scaling a pretrained text encoder improves sample quality more than scaling an image diffusion model size.

Google’s research team created a benchmark tool to assess and compare different text-to-image models, called DrawBench.

Using DrawBench, Google’s team said human raters preferred Imagen over other models such as DALL-E 2 in side-by-side comparisons “both in terms of sample quality and image-text alignment”.

Concerns of misuse

Similarly to OpenAI, Google Research said there are several ethical challenges to be considered with text-to-image research.

The team said these models can affect society in “complex ways” and that the risk of misuse raises concerns in terms of creating open-source code and demos.

“The data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets,” the research paper said.

“While this approach has enabled rapid algorithmic advances in recent years, datasets of this nature often reflect social stereotypes, oppressive viewpoints and derogatory, or otherwise harmful, associations to marginalised identity groups.”

The researchers also said that preliminary analysis of Imagen suggests that the model encodes a range of “social and cultural biases” when generating images of activities, events and objects.

“We aim to make progress on several of these open challenges and limitations in future work,” they added.

When Open-AI unveiled DALL-E 2 last month, concerns were raised that this technology could help people spread disinformation online through the use of authentic-looking fake images.

10 things you need to know direct to your inbox every weekday. Sign up for the Daily Brief, Silicon Republic’s digest of essential sci-tech news.