When Gemini released its image generation features, it was quickly removed from the market. Google CEO Sundar Pichai had to apologize and said the model’s behavior was “completely unacceptable.” What happened? Google had built the model to generate diverse images. Unlike many other Image GenAI, it would not only generate results of white, senior men when asked about a ‘CEO’. However, it went too far when generating images of historical figures, like People of Color, for the image of the US founding fathers.
The uproar in the Large Large Language Model community was huge. People complained that Gemini was falsifying history. Some even go as far as accusing Google of being ‘super woke’ (‘woke’ used negatively). However, what Gemini essentially did was to ensure that diversity was built in. It was an acknowledgment that no foundational LLM has data sources that would enable it to accurately reflect history or society, let alone its full diversity. As a result, GenAI models inject their bias via stereotypes, sometimes overtly and sometimes very subtly. This is called algorithmic bias.
That the reaction to Google’s move was so negative is quite telling in many ways.
Users assumed that a LLM reliably produces historically accurate information. The LLM isn’t seen as a tool for creativity but for factual accuracy. It’s not understood that an LLM is calculating a statistical model based on the training data without fact-checking. This is despite widespread reporting that LLMs suffer from bias and hallucinations. So, with the correction of Google, CEOs were of all skin colors, all genders, all nationalities, etc. But it also made the founding fathers of the US or Nazi soldiers equally diverse - which obviously is not historically correct.
But an important aspects users are unaware of, general purpose LLMs do not have the necessary high quality data to be accurate. Their implementation even distorts whatever accurate data they have available. One prime example of this is the “white Jesus”. The data on the internet, with which LLMs are trained, largely depict Jesus as white. Accordingly, an LLM calculating the result statistically, also produces Jesus in a light skin color. And yet, this is historically inaccurate. We know that Jesus was not white.
ChatGPT was released to the public without much limitation or guidance as a way to find out what LLMs could be used for; its scope was intentionally wide. This confused the public into thinking that an LLM can be used even for sensitive topics, despite their tendency to hallucinations and biases. In their raw form, they, for example, invent non-existing cases in legal filings or embed gender stereotypes into performance reviews. The risks related to foundational models and their wide range of potential tasks are recognized within the EU AI Act for this very reason.
While these issues are getting attention in the press, there are no equally high-profile cases of an LLM being taken offline as a result, despite all LLMs exhibiting significant issues in this aspect.
Another disturbing fact about the uproar: The “super woke” results of Gemini were perceived as a significant problem. It was seen as a much bigger problem than an LLM subversively giving us information that’s biased, such as, for example, the white Jesus or CEOs being shown as white men. This clearly shows two things:
As a solution this lack of diversity dilemma, at a recent event about AI image generation, the speaker suggested to “prompt-in diversity”. This is the acknowledgment that even for creative uses, lack of diversity is the default. I find this lack of representation in today’s output troubling, but it will lead to even more amplification: As more and more content is being generated with the help of LLMs and systems where users do not even write an explicit prompt, we will end up training future LLMs and this even more skewed data.
Data is also an issue. Go ahead and try to get any image generator to give you an image of someone aged 40–55. The default is 20-35. It is possible to get toddlers and children as well as elderly people. But the age group 40-55 is non-existent and impossible to prompt for. Similar issues exist related to people in wheelchairs or other disabilities.
Now, Google rightfully was criticized since they overshot their target as they made it impossible to prompt in more specific instructions, which they acknowledged in a blog post where they also made this important additional statement:
“One thing to bear in mind: Gemini is built as a creativity and productivity tool, and it may not always be reliable, especially when it comes to generating images or text about current events, evolving news or hot-button topics. It will make mistakes. As we’ve said from the beginning, hallucinations are a known challenge with all LLMs — there are instances where the AI just gets things wrong. This is something that we’re constantly working on improving.”
But the intention of Google was well-meaning: To correct the inherent bias of LLMs. In this spirit, we at Witty Works welcome Gemini’s attempt, which Adobe Firefly also seems to adhere to. And if you want historical accuracy (within the limits of the model's underlying data), then you should be forced to prompt this in. This also gives the model the opportunity to realize that you are, in fact, using it not for creativity (the default) but for accuracy. This, in turn, would allow the model to add relevant disclaimers about its limitations, a key step towards an ethical use of AI for humanity. Or better yet, use an LLM built for the purpose of generating accurate historical images.
At the same time, the data sources need to become more diverse, at least to the point where we can prompt representation of any group. That is the absolute baseline, actually.
Algorithmic bias describes systematic and repeatable errors in a computer system that create "unfair" outcomes, such as "privileging" one category over another in ways different from the intended function of the algorithm.
https://en.wikipedia.org/wiki/Algorithmic_bias
Generative artificial intelligence (generative AI, GenAI, or GAI) is artificial intelligence capable of generating text, images or other data using generative models, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.
https://en.wikipedia.org/wiki/Generative_artificial_intelligence
The term is often used interchangeably with LLM (see below).
In the field of artificial intelligence (AI), a hallucination or artificial hallucination (also called confabulation or delusion) is a response generated by an AI that contains false or misleading information presented as fact.
https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)
A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.
https://en.wikipedia.org/wiki/Large_language_model
A foundation model is a machine learning model that is trained on broad data such that it can be applied across a wide range of use cases. Foundation models have transformed artificial intelligence (AI), powering prominent generative AI applications like ChatGPT.
https://en.wikipedia.org/wiki/Foundation_model