Spending sufficient time with ChatGPT and other artificial intelligence chatbots reveals their tendency to produce false information.
Termed as hallucination, confabulation, or simply fabricating content, this issue now poses a challenge for businesses, organizations, and high school students who rely on generative AI systems to compose documents and accomplish tasks. Some individuals employ these systems for critical tasks such as psychotherapy or researching and writing legal briefs, making the potential consequences quite significant.
According to Daniela Amodei, co-founder and president of Anthropic, the company behind the chatbot Claude 2, virtually all existing models suffer from some degree of hallucination.
Amodei explained that these AI models are primarily designed to predict the next word in a given context. As a result, there will inevitably be some level of inaccuracy in how the model performs this prediction.
Anthropic, OpenAI (the creator of ChatGPT), and other leading developers of large language models acknowledge the need to enhance the truthfulness of their AI systems.
However, the timeline for achieving this goal and whether they will reach a point where they can safely provide medical advice or perform other critical tasks remains uncertain.
Emily Bender, a linguistics professor and director of the University of Washington’s Computational Linguistics Laboratory, holds the view that the issue may not be entirely fixable due to the fundamental mismatch between the technology and the intended use cases.
The reliability of generative AI technology holds significant importance, with the potential to contribute an estimated $2.6 trillion to $4.4 trillion to the global economy, according to the McKinsey Global Institute. This includes various applications beyond chatbots, such as generating images, videos, music, and computer code, with language being a common component in most of these tools.
Google is already promoting an AI product for news-writing to media organizations, where accuracy is crucial. The Associated Press is also exploring the use of such technology in collaboration with OpenAI, using part of AP’s text archive to enhance AI systems.
In India, computer scientist Ganesh Bagler has been collaborating with hotel management institutes to develop AI systems, including a precursor to ChatGPT, capable of inventing South Asian cuisine recipes, including innovative versions of rice-based biryani. The accuracy of these AI-generated recipes becomes vital, as even a single “hallucinated” ingredient could mean the difference between a delicious meal and an inedible one.
During OpenAI CEO Sam Altman’s visit to India, Bagler raised concerns about hallucinations in ChatGPT, which might be tolerable in casual conversations but can become a significant problem when it comes to AI-generated recipes. This emphasizes the need for improving the accuracy and reliability of such AI systems, especially in critical and high-stakes domains.
When Bagler eventually asked for Sam Altman’s take on the situation, Altman expressed optimism and a commitment to addressing the hallucination problem in AI systems.
Altman believes that they can significantly improve the hallucination issue within a timeframe of about one and a half to two years. He envisions reaching a point where such problems are no longer a major concern. He also acknowledges the importance of striking a balance between creativity and perfect accuracy in AI models. The model needs to learn when to prioritize one over the other, depending on the context and the intended use.
However, some experts, like University of Washington linguist Bender, hold a different perspective. She describes a language model as a system designed to calculate the likelihood of various word sequences based on the written data it has been trained on.
In conclusion, while Altman remains optimistic about the potential improvements in AI systems, there are differing opinions among experts regarding the effectiveness of these enhancements in addressing the fundamental challenges associated with language modeling and hallucination issues.
Spell checkers rely on language modeling to detect incorrect words, and this technology also plays a crucial role in automatic translation and transcription services. It helps in creating more natural and typical-looking text in the target language. Many people encounter a version of this technology in the “autocomplete” feature while composing text messages or emails.
The latest generation of chatbots, including ChatGPT, Claude 2, or Google’s Bard, attempt to take this language modeling to a higher level by generating entire passages of text. However, according to Bender, they are still essentially repeatedly selecting the most plausible next word in a sequence.
When these language models are used for text generation, they are designed to “make things up,” as that is their primary function. While they excel at mimicking various writing styles, such as legal contracts, television scripts, or sonnets, their generated content’s correctness is often a matter of chance. Even with tuning to improve accuracy, there will still be failure modes, which might be less noticeable or more obscure to human readers.
For marketing firms using Jasper AI to assist with writing pitches, these errors may not pose a significant issue, according to the company’s president, Shane Orlick. However, for applications where accuracy is critical, such as medical advice or legal documents, addressing the limitations of language models remains a significant challenge.
Shane Orlick, the president of Jasper AI, views hallucinations in AI language models as an added bonus, as they sometimes generate unique ideas or perspectives that human users may not have considered themselves. Jasper AI collaborates with partners like OpenAI, Anthropic, Google, and Meta to offer a range of AI language models tailored to their customers’ specific needs. Depending on the customer’s concern about accuracy or data security, different models may be recommended.
Orlick acknowledges that fixing hallucinations won’t be easy, but he believes that companies like Google, which have a reputation for maintaining high factual content standards in their search engine, will invest significant resources into finding solutions. While perfection may be difficult to achieve, he expects the technology to continually improve over time.
Techno-optimists, including Microsoft co-founder Bill Gates, express confidence that AI models can eventually be taught to distinguish fact from fiction. Gates cites promising work in this area, including research from OpenAI. Other researchers, like those at the Swiss Federal Institute of Technology in Zurich, have also been working on methods to detect and remove hallucinated content in models like ChatGPT.
Even Sam Altman, the CEO of OpenAI, doesn’t fully trust the answers produced by ChatGPT when seeking information. He humorously admitted to having the least trust in the accuracy of ChatGPT’s responses during a speech at Bagler’s university. This indicates that while there is progress and optimism, there is still a long way to go in achieving perfect accuracy and reliability in AI language models.