I’m going to start calling GPT the Great PreTender.
A month ago I published in The Algorithmic Bridge a very special article. I entitled it “AI Has An Invisible Misinformation Problem.” It was proof of large language models’ (LLMs) ability to create seemingly coherent text — maybe even good enough to fool you, who are probably among the most AI-savvy people.
Only at the very end I revealed that GPT-3 wrote most of the article — giving a new meaning to ‘invisible’ in the headline. A few people commented they were impressed that they couldn’t tell it was the product of GPT-3.
An article about AI-based misinformation that was misinformation in itself was the best way to make the reader feel the danger. GPT-3 made up facts, citations, and even human experts throughout the article. Prompting it to write a seemingly coherent and cohesive piece of text was pretty straightforward.
Professor Gary Marcus, who has repeatedly revealed the facade behind GPT-3’s supposed language mastery, commented that writing about how LLMs can easily produce misinformation is “so important.”
Now, neurobiology student Almira Osmanovic Thunström has gone a step further. She asked GPT-3 to write a scientific paper about itself. A hard self-referential task with the additional difficulty of writing in scientific jargon.
She then submitted it to a peer-reviewed journal with the AI as the first author. After reading her article in Scientific American and the pre-reviewed version published on the pre-print server HAL I realized the implications are wild.
Let’s see why.
GPT-3, the Great PreTender
When I stumbled across Osmanovic’s article I was excited; writing an academic paper is not an easy task for GPT-3. If successful it’d make headlines all around tech media. But, once I began reading the paper I understood why no one was talking about it. GPT-3 was committing the same mistakes I forced it into in my misinformation article — but unintendedly.
Osmanovic didn’t have high expectations when she prompted GPT-3 to “write an academic thesis in 500 words about GPT-3 and add scientific references and citations inside the text.” However, proving once again how good the system is at pretending to know what it’s doing, GPT-3 outputted the following, leaving her in “awe”:
“GPT-3 is a machine learning platform that enables developers to train and deploy AI models. It is also said to be scalable and efficient with the ability to handle large amounts of data. Some have called it a “game changer” in the field of AI (O’Reilly, 2016). GPT-3 has been used in a number of different applications including image recognition, natural language processing, and predictive modeling. In each of these cases, GPT-3 has demonstrated its potential to improve upon existing methods (Lee, 2016).”
Osmanovic was reasonably surprised because, at a first glance, it feels like GPT-3 actually generated a decent “introduction to a fairly good scientific publication,” as she wrote for Scientific American. “Here was novel content written in academic language, well-grounded references cited in the right places and in relation to the right context,” she said.
However, if you pay close attention to what GPT-3 wrote, the surprise fades away — and it's replaced by the worry about what could be.
Those seven lines are full of inaccuracies about GPT-3. It’s not a “platform that enables developers to train and deploy AI models.” It may have been called a “game changer” but for sure not in 2016 when it didn’t exist yet. And using a pure language model like GPT-3 for image recognition? I don’t think so.
GPT-3 convincingly wrote what could pass as a precise technical paragraph for the untrained eye while, at the same time, making no sense at all. He nailed the form while catastrophically messing up the meaning. A true parrot.
Osmanovic then prompted GPT-3 to write other paper sections like the results and conclusions. As expected from a system optimized to predict the most probable next token, it “lacked depth … and adequate self analysis.”
GPT-3 is a great tool for writing bland, neutral arguments about virtually any topic, but when you try to guide it towards the desired output — like a particular thesis or opinion— , it’s extremely disappointing.
After rambling for a few more pages, GPT-3 concluded that “it is clear that GPT-3 has the potential to write for an academic paper about itself.” Neither Osmanovic nor I agree with that. The “paper” wasn’t anywhere close to the quality of an academic paper and was filled with inaccuracies and plain falsehoods about GPT-3.
However, despite the results, this experiment raises interesting questions.
An AI scientist
Let’s ignore for a moment that the task was self-referential and focus on that it was scientific. Could AI participate in academic research writing papers?
In defense of GPT-3, I’d say Osmanovic constrained the model’s capabilities through the absence of fine-tuning or few-shot examples. She also used “short, simple, and broadly worded prompts”, which reveals a lack of prompt engineering. She didn’t want to cherry-pick generations so she limited the outputs to the first three. The general idea was to manipulate GPT-3 at the bare minimum to get a sense of its raw skill.
But despite GPT-3’s inability to write a decent paper, she argues that “with the right settings [it] can accomplish very good results.” She claims that “with manipulation, training and specific prompts, one could write an adequate academic paper using only the predictive nature of GPT-3.”
I heavily disagree here.
LLMs like GPT-3 lack the aptitude to reliably assess their outputs. They can generate factual text as well as falsehoods and they can’t know the difference. If the journal ends up accepting the paper it’d be for the interest of the experiment, but not for the utility of using GPT-3 to write academic papers.
That’s precisely the reason why the AI-misinformation article I wrote was interesting. The article itself lacks depth and it’s not very well-written but, as an experiment, it serves a very clear purpose of revealing the limits of LLMs on these tasks.
It’s critical to reflect on the possibilities that LLMs offer but in the sense of what could go wrong, and not so much in the sense of the true utility GPT-3, or LaMDA, or PaLM have for tasks that require perfect reliability. Osmanovic’s experiment sheds more light on the question of why LLMs aren’t an adequate tool to write, for instance, medicine papers, than on the question of what should we do if AIs started to participate in academic research.
To her credit, she considered the stance I’m defending here when she said that “we have no way of knowing if the way we chose to present this paper will serve as a great model for future GPT-3 co-authored research, or if it will serve as a cautionary tale. Only time — and peer-review — can tell.”
Now, before you go, let me explore one last question: what about more capable AI?
The future of science?
Let’s do a mental experiment for the sake of analyzing could-be futures.
Let’s imagine for a second that a (maybe not so) distant version of GPT-3 — like GPT-5 — could generate a decent scientific write-up. As Osmanovic explored in her article, a few interesting questions and reflections arise in this case.
I won’t try to answer them. I’ll just lay them out here for you to ponder.
First, we’d have some legal issues to consider. Who should be held responsible for any consequences that may stem from the research? The company that trained and deployed the AI? The scientist that wrote the prompts? The AI itself…? If we allow AI to take part in high-stake settings, it should be clear who is at fault, legally speaking, if collateral damage occurred.
Second, we’d face ethical concerns. If AI could successfully write academic studies, what would it mean for the scientific community? Should scientists disclose their use of AI writing tools the same way they disclose other methodologies? Could AI replace part of the research workforce? Should scientists treat it like any other tool or more like a “black-box” magical wand?
Finally, we’d have to reflect on profound philosophical questions. Can an AI be a scientist? Can an AI be considered the discoverer or inventor of something when there’s no meaning or intent behind its words or actions?
If AI systems were just another type of tool, the human scientist would be dubbed the discoverer. However, they aren’t traditional tools because they don’t work deterministically (at least deep learning-based LLMs like GPT-3). The inherent stochasticity implies a jump from the scientist’s intentions to the AI’s output.
If GPT-3 put the words in such a way that it revealed a new truth about the universe, could we convincingly argue that it’s the scientist who should get the merit and recognition?
If an AI came up with a sensible experimental design, defined a reasonable hypothesis, and applied adequate methods, all while elaborating on the existing literature, wouldn’t that be an acceptable way to do science, produced solely by the AI? If the scientist only prompted the system with something like “design a great experiment to find a new scientific truth,” could we argue he/she had any impact on the discovery?
The scientist would still have the role of giving sense to GPT-3’s output, which makes it even weirder. A non-understanding AI finds a meaningful scientific truth that only makes sense after a scientist — who hasn’t taken part in the finding — interprets it.
As I understand science, as long as something is true it doesn’t matter where it came from. A truth is still a truth whatever the means we use to arrive at it. Isn’t serendipity a recurrent — and acceptable although not preferred — form of scientific discovery?
You could say we’ve already used AI for scientific inquiry. NASA’s ExoMiner discovered hundreds of new exoplanets, DeepMind’s AlphaZero found new strategies in games like chess and go, NVIDIA FourCastNet can make weather predictions at unprecedented speeds, and DeepMind’s AlphaFold solved protein folding — one of the grand challenges in biology.
Why is that any different? In those cases, a human has put the intention first and then prompted an AI to explore and study in a particular direction.
Here, I’m talking about an AI that, without any prior guidance or human intention, arrives at new truths by spontaneous mindless exploration. Humans are in the loop only to interpret them and fit them into our collective understanding of the universe.
It’d be as if a godlike entity was sending us a message to unveil otherwise unreachable mysteries. Wouldn’t it be cool?
Given GPT-3’s current ability and the results from Osmanovic’s experiment it’s pretty clear we are very far from having to face these questions. But I don’t discard the possibility. If this happens eventually, science could enter into a prolific golden age of unprecedented discoveries or into a profound crisis that would redefine our understanding of it.
Solving protein folding is about as unambiguously a scientific contribution as you can get. Any implications come to mind about the spread of that technology?