AI may be very good at some things, such as creating podcasts or coding. But according to a recent study, it finds it difficult to pass a challenging history test.
A group of academics has developed a new benchmark to evaluate the performance of the top three large language models (LLMs) on historical questions: Google’s Gemini, Meta’s Llama, and OpenAI’s GPT-4. According to the Seshat Global History Databank, a sizable historical knowledge repository named for the ancient Egyptian goddess of wisdom, the benchmark, Hist-LLM, evaluates the accuracy of responses.
Researchers from the Complexity Science Hub (CSH), an Austrian research institute, said the findings, which were presented last month at the well-known AI conference NeurIPS, were disappointing. Although the AI, GPT-4 Turbo was the best-performing LLM, its accuracy was just 46%, which is not much better than random guessing.
The study’s key finding is that, despite their impressiveness, LLMs still don’t have the breadth of knowledge needed for advanced history. As an associate professor of computer science at University College London and one of the paper’s co-authors, Maria del Rio-Chanona stated, “They’re great for basic facts, but they’re not yet up to the task when it comes to more nuanced, PhD-level historical inquiry.”
ICYMT: CHRAJ calls for immediate asset declaration and ethical compliance from public officials
The researchers sent TechCrunch some historical questions that were incorrectly answered by LLMs. For instance, GPT-4 Turbo was questioned about the existence of scale armor in ancient Egypt at a particular time. Yes, according to the LLM, but the technology didn’t exist until 1,500 years later in Egypt.
When LLMs are so adept at answering extremely complex questions about things like coding, why do they struggle to answer technical history questions? “It’s probably because LLMs tend to extrapolate from very prominent historical data, finding it difficult to retrieve more obscure historical knowledge,” Del Rio-Chanona told TechCrunch.
Other patterns that the researchers noticed included the fact that OpenAI and Llama models did worse in some places, such as sub-Saharan Africa, which may indicate biases in the training data.
For instance, the researchers queried GPT-4 whether a standing army of professionals existed in ancient Egypt at a particular time in history. The LLM gave the wrong response, claiming that it did, even though the right response is no. This is probably due to the abundance of publicly available material regarding the existence of standing armies in other historical empires, such as Persia.
“You might just remember A and B and try to extrapolate from that if you are told A and B 100 times and C once, and then you are asked a question about C,” del Rio-Chanona stated.
According to Peter Turchin, a faculty member at CSH and the study’s lead, the findings demonstrate that LLMs are still not a human alternative in several sectors.
However, the researchers remain optimistic that LLMs will be useful to historians in the future. By adding more complicated questions and data from underrepresented areas, they are attempting to improve their standard.
According to the publication, “Overall, our results underscore the potential for these models to aid in historical research, while also highlighting areas where LLMs need improvement.”
SOURCE: TECH CRUNCH