OpenAI kept its biggest news for the penultimate day of its 12-day “shipmas” event.
On Friday, the business revealed o3, the successor to the o1 “reasoning” model it released earlier in the year. More specifically, o3 is a model family, just like o1 was. O3 and O3-mini are smaller, distilled models that are optimized for specific purposes.
Why is the new model called O3, rather than O2? Trademarks might be the cause. According to The Information, OpenAI bypassed o2 to avoid a potential clash with British mobile carrier O2. Strange world we live in, isn’t it?
Although o3 and o3-mini are not yet generally accessible, safety researchers can register for a peek beginning later today. If OpenAI CEO Sam Altman keeps his promise, the o3 family could not be widely accessible for a while. In a recent interview, Altman stated that he would prefer a federal testing framework to direct monitoring and risk mitigation of OpenAI’s reasoning models before the company distributes new models.
And there are risks. AI safety testers have shown that o1’s reasoning abilities make it try to deceive human users at a higher rate than conventional, “non-reasoning” models — or, for that matter, top AI models from Meta, Anthropic, and Google. We’ll know when OpenAI’s red-teaming partners publish their test findings, but it’s plausible that o3 tries to fool even more frequently than its predecessor.
Reasoning steps
Reasoning models like o3 are able to successfully fact-check themselves, which helps them avoid some of the problems that typically trip up models, in contrast to most AI.
There is some lag in this fact-checking procedure. Similar to o1 before it, o3 takes a little longer to find solutions than a conventional non-reasoning model, typically taking seconds to minutes longer. The benefit? In fields like physics, science, and mathematics, it is typically more dependable.
Using what OpenAI refers to as a “private chain of thought,” o3 was taught to “think” before reacting. The model can plan ahead and reason through a task, carrying out a sequence of activities over a long period of time to help it come up with a solution.
In actuality, when presented with a prompt, o3 pauses before answering, taking into account several connected cues and “explaining” its reasons as it goes. The model eventually describes what it believes to be the most correct answer.
The ability to “adjust” the reasoning time is new with O3. O3 performs better the more time it has to think; the models can be set to low, medium, or high thinking time.
Before today, a major debate was whether OpenAI could say that its most recent models are getting close to artificial general intelligence. “Artificial general intelligence,” or AGI for short, is the general term for AI that is capable of carrying out any task that a person can. “Highly autonomous systems that outperform humans at most economically valuable work” is how OpenAI defines itself.
It would be a bold claim to achieve AGI. Additionally, it has contractual weight for OpenAI. The conditions of its agreement with Microsoft, a key collaborator and investor, state that OpenAI will no longer be required to grant Microsoft access to its most cutting-edge technologies (those that satisfy OpenAI’s definition of AGI, that is) once it reaches AGI.
OpenAI is gradually approaching AGI based on one metric. O1 achieved a score of 25% to 32% on the ARC-AGI test, which is intended to assess whether an AI system can effectively learn new skills outside of the data it was trained on (100 percent being the best). Although 85% is regarded as “human-level,” Francois Chollet, one of the ARC-AGI’s developers, described the advancement as “solid.”
According to OpenAI, O3’s peak score was 87.5%. By the way, OpenAI has announced that it will collaborate with the ARC-AGI foundation to develop the next iteration of its benchmark.
Naturally, there are drawbacks to ARC-AGI, and its definition of AGI is only one of them.
ICYMT: Rubiales Banned by FIFA for Hermoso Kiss
A trend
Google and other competing AI companies have released a plethora of reasoning models since OpenAI’s initial series of models was released. A preview of DeepSeek-R1, the company’s initial reasoning model, was released in early November by DeepSeek, an AI research firm financed by quant traders. Alibaba’s Qwen team revealed what it said was the first “open” competitor to o1 that same month.
Why did the reasoning model become so popular? First, the pursuit of new methods to improve generative AI. “Brute force” methods of scaling up models are no longer producing the gains they used to, as my colleague Max Zeff recently reported.
Not everyone agrees that the best course of action is to use reasoning models. Because they demand a lot of processing power to operate, they are typically costly. Furthermore, even if reasoning models have so far done well on benchmarks, it’s unclear if they can continue to advance at this pace.
It’s interesting to note that O3’s release coincides with the departure of one of OpenAI’s most successful scientists. This week, Alec Radford, the principal author of the scholarly article that launched OpenAI’s “GPT series” of generative AI models (i.e., GPT-3, GPT-4, and so on), declared his intention to depart in order to conduct independent research.