OpenAI saved its biggest announcement for the last day of its 12-day “shipmas” event.
On Friday, the company unveiled o3, the successor to the o1 “reasoning” model it released earlier in the year. o3 is a model family, to be more precise — as was the case with o1. There’s o3 and o3-mini, a smaller, distilled model fine-tuned for particular tasks.
OpenAI makes the remarkable claim that o3, at least in certain conditions, approaches AGI — with significant caveats. More on that below.
Why call the new model o3, not o2? Well, trademarks may be to blame. According to The Information, OpenAI skipped o2 to avoid a potential conflict with British telecom provider O2. CEO Sam Altman somewhat confirmed this during a livestream this morning. Strange world we live in, isn’t it?
Neither o3 nor o3-mini are widely available yet, but safety researchers can sign up for a preview starting later today. Altman said that the plan is to launch o3-mini toward the end of January and follow with o3 shortly after.
That conflicts a bit with his recent statements. In an interview this week, Altman said that, before OpenAI releases new reasoning models, he’d prefer a federal testing framework to guide monitoring and mitigating the risks of such models.
And there are risks. AI safety testers have found that o1’s reasoning abilities make it try to deceive human users at a higher rate than conventional, “non-reasoning” models — or, for that matter, leading AI models from Meta, Anthropic, and Google. It’s possible that o3 attempts to deceive at an even higher rate than its predecessor; we’ll find out once OpenAI’s red-teaming partners release their test results.
For what it’s worth, OpenAI says that it’s using a new technique, “deliberative alignment,” to align models like o3 with its safety principles. It’s detailed the work in a new paper published Friday.
Reasoning steps
Unlike most AI, reasoning models such as o3 effectively fact-check themselves, which helps them to avoid some of the pitfalls that normally trip up models.
This fact-checking process incurs some latency. o3, like o1 before it, takes a little longer — usually seconds to minutes longer — to arrive at solutions compared to a typical non-reasoning model. The upside? It tends to be more reliable in domains such as physics, science, and mathematics.
o3 was trained to “think” before responding via what OpenAI calls a “private chain of thought.” The model can can reason through a task and plan ahead, performing a series of actions over an extended period that help it figure out a solution.
In practice, given a prompt, o3 pauses before responding, considering a number of related prompts and “explaining” its reasoning along the way. After a while, the model summarizes what it considers to be the most accurate response.
New with o3 is the ability to “adjust” the reasoning time. The models can be set to low, medium, or high compute (i.e. thinking time) — the higher the compute, the better o3 does.
Benchmarks and AGI
One big question leading up to today was, might OpenAI claim that its newest models are approaching AGI? AGI, short for “artificial general intelligence,” refers broadly speaking to AI that can perform any task a human can. OpenAI has its own definition: “highly autonomous systems that outperform humans at most economically valuable work.”
Achieving AGI would be a bold declaration. And it carries contractual weight for OpenAI, as well. According to the terms of its deal with close partner and investor Microsoft, once OpenAI achieves AGI, it’s not longer obligated to give Microsoft access to its most advanced technologies (those that meet OpenAI’s AGI definition, that is).
Going by one benchmark, OpenAI is slowly inching closer to AGI. On ARC-AGI, a test designed to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on, o3 achieved a 87.5% score on the high compute setting. At its worst (on the low compute setting), the model tripled the performance of o1.
Incidentally, OpenAI says it’ll partner with the foundation behind ARC-AGI to build the next generation of its benchmark.
Of course, ARC-AGI has its limitations — and its definition of AGI is but one of many.
On other benchmarks, o3 blows away the competition.
The model outperforms o1 by 22.8 percentage points on SWE-Bench Verified, a benchmark focused on programming tasks, and achieves a Codeforces rating — another measure of coding skills — of 2727. It scores 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question, and achieves 87.7% on GPQA Diamond, a set of graduate-level biology, physics, and chemistry questions. Finally, o3 sets a new record on EpochAI’s Frontier Math benchmark, solving 25.2% of problems; no other model exceeds 2%.
where no other model exceeds 2%.
These claims have to be taken with a grain of salt, of course. They’re from OpenAI’s internal evaluations. We’ll need to wait to see how the model holds up to benchmarking from outside customers and organizations.
A trend
In the wake of the release of OpenAI’s first series of reasoning models, there’s been an explosion of reasoning models from rival AI companies — including Google. In early November, DeepSeek, an AI research company funded by quant traders, launched a preview of its first reasoning model, DeepSeek-R1. That same month, Alibaba’s Qwen team unveiled what it claimed was the first “open” challenger to o1.
What opened the reasoning model floodgates? Well, for one, the search for novel approaches to refine generative AI. As my colleague Max Zeff recently reported, “brute force” techniques to scale up models are no longer yielding the improvements they once did.
Not everyone’s convinced that reasoning models are the best path forward. They tend to be expensive, for one, thanks to the large amount of computing power required to run them. And while they’ve performed well on benchmarks so far, it’s not clear whether reasoning models can maintain this rate of progress.
Interestingly, the release of o3 comes as one of OpenAI’s most accomplished scientists departs. Alec Radford, the lead author of the academic paper that kicked off OpenAI’s “GPT series” of generative AI models (that is, GPT-3, GPT-4, and so on), announced this week that he’s leaving to pursue independent research.