How Smart Is ChatGPT o3?

I claim it's more reliable than a person off the street

May 30, 2025

For a lot of reasons, like Sam Altman kind of proving himself a slimeball in the OpenAI board drama last year, I really want Anthropic’s Claude to beat ChatGPT. So it’s despite my biases, not because of them, that I regret to report that I find ChatGPT o3 (not the free ChatGPT) markedly more useful than Claude or Gemini. Even Claude 4 Opus with extended thinking turned on, and the latest/greatest Google Gemini, which I’ve also been trialing this week. With the exception of writing code, at which Claude excels, o3 is just smarter and more helpful. Oh, and for math, nothing beats o4-mini-high. When I work on a math problem with that thing it feels like working with an intellectual peer. In terms of breadth of knowledge it wipes the floor with me, of course. So by intellectual peer I just mean in terms of getting one’s head around a problem, building intuition, and thinking of ways to solve it. It's beyond uncanny.

To be clear, o3 and kin are not AGI. But consider how far we’ve come. I don’t think there exists any one-shot textual request that can unmask it as dumber than a human. This used to be so, so easy. Pre-ChatGPT, a few years ago, you could ask any random common sense question (“what’s bigger, your mom or a french fry?”) and unless it got lucky, it would fall on its face. By GPT-4 you started to have to work a bit to find such questions, but there were still plenty of them. “How many r’s in ‘strawberry’?” was a classic. I don’t think such questions exist anymore. I mean, maybe Gary Marcus can still do it?

The above Manifold market thinks he may manage it for a few more years. I challenge anyone to do it with o3 today. The gauntlet is hereby thrown down:

So far traders think I’m probably/maybe wrong. Tune in next week to find out if I am.

(If o3 is so smart, could it write this newsletter? No, gross, I mean, kind of. I just hate its voice. As in, I never use its actual words because it just sounds all wrong to me, aesthetically. To demonstrate, I’ll make an exception for the rest of this parenthetical starting now. I favor evidence over hand-waving, concision over filler, and rapid iteration over confident mistakes. Point me at something thorny and let’s hammer it smooth together. [shudder])

Especially with the way o3 now cites its sources, I think we’re about to see people citing it for various fact checks with a perfectly straight face and be perfectly correct to do so. It once seemed ridiculous to cite Wikipedia for fact checks (“literally anyone can type literally anything on literally any Wikipedia page!”) just like it now seems ridiculous to cite an LLM. But give it a couple years.

In the News

Google’s Veo 3 video generation is blowing the internet’s mind. AI-generated video does voices now. I recommend watching “prompt theory” (just google it) to see what’s possible, with a lot of effort. I haven’t tried replicating things like that myself. But it’s clearly about to be huge for everything from TikTok to Hollywood.
Shoshannah Tekofsky describes the Agent Village experiment, similar in spirit to the fake workplace experiment I talked about a month ago.
Still lack-of-news, but I continue to nervously await Tesla’s robotaxi launch. They’re launching something and my credibility depends on it not counting as members of the public riding in Teslas with no one in the driver’s and no one supervising in real time remotely, ready to hit the emergency brake. (Remote human assistance, like Waymo has, is fine.) As another demonstration of o3’s utility, check out the transcript of me bugging my robot intern to check the news for me every few days. chatgpt.com/share/6839f339-4cf0-800d-8152-c230b69ee480
I accidentally convinced my friend Nathan Arthur to get on the Substack train. He has good technical posts on using AI for software development. narthur.substack.com
In other friends-with-AI-newsletters news, I plan to join Christopher Moravec’s webinar in which he’ll demonstrate vibe-coding and talk about the display in our living room that generates art based on the conversations people are having, aka the Whisperframe.
We have a nice short-term prediction from Gary Marcus: AI agents that markedly accelerate AI research will not be here by the end of 2025. By “agents” we mean AIs that act in the world autonomously. But also “the world” can just mean the internet. Not just models that respond to prompts, is the point, wildly useful as that is, especially when the response includes working computer code. I’m a little bit on Gary Marcus’s side on this one. If we’re wrong, it will be a huge update and a huge credibility boost for the authors of AI 2027. We’ll know within 7 months!
Not sure this counts as news but I just landed in California for Less.Online and Manifest. I’m excited to chat with the authors of AI 2027 in particular, at least three of whom will be there. We’ll see if I end up convinced that their AGI timeline isn’t as overaggressive as it seems.

Zeb Taylor

May 31Edited

I love Gemini Pro for learning or specialized purposes (2.5 Pro as a Gem is very good for many of my purposes, and it's speed is amazing), but o3 definitely feels the most "powerful." The main way I have seen it is in prepping for debate; o3 can search for evidence for a very long time and yields high-quality evidence for my exact claim in a way that no other tool (not even the deep research-es) was able to until now. This alone has probably saved me more time / made me more impressed with o3 than "__ model can do PhD-level physics"-type capabilities.

Expand full comment

d20diceman

May 31

>I never use its actual words because it just sounds all wrong to me, aesthetically.

Does anyone have any tips on avoiding this insipid 'house style'? Even when I give it drastic custom instructions I find some of the same vibe always comes through.

Some of the ones I've tried are "write in the style of G.K. Chesterton", "reply as a snarky nihilist ala Wednesday Addams" and "respond as three seperate personas named Wiles, Wit and Whim which have the following traits [...]". I enjoyed all of these but the mask slips too often.

1 reply by Daniel Reeves

13 more comments...

AGI Friday

Discussion about this post