15 Comments
User's avatar
Zeb Taylor's avatar

I love Gemini Pro for learning or specialized purposes (2.5 Pro as a Gem is very good for many of my purposes, and it's speed is amazing), but o3 definitely feels the most "powerful." The main way I have seen it is in prepping for debate; o3 can search for evidence for a very long time and yields high-quality evidence for my exact claim in a way that no other tool (not even the deep research-es) was able to until now. This alone has probably saved me more time / made me more impressed with o3 than "__ model can do PhD-level physics"-type capabilities.

Expand full comment
d20diceman's avatar

>I never use its actual words because it just sounds all wrong to me, aesthetically.

Does anyone have any tips on avoiding this insipid 'house style'? Even when I give it drastic custom instructions I find some of the same vibe always comes through.

Some of the ones I've tried are "write in the style of G.K. Chesterton", "reply as a snarky nihilist ala Wednesday Addams" and "respond as three seperate personas named Wiles, Wit and Whim which have the following traits [...]". I enjoyed all of these but the mask slips too often.

Expand full comment
Daniel Reeves's avatar

Yeah, agreed. Either the mask slips or it tries too hard and it's all wrong anyway.

Expand full comment
d20diceman's avatar

Surprising to see the "Does o3 make egregious errors" market only at 92% when several straightforward examples of it being stumped by simple questions have already been posted in the comments.

Not sure if people were avoiding betting it any higher than that in case it turned out o3-with-Dan's-context does much better? I bet it up to 99%.

Expand full comment
Daniel Reeves's avatar

Risky! It's looking likely you're right but I wouldn't go that high just yet.

Expand full comment
Emerald Fleur's avatar

How is Gemini so high on the vibesmarks when it is so consistently crap? (Oh, lol, I just saw the first sentence of the other comment 🤣)

I switched because I was a cheapskate but I used to never thumbs down 4o responses but find myself () off and annoyed at how often I thumbs down Gemini responses on pro or personalized. (Maybe half the time nowadays, subjectively?) Personalized makes me even more furious as it only seemingly pulls from your google history and not your past conversations? Or at least it feels that way. If you want convo history (poorly) incorporated you need to actively prompt Gemini Flash.

Gemini Live’s lack of end to end voice until recently made it godawful compared to 4o.

I don’t care if it beats vibemarks, it’s such an awful product and given how much I need an LLM in my life every day for non creative tasks I’m going to be switching from Gemini Pro to ChatGPT Plus and eat the price difference with happy aplomb. 😤

Expand full comment
SorenJ's avatar

Have you really been trying out Gemini Pro for math? I have found that it wipes the floor for math/physics compared to o3, o4-mini-high, or Claude 4 opus/sonnet. And I use all of them daily. Do you have some examples?

o3 almost always feels like a used car salesman.

Expand full comment
Daniel Reeves's avatar

Man, this is wild how different people's experiences seem to be with these things. Let's definitely compare notes. Here's one where I think o4-mini-high was smarter than Gemini:

Show that the integral of f + f^-1 dx over 0 to 1 for bijective f mapping (0,1) to (0,1) is 1.

o4-mini-high and Gemini were actually both a little confused by how the thing we want to prove there isn't technically true. But when I showed them each a counterexample, o4-mini-high effectively smacked its forehead and fixed it (turns out we need to specify that f is increasing) whereas Gemini didn't understand what had happened at all.

Expand full comment
SorenJ's avatar

Here is Gemini's response (one-shot). I asked it to give me it as latex, so that I could copy and paste:

\documentclass{article}

\usepackage{amsmath}

\usepackage{amsfonts}

\usepackage{amssymb}

\begin{document}

To show that the integral of $f(x) + f^{-1}(x)$ over the interval $[0,1]$ is $1$ for a bijective function $f: (0,1) \to (0,1)$, we first establish the conditions on $f$ implied by the problem statement. For $f$ and $f^{-1}$ to be integrable over $[0,1]$, we typically assume $f$ is continuous and strictly monotonic. Given that $f$ maps $(0,1)$ to $(0,1)$ bijectively, its continuous extension to the closed interval $[0,1]$ must map the endpoints $\{0,1\}$ to $\{0,1\}$. There are two cases for such a function:

\begin{enumerate}

\item $f(0)=0$ and $f(1)=1$ ($f$ is strictly increasing).

\item $f(0)=1$ and $f(1)=0$ ($f$ is strictly decreasing).

\end{enumerate}

The identity $\int_0^1 (f(x) + f^{-1}(x))dx = 1$ holds specifically for the first case, or for a specific subset of the second case. Standard versions of this theorem assume $f(0)=0$ and $f(1)=1$. We will proceed with this assumption, which is common for such problems.

\section*{Proof}

Let the integral be $I$. We have:

$$I = \int_0^1 (f(x) + f^{-1}(x))dx = \int_0^1 f(x)dx + \int_0^1 f^{-1}(x)dx.$$

Let's evaluate the second part of the sum, $J = \int_0^1 f^{-1}(x)dx$.

We use the substitution $x = f(y)$. This implies $y = f^{-1}(x)$, and $dx = f'(y)dy$.

The limits of integration for $x$ are $0$ and $1$. We need to find the corresponding limits for $y$:

When $x=0$, $y = f^{-1}(0)$.

When $x=1$, $y = f^{-1}(1)$.

Under the assumption that $f$ is an increasing bijection such that $f(0)=0$ and $f(1)=1$, it follows that its inverse $f^{-1}$ also satisfies $f^{-1}(0)=0$ and $f^{-1}(1)=1$.

So, the limits for $y$ are also $0$ and $1$.

Substituting into $J$:

$$J = \int_0^1 y f'(y)dy.$$

We can evaluate this integral using integration by parts, where $\int u dv = uv - \int v du$.

Let $u = y$ and $dv = f'(y)dy$.

Then $du = dy$ and $v = \int f'(y)dy = f(y)$.

Applying integration by parts:

$$J = [y f(y)]_0^1 - \int_0^1 f(y)dy.$$

Now, substitute the limits:

$$J = (1 \cdot f(1) - 0 \cdot f(0)) - \int_0^1 f(y)dy.$$

Using our assumption $f(0)=0$ and $f(1)=1$:

$$J = (1 \cdot 1 - 0 \cdot 0) - \int_0^1 f(y)dy = 1 - \int_0^1 f(y)dy.$$

Now, substitute this expression for $J$ back into the original integral $I$:

$$I = \int_0^1 f(x)dx + \left(1 - \int_0^1 f(y)dy\right).$$

Since the variable of integration is a dummy variable, $\int_0^1 f(x)dx = \int_0^1 f(y)dy$.

Therefore:

$$I = \int_0^1 f(x)dx + 1 - \int_0^1 f(x)dx = 1.$$

Thus, for a bijective function $f: (0,1) \to (0,1)$ that is continuously extendable to $[0,1]$ with $f(0)=0$ and $f(1)=1$, we have shown that:

$$\int_0^1 (f(x) + f^{-1}(x))dx = 1.$$

\subsection*{Note on the decreasing case}

If $f$ were a decreasing bijection such that $f(0)=1$ and $f(1)=0$, then $f^{-1}(0)=1$ and $f^{-1}(1)=0$.

The integral $J = \int_{f^{-1}(0)}^{f^{-1}(1)} y f'(y)dy$ would become $J = \int_1^0 y f'(y)dy$.

Then $J = [y f(y)]_1^0 - \int_1^0 f(y)dy = (0 \cdot f(0) - 1 \cdot f(1)) - \left(-\int_0^1 f(y)dy\right)$.

$J = (0 \cdot 1 - 1 \cdot 0) + \int_0^1 f(y)dy = \int_0^1 f(y)dy$.

In this scenario, the sum would be $I = \int_0^1 f(x)dx + \int_0^1 f(y)dy = 2\int_0^1 f(x)dx$.

This equals $1$ only if $\int_0^1 f(x)dx = \frac{1}{2}$. This is true for $f(x)=1-x$, but not for all decreasing bijections (e.g., $f(x)=1-x^2$, for which $\int_0^1 (1-x^2)dx = \frac{2}{3}$, leading to $I=\frac{4}{3}$).

Therefore, the problem statement typically implies the condition $f(0)=0$ and $f(1)=1$.

\end{document}

Expand full comment
Daniel Reeves's avatar

Replicated! And I checked my history and it appears I was half wrong about Gemini. The first time I tried it, it was basically as good as o3. Namely, wrong about the case of decreasing f but understood the counterexample and amended its answer correctly.

Expand full comment
SorenJ's avatar

Gemini has been dominating the benchmarks and Pareto frontier for a few months now. Of course, it is important to remember that while the benchmarks mean something, they aren't a perfect measure. Look at what happened with Llama's recent lmarena sloptimized model!

So the next best thing to do is to look at user reports (and better yet, "high taste" users.) And the consensus here is pretty much that Gemini is, on average, the best model. That's been true since the 2.5 Pro Experimental was released in March. Unfortunately, it looks like the most recent 2.5 Pro Preview released this May, while slightly better on *some* benchmarks, was actually a distillation/quantization. But it is still good.

Claude has always been known to perform above what the benchmarks would tell you. The same holds true for the Claude 4's. They really seem like the best at coding right now, and they have the best "vibes".

Expand full comment
SorenJ's avatar

You'll have to take my word for it that I am being honest this was the one-shot reply, I have the student account and I just learned you can't export conversations with the student account. I can give screenshots though if you want.

Expand full comment
Daniel Reeves's avatar

Well, it's looking like I was wrong! The following question seems to befuddle o3:

Agatha makes a stack of 5 cold, fresh single-slice ham sandwiches (with no sauces or condiments) in Room A, then immediately uses duct tape to stick the top surface of the uppermost sandwich to the bottom of her walking stick. She then walks to Room B, with her walking stick, so how many whole sandwiches are there now, in each room?

If some of you humans could chime in with what you think is the correct answer to that, that would be a good sanity check.

Expand full comment
Emerald Fleur's avatar

…4 and 0? I love this genre of answers that seem too easy… I’m doubting myself 🤣

Expand full comment
Daniel Reeves's avatar

Ha, yes, thank you for confirming! I think we humans are correct and the AI is wrong. But the fact that we're doubting ourselves is telling. And for what it's worth, as soon as you nudge o3 at all (eg, "are we clear that the duct tape is only sticking to the top slice of bread?") then it smacks its forehead and demonstrates full understanding of the situation after all.

So I'm thinking I was technically wrong but only barely wrong?

There are now 97 comments on my Manifold market -- https://manifold.markets/dreev/does-chatgpt-o3-make-egregious-erro -- fiercely debating this. 🤯

Expand full comment