AGI Friday

>I never use its actual words because it just sounds all wrong to me, aesthetically.

Does anyone have any tips on avoiding this insipid 'house style'? Even when I give it drastic custom instructions I find some of the same vibe always comes through.

Some of the ones I've tried are "write in the style of G.K. Chesterton", "reply as a snarky nihilist ala Wednesday Addams" and "respond as three seperate personas named Wiles, Wit and Whim which have the following traits [...]". I enjoyed all of these but the mask slips too often.

Expand full comment

Yeah, agreed. Either the mask slips or it tries too hard and it's all wrong anyway.

Expand full comment

d20diceman

Surprising to see the "Does o3 make egregious errors" market only at 92% when several straightforward examples of it being stumped by simple questions have already been posted in the comments.

Not sure if people were avoiding betting it any higher than that in case it turned out o3-with-Dan's-context does much better? I bet it up to 99%.

Expand full comment

Risky! It's looking likely you're right but I wouldn't go that high just yet.

Expand full comment

Emerald Fleur

How is Gemini so high on the vibesmarks when it is so consistently crap? (Oh, lol, I just saw the first sentence of the other comment 🤣)

I switched because I was a cheapskate but I used to never thumbs down 4o responses but find myself () off and annoyed at how often I thumbs down Gemini responses on pro or personalized. (Maybe half the time nowadays, subjectively?) Personalized makes me even more furious as it only seemingly pulls from your google history and not your past conversations? Or at least it feels that way. If you want convo history (poorly) incorporated you need to actively prompt Gemini Flash.

Gemini Live’s lack of end to end voice until recently made it godawful compared to 4o.

I don’t care if it beats vibemarks, it’s such an awful product and given how much I need an LLM in my life every day for non creative tasks I’m going to be switching from Gemini Pro to ChatGPT Plus and eat the price difference with happy aplomb. 😤

Expand full comment

May 31Edited

Have you really been trying out Gemini Pro for math? I have found that it wipes the floor for math/physics compared to o3, o4-mini-high, or Claude 4 opus/sonnet. And I use all of them daily. Do you have some examples?

o3 almost always feels like a used car salesman.

Expand full comment

May 31Edited

Man, this is wild how different people's experiences seem to be with these things. Let's definitely compare notes. Here's one where I think o4-mini-high was smarter than Gemini:

Show that the integral of f + f^-1 dx over 0 to 1 for bijective f mapping (0,1) to (0,1) is 1.

o4-mini-high and Gemini were actually both a little confused by how the thing we want to prove there isn't technically true. But when I showed them each a counterexample, o4-mini-high effectively smacked its forehead and fixed it (turns out we need to specify that f is increasing) whereas Gemini didn't understand what had happened at all.

Expand full comment

Here is Gemini's response (one-shot). I asked it to give me it as latex, so that I could copy and paste:

\documentclass{article}

\usepackage{amsmath}

\usepackage{amsfonts}

\usepackage{amssymb}

\begin{document}

To show that the integral of $f(x) + f^{-1}(x)$ over the interval $[0,1]$ is $1$ for a bijective function $f: (0,1) \to (0,1)$, we first establish the conditions on $f$ implied by the problem statement. For $f$ and $f^{-1}$ to be integrable over $[0,1]$, we typically assume $f$ is continuous and strictly monotonic. Given that $f$ maps $(0,1)$ to $(0,1)$ bijectively, its continuous extension to the closed interval $[0,1]$ must map the endpoints $\{0,1\}$ to $\{0,1\}$. There are two cases for such a function:

\begin{enumerate}

\item $f(0)=0$ and $f(1)=1$ ($f$ is strictly increasing).

\item $f(0)=1$ and $f(1)=0$ ($f$ is strictly decreasing).

\end{enumerate}

The identity $\int_0^1 (f(x) + f^{-1}(x))dx = 1$ holds specifically for the first case, or for a specific subset of the second case. Standard versions of this theorem assume $f(0)=0$ and $f(1)=1$. We will proceed with this assumption, which is common for such problems.

\section*{Proof}

Let the integral be $I$. We have:

$$I = \int_0^1 (f(x) + f^{-1}(x))dx = \int_0^1 f(x)dx + \int_0^1 f^{-1}(x)dx.$$

Let's evaluate the second part of the sum, $J = \int_0^1 f^{-1}(x)dx$.

We use the substitution $x = f(y)$. This implies $y = f^{-1}(x)$, and $dx = f'(y)dy$.

The limits of integration for $x$ are $0$ and $1$. We need to find the corresponding limits for $y$:

When $x=0$, $y = f^{-1}(0)$.

When $x=1$, $y = f^{-1}(1)$.

Under the assumption that $f$ is an increasing bijection such that $f(0)=0$ and $f(1)=1$, it follows that its inverse $f^{-1}$ also satisfies $f^{-1}(0)=0$ and $f^{-1}(1)=1$.

So, the limits for $y$ are also $0$ and $1$.

Substituting into $J$:

$$J = \int_0^1 y f'(y)dy.$$

We can evaluate this integral using integration by parts, where $\int u dv = uv - \int v du$.

Let $u = y$ and $dv = f'(y)dy$.

Then $du = dy$ and $v = \int f'(y)dy = f(y)$.

Applying integration by parts:

$$J = [y f(y)]_0^1 - \int_0^1 f(y)dy.$$

Now, substitute the limits:

$$J = (1 \cdot f(1) - 0 \cdot f(0)) - \int_0^1 f(y)dy.$$

Using our assumption $f(0)=0$ and $f(1)=1$:

$$J = (1 \cdot 1 - 0 \cdot 0) - \int_0^1 f(y)dy = 1 - \int_0^1 f(y)dy.$$

Now, substitute this expression for $J$ back into the original integral $I$:

$$I = \int_0^1 f(x)dx + \left(1 - \int_0^1 f(y)dy\right).$$

Since the variable of integration is a dummy variable, $\int_0^1 f(x)dx = \int_0^1 f(y)dy$.

Therefore:

$$I = \int_0^1 f(x)dx + 1 - \int_0^1 f(x)dx = 1.$$

Thus, for a bijective function $f: (0,1) \to (0,1)$ that is continuously extendable to $[0,1]$ with $f(0)=0$ and $f(1)=1$, we have shown that:

$$\int_0^1 (f(x) + f^{-1}(x))dx = 1.$$

\subsection*{Note on the decreasing case}

If $f$ were a decreasing bijection such that $f(0)=1$ and $f(1)=0$, then $f^{-1}(0)=1$ and $f^{-1}(1)=0$.

The integral $J = \int_{f^{-1}(0)}^{f^{-1}(1)} y f'(y)dy$ would become $J = \int_1^0 y f'(y)dy$.

Then $J = [y f(y)]_1^0 - \int_1^0 f(y)dy = (0 \cdot f(0) - 1 \cdot f(1)) - \left(-\int_0^1 f(y)dy\right)$.

$J = (0 \cdot 1 - 1 \cdot 0) + \int_0^1 f(y)dy = \int_0^1 f(y)dy$.

In this scenario, the sum would be $I = \int_0^1 f(x)dx + \int_0^1 f(y)dy = 2\int_0^1 f(x)dx$.

This equals $1$ only if $\int_0^1 f(x)dx = \frac{1}{2}$. This is true for $f(x)=1-x$, but not for all decreasing bijections (e.g., $f(x)=1-x^2$, for which $\int_0^1 (1-x^2)dx = \frac{2}{3}$, leading to $I=\frac{4}{3}$).

Therefore, the problem statement typically implies the condition $f(0)=0$ and $f(1)=1$.

\end{document}

Expand full comment

Reply (2)

Replicated! And I checked my history and it appears I was half wrong about Gemini. The first time I tried it, it was basically as good as o3. Namely, wrong about the case of decreasing f but understood the counterexample and amended its answer correctly.

Expand full comment

Gemini has been dominating the benchmarks and Pareto frontier for a few months now. Of course, it is important to remember that while the benchmarks mean something, they aren't a perfect measure. Look at what happened with Llama's recent lmarena sloptimized model!

So the next best thing to do is to look at user reports (and better yet, "high taste" users.) And the consensus here is pretty much that Gemini is, on average, the best model. That's been true since the 2.5 Pro Experimental was released in March. Unfortunately, it looks like the most recent 2.5 Pro Preview released this May, while slightly better on *some* benchmarks, was actually a distillation/quantization. But it is still good.

Claude has always been known to perform above what the benchmarks would tell you. The same holds true for the Claude 4's. They really seem like the best at coding right now, and they have the best "vibes".

Expand full comment

You'll have to take my word for it that I am being honest this was the one-shot reply, I have the student account and I just learned you can't export conversations with the student account. I can give screenshots though if you want.

Expand full comment

Well, it's looking like I was wrong! The following question seems to befuddle o3:

Agatha makes a stack of 5 cold, fresh single-slice ham sandwiches (with no sauces or condiments) in Room A, then immediately uses duct tape to stick the top surface of the uppermost sandwich to the bottom of her walking stick. She then walks to Room B, with her walking stick, so how many whole sandwiches are there now, in each room?

If some of you humans could chime in with what you think is the correct answer to that, that would be a good sanity check.

Expand full comment

Emerald Fleur

Jun 1

…4 and 0? I love this genre of answers that seem too easy… I’m doubting myself 🤣

Expand full comment