20 Comments
User's avatar
Emerald Fleur's avatar

Random thought for fellow AGI Friday readers: Those cheaply generated AI summaries at the top of Google Search results can't be good for the reputation of AI as a whole, no?

I find that Gemini 2.5 Pro and all the gemini thinking or chain of thought models so far very rarely hallucinates, while Gemini Flash and similar Gemini models nearly always hallucinate to the point where I never trust Flash with any factual lookup, even though both models have access to Google Search.

I read in a wired article about the history of Gemini that while internal factions complained, surveying done of end users found that the addition of AI generated summaries overwhelmingly was preferred by users.

"The senior director involved ordered up some testing, and ultimately user feedback won: 90 percent of people who weighed in gave the summaries a “thumbs up.”

https://www.wired.com/story/google-openai-gemini-chatgpt-artificial-intelligence/

However, as a person who uses AI (never for creative work) daily, I DO NOT TRUST the AI summaries at the top of the search results even on the most common queries that must be queried millions of times a day because they have been so consistently wrong as to make me permanently distrust and mentally skip over these results every time I see them. If I want actual results I tap the AI mode button.

I can see the reasoning this way Socratic argument style.

A) Using the flash model is cheaper since we have a lot of unique google searches and allows us to customize it for every single user.

B) Fair, but shouldn't we use a more computationally expensive model?

A) No, because users are clearly already content with our worse model.

B) The one that generates really inaccurate results? We have a model that doesn't hallucinate right here. Why don't we use it?

A) I can show you a business cost benefit right here that shows that the improved results barely matter to the end user for x y and z reason.

Here's me chiming into the hypothetical

C) Why don't we just generate an expensive computational cost summary for the 1 million most common queries, or some threshold that meets the most users possible?

A) (My guess of what they'd say) We already do that with the flash model, but the generating notes makes users feel more engaged with the AI than if we were to instantly display the results. Additionally, using the flash model allows us to generate individualized results that better reflect the flow of information sources presented to the users and allows us to react to realtime events like if the Pope were to suddenly pass away.

C) Why don't we just cache an expensive query once an hour for the most common search queries? Users aren't expecting personalization there (although that'd be really cool) because users aren't expecting to opt into AI personalization when they use a Google Search (yet) and might prompt a bigger backlash than we already have. We treat it like our extracted data from Wikipedia articles.

A) Cost, benefit, analysis.

C) This small expensive change could do a lot to convert a lot of users to Gemini. Who in their right minds would pay us for Gemini if flash keeps hallucinating on every second result?

A) You're exaggerating.

C) I am frustrated! This is our chance to get AI in front of as many people as possible and we could convert a lot of people who are skeptics into believers by providing good quality results. That's what they come to Google for, to get good quality results and frankly while they're good enough for some long term growth might be hampered if people start mentally skipping over it, even if your cost benefit analysis and statistics show that this is the most wise short term decision!

I have very strong feelings about AI summaries. 😅

Expand full comment
Markos's avatar

"the thing they’re testing is… just the smartphone app"

They are testing both the FSD part and any remote operations (monitoring the car, even remotely stopping/starting it)

"When you summon a car, it’s just a normal Tesla with a human driver using normal supervised FSD."

That's both a safety and legal feature. They want to avoid reporting any issues to the NHTSA while testing. You can argue that it's bending the rules or whatever, but it's not a basis for the technical feasibility of them starting an actual service in June.

As for what will happen in June, in the Q1 2025 earnings call, they spoke of "10-20 cars" and Musk specifically mentioned "June or July". So it will be a tiny start, and they can still claim "victory"

Expand full comment
Daniel Reeves's avatar

Thanks, these are all solid points. Where does this leave you on the probability of driverless Teslas this summer?

Expand full comment
Markos's avatar

It’s summer in Texas, so weather conditions will be favourable and we’re discussing a fleet of 10 cars. You can already see videos of people in California and Texas using FSD and it does fine. Of course they will have remote monitoring to avoid accidents. So I would say 90% chance of Tesla claiming victory and 150% chance of critics arguing this is pointless because it is a single city and just 10 cars. 😆

Expand full comment
Daniel Reeves's avatar

I don't think remote monitoring can reliably avoid accidents. Network lag would be akin to drunk driving.

But I agree that Tesla is likely to do something they can call a launch and that this will be hotly debated. I'm trying to pin down the question in the Manifold market highlighted in the post:

https://manifold.markets/dreev/will-tesla-count-as-a-waymo-competi?play=true

It's sitting at a 21% chance at the moment.

Expand full comment
Markos's avatar

Well, here's Vay https://vay.io/, a car-rental company operating now in Las Vegas, which remotely delivers the car to you. You do your own driving, but can leave it anywhere you want, and then they remotely drive it back.

And Baidu has been operating Apollo Go, a fully autonomous taxi service (but with LiDAR, of course), for some time now. You can see their remote consoles here https://evcentral.com.au/chinas-google-getting-into-electric-cars/

Expand full comment
Daniel Reeves's avatar

Ah, thank you! I hadn't realized there was this much tele-operation happening, or legal. I looked into Vay and it's less than it seems:

1. They don't tele-operate with passengers in the car, they just use it to get rental cars to customers who then drive the cars normally

2. They limit the speed during tele-operation to 26mph (42 km/h)

Like I said, I think the network delay is comparable to drunk driving; so they've calculated that up to that speed that's not too much of a risk, for Las Vegas at least.

I'm still thinking about what this means for what Tesla might do in Austin. Thanks again for pointing me to that.

Expand full comment
Markos's avatar

Here's how they can have Starlink on each car for additional connectivity

Starlink is now equipped for aeroplanes, so why not for cars?

https://chatgpt.com/s/dr_682c20762f70819190ea399fc1ec8364

Expand full comment
Markos's avatar

I put ChatGPT to search for this. It says it's doable, but there may be issues at higher speeds, as you mentioned. This also depends on the specific areas (Austin) and how good the network is there

https://chatgpt.com/s/dr_682bad91be648191beee7e70e3419303

Expand full comment
Markos's avatar

By the way, from the link on manifold, "our Event Response agents are able to remotely move the Waymo AV under strict parameters" is basically tele-operation (on rare occasions, which are not fully reported), which you describe as a "huge scandal" if Tesla did it ;)

Expand full comment
Daniel Reeves's avatar

The line is not perfectly bright but it's bright enough. They can remotely get a car onto the shoulder at walking speed if it's stuck and safety requires it. The scandal would be remote supervision at driving speed.

Expand full comment
Daniel Reeves's avatar

And here are my rough thoughts on the Nature article, which I may also want to turn into a future AGI Friday:

The idea of model collapse is a big deal, and may predict that we peter out below AGI. In domains where the AI can generate its own synthetic data, like playing Chess or Go, it bootstraps itself to unfathomably superhuman capability. Intuitively it seems that that would not work with text. It's like eating its own tail. But it's not obvious. Maybe by doing chain-of-thought or other tricks, it can generate text that's good enough to train on. If so, you can improve the base model, then use the same tricks to generate even better text. So now retrain the base model on that, rinse and repeat. If that works, that's recursive self-improvement that may have no upper bound, or no subhuman upper bound.

The Nature article seems to say that's impossible, but consider the counterpoint -- https://arxiv.org/abs/2406.07515 -- that verifying high-quality data is easier than producing it, so if you have a smart enough LLM to separate the wheat from the chaff, you can avoid model collapse (and I guess potentially bootstrap to superintelligence).

(Or consider the counterpoint -- https://openreview.net/forum?id=5B2K4LRgmz -- that model collapse only happens if you replace the original data with synthetic data. If you just keep appending synthetic data then the AI's performance plateaus rather than degrades. Does it plateau below human level? Who can tell!)

One more article suggesting recursive self-improvement is possible: https://ar5iv.labs.arxiv.org/html/2502.13441

I should also mention how much an LLM (o3, specifically) is helping me with this lit review!

PS: If the point is that an LLM can't bootstrap itself without an external reward signal, like with a well-defined game, then what happens if you turn the real world into a well-defined game? Give an agent internet access and a bank account and tell it to make the balance go up, in whatever ways it can come up with to do that. Maybe instrumental convergence blah blah blah we all die, is what happens.

PPS: Thanks so much for asking about this stuff!

Expand full comment
Daniel Reeves's avatar

Definitely! In fact, I'm working on turning my reaction to that "fake company staffed by AI agents" article into today's AGI Friday. I may do the same for a future AGI Friday about that Nature article and recursive self-improvement.

My general thoughts are that I think articles like these are doing valuable hype-deflation work. This actually gets at the core of what I'm hoping to convey with AGI Friday. The hypesters and the pooh-poohers are both deeply wrong. Depending on how the future plays out, one or the other group will be able to pretend they knew it all along. But it's kind of a coin toss. I mean, not literally, and it depends on the timeframe. If we take 2030 as the cutoff then I personally think the pooh-poohers have the edge. But one of the articles you linked to says "the machines aren't coming for your job anytime soon" and that sure sounds like it means more than a 5-year horizon. So I want to vehemently disagree with that. Your job is very safe this year and *probably* safe this decade. Beyond that, literally (and I mean pretty literally, literally) anything is possible.

Expand full comment
Clive F's avatar

Whenever people ask what the line in the sand is for testing if AGI is smarter than humans, I think of that line from one of Robert Heinlein's books (one of the Lazarus Long ones, iirc):

“A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.”

How many can we tick off for the current LLMs? What's going to fall next? (Apparently robotic brick laying is way tricker than people think, for example - we've been trying and failing since the industrial revolution to produce general purpose brick-laying machines.)

A number are stuck behind Moravec's Paradox, I guess (which gives us my favourite test of human-level AGI, from Wozniack: "enter a random house, and make a cup of coffee, finding all the things you need in there").

Expand full comment
Daniel Reeves's avatar

Amen. But what do you think of making the physical/nonphysical distinction? Maybe it's mostly robotics that's lagging? I notice that all the things on Heinlein's list that don't require a physical body are done or seem close. The definition of AGI I like best these days is based on the idea of an artificial drop-in remote worker.

Expand full comment
SorenJ's avatar

If you had taken the situation in 2016 and looked at the log-scale of progress for self-driving cars, would you have predicted based on that log-scale that self driving cars were 2 years away?

If so, that might suggest that we are in a similar situation today with regards to AGI, or at least superhuman-software-development-AI. Right now the log-scales of METR suggest we are maybe 2 years away from AI that can fully do the work of a software engineer. But if our situation is more like the 2016 one, then we are probably actually about 10 years away from this, given that the last 20% gets increasingly and increasingly harder to capture. (I was just trying to code with Gemini 2.5 today and the experience was... frustrating to say the least.)

Expand full comment
Daniel Reeves's avatar

Great question. I do think human-level driving *could've* been here a lot sooner. Waymo has had it for years and is now pretty clearly superhuman. Of course AGI could be similar in these regards as well. Just that there are plenty of reasons to think AGI will be very different from other technologies.

In any case, intuitively I agree that 10 years seems a lot more reasonable than 2 years. Here's Ege Erdil, who I previously characterized as bullish on AGI timelines, making the case yesterday that we're decades away:

https://epochai.substack.com/p/the-case-for-multi-decade-ai-timelines

Expand full comment
SorenJ's avatar

For your geometric reasoning benchmark, I get on problem 0 (original) a minimum of 1 or 2 initially. I found the phrasing confusing, "What is the minimum and maximum number of line intersections in that drawing?" I assumed this meant, "what is the minimum and maximum number of line intersections FROM THE LINE THAT WE DREW IN STEP 4?"

Anyway, I now see that you meant all line intersections possible. But I still think there is another ambiguity. The square is "fully inside", but given that lines have zero thickness, you can make it so that the edge of the square intersects the circle. Then, the exiting of the square and the entering of the circle really only count as one intersection of 3 different lines. So I get a minimum of 3 or 4.

Expand full comment
Daniel Reeves's avatar

You're totally right about these ambiguities, and LLMs sometimes have the acuity to clarify them before answering. Which you could treat as part of the test, I guess. I only counted the AI as wrong if it didn't have a coherent model of the shapes and lines being asked about. Also I think I managed to get rid of the ambiguities for all the subsequent problems, making it much easier to grade them. Except for the one where the AI was smarter than me.

Expand full comment