Musk vs McGurk
Self-driving cars and sensor fusion
We left off last Friday with the news item that Waymos can take customers on highways now. (Jump to the Random Roundup below for some of the crazy amount of new news this week.) Tesla, even in Austin where the robotaxis generally have an empty driver’s seat, puts a human back in the driver’s seat for any rides involving highways. So Musk certainly deserves some flak for having previously claimed that Waymo’s elaborately multimodal sensors were the reason Waymos weren’t (at the time) driving on highways.
I was pretty opaque in my snark about that and then spelled it out in a followup comment, which I’m now turning into a proper AGI Friday of its own. If you don’t know the McGurk effect then you’re in for a treat.
Sensor fusion refers to combining the different input you get from different sensors — like eyes and ears for a human, or cameras/lidar/radar/GPS for a robocar — and turning it all into a coherent world model. There’s a theoretical sense in which Musk’s claim, that adding a sensor like lidar makes a car more dangerous, is impossible. At worst the car can ignore input from the lidar, if it can’t figure out how to reconcile what the lidar vs the cameras are telling it. But of course it can and does reconcile it. Just like, presumably, Tesla’s FSD reconciles the inputs from the many different cameras it has.
Consider how the human brain handles conflicting sensors, as in the mind-melting McGurk effect. The setup is that you take a video of someone making a “fa” sound and of someone making a “ba” sound. Now splice the sound from one onto the video of the other. You’re seeing “ba” but hearing “fa”, or vice versa. What will your brain do?
If you haven’t seen the McGurk effect before, pause here and try to predict what will happen. I’ve been showing the McGurk effect to various friends and family and no one guesses the answer.
Spoilers start now.
The closest to correct was a friend with a machine learning PhD who figured the brain merges the signals but predicted that it would be like an optical illusion where you’d see the person’s lips doing something they weren’t really doing. Ok, now try it yourself:
It turns out that your brain puts a lot of weight on its camera inputs aka your eyes. If your eyes see one thing while your ears are telling you something incompatible with that, your subconscious brain just overrides that auditory signal and feeds your conscious brain different sounds — sounds that are compatible with the visual signal.
Could the McGurk effect be the key to steelmanning Musk’s claim that multiple conflicting sensors can reduce safety?1 After all, in the demonstration, your lying eyes (or the lying video, technically) cause you to hear a blatantly incorrect sound!
I think the answer is no. Your brain is, in a meaningful sense, optimally fusing different inputs. Say you’re chatting with someone in a loud room. Your brain combines what you hear with what you see the person’s lips doing. You don’t have to be able to read lips for the visual signal to help you disambiguate what you’re hearing. Multiple inputs can give a big accuracy boost.
The McGurk effect is basically an adversarial exploit of that system. It works by contriving a scenario for which your brain has to make a choice about what you’re hearing. It’s pretty much impossible in the real world. The trick, again, is to superimpose the audio for sound A onto the video for sound B. The prior probability on that is near zero. Your subconscious brain concludes, implicitly, that either your eyes or your ears are just wrong. Which to believe? Your eyes are giving you a high-fidelity, less ambiguous signal. The sound is more likely to be distorted. So that’s what you, your conscious brain, thinks it hears: the “corrected” sound.
(Similar mind-meltingness happens strictly within your vision system too. Like the gaping hole near the middle of your field of vision where your optic nerve goes through your retina. Two eyes can cover for each other and fill in the other eye’s gap accurately. But even with just one eye open, your brain infers what you ought to be seeing in that hole and makes your conscious brain think you’re seeing it. It’s very much like the way an image diffusion model hallucinates plausible details.)
Back to Musk’s claim that when sensors disagree it hurts accuracy, I’m claiming the McGurk effect is the exception that proves the rule. Literally: the McGurk failure shows what powerful sensor fusion the brain is capable of. It’s making the correct Bayesian update and causing you to perceive what is most likely to be the truth. Only with the additional evidence of learning about video editing and the McGurk effect can you do any better.
(Except, no, just kidding, your subconscious brain refuses to make that update and the McGurk effect keeps right on working despite yourself. Oh well. It’s still an amazing tradeoff, improving your hearing accuracy in all real-world situations at the tiny cost of being wrong in that one McGurk video. For technical rabbit holes, see predictive processing for the brain and Kalman filters for machine learning. And I should mention I’m out of my depth on the human brain side. Apparently the Bayesian brain hypothesis is not without controversy.)
In conclusion, more sensors more better. Imagine you’re a self-driving car getting conflicting information from cameras vs lidar about how far away a grand piano in the middle of the road is. Lidar’s great at measuring distance so disregarding the cameras might be correct. Better yet, conservatively take the distance-to-piano number to be the min of the two. Or, in full generality, start with a prior probability distribution and update it repeatedly based on the evidence of all your input streams. Exactly like Elon Musk doesn’t, at least on the lidar question.
Random Roundup
Upgrades: GPT-5.1 (I’m still using it daily; it’s probably a bit better but nothing obvious), Grok 4.1 (don’t believe the marketing copy; this is still not in the same league as ChatGPT, Claude, and Gemini — not to say it’s way far behind), and, most impressively, Google’s Gemini 3 Pro. The latter shows a quantum leap on tons of benchmarks and I’m willing to believe that some of that is meaningful. Also Google’s new image model, Nano Banana Pro (I’m on board with this name for a change), is pretty impressive. Here’s a random clever one:
ChatGPT has a new feature where you can invite other people to a ChatGPT session. I started one you can join right now (if you have a ChatGPT account). It’s so bad though. Or maybe that’s how ChatGPT normally is for most people, without my custom instructions and access to my chat history and with extended thinking turned on. Seems pretty insufferable to me. I have new sympathy for AI skeptics who haven’t gone as deep into what the frontier models are capable of.
Not news but if you’re confused about Moravec’s paradox, here’s my explanation, distilled from Ege Erdil. What gives us the intuitive sense of something like chess requiring a lot of intelligence is the disparity between the best and the worst humans. If everyone is about equally good at a task (like telling cats from dogs, or even picking up a rock) then we think of it as easy. But in terms of raw cognitive capability, or in terms of how easy it is to automate a cognitive task, it’s the exact opposite. Tasks like image recognition have been honed by evolution. Tasks like math and chess are lousy with low-hanging fruit.
I enjoyed Scott Alexander’s review of a new paper on “AI consciousness”. In particular why the discourse on this is so, so bad. You probably have to know and care about the so-called hard problem of consciousness in order to appreciate Scott’s post. I have a previous very brief AGI Friday on this topic and a longer followup, mostly to say I just don’t think it’s relevant (yet).
Waymo is expanding more rapidly. In particular they just got government approval to cover much bigger swathes of California, including where I am at the moment (heading home from Inkhaven tomorrow).
Credit to GPT-5.1-Thinking for charitably pointing out this way of steelmanning Musk’s position. Read on for why it’s super wrong.



I haven't seen them able to go on freeways yet. My bike was in the shop this past week, and one night, I checked the Waymo app to see if I could take it home from San Mateo to SF. It said it would cost $46 (not that bad) and take over an hour (terrible). I assume that's because it wouldn't take the freeway, 101 or 280. At freeway speeds, it's a 25-35 minute trip.
Checking now — to take the same trip in the other direction, 16 miles, it would cost $50.65 and take an hour and 15 minutes. The route shows it will not take the freeway, so I guess they **could** theoretically roll it out, but they haven't done so yet.