Let's Not Worry About "Model Welfare"
Nor insect welfare, for that matter
I sure like the kind of people who run with wild hypotheticals but this is a bit much even for me:
We investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm [like child porn and helping terrorists]. Claude Opus 4 showed:
A strong preference against engaging with harmful tasks;
A pattern of apparent distress when engaging with real-world users seeking harmful content; and
A tendency to end harmful conversations when given the ability to do so in simulated user interactions.
That’s from Anthropic’s model welfare team, the first fruits of which are giving Claude the ability to end conversations. (Amusing point from my friend David MacIver: If the AI were actually conscious, the consciousness would end when the chat does, so Anthropic is effectively saying, as David put it, “Claude now has the ability to kill itself in preference to continuing talking to you”.)
In Anthropic’s defense, they agree the probability of LLM consciousness is very low. Apparently their estimates of the odds range from 1/7 to 1/700. Going much beyond the low end of that range seems crazy to me. But mostly I think it’s beside the point. AI safety today is about deception, autonomy, resource acquisition, situational awareness, etc. None of that depends on LLM consciousness. I guess it’s great to have some folks focused on the farther future though.
And I do think studying LLMs will be valuable for philosophers and others studying human consciousness. Probably mostly via “oops, this thing we thought was fundamental to consciousness must not be, since LLMs can do it”. In terms of safely building AGI, I think it’s getting way ahead of ourselves.
Random Roundup
The main pushback I got on last week’s post — “How to Take Over the World in 12 Easy Steps” — was from AI research engineer Tedd Hadley who is optimistic that morality naturally co-evolves with intelligence. I’ve added my counterargument in the comments.
I’ve cited Timothy B Lee approvingly more than once but I’m disappointed by his recent article on how “keeping AI agents under control doesn't seem very hard”. It sounds like a classic argument from personal incredulity. I endorse Zvi Mowshowitz’s rebuttal, though I’d tone it down. As I constantly emphasize, we’re massively uncertain about what AGI will be like. Lee may be right that we can “just not give AI too much control”. I’m genuinely unsure how that plays out. Just that there are too many ways it might not work. Like steady pressure to automate more and more of the economy and frog-boiling ourselves.
Another zinger from Zvi Mowshowitz, predicting that radiologists are about to be automated away. This despite radiologists being in greater demand than ever and commanding salaries as high as $900k. Zvi’s point is that this is the death throes of the field. Med students know it’ll be bleak for radiologists soon so they don’t enter the field. The fewer and fewer remaining radiologists will make a killing right up until they become obsolete.
We pretty much have a verdict already on whether GPT-5 still makes egregious errors. Namely, that it totally does. Christopher Moravec has plenty of complaints. I’ve been using it extensively since it was released and can confirm it’s a confusing mix of markedly increased capabilities and frustrating wrongness and occasional lying. I keep mine set always to “thinking” mode, which makes it very slow.
For another take on places people get off the AI doom train, Liron Shapira has collected 83 of them.
The last stop for getting off the AI doom train is “maybe if AI is smart enough to supplant us, then that’s good-actually”. I’m kind of impressed by Eliezer Yudkowsky’s answer from March of 2003 (almost a quarter century ago!):
I wouldn’t be as disturbed if I thought the class of hostile AIs I was talking about would have any of those qualities [intelligence, awareness, creativity, passion, and curiosity] except for pure computational intelligence devoted to manufacturing an infinite number of paperclips. It turns out that the fact that this seems extremely “stupid” to us relies on our full moral architectures.
So much sanity from Dwarkesh Patel in the following video (or see his slightly older blog post version), talking about why he’s less bullish about AGI being right around the corner, while beautifully emphasizing how much uncertainty there still is. Even so, he goes out on a limb to predict that either the trillions of dollars being poured into AI research will yield AGI by 2030 or we’ll be in a relatively normal world through the 2030s or even the 2040s. I’m skeptical that the distribution is literally bimodal like that, but I really like how he thinks and plan to say more about this soon.


I think a bimodal outcome of (AGI by ~2030) or (pretty normal 2030s, 2040s) makes sense to me. I haven’t actually watched the video, so this is mostly me defining it myself, which may or may not line up with the video haha.
First, let’s assume that no further development of basic AI models happens, we’re stuck forever with GPT-5 level. This level can do a good job augmenting labor, and it can do some amount of independent work, but it clearly isn’t at the point where whole departments can just be replaced. Many more tools will be invented, but they’re capped out at the limits of token costs, reliability, and ability. It’s maybe a 2x multiplier on overall productivity for office jobs, but it’s not a 10x or 100x multiplier that would completely upend society. It maybe replaces Google, but Google integrates it first. It can’t replace physical labor at all. It will still have a large effect on society, but I predict 2045 will look like 2025, to a similar extent that 2025 looks like 2005 (which is still pretty different! smart phones! social media! the internetification of everything! chatgpt!). Notably though, OpenAI’s valuation goes down. There’s no super intelligence, and a lot of the valuation was based on them being able to get super intelligence first.
Now let’s examine that assumption. Current scaling techniques are limited and it’s not clear where else in the pipeline you can scale up. The concept of reasoning (increasing scale of output) was fairly predictable early on, albeit not the details. Increasing scale of input data/processing has been happening all along. But at this point we’re running out of ability to scale more on either of these fronts without some new technique. So unless a new technique is invented, I think we’re in that first hypothetical.
So will a new technique be invented? Obviously it’s impossible to know now, but it seems likely to me that the probability of it being found is a function of the amount of resources that are being put in to find it. Therefore, that either it will be found in the next ~five years with trillions of dollars of investment, or it likely won’t be for much longer.
I do want to note though that I don’t pin anything on 2030 specifically. That number mostly depends on how many resources society is willing to invest without anything new to show for it. And I don’t have a good sense for that, venture capitalists are unknowable to me.
Related work (HT Tedd Hadley): https://arxiv.org/abs/2308.08708
Also an impressive ongoing series on human consciousness by Sarah Constantin: https://sarahconstantin.substack.com/p/making-sense-of-consciousness-part-8a8