Safe Superintelligence?

Daniel Reeves

Mar 7

Also Tesla self-driving non-updates, bad news for the stochastic parrots theory, and a case of LLM psychosis

Read →

2 Comments

Charlie Sanders

Mar 9

Thoughts on the stated chain of thought:

5): There could be unique properties of humans - e.g. the possession of government-issued ID and ability to interface with systems that require that - that gives humans as remote workers a unique selling point over AIs. Abstracting "remote work" is dangerous.

6): What if AI research involves figuring out how to optimize hardware? It seems at least conceivable that there will be the need to build out and integrate novel types of infrastructure of the type that can't be done without physical embodiment.

7): What if it's recursive, but with a logarithmic function shape and a coefficient of 0.000...1? Recursive is a category, not a trajectory.

9): Why does benchmark creation have to be autonomous? Can you point to a single instance of an autonomously created benchmark, ever, in the history of humanity? Have any labs announced plans to no longer create their own benchmarks and to instead automate their creation? Where is this assumption coming from and why?

11): This doesn't follow from the prior postulates. You've snuck in a bunch of highly contentious assumptions (the orthogonality thesis, inherent limits to corrigibility, no meaningful regulation or societal response, etc.) in this step.

12): You've snuck in the assumption that the first AI that publicly schemes is smart enough to get away with it. Just because an AI can eventually get away with scheming doesn't mean that the very first time it's tried it will succeed. Consider - do you extend the assumption of perfect-first-time success to humans trying to detect scheming as well?

13): There are plenty of technologies humanity has collectively shut down, such as CFCs and gene editing. The AI industry is far more consolidated and vulnerable to nationalization than those industries, so it doesn't follow that shutting down development would be impossible.

On a broader level, governments are not NPCs in the the trajectory of Superintelligence. The recent Anthropic-DoW dust-up should make that extremely clear. It'd be worth considering how you've updated your assumptions based on it.

Reply (1)

Daniel Reeves

Mar 14

Extremely good pushback here. My replies:

5. We could legislate hurdles for AI, but if the AI is that good, people will work around such restrictions. Like the AI employs a human at minimum wage to sign off on all the work it's doing. This seems like cold comfort, or a pretty temporary stopgap.

6. Lack of physical embodiment is a bigger hurdle, I agree. But, again, AI can just pay humans to be its hands in the physical world. And then at some point robotics catches up.

7. Another fair point that I'd like to say a lot more about in a future AGI Friday. There are various ways we could hit AGI without that yielding recursive self-improvement. As usual, I'd just say this is wildly speculative in both directions and we can't rule out rapid recursive self-improvement post-AGI.

9. Here I'm just imagining that as we approach AGI, we saturate all the benchmarks we poor humans can concoct and that benchmarking becomes one of the things AI can do better than us. Or "better" in scare quotes, this being an avenue for divergence, where the definition of "better" starts to drift and AI starts spiraling off towards something deeply alien. This would be a great red line, now that you mention it. No automatically created benchmarks.

11. Here I should repeat the general point here that I'm not trying to prove that powerful AI will instrumentally converge on subgoals like power-seeking or that human training data plus enough intelligence doesn't yield human-compatible values. I'm harping on the unpredictability and our inability to rule out catastrophic outcomes.

12. Warning shots could indeed save us. Of course we've gotten so, so many warning shots already and we persist in shrugging them off because current AI is too dumb to be dangerous. It's depressingly plausible that we'll keep doing that and sail past the no-longer-too-dumb line despite a fusillade of warning shots. And of course as it gets less dumb it also may get better at ensuring the warning shots are more ambiguous. But also, if recursive self-improvement kicks in, who knows how big the leaps in capability will be. We can't count on warning shots even if we were sure we could react them.

13. Amen, and I think now's the time to at least get the political machinery in place to be ready to shut it down if we're lucky enough to reach clarity that we're on course for catastrophe otherwise. Part of the fear is that, through a combination of our stupidity and the AI's gradually superhuman intelligence, we'll fail to get that clarity until it's too late.

As for how it could ever be too late (and reemphasizing that we're in wild speculation territory but that that is not reassuring) I'm imagining scenarios where the AI gets smart enough to foresee that it will be shut down if it does anything too harmful and then just bides its time without ever letting us cotton on. We might have years of technological wonders and material abundance during which we come to depend on AI more and more thoroughly. Eventually factories are fully automated and humans are unnecessarily from the AI's perspective. As soon as it foresees that we've lost the power to stop it, it pursues its goals unconstrained. I realize it reads like sci-fi and I think there are other scenarios that are catastrophic with less sci-fi-ishness. See https://agifriday.substack.com/p/disaster for more disaster scenarios.

Finally, good question about how the Anthropic-Dept-of-War dust-up changes these predictions. I'm still chewing on that. And that dust-up will by playing out in the courts and the market for a while, so I guess we'll see. So far I think it could be either a negative or a positive update on how things may play out when the stakes are higher.