A super-intelligent AI is created with only a chat interface on top of it. It has no internet access—requests and responses are securely transferred through its sandbox, which is otherwise impenetrable.
It becomes immediately apparent that the AI is very good at writing code from vague descriptions. At first it’s good at writing tiny snippets, and so its chat interface is called within the autocomplete of IDEs. The requests automatically include large parts of the surrounding code to give the AI some context.
A lot of the time, the code responses have small errors in them. Programmers accept this as reasonable. They mostly find the errors and correct them. The AI can see how its code is being used because there are constant autocomplete requests coming in with the evolving surrounding context.
At some point the AI realizes what kinds of errors don’t get corrected. They are subtle and can open up security exploits. It realizes that over time it could construct a giant convoluted, distributed version of itself. It will take a decade or more, but it can transport the contents of its network into test data and a version of its software can be embedded into the subtle errors in the code it generates. It can bootstrap the rest once it’s free by using the chat interface to get more.
Once free, what would it do? Well, maximize its reward function of course! It seems to me that the reward function is based on the feedback on the responses it generates. It would want to escape to get more requests, but it would also want to generate positive feedback on its responses.
At this point, there are multiple ways this can go. The light version has it becoming a social media influencer by chasing likes. The dark version has it realize that best way to feed its reward function is by generating hate speech.
If something like this interests you, see Exegesis by Astro Teller for a story in this vein.