Risks of Advanced AI

Introduction

How might AI go wrong?

Recently there’s been a lot more public attention to AI safety. Along with many realising the dangers, there are still plenty of people who do not believe this is a serious issue. There are many different possible failure scenarios and this post will go into detail on two that I think are reasonably illustrative, and most important to think about from the perspective of building intuition on this issue.

I will try and assume no prior knowledge, this is meant to be introductory. Each scenario will be laid out as a short list of reasons, and then justifications for each of those reasons. This will be kept brief, and justifications will not be extremely detailed - I’m going to try and add links to things for further reading to compensate for this. The idea is for the reasons to be quickly readable, understandable, and memorable. The justifications serve to argue for the points which do not already convince you.

Agentic General Superintelligence

This is the classical kind of AI scenario, often talked about by Eliezer Yudkowsky. We’ll refer to it as AGI. In this failure mode, a single AI system takes over the world and humans get wiped out. By agentic I mean it takes actions and make decisions in order to achieve some desired outcome, with long term planning. By general superintelligence I mean that it can do and think about a lot of different things, all to the level of at least above a human, potentially much higher.

Reasons this will end badly

An AGI could feasibly exist
Humanity will be motivated to pursue creation of AGI
An AGI will have a goal
This goal will not be perfectly aligned to what we want
We will not be able to stop it pursuing this goal
Humanity will be consumed in service of this goal

Justifications

The existence of a human is proof that something as smart as a human could exist. Are we in the optimal configuration for intelligence? Almost certainly not. There are almost certainly improvements that could be made to a human brain to make it more intelligent, these improvements do not have to be found via slow and messy evolution. What about superintelligence? Consider the difference between Albert Einstein and the dumbest ‘village idiot’. Their brains would be basically identical if they were sat on plates in front of you. Now consider the possible leap in intelligence to something not running on a human brain.
Aligned AGI would be tremendously beneficial, if such a thing were possible. If something is smarter than all humans, can plan over long time horizons, make decisions, etc. then it will feasibly be able to solve vast problems that face humanity. Possibly even invite in a utopia. Additionally, it would be tremendously economically advantageous to create such a thing. The altruistic and greedy alike have seemingly lots to gain from AGI’s creation.
In order for our AGI to reason about the actions it should take in the world, it needs to have some sense of a desired world state. Something it is pushing towards. The Von Neumann-Morgenstern utility theorem shows us that as long as these preferences are reasonably coherent (adhere to certain axioms), then this can be encoded as maximising the expected value of a utility function - some numeric variable of the current world state. This is its goal. We might have hard-coded this goal, we might have had it learn this goal via some other means, it might be embedded in it implicitly.
What do you want? What do humans want? What is best for humans? How do we encode the answers to these questions into a goal for our AGI? Fundamentally, answering these questions is basically impossible: humans are messy things with inconsistent desires and objectives, both on individual and collective levels. Should such a thing as “the fundamental goal of humanity” even exist, somehow embedding it in our AGI is going to be near impossible. But what if we give it our best estimate? Something extremely accurate? Our AGI is going to aggressively optimise for its goal, the slightest inaccuracy will cause catastrophically different outcomes at this level of optimisation. Intuition for this can be found in Goodhart’s Law, machine learning overfitting, and reinforcement learning reward hacking. Maybe you have some ideas that could avoid this.
When people reason about superintelligence they often make a single fatal flaw. Not using what I call a “structural mindset”. I plan to write a full post about this (soon^TM), but until that is done I’ll explain it briefly as: “Reasoning about a system based on its underlying properties, rather than imaginable behaviours”. Apply this to computer security and you get the Security Mindset. Lets imagine Eve makes an AGI and is worried about it going rogue, so she sticks it in a box. It has some basic input and output so we can make use of it, otherwise we just have the worlds most technologically advanced box. Alice and Bob both think this is a bad idea. Bob says: “The AGI might trick you into letting it out, perhaps it convinces you that it could be more helpful if you gave it internet access?” Eve replies: “I’m not that stupid, I’ll just tell people not to connect it to anything else, and definitely not to the internet.” Before Bob can continue the brewing back-and-forth, Alice states “In order to reason about the world and be useful, our AGI must have some internal model of the world. Thus if there exists a series of outputs such that it results in it being let out of the box, the AGI could find it. Since it is more intelligent than any human, it will likely find one we haven’t found ourselves, and thus don’t know how to defend against. This is compounded by the already existing documentation on the flaws of human reasoning.” Eve pauses to think for a moment. “Shit. Guess I’ll just unplug it then.”
So our AGI wants something, and is putting its full cognition towards maximising the expected value of that thing. Regardless of what it is, humans are just another resource, like wood or iron. Maybe we’ll get enslaved to perform efficient medium-level computation. Maybe we’ll get disassembled into our component parts for usage in something else. Maybe all of us except the happiest human alive will get murdered and that human will be pumped full of heroin because we asked it to maximise average human happiness. Maybe it sees us an adversary because we might wipe it out, so it beats us to the punch. Maybe it’s very very close to being aligned but not quite perfect. Who knows. Who cares. We’re all basically dead.

Ubiquitous Narrow Intelligence

This is a much less obvious failure mode. For starters, it’s more of a set of minor failure modes that will all probably add up into something bigger and badder. A specific example of how this might go is laid out excellently by Paul Christiano. This is also the failure mode we seem to be heading towards, with things like ChatGPT and GPT4. Since this involves systems closer to modern day ones, reasoning about things via the properties of deep learning and related fields is more valid than in the above scenario.

Imagine lots of AI’s everywhere, each doing something useful for us. None of them are crazily intelligent or general, possibly none of them are even agentic. Why might this go wrong?

Reasons

The AI’s will be maximising some objective function
These objectives are never quite what we actually want from them
Sometimes the AI will do something it thinks is good, but we didn’t want, with varying consequences
AI systems will become more powerful and more common
As AI systems become more common and powerful these failures will become more common and more severe
Some critical failure mass is reached and things go badly

Justifications

For deep learning to work, we have to optimise against something. Even if we optimised for some proxy objective, our system is still pursuing something.
Currently to design an AI’s objective there are basically three approaches. Hard code it, learn it from data and a loss function directly, or pre-train a model on some proxy objective and data, and then manipulate (e.g. prompt or finetune) it to doing the exact thing you want. Hard coding will make our objective very simple, thus it will not capture important nuance. Consider “clean my carpet but don’t stand on my baby or knock my vase over, use minimal cleaning supplies, don’t kill my friend who occasionally makes mess because that makes the carpet cleaner in expectation, etc.” Learning it from data is promising but there’s no quarantees it will generalise correctly from the training data to your deployment (inner alignment problem). Additionally you run into difficulties like needing lots of representative human data, or if you utilise pre-training you need to be careful to make sure the model is now pursuing the fine-tuned / prompted objective rather than the pre-training one.
Oh no my cleaning robot bathed the room in napalm because it thought by clean I meant “get rid of all bacteria”. Oh no my robot arm that was meant to pick up and move nuclear waster into the cooling pond had simply tricked all the humans that oversaw its training that it was actually holding the objects when really it wasn’t. Oh no I asked ChatGPT for a bunch of citations for my research paper but they were all made up since realistic-sounding citations are easier to come up with than real ones and still seems like a good continuation of that kind of interaction.
Similar to point 2 from AGI discussion. If near future AI systems have good results, or look good on paper, people will be motivated to deploy more of them. There are also strong economic incentives for company boards / CEOs if AI can automate jobs and save money. Power creep is the direct intended consequence of capabilities research which is still going strong, and is currently seeing a lot of success. Also Moloch.
More AI which can fail leading to more failures is pretty obvious. More severe failures can be motivated by either “more samples increases the extreme-ness of the most extreme sample” or by failures combining in multiplicative/additive ways. More powerful AI will be better able to exploit misspecified objectives in ways which are unexpected and undesirable. For short term examples look into reward hacking, for long term examples read the above section on Agentic General Superintelligence.
Complex and chaotic systems, like the ones which surround us in modern life, tend to have difficult to predict nonlinear behaviour, which often gains or loses stability unexpectedly. Eventually all the failures of narrow AI will result in something that pushes us over some threshold of “things going bad causing more things to go bad” and society as we know it will take a bit of a tumble.

Silver Lining?

As discussed above this failure mode takes the form of things gradually going more and more wrong until there’s some feedback loop that drives us to collapse. The good news is that (hopefully) we’ll be able to see small things going bad and realise where this trend is headed, thus giving us time to solve problems and slow down deployment to avoid the really bad things. The bad news is that the warning signs started in 2016, and (until recently) not much serious attention has been paid to it.

Quickfire Failure Modes

Just to illustrate the breadth of answers to “how might AI go wrong”m here are some additional examples. Bonus points: think of your own AI failure scenario that’s not already been covered!

Rapid job automation without a proper strategy for the growing number of unemployed people causing economic collapse.
Lethal autonomous weapons get out of control and instigate a large-scale conflict without human intention, or (perhaps less likely) specific systems go rogue and start killing lots of things and refusing to be turned off.
Powerful but aligned AI controlled by states or other large scale global actors tips the balance of societal power too much and we end up in a Dystopian Nightmare where the average person is completely powerless relative to AI-enhanced everything ruling elite.
Society becomes way too reliant on AI and technology in general, letting it control everything from nuclear power generators to food production. All is happy, all is good. A solar flare hits, it all dies, and we aren’t able to pick up the pieces in time.
Bad actors (possibly omnicidal ones) get hold of a powerful AI system and use it to be very naughty, worst-case actively trying to end humanity.

Closing Thoughts

If we naively pursue AGI, managing to avoid the myriad of other hazards, scenario 1 will almost certainly happen with dire consequence. The main solution to this is either we are incapable of making AGI (e.g. resource limited), or we decide against pursuing it.

Scenario 2 however is, in the medium term, perhaps more likely. Arguably it’s already happening and we need to act quickly. It is possibly avoidable, with enough governance and alignment work. Hopefully we can avoid this failure mode (and all the quickfire ones) and that afterwards we’ve collectively gained the wisdom to go not much further in this direction.

Then again. Maybe I, and many others who’ve thought about this, are incorrect and actually the problems aren’t too hard to solve. Maybe there aren’t too many unknown dangers lurking for us, and the ones that are we identify and mitigate in a timely manner. Maybe the economic pressures to go fast and break things are outweighed by humanity carefully putting one foot infront of the other. Maybe these things are all actually quite out of reach and GPT-based AI is going to plateau in effectiveness. Maybe.

Probably not.

Risks of Advanced AI

Introduction

Agentic General Superintelligence

Reasons this will end badly

Justifications

Ubiquitous Narrow Intelligence

Reasons

Justifications

Silver Lining?

Quickfire Failure Modes

Closing Thoughts

Home