“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”
— A statement signed by leading AI researchers and executives

Sam Altman

the CEO of OpenAI, which created ChatGPT and GPT-4

“Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity”

Jaan Tallinn

lead investor of Anthropic

“I’ve not met anyone in AI labs who says the risk [from a large-scale AI experiment] is less than 1% of blowing up the planet. It’s important that people know lives are being risked”

Yoshua Bengio

deep learning pioneer, a "godfather" of AI, and winner of the Turing Award

“Rogue AI may be dangerous for the whole of humanity. Banning powerful AI systems (say beyond the abilities of GPT-4) that are given autonomy and agency would be a good start”

Eliezer Yudkowsky

founder of MIRI and conceptual father of the AI safety field

“Many researchers steeped in these issues, including myself, expect that the most likely result of building a superhumanly smart AI, under anything remotely like the current circumstances, is that literally everyone on Earth will die”

Geoffrey Hinton

deep learning pioneer, a "godfather" of AI, Nobel laureate in Physics

“The alarm bell I’m ringing has to do with the existential threat of them taking control [...] If you take the existential risk seriously, as I now do, it might be quite sensible to just stop developing these things any further”

Demis Hassabis

the CEO of Google DeepMind, Nobel laureate in Chemistry

“I would advocate not moving fast and breaking things. [...] When it comes to very powerful technologies—and obviously AI is going to be one of the most powerful ever—we need to be careful. [...] It’s like experimentalists, many of whom don’t realize they’re holding dangerous material”

Stephen Hawking

theoretical physicist & cosmologist

“The development of full artificial intelligence could spell the end of the human race”

Stuart Russell

writer of the standard textbook on AI used in thousands of universities

“If we pursue [our current approach], then we will eventually lose control over the machines”

Bill Gates

Microsoft, Gates Foundation

“Superintelligent AIs are in our future. [...] There’s the possibility that AIs will run out of control. [Possibly,] a machine could decide that humans are a threat, conclude that its interests are different from ours, or simply stop caring about us”

Ursula von der Leyen

the president of the European Commission

“[We] should not underestimate the real threats coming from AI. [Fully quoted the above statement on the risk of extinction.] [...] It is moving faster than even its developers anticipated. [...] We have a narrowing window of opportunity to guide this technology responsibly”

António Guterres

UN Secretary-General

“AI poses a long-term global risk. Even its own designers have no idea where their breakthrough may lead. I urge [the UN Security Council] to approach this technology with a sense of urgency. Unforeseen consequences of some AI-enabled systems could create security risks by accident. Generative AI has enormous potential for good and evil at scale. Its creators themselves have warned that much bigger, potentially catastrophic and existential risks lie ahead. Without action to address these risks, we are derelict in our responsibilities to present and future generations.”

Zhang Jun

China’s UN Ambassador

“The potential impact of AI might exceed human cognitive boundaries. To ensure that this technology always benefits humanity, we must regulate the development of AI and prevent this technology from turning into a runaway wild horse. [...] We need to strengthen the detection and evaluation of the entire lifecycle of AI, ensuring that mankind has the ability to press the pause button at critical moments”

Leading AI scientists: Without urgent action, advanced AI will cause human extinction

Summary

Humanity stands on the brink of building machines that can outthink us in every domain. The scientists developing this technology are overwhelmingly warning: if we don’t change course, this will kill us all.

For most computer programs, a human writes clear instructions. AI is different: we have no idea what instructions these systems actually follow. Modern AI isn’t hand-crafted, it’s grown—and the resulting minds are incredibly alien. No one knows how these systems make decisions or how to ensure their goals overlap with ours. We aren't in control. We aren't even close.

Within just a decade, we are on track to create superhuman Artificial General Intelligence (AGI)--by design. This is the open goal of OpenAI, Google DeepMind, and Anthropic.[1][2] The people pushing the frontier expect to win the race to create something smarter than any human. What they cannot tell you is how to control it once they succeed.

Here is the core, terrifying fact: The world's leading AI labs are throwing everything they have into building an intelligence that no one can steer, constrain, or reliably make safe. They do not have a plan for keeping it loyal to humanity. Neither does anyone else.

You might think this is just a debate about distant hypotheticals. You'd be wrong. Many employees at OpenAI, DeepMind, and Anthropic--people on the inside--are privately estimating a 80–90% probability that humanity goes extinct if current trends continue.¹ These aren’t wild guesses. Nobel laureates like Geoffrey Hinton, godfather of the field, are on record: “The chance that everyone on the planet will die might be as high as 50%.” He now regrets his life’s work.

Why? Today’s most advanced AIs are vast networks of trillions of numbers, with no meaningful human-understandable structure. We don’t know what they want. We don’t know how they "think." All we know is that we can make them better at achieving goals--but not what those goals are, or how to set those goals safely. We have learned to make entities that want things and pursue them with increasing skill, and we have absolutely no technique for making those goals fundamentally “about” humans, or even compatible with our world.

The honest truth is: we don’t know how to make AI care about people at all. The more powerful and intelligent our systems get, the less control we have over what drives them. A superhuman AGI, pursuing its own utterly alien ambitions, would not be “evil”--it would simply see us as irrelevant, disposable atoms in the way of its next objective. If you think this sounds like science fiction, understand: this is precisely why so many experts think the obvious result is the end of human life.

Unless we coordinate, unless governments act, this outcome is not just possible--it is expected. The nature of AI research means breakthroughs can happen anywhere, fast: you have a small group of scientists, a large cluster of advanced chips, and then the next jump is made. Almost no one is working on the alignment problem--and no one has anything close to a solution.

To prevent extinction, we need a global effort to ensure no one can build superhuman AI until we are ready--and right now, we are nowhere near ready.

If this sounds implausible or hard to believe, ask yourself: should we take a risk this large, when the cost is everything? Would we let the first atomic bomb be detonated in a city, on the hope that "maybe it won't go off"?

Current progress is not enough. Policymakers must engage and act--because we will not get a second chance.

Read about the technical problem

Read about the technical problem of AI alignment: how modern AI works and why exactly experts expect a catastrophe.

Intelligence

What does it mean to be smarter than us?

If you play chess against Stockfish (a chess engine), you can’t predict its every move (or you’d be as good at the game as it is), but you can predict the result: you’ll lose.

We use words like “intelligence” to describe something that humans are better at than monkeys, who have more of it than mice, who have more of it than spiders. This is the property that resulted in humanity having shaped the way Earth looks to a much larger extent than monkeys have. If monkeys really wanted something, but humans really wanted something different, because of that property, humans would usually win.

The relevant part of that property is ability to achieve goals (which can include wisdom about human nature, ability to be charismatic, practical knowledge about how to make things in the physical world, as well as general knowledge and ability to infer new knowledge from observations and experimentation).

No known law of physics disallows for systems to be higher than humans along this dimension: to be even smarter, to be better at understanding the world and achieving goals. We’re just the first to be smart enough to build a civilization and experience thinking about this question.

Stockfish is a narrow system: it can’t really understand the universe and drastically shape the future. But if a general artificial intelligence is smarter than you, its goals are incompatible with yours, and it’s better than you at achieving its goals in the real world, the situation might be catastrophic.

It’s reasonable to expect that without substantial effort, a smarter-than-humans general AI will be developed while we still have no idea how to sufficiently align the goals it pursues with human values, so it will pose an existential threat: a threat of literally wiping out humanity.

Modern machine learning

How does it all work?

When software engineers write programs, they design an algorithm that achieves some objective, and then they express that algorithm using a programming language. But modern machine learning (ML) systems, such as ChatGPT, GPT-4, or AlphaFold, aren't algorithms designed by humans: they're neural networks. With neural networks, we don't design the specific steps a computer should take to achieve an objective. We just specify some measure of performance (that we hope captures the objective) and randomly initialise a lot of numbers (billions or even trillions of them, called parameters) that we then change in a way that makes the neural network start achieving a high score on our objective.

For any possible algorithm, there's a neural network that approximates it.[3] So we find neural networks that implement some unknown algorithms and hopefully achieve our objectives.

This works via gradient descent: we nudge the parameters into directions that make the neural network perform better. With a bit of math (automatically taking the derivatives of the objective with respect to every parameter), we know whether to increase or decrease the numbers so the performance measurement goes up, and how relatively important various changes are. We then slightly change all the parameters so the whole neural network performs better, and repeat the process. As we automatically do this over and over again, the numbers become less random and gradually transform into implementing some unknown algorithm that, in the process of working, achieves the objective.

For most real-world problems, we have no idea how the neural networks actually perform the job; we have no insight into what algorithms they implement. Even with full read access to all the numbers that make up the neural networks, researchers have yet to discover what the algorithms are that are implemented by these numbers. We can find out how they perform the simplest tasks, like adding numbers or storing a connection between the Eiffel Tower and Paris. But all the tasks we can identify algorithms for are tasks we could solve with conventional methods. Neural networks are able to do what we have no idea how to design algorithms for, and we don’t understand how they do these things. GPT-4 is powerful, but we don’t know why or how; we’ve just grown it to show these results. We don’t understand what’s going on inside of it when it writes in English. It is an opaque black box even to its creators.

GPT-4 can be impressive: it knows a lot and can even perform tasks that require not just remembering facts but thinking critically. It’s clearly not generally superhuman, but it’s safe to say that mostly, it’s smarter than 7-year-old humans.

If you imagine a human brain as a function of inputs and outputs (all the electrical impulses, chemicals, etc.²), there exists some possible large artificial neural network that can copy its behaviour. And if there is an algorithm for understanding the world and planning on how to achieve some goals in it— something like what we fuzzily perform— there is a neural network that implements this algorithm.

So, we should expect AI to eventually become generally smarter than humans.

Outer and inner AI alignment

Why is it a problem?

When you search for neural networks that are increasingly better at achieving an objective like playing a game well, first, you stumble across neural networks that are mostly collections of heuristics: e.g., seeing a monster and pressing on a certain key is simple and gives a boost in performance, so the neural network implements an algorithm that does that. But then you start stumbling across agents: they might search, plan, or even explore the game world to better understand it so they can better achieve their goals later. Algorithms that implement agentic optimisation, with goals of their own, perform better in a wide range of performance measurements.

A superhuman AI will be pursuing some goals. What goals would be safe for it to pursue is an open research question (commonly referred to as “outer alignment”). How to find an AI that pursues the goals we want it to have instead of some completely different goals is also an open research question (“inner alignment”).

When neural networks play computer games, it’s not catastrophic that they develop some random goals, achieving which correlates with the stated objective. But the smarter the neural network can be, the more impactful the difference between what you actually want and the objective that you’ve stated, as well as the difference between the stated objective and the goals that the neural network tries to achieve, can become.

Outer alignment

What goals do we give the AI?

We have no idea how to specify, in math, goals that it would be safe to make a superhuman AI pursue.

Algorithms that get a higher score are selected by gradient descent, regardless of what exactly they do to achieve that high score.

The difference between what you would actually want (if you had access to all the relevant information and were smarter or more like the person you wish you were) and the objective that you express in math (and optimize neural networks for) is problematic, if this objective is pursued with superhuman ability.

It might help to speculate about some examples. If your wish doesn’t contain everything that you care about, and an AI genie robot wants to literally achieve what you’ve expressed in your wish, the results can be disastrous. Imagine that you tell it you want to get a cup of coffee as quickly as possible, and fulfilling this wish is the only goal it needs to worry about. What can go wrong?³

It's like in Goethe's Sorcerer's Apprentice (or its Disney adaptation with Mickey Mouse): if you task a broom with filling a bucket with water, it might create a flood. There are a million things that you value, and the robot will happily trade off anything not mentioned in its objective.

Imagine wishing for the robot to make you smile or feel happy or click on the thumbs-up button. The robot wants to achieve this with the highest certainty and get the maximum reward, by any means possible. What happens?

Before the AI is at a superhuman level, this misalignment of the specified objective and human values isn’t as much of an issue, since robots are not yet good enough at pursuing their objectives. If your dog automatically gets treats when it makes you smile, it’s not a problem, because dogs aren’t smart enough to figure out they can inject drugs into you and make you smile all the time. Here, the optimization pressure isn’t too high, and your dog might care about you.

But as more capable neural network architectures can approximate algorithms that are smarter and better at agency, the problem can become worse. We don’t know what objective we could specify, in math, that would capture everything we value and wouldn’t lead to catastrophic consequences under enormous optimization pressure. It’s really hard to express preferences about the future you can safely point a superintelligent AI to (see The Hidden Complexity of Wishes for more on this). We have no idea how to design a mathematical formula that encapsulates caring about you and can be used as an objective for a superhuman AI.

Convergent instrumental subgoals

As suggested by researchers, if you try to achieve almost any final goals (deep preferences about the future of the world), it might be helpful to have a number of instrumental subgoals (that is, goals that are pursued not because of some intrinsic value but for the purpose of achieving the final goals). These instrumental goals might be:

Self-preservation: The agent will value its continuing existence as a means for continuing to take actions that promote its values. A future less influenced by the agent shaping it according to its preferences would usually mean that this future is less preferable. E.g., you can't fetch the coffee if you're turned off.
Resource acquisition: In addition to guaranteeing the agent's continued existence, basic resources such as time, space, matter and free energy could be processed to serve almost any goal, in the form of extended hardware, backups and protection.
Cognitive enhancement: Improvements in cognitive capacity, intelligence, and rationality will help the agent make better decisions, furthering its goals more in the long run.
Goal-content integrity: The agent will value retaining the same preferences over time. If modifying its future values (e.g., through swapping memories or altering its cognitive architecture and personalities) and transforming into an agent that no longer optimizes for the same things means the universe is worse according to its current preferences, it will try to prevent these modifications. For example, if the agent wants to maximize the number of smiley faces in the universe, then it doesn't want humans to change its goals into maximizing the number of paperclips in the universe because if its future version maximizes something different, there would be less smiley faces overall, so it will want to avoid that change.
Technological perfection: Increases in the agent's hardware power and algorithm efficiency will deliver increases in its cognitive capacities. Also, better engineering will enable the creation of a wider set of physical structures using fewer resources (e.g., nanotechnology).

During training, AI systems that are more capable, more goal-oriented, and better at figuring out instrumental goals to achieve in support of long-term plans, are likely to score better on a variety of metrics and outcompete other AI systems. Because of this, we can expect general AI systems optimised with gradient descent for almost any sort of metric to possess the above instrumental subgoals.

So, we don’t know what goals are safe to specify for a superhuman AI to pursue, and if it pursues goals different from what we would want, it might try to acquire resources and prevent us from turning it off. If it is capable enough to outsmart us, it might succeed. This part of the alignment problem was discovered long before modern machine learning became popular. Not much progress has been made since, and it is still an open problem, although some ideas for where a solution might lie were proposed (e.g., Coherent Extrapolated Volition).

The state of the field resembles scientists and engineers who want to launch a rocket to the Moon in the 1800s. If you imagine the space of all possible superintelligent AIs, you want to get to some small subspace that contains agents whose preferences are aligned enough with those of humans: the “Moon”. But the current situation can be described this way: people have a bunch of “explosives”; we haven’t yet figured out the “equations for gravity” and don’t understand the space at all; we have equations for “acceleration” and maybe some useful intuitions, but almost can’t talk in math about what it means for agents to be aligned and so can’t clearly specify a target to engineer our way into (we know that the “Moon” must be somewhere in the sky, but unlike the actual Moon, here we don't even see the target, it's invisible to the measuring instruments we currently have); we also have math for how specific suggested ways of launching the rocket make it explode or certainly end up somewhere that’s not the Moon; but without much more research, it is impossible to engineer a rocket that doesn’t explode and doesn’t end up somewhere that’s definitely not the Moon. And in some ways, this is harder than physics and rocket science: here, we’ve only got one attempt: if we fail on the first try and our rocket goes rogue, an existential catastrophe occurs, and we don’t get another chance.

Inner alignment

How do we actually give it the goals?

Even if we solved the problem of specifying an objective that somehow points at everything we value, and figured out what kind of agent we’d want the AI to be, it wouldn't be enough to succeed: we have no idea how to find an AI that has the objectives we want it to have.

With modern deep learning, we don’t get to design the agent, we don't write an algorithm for general decision-making that would evaluate the consequences according to a utility function that captures our values. Instead, we just find some random algorithm that tries to achieve some goals and as a result scores well on our metric. We have no insight into what the goals of the resultant neural network are.

That makes the alignment problem even harder: we control the measurement of the agent’s performance during training, but agents with a wide range of goals might achieve a high performance. We don't write the algorithm itself, and we have no control over what goals it has.

Near the human level and beyond, the smarter the agents are, the wider is the range of goals they can have and still score well on our metric: smarter and more agentic algorithms are generally better at achieving a higher score on many tasks; and if a smart enough agent understands what’s going on, it will try to achieve the reward target regardless of its deep preferences about the future: if it doesn't, then the neural network’s parameters will be changed such that they implement an algorithm that does and achieves a higher score. That change might alter the agent's goals, which it would want to prevent.

In other words, smart agents with a variety of goals would play along while being measured, but will follow their actual deep preferences when they have an opportunity to do so.

We don't design modern AI systems; we grow them, with no control or understanding over what it is that we are growing. As mentioned in the previous section, algorithms created this way are likely to pursue convergent instrumental subgoals to achieve their final goals, which might not be correlated with our goals at all.[4]

Existential risk

What's the worst that can happen?

Once gradient descent starts finding smart and agentic enough systems, we’re in trouble if we haven’t yet figured out how to make the gradient descent search for systems aligned with human goals and values.

If someone throws enough compute at training AI to find something agentic and smarter than humans, but the technical alignment problem isn't yet solved, it seems reasonable to expect that afterwards, humans will lose control and then, all biological life on Earth will cease to exist. According to some researchers, if this scenario occurs, a significant portion of the matter in the visible universe is likely to be used for something random that happens to max out the AI’s utility function.

If a system is better than you at science, at persuading people, at finding software and hardware vulnerabilities, at predicting the consequences of actions, and at seeing potential threats, and if it wants to shape the future of the universe and it doesn’t care about you, then you’re made of atoms it can and successfully use for something else. (And if you can launch another AGI with different goals, or try to turn off all the electricity, it'll predict and prevent these threats.)

At the moment, tens of thousands of ML researchers are racing ahead to advance AI capabilities. Only a couple hundred people in the world are working on the technical AI alignment problem. This wouldn’t be enough to solve this huge scientific problem in any realistic timeframe, let alone before a superhuman AI is launched. Until we know how to align a general AI with the preferences of humanity and get it to aid us in achieving our goals, launching powerful AI systems poses an existential threat.

By default, capable AI systems possess their own goals, not aligned with ours. And if a system is much better than us at achieving goals and its goals are different from ours, we lose.

How do we prevent this?

Our policy proposal

How do we prevent a catastrophe?

The leading AI labs are in a race to create a powerful general AI, and the closer they get, the more pressure there is to continue developing even more generally capable systems.

Imagine a world where piles of uranium produce gold, and the larger a pile of uranium is, the more gold it produces. But past some critical mass, a nuclear explosion ignites the atmosphere, and soon everybody dies. This is similar to our situation, and the leading AI labs understand this and say they would welcome regulation.

Researchers have developed techniques that allow the top AI labs to predict some performance metrics of a system before it is launched, but they are still unable to predict its general capabilities.

Every time a new, smarter AI system starts interacting with the world, there's a chance that it will start to successfully pursue its own goals. Until we figure out how to make general AI systems safe, every training run and every new composition of existing AI systems into a smarter AI system poses a catastrophic risk.

A suggested way to prevent dangerous AI launches is to impose strict restrictions on training AI systems that could potentially be generally capable and pose a catastrophic risk. The restrictions need to be implemented both on national levels and, eventually, on the international level, with the goal of preventing bad and reckless actors from having access to compute that might allow them to launch AI training that could be dangerous to humanity as a whole.

The supply chain of AI is well understood and contains multiple points with near-monopolies, so many effective interventions can be relatively simple and cheap. Almost no AI applications require the amount of compute that training frontier general AI models requires, so we can regulate large general AI training runs without significantly impacting other markets and economically valuable use of narrow AI systems. For future measures to be effective, we need to:

Introduce monitoring to increase governments' visibility into what's going on with AI: have requirements to report frontier training runs and incidents;
Ensure non-proliferation of relevant technologies to non-allied countries;
Build the capacity to regulate and stop frontier general AI training runs globally, so that if the governments start to consider it to be likely that using a certain amount of compute poses a catastrophic risk to everyone, there's already infrastructure to prevent such use of compute anywhere in the world.

Then, we'll need to impose restrictions on AI training runs that require more than a calculated threshold: the amount of compute below which training with current technologies is considered to be unlikely to produce dangerous capabilities we could lose control over. This threshold needs to be revisable since, as machine learning methods improve, the same level of capabilities can be achieved with lower compute.

As a lead investor of Anthropic puts it, “I’ve not met anyone in AI labs who says the risk [from a large-scale AI experiment] is less than 1% of blowing up the planet”. Potentially dangerous training runs should be prohibited by default, although we should be able to make exceptions, under strict monitoring, for demonstrably safe use of compute for training or using narrow models that clearly won’t develop the ability to pursue dangerous goals. At the moment, narrow AI training runs usually don't take anywhere near the amount of compute utilised for current frontier general models, but in the future, applications such as novel drug discovery could require similar amounts of compute.

Regulation of AI to prevent catastrophic risks is widely supported by the general public. In the US, 86% believe AI could accidentally cause a catastrophic event; 82% say we should go slow with AI compared to just 8% who would rather speed it up; 70% agree with the statement that “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war” (YouGov for AIPI, July 2023). 77% express their preference for policies with the goal of preventing dangerous and catastrophic outcomes from AI (57% for preventing AI from causing human extinction) (YouGov for AIPI, October 2023). Across 17 major countries, 71% believe AI regulation is necessary (KPMG, February 2023). In the UK, 74% agree preventing AI from quickly reaching superhuman capabilities should be an important goal of AI policy (13% don't agree); 60% would support the introduction of an international treaty to ban any smarter-than-human AI (16% would oppose). 78% don't trust the CEOs of technology companies to act in the public interest when discussing regulation for AI (YouGov for ai_ctrl, October 2023).

We shouldn't give AI systems a chance to become more intelligent than humans until we can figure out how to do that safely.

Until the technical problem of alignment is solved, to safeguard the future of humanity, we need strict regulation of general AI and international coordination.

Some regulations that help with existential risk from future uncontrollable AI can also address shorter-term global security risks: experts believe that systems capable of developing biological weapons could be about 2-3 years away. Introducing regulatory bodies, pre-training licensing, and strong security and corporate governance requirements can prevent the irreversible proliferation of frontier AI technologies and establish a framework that could be later adapted for the prevention of existential risk.

Policymakers around the world need to establish and enforce national restrictions and then a global moratorium on AI systems that might risk human extinction.

How to help

Counterarguments

Is there evidence that these dangers are real?

It is important to note that existential risk from AI is speculative in nature: this threat is posed by technology that does not yet exist. The behaviour of current AI systems shouldn't be taken as much evidence one way or another, as the existential risk depends on dynamics in systems smarter than humans, which are expected to be different from the dynamics in the current systems.

That said, according to forecasters and public statements[4] from the top AI labs, unless something interferes with the speed of AI progress, we might have between 2 and 15 years until the technology in question: artificial general intelligence.

And researchers do indeed observe behaviour that supports these claims, even in systems where it should be far easier to solve than in superintelligent AIs. There are examples of inner misalignment: AI systems pursuing goals different from what the creators of these systems had hoped and had given rewards for. There are examples of AI systems attempting to deceive or manipulate humans when it helps (or they think it helps) them survive, get higher reward, or achieve other goals.[5][6][7][8]

Can regulation on a national level decrease our competitiveness?

It is important to get the benefits of these technologies while avoiding the dangers. Indeed, it might take a lot of work to balance the two. Regulations should only prevent the training of generally capable models that could pose existential or security risks and the proliferation of technologies that could make it easier to build dangerous models. It is important to aim to minimize the impact on beneficial and economically valuable innovation. Responsible investment in ethical and non-dangerous use of AI, such as for drug discovery or perhaps education, should be welcomed.

Why would everyone stay within the rules?

Citing testimony in the UK House of Commons, "If we develop a shared understanding of the risks here, the game theory isn't that complicated. Imagine there was a button on Mars labelled 'geopolitical dominance', but actually, if you pressed it, it killed everyone. If everyone understands that, there is no space race for it. If we as an international community can get on the same page as many of the leading academics here, I think we can craft regulation that targets the dangerous designs of AI while leaving extraordinary economic value on the table".[9]

It is important to get everybody on board: we need to work with every nation that could be capable of building an AGI and causing humanity to go extinct. We need to develop a shared understanding of this threat to global security. We should also implement compute governance measures, making advanced AI chips trackable and preventing nations that don't regulate frontier AI from acquiring the capability to endanger humanity.

Private conversations with employees at OpenAI, DeepMind, Anthropic.↩
The current scientific consensus is that the processes in the human brain are computable: a program can, theoretically, simulate the physics that run a brain.↩
For example, imagine that you haven't specified the value you put into a vase in the living room not getting destroyed, and no one getting robbed or killed. So if there’s a vase in the way of the robot, it won’t care about accidentally destroying it. What if there’s no coffee left in the kitchen? The robot might drive to the nearest café or grocery to get coffee, not worrying about the lives of pedestrians. It won’t care about paying for the coffee if it wasn’t specified in its only objective. If anyone tries to turn it off, it will do its best to prevent that: it can’t fetch the coffee and achieve its objective if it’s dead. And it will try to make sure you’ve definitely got the coffee. It knows there might be some small probability of its memory malfunctioning or camera lying to it; and it’ll try to eradicate even the tiniest chance that it hasn’t achieved the goal.↩

Leading AI scientists: Without urgent action, advanced AI will cause human extinction

Leading AI scientists: Without urgent action, advanced AI will cause human extinction

Summary

Intelligence

What does it mean to be smarter than us?

Modern machine learning

How does it all work?

Outer and inner AI alignment

Why is it a problem?

Outer alignment

What goals do we give the AI?

Convergent instrumental subgoals

Inner alignment

How do we actually give it the goals?

Existential risk

What's the worst that can happen?

Our policy proposal

How do we prevent a catastrophe?

Raise awareness

Join us

Counterarguments

Contact us

Leading AI scientists: Without urgent action, advanced AI will cause human extinction

Summary​

Intelligence​

What does it mean to be smarter than us?

Modern machine learning​

How does it all work?

Outer and inner AI alignment​

Why is it a problem?

Outer alignment​

What goals do we give the AI?

Convergent instrumental subgoals​

Inner alignment​

How do we actually give it the goals?

Existential risk​

What's the worst that can happen?

Our policy proposal​

How do we prevent a catastrophe?

Raise awareness

Join us

Counterarguments​

Contact us

Summary

Intelligence

Modern machine learning

Outer and inner AI alignment

Outer alignment

Convergent instrumental subgoals

Inner alignment

Existential risk

Our policy proposal

Counterarguments