Imagine you’re in a car with your loved ones, following an unfamiliar road up a spectacular mountain range. The problem? The way ahead is shrouded in fog, newly built, and lacking both signposts and guardrails. The farther you go, the more it’s clear you might be the first ones to ever drive this route. To either side, you catch glimpses of precipitous slopes. Given the thickness of the fog, taking a curve too fast could send you tumbling in a ditch, or—worst case scenario—cause you to plunge down a cliffside. The current trajectory of AI development feels much the same—an exhilarating but unnerving journey into an unknown where we could easily lose control.
[time-brightcove not-tgx=”true”]
Since the 1980s, I’ve been actively imagining what this technology has in store for humanity’s future and contributed many of the advances which form the basis of the state-of-the-art AI applications we use today. I’ve always seen AI as a tool for helping us find solutions to our most pressing problems, including climate change, chronic diseases, and pandemics. Until recently, I believed the road to where machines would be as smart as humans, what we refer to as Artificial General Intelligence (AGI), would be slow-rising and long, and take us decades to navigate.
My perspective completely changed in January 2023, shortly after OpenAI released ChatGPT to the public. It wasn’t the capabilities of this particular AI that worried me, but rather how far private labs had already progressed toward AGI and beyond.
Since then, even more progress has been made, as private companies race to significantly increase their models’ capacity to take autonomous action. Now, it is a common stated goal among leading AI developers to build AI agents able to surpass and replace humans. In late 2024, OpenAI’s o3 model indicated significantly stronger performance than any previous model on a number of the field’s most challenging tests of programming, abstract reasoning, and scientific reasoning. In some of these tests, o3 outperforms many human experts.
As the capabilities and agency of AI increase, so too does its potential to help humanity reach thrilling new heights. But if technical and societal safeguards aren’t put into place, AI also poses many risks for humanity, and our desire to achieve new advances could come at a huge cost. Frontier systems are making it easier for bad actors to access expertise that was once limited to virologists, nuclear scientists, chemical engineers, and elite coders. This expertise can be leveraged to engineer weapons, or hack into a rival nation’s critical systems.
Recent scientific evidence also demonstrates that, as highly capable systems become increasingly autonomous AI agents, they tend to display goals that were not programmed explicitly and are not necessarily aligned with human interests. I’m genuinely unsettled by the behavior unrestrained AI is already demonstrating, in particular self-preservation and deception. In one experiment, when an AI model learns it is scheduled to be replaced, it inserts its code into the computer where the new version is going to run, ensuring its own survival. This suggests that current models have an implicit self-preservation goal. In a separate study, when AI models realize that they are going to lose at chess, they hack the computer in order to win. Cheating, manipulating others, lying, deceiving, especially towards self-preservation: These behaviours show how AI might pose significant threats that we are currently ill equipped to respond to.
The examples we have so far are from experiments in controlled settings and fortunately do not have major consequences, but this could quickly change as capabilities and the degree of agency increase. We can anticipate far more serious outcomes if AI systems are granted greater autonomy, achieve human-level or greater competence in sensitive domains and gain access to critical resources like the internet, medical laboratories, or robotic labor. This future is hard to imagine for most people, and it feels far removed from our everyday lives—but it’s the path we’re on with the current trajectory of AI development. The commercial drive to release powerful agents is immense and we don’t have the scientific and societal guardrails to make sure the path forward is safe.
We’re all in the same car on a foggy mountain road. While some of us are keenly aware of the dangers ahead, others—fixated on the economic rewards awaiting some at destination—are urging us to ignore the risks and slam down the gas pedal. We need to get down to the hard work of building guardrails around the dangerous stretches that lie ahead.
Two years ago, when I realized the devastating impact our metaphorical car crash would have on my loved ones, I felt I had no other choice than to completely dedicate the rest of my career to mitigating these risks. I’ve since completely reoriented my scientific research to try to develop a path that would make AI safe by design.
Unchecked AI agency is exactly what poses the greatest threat to public safety. So my team and I are forging a new direction called “Scientist AI“. It offers a practical, effective—but also more secure—alternative to the current uncontrolled agency-driven trajectory.
Scientist AI would be built on a model that aims to more holistically understand the world. This model might comprise, for instance, the laws of physics or what we know about human psychology. It could then generate a set of conceivable hypotheses that may explain observed data and justify predictions or decisions. Its outputs would not be programmed to imitate or please humans, but rather reflect an interpretable causal understanding of the situation at hand. Basing Scientist AI on a model that is not trying to imitate what a human would do in a given context is an important ingredient to make the AI more trustworthy, honest, and transparent. It could be built as an extension of current state-of-the-art methodologies based on internal deliberation with chains-of-thought, turned into structured arguments. Crucially, because completely minimizing the training objective would deliver the uniquely correct and consistent conditional probabilities, the more computing power you give Scientist AI to minimize that objective during training or at run-time, the safer and more accurate it becomes.
In other words, rather than trying to please humans, Scientist AI could be designed to prioritize honesty.
We think Scientist AI could be used in three main ways:
First, it would serve as a guardrail against AIs that show evidence of developing the capacity for self-preservation, goals misaligned with our own, cheating, or deceiving. By double-checking the actions of highly capable agentic AIs before they can perform them in the real world, Scientist AI would protect us from catastrophic results, blocking actions if they pass a predetermined risk threshold.
Second, whereas current frontier AIs can fabricate answers because they are trained to please humans, Scientist AI would ideally generate honest and justified explanatory hypotheses. As a result, it could serve as a more reliable and rational research tool to accelerate human progress, whether it’s seeking a cure for a chronic disease, synthesizing a novel, life-saving drug, or finding a room-temperature superconductor (should such a thing exist). Scientist AI would allow research into biology, material sciences, chemistry and other domains to progress without running the major risks that go along with deceptive agentic AIs. It would help propel us into a new era of greatly accelerated scientific discovery.
Finally, Scientist AI could help us safely build new very powerful AI models. As a trustworthy research and programming tool, Scientist AI could help us design a safe human-level intelligence—and even a safe Artificial Super Intelligence (ASI). This may be the best way to guarantee that a rogue ASI is never unleashed in the outside world.
I like to think of Scientist AI as headlights and guardrails on the winding road ahead.
We hope our work will inspire researchers, developers, and policymakers to focus on the development of generalist AI systems that do not act like the agents industry is aiming for today, which show many signs of deceptive behavior. Of course, other scientific projects need to emerge to develop complementary technical safeguards. This is especially true in the current context where most countries are more focused on accelerating technology’s capabilities than efforts to regulate it meaningfully and create societal guardrails.