Latest Articles

Latest Articles

Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes
Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes
Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes
Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes

Feb 24, 2024

Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes

The following are the summary and transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format.

Feb 15, 2024

Conjecture: 2 Years

It has been 2 years since a group of hackers and idealists from across the globe gathered into a tiny, oxygen-deprived coworking space in downtown London with one goal in mind: Make the future go well, for everybody. And so, Conjecture was born.

Oct 13, 2023

Multinational AGI Consortium (MAGIC): A Proposal for International Coordination on AI

This paper proposes a Multinational Artificial General Intelligence Consortium (MAGIC) to mitigate existential risks from advanced artificial intelligence (AI). MAGIC would be the only institution in the world permitted to develop advanced AI, enforced through a global moratorium by its signatory members on all other advanced AI development.

Oct 12, 2023

unRLHF - Efficiently undoing LLM safeguards

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. I'm grateful to Palisade Research for their support throughout this project.

Priorities for the UK Foundation Models Taskforce
Priorities for the UK Foundation Models Taskforce
Priorities for the UK Foundation Models Taskforce
Priorities for the UK Foundation Models Taskforce

Jul 21, 2023

Priorities for the UK Foundation Models Taskforce

The UK government recently established the Foundation Models Taskforce, focused on AI safety and backed by £100M in funding. Founder, investor and AI expert Ian Hogarth leads the new organization. The establishment of the Taskforce shows the UK’s intention to be a leading player in the greatest governance challenge of our times: keeping humanity in control of a future with increasingly powerful AIs. This is no small feat, and will require very ambitious policies that anticipate the rapid developments in the AI field, rather than just reacting to them.

Levels of Pluralism
Levels of Pluralism
Levels of Pluralism
Levels of Pluralism

Jul 17, 2023

Levels of Pluralism

When do we want all our research eggs in the same paradigm basket? Although most people don't go as far as the extreme paradigmatism of Thomas Kuhn in The Structure of Scientific Revolutions, which only allows one paradigm at time (for a given science), the preference for less rather than more options is still pervasive. In the ideal, a convergence to one, even if it's not always feasible. After all, there's only one correct answer, right?

Conjecture internal survey: AGI timelines and probability of human extinction from advanced AI
Conjecture internal survey: AGI timelines and probability of human extinction from advanced AI
Conjecture internal survey: AGI timelines and probability of human extinction from advanced AI
Conjecture internal survey: AGI timelines and probability of human extinction from advanced AI

May 22, 2023

Conjecture internal survey: AGI timelines and probability of human extinction from advanced AI

We put together a survey to study the opinions of timelines and probability of human extinction of the employees at Conjecture. The questions were based on previous public surveys and prediction markets, to ensure that the results are comparable with people’s opinions outside of Conjecture.

Input Swap Graphs: Discovering the role of neural network components at scale
Input Swap Graphs: Discovering the role of neural network components at scale
Input Swap Graphs: Discovering the role of neural network components at scale
Input Swap Graphs: Discovering the role of neural network components at scale

May 16, 2023

Input Swap Graphs: Discovering the role of neural network components at scale

Activation and path patching are techniques employed to manipulate neural network internals. One approach, used in ROME, which we'll refer to as corrupted patching, involves setting the output of certain parts of the network to a corrupted value. Another method used in the work on indirect object identification, referred to as input swap patching, involves assigning these parts the values they would produce on a different input sample, while the rest of the model operates normally.

Cognitive Emulation: A Naive AI Safety Proposal
Cognitive Emulation: A Naive AI Safety Proposal
Cognitive Emulation: A Naive AI Safety Proposal
Cognitive Emulation: A Naive AI Safety Proposal

Feb 25, 2023

Cognitive Emulation: A Naive AI Safety Proposal

This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution.

Basic facts about language models during training
Basic facts about language models during training
Basic facts about language models during training
Basic facts about language models during training

Feb 21, 2023

Basic facts about language models during training

In this post, we continue the work done in our last post on language model internals but this time we analyze the same phenomena occurring during training. This is extremely important in understanding how language model training works at a macro-scale and sheds light into potentially new behaviours or specific important phase transitions that may occur during training which deserve further study, as well as giving insight into the origin of phenomena that we consistently observe in fully trained models.

AGI in sight: our look at the game board
AGI in sight: our look at the game board
AGI in sight: our look at the game board
AGI in sight: our look at the game board

Feb 18, 2023

AGI in sight: our look at the game board

From our point of view, we are now in the end-game for AGI, and we (humans) are losing. When we share this with other people, they reliably get surprised. That’s why we believe it is worth writing down our beliefs on this.

Human decision processes are not well factored
Human decision processes are not well factored
Human decision processes are not well factored
Human decision processes are not well factored

Feb 17, 2023

Human decision processes are not well factored

A classic example of human bias is when our political values interfere with our ability to accept data or policies from people we perceive as opponents. When most people feel like new evidence threatens their values, their first instincts are often to deny or subject this evidence to more scrutiny instead of openly considering it. Such reactions are quite common when challenged: it takes active effort not to react purely defensively and consider the critic’s models, even when they are right.

Don't accelerate problems you're trying to solve
Don't accelerate problems you're trying to solve
Don't accelerate problems you're trying to solve
Don't accelerate problems you're trying to solve

Feb 15, 2023

Don't accelerate problems you're trying to solve

If one believes that unaligned AGI is a significant problem (>10% chance of leading to catastrophe), speeding up public progress towards AGI is obviously bad. Though it is obviously bad, there may be circumstances which require it. However, accelerating AGI should require a much higher bar of evidence and much more extreme circumstances than is commonly assumed.

Why almost every RL agent does learned optimization
Why almost every RL agent does learned optimization
Why almost every RL agent does learned optimization
Why almost every RL agent does learned optimization

Feb 12, 2023

Why almost every RL agent does learned optimization

This post discusses the blurred conceptual boundary between RL and RL2 (also known as meta-RL). RL2 is an instance of learned optimization. Far from being a special case, I point out that the conditions under which RL2 emerges are actually the default conditions for RL training. I argue that this is safety-relevant by outlining the evidence for why learned planning algorithms will probably emerge -- and have probably already emerged in a weak sense -- in scaled-up RL2 agents.

Empathy as a natural consequence of learnt reward models
Empathy as a natural consequence of learnt reward models
Empathy as a natural consequence of learnt reward models
Empathy as a natural consequence of learnt reward models

Feb 4, 2023

Empathy as a natural consequence of learnt reward models

Empathy, the ability to feel another's pain or to 'put yourself in their shoes' is often considered to be a fundamental human cognitive ability, and one that undergirds our social abilities and moral intuitions.  As so much of human's success and dominance as a species comes down to our superior social organization, empathy has played a vital role in our history. Whether we can build artificial empathy into AI systems also has clear relevance to AI alignment.

Gradient hacking is extremely difficult
Gradient hacking is extremely difficult
Gradient hacking is extremely difficult
Gradient hacking is extremely difficult

Jan 24, 2023

Gradient hacking is extremely difficult

There has been a lot of discussion recently about gradient hackers as a potentially important class of mesaoptimizers. The idea of gradient hackers is that they are some malign subnetwork that exists in a larger network that steer the gradients in such a way that the behaviour of its host network can be updated away from minimizing the outer loss and towards the gradient hacker's own internal objective.

Basic Facts about Language Model Internals
Basic Facts about Language Model Internals
Basic Facts about Language Model Internals
Basic Facts about Language Model Internals

Jan 4, 2023

Basic Facts about Language Model Internals

As mentioned in our retrospective, while also producing long and deep pieces of research, we are also experimenting with a high iteration frequency. This is an example of this strand of our work. The goal here is to highlight interesting and unexplained language model facts. This is the first in a series of posts which will be exploring the basic ‘facts on the ground’ of large language models at increasing levels of complexity.

[Interim research report] Taking features out of superposition with sparse autoencoders
[Interim research report] Taking features out of superposition with sparse autoencoders
[Interim research report] Taking features out of superposition with sparse autoencoders
[Interim research report] Taking features out of superposition with sparse autoencoders

Dec 13, 2022

[Interim research report] Taking features out of superposition with sparse autoencoders

Recent results from Anthropic suggest that neural networks represent features in superposition. This motivates the search for a method that can identify those features. Here, we construct a toy dataset of neural activations and see if we can recover the known ground truth features using sparse coding. We show that, contrary to some initial expectations, it turns out that an extremely simple method – training a single layer autoencoder to reconstruct neural activations with an L1 penalty on hidden activations – doesn’t just identify features that minimize the loss, but actually recovers the ground truth features that generated the data.

Tradeoffs in complexity, abstraction, and generality
Tradeoffs in complexity, abstraction, and generality
Tradeoffs in complexity, abstraction, and generality
Tradeoffs in complexity, abstraction, and generality

Dec 12, 2022

Tradeoffs in complexity, abstraction, and generality

Complexity, abstraction, and generalization are powerful: Complexity lets us explain a wider range of phenomena, Abstraction allows us to reason more easily by filtering unnecessary noise or details, Generalization lets us reason over larger domains. And yet, complexity, abstraction, and generality have costs, which creates a tradeoff between these benefits that they bring and the costs we incur.

Re-Examining LayerNorm
Re-Examining LayerNorm
Re-Examining LayerNorm
Re-Examining LayerNorm

Dec 1, 2022

Re-Examining LayerNorm

There's a lot of non-linearities floating around in neural networks these days, but one that often gets overlooked is LayerNorm. This is understandable because it's not "supposed" to be doing anything; it was originally introduced to stabilize training. Contemporary attitudes about LayerNorm's computational power range from "it's just normalizing a vector" to "it can do division apparently". And theories of mechanistic interpretability such as features as directions and polytopes are unhelpful, or even harmful, in understanding normalization's impact on a network's representations.

Biases are engines of cognition
Biases are engines of cognition
Biases are engines of cognition
Biases are engines of cognition

Nov 30, 2022

Biases are engines of cognition

Humans are not perfect reasoners. Cognitive biases are not biases; they are engines of cognition, each exploiting regularities of the universe.  Combining enough of these “cognitive biases” was enough to create a general intelligence but doesn’t even approximate perfect reasoners.

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

Nov 28, 2022

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

If we take the SVD of the weight matrices of the OV circuit and of MLP layers of GPT models, and project them to token embedding space, we notice this results in highly interpretable semantic clusters. This means that the network learns to align the principal directions of each MLP weight matrix or attention head to read from or write to semantically interpretable directions in the residual stream.

wooden magnifying glass on beige background
wooden magnifying glass on beige background
wooden magnifying glass on beige background
wooden magnifying glass on beige background

Nov 28, 2022

Searching for Search

This post is a loosely structured collection of thoughts and confusions about search and mesaoptimization and how to look for them in transformers. We've been thinking about this for a while and still feel confused. Hopefully this post makes others more confused so they can help.

The First Filter
The First Filter
The First Filter
The First Filter

Nov 26, 2022

The First Filter

Consistently optimizing for solving alignment (or any other difficult problem) is incredibly hard. The first and most obvious obstacle is that you need to actually care about alignment and feel responsible for solving it. You cannot just ignore it or pass the buck; you need to aim for it.

What I Learned Running Refine
What I Learned Running Refine
What I Learned Running Refine
What I Learned Running Refine

Nov 24, 2022

What I Learned Running Refine

Refine, the alignment research incubator we are running at Conjecture, finished its first cohort a few weeks ago. So now is a good time to take stock, share what we’ve learned, and discuss its future. Let’s get this out of the way first: we are not planning any new cohort in the foreseeable future. There are multiple reasons for this, which I’ll expand on in this post.

Conjecture: a retrospective after 8 months of work
Conjecture: a retrospective after 8 months of work
Conjecture: a retrospective after 8 months of work
Conjecture: a retrospective after 8 months of work

Nov 23, 2022

Conjecture: a retrospective after 8 months of work

This post is a brief retrospective on the last 8 months at Conjecture that summarizes what we have done, our assessment of how useful this has been, and the updates we are making.

Current themes in mechanistic interpretability research
Current themes in mechanistic interpretability research
Current themes in mechanistic interpretability research
Current themes in mechanistic interpretability research

Nov 16, 2022

Current themes in mechanistic interpretability research

This post gives an overview of discussions - from the perspective and understanding of the interpretability team at Conjecture - between mechanistic interpretability researchers from various organizations including Conjecture, Anthropic, Redwood Research, OpenAI, and DeepMind as well as some independent researchers. It is not a review of past work, nor a research agenda. We're thankful for comments and contributions from Neel Nanda, Tristan Hume, Chris Olah, Ryan Greenblatt, William Saunders, and other anonymous contributors to this post, which greatly improved its quality. While the post is a summary of discussions with many researchers and received comments and contributions from several, it may nevertheless not accurately represent their views.

Mysteries of mode collapse
Mysteries of mode collapse
Mysteries of mode collapse
Mysteries of mode collapse

Nov 8, 2022

Mysteries of mode collapse

I have received evidence from multiple credible sources that text-davinci-002 was not trained with RLHF. The rest of this post has not been corrected to reflect this update. Not much besides the title (formerly "Mysteries of mode collapse due to RLHF") is affected: just mentally substitute "mystery method" every time "RLHF" is invoked as the training method of text-davinci-002. The observations of its behavior otherwise stand alone.

a colorful network with dots and lines
a colorful network with dots and lines
a colorful network with dots and lines
a colorful network with dots and lines

Sep 23, 2022

Interpreting Neural Networks through the Polytope Lens

Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? What basic objects should we use to describe the operation of neural networks mechanistically? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned.

Methodological Therapy: An Agenda For Tackling Research Bottlenecks
Methodological Therapy: An Agenda For Tackling Research Bottlenecks
Methodological Therapy: An Agenda For Tackling Research Bottlenecks
Methodological Therapy: An Agenda For Tackling Research Bottlenecks

Sep 22, 2022

Methodological Therapy: An Agenda For Tackling Research Bottlenecks

Better epistemology should make you stronger. Which is why at Conjecture's' epistemology team we are so adamant on improving our models of knowledge-production: this feels like the key to improving alignment research across the board, given the epistemological difficulties of the field.

Moebius illustration of a simulacrum living in an AI-generated story discovering it is in a simulation
Moebius illustration of a simulacrum living in an AI-generated story discovering it is in a simulation
Moebius illustration of a simulacrum living in an AI-generated story discovering it is in a simulation
Moebius illustration of a simulacrum living in an AI-generated story discovering it is in a simulation

Sep 2, 2022

Simulators

Self-supervised learning may create AGI or its foundation. What would that look like?

Conjecture: Internal Infohazard Policy
Conjecture: Internal Infohazard Policy
Conjecture: Internal Infohazard Policy
Conjecture: Internal Infohazard Policy

Jul 29, 2022

Conjecture: Internal Infohazard Policy

Much has been written on this forum about infohazards, such as information that accelerates AGI timelines, though very few posts attempt to operationalize that discussion into a policy that can be followed by organizations and individuals. This post makes a stab at implementation. Below we share Conjecture’s internal infohazard policy as well as some considerations that we took into account while drafting it. Our goal with sharing this on this forum is threefold:

Abstracting The Hardness of Alignment: Unbounded Atomic Optimization
Abstracting The Hardness of Alignment: Unbounded Atomic Optimization
Abstracting The Hardness of Alignment: Unbounded Atomic Optimization
Abstracting The Hardness of Alignment: Unbounded Atomic Optimization

Jul 29, 2022

Abstracting The Hardness of Alignment: Unbounded Atomic Optimization

If there's one thing alignment researchers excel at, it's disagreeing with each other. I dislike the term pre paradigmatic, but even I must admit that it captures one obvious feature of the alignment field: the constant debates about the what and the how and the value of different attempts. Recently, we even had a whole sequence of debates, and since I first wrote this post Nate shared his take on why he can’t see any current work in the field actually tackling the problem. More generally, the culture of disagreement and debate and criticism is obvious to anyone reading the AF.

Robustness to Scaling Down: More Important Than I Thought
Robustness to Scaling Down: More Important Than I Thought
Robustness to Scaling Down: More Important Than I Thought
Robustness to Scaling Down: More Important Than I Thought

Jul 23, 2022

Robustness to Scaling Down: More Important Than I Thought

In Robustness To Scale, Scott Garrabrant presents three kinds of... well, robustness to scale: Robustness to scaling up, meaning that a solution to alignment keeps working as the AI gets better. Robustness to scaling down, meaning that a solution to alignment keeps working if the AI is not optimal or perfect. Robustness to relative scale, meaning that the solution to alignment doesn't rely on symmetrical capabilities between different subsystems.

How to Diversify Conceptual Alignment: the Model Behind Refine
How to Diversify Conceptual Alignment: the Model Behind Refine
How to Diversify Conceptual Alignment: the Model Behind Refine
How to Diversify Conceptual Alignment: the Model Behind Refine

Jul 20, 2022

How to Diversify Conceptual Alignment: the Model Behind Refine

We need far more conceptual AI alignment research approaches than we have now if we want to increase our chances to solve the alignment problem. However, the conceptual alignment field remains hard to access, and what feedback and mentorship there is focuses around few existing research directions rather than stimulating new ideas.

Circumventing interpretability: How to defeat mind-readers
Circumventing interpretability: How to defeat mind-readers
Circumventing interpretability: How to defeat mind-readers
Circumventing interpretability: How to defeat mind-readers

Jul 14, 2022

Circumventing interpretability: How to defeat mind-readers

Unaligned AI will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these risks.

Mosaic and Palimpsests: Two Shapes of Research
Mosaic and Palimpsests: Two Shapes of Research
Mosaic and Palimpsests: Two Shapes of Research
Mosaic and Palimpsests: Two Shapes of Research

Jul 12, 2022

Mosaic and Palimpsests: Two Shapes of Research

In Old Masters and Young Geniuses, economist-turned-art-data-analyst David Galenson investigates a striking regularity in the careers of painters: art history and markets favors either their early pieces or the complete opposite — their last ones. From this pattern and additional data, Galenson extracts and defends a separation of creatives into two categories, two extremes of a spectrum: conceptual innovators (remembered as young geniuses) and experimental innovators (remembered as old masters).

Epistemological Vigilance for Alignment
Epistemological Vigilance for Alignment
Epistemological Vigilance for Alignment
Epistemological Vigilance for Alignment

Jun 6, 2022

Epistemological Vigilance for Alignment

Nothing hampers Science and Engineering like unchecked assumptions. As a concrete example of a field ridden with hidden premises, let's look at sociology. Sociologist must deal with the feedback of their object of study (people in social situations), their own social background, as well as the myriad of folk sociology notions floating in the memesphere.

Productive Mistakes, Not Perfect Answers
Productive Mistakes, Not Perfect Answers
Productive Mistakes, Not Perfect Answers
Productive Mistakes, Not Perfect Answers

Apr 9, 2022

Productive Mistakes, Not Perfect Answers

I wouldn’t bet on any current alignment proposal. Yet I think that the field is making progress and abounds with interesting opportunities to do even more, giving us a shot. Isn’t there a contradiction? No, because research progress so rarely looks like having a clearly correct insight that clarifies everything; instead it often looks like building on apparently unpromising ideas, or studying the structure of the problem.

conjecture logo
conjecture logo
conjecture logo
conjecture logo

Apr 8, 2022

We Are Conjecture, A New Alignment Research Startup

Conjecture is a new alignment startup founded by Connor Leahy, Sid Black and Gabriel Alfour, which aims to scale alignment research. We have VC backing from, among others, Nat Friedman, Daniel Gross, Patrick and John Collison, Arthur Breitman, Andrej Karpathy, and Sam Bankman-Fried. Our founders and early staff are mostly EleutherAI alumni and previously independent researchers like Adam Shimi. We are located in London.

Sign up to receive our newsletter and
updates on products and services.

Sign up to receive our newsletter and updates on products and services.

Sign up to receive our newsletter and updates on products and services.

Sign Up