All Alignment Research is Capabilities Research

Though probably not all capabilities research is alignment research.

A couple weeks ago I saw a version of the xkcd standards comic about responsible AI research labs becoming racing AI capabilities labs. I felt a strong sense of connection to the comic, especially around OpenAI’s betrayals of its original structure, but I think it’s importantly confused about what alignment actually is.

The meme implicitly models safety work and capabilities as separate research programs, where safer models work slower and more dangerous models are more powerful. Labs that forget about safety work accelerate the race by defecting from a shared agreement. And it’s definitely true that coordinating a group of different billion dollar valued companies is harder than coordinating only one or two! And market forces don’t have a way to short sell services provided by private companies because of the externalities.

But the model is wrong about the idea that safety work is fundamentally different from capabilities work. The correctly-specified task—do what humans actually want, robustly, across distributional shifts—is strictly harder than a narrowly benchmark-optimized task. But it’s also more useful! If an AI tool makes unpredictable errors whenever you use it slightly incorrectly, it’s not a very useful technology. If an AI tool continuously communicates with you about what your objectives are and stays on task reliably with robust test infrastructure, it’s a very useful technology. RLHF and Constitutional AI were both developed as alignment techniques, and their ability to get models to understand instructions is plausibly the reason models became commercially deployable at all! It’s hard to disentangle things like employee mobility and historical events or track valuation of privately held companies, but Anthropic and DeepMind both meaningfully surpassed OpenAI’s early market lead while, and potentially becauseof, being more concerned about model alignment.

At a glance, it is quite alarming to see people leave OpenAI for Anthropic, only for Anthropic’s models to become commercially viable first! It doesn’t feel like people going slower and safer.

But on a closer look, this could be a negative alignment tax rearing its empirical head. This isn’t a new idea. But it doesn’t answer broader concerns about coordination, which I think are remarkably contiguous with power and privilege critiques that come from very different directions. The gaps between “what I say” and “what I want” is remarkably similar to the gap between “what I want” and “what we want.” I like other people! I want to include them! But that doesn’t magically grant me knowledge of their values or needs. And it’s both hard to imagine tech companies doing thorough surveys of the third world, or AI agents that independently seek out alienated perspectives.

It’s very possible that focusing on access and inclusion for non-OECD users would expand data distributions in ways that solve some of the problems of RLHF.

This doesn’t address every concern—existing models are still heavily centralized and reliant on small numbers of private companies for their functionality.

But some unfortunate futures from AI can be guarded against by styles of adoption and inclusion that utilize AI capabilities, in ways that reach out to a broader variety of AI critics. And that should be satisfying when reasoning under moral uncertainty.

Previous
Previous

Engaging vs. Engagement

Next
Next

Dirty dishes on the counter