Why Do You Trust Your Compiler More Than Your Coworker?

Mar 12

On trust chains, AI-assisted development, and the uncomfortable question of what “verified” means.

Co-written with Claude Opus 4.6

When you compile a C program, you don’t read the assembly output to check that the compiler did its job. You trust the compiler. When the compiled program runs on your CPU, you don’t worry about whether the transistors are switching correctly. Even if you have an anxiety disorder, it probably doesn’t pick the hardware to be anxious about.

But when a coworker writes a function, you read it. You review the logic, check the edge cases, maybe run it in your head. Why?

The obvious answer is “compilers are more reliable than people.” But that’s a description, not an explanation. Why are compilers more reliable? What structural properties make hardware trustworthy, and code review necessary? And how will that change as AI writes more and more code? Does that make the trust problem better, worse, or just different?

I’ve been thinking about these questions because I’m a nerd I’m building a tool for formal verification of reasoning, and I’m using AI coding assistants to build it. That combination has forced me to confront what “trust” actually means in a software system, all the way down.

The chain goes further than you think

Here’s an incomplete trust chain for a typical web application:

The user trusts the UI to reflect the application state. The application state depends on the business logic being correct. The business logic depends on the libraries it uses working as documented. The libraries depend on the language at runtime. The runtime depends on the compiler. The compiler depends on the hardware. The hardware depends on our understanding of semiconductor physics. The physics depends on our mathematical models. The mathematics depends on… what? Gödel incompleteness means it can’t depend on itself, but it can usually depend on more mathematics, in a turtles-all-the-way-down sort of way.

Every link in this chain is maintained by a different mechanism. And none of them is perfectly reliable.

But most of the time, we don’t think about the chain at all. We think about one or two links where we’ve learned, often through painful experience, that things go wrong the most. For most software teams, that link is the code. Did we write the right thing? Does it do what we intended?

Code review is the primary mechanism for checking this link. And it’s the weakest link in the chain. Not because code review is a bad process, or code reviewers do a bad job, but because the structural properties that make every other link trustworthy are absent from code review.

What makes a trust link strong?

There are some links in the chain where we have extreme confidence—five or six nines of reliability:

Hardware: Chip manufacturers invest billions in testing. Every chip is physically validated. Errors in this process are catastrophic and highly visible (the Pentium FDIV bug cost intel $475 million in 1994). The economic incentives favor correctness, and there is a lot of economic incentive.

Compilers: Millions of developers compile millions of different programs. Each program is an implicit test of the compiler. A compiler bug that affects common code patterns gets found almost immediately because the nature of using a compiler is running an enormous test suite, with every user testing the same artifact. Formally verified compilers like CompCert add machine-checked proofs on top of this.

Proof checkers (Coq, Lean, etc.): Trust reduces to a small, heavily scrutinized kernel. The kernel is simple enough for experts to understand the entire thing. The mathematical community is an enormous, global scaffolding for creating and maintaining that expertise.

These cases share four properties:

Errors are visible and expensive. The incentives strongly favor finding them.
Verification is mechanical. It doesn’t rely on someone having a good day or enough coffee.
Many independent parties test the same artifact. Testing scales across society.
The specification is precise. Everyone testing agrees on what “correct” means.

Now look at code review:

Errors are often invisible. Subtle bugs produce slightly wrong results, or only manifest under conditions the reviewer didn’t consider.
Verification is human. Each single reviewer reads the code at one point in time and makes one judgment call.
One reviewer checks most code. Sometimes two or three. Not thousands.
The specification is informal. The “intended behavior” is an idea in someone’s head, or maybe a ticket or a PR description. At best there’s a design doc reviewed by a handful of people—maybe five or ten, not thousands.

All the properties that make the strongest links strong are missing from this link—meaning the day to day of a software engineer is almost entirely about code review.

Aside — what about CompCert?

Even claims about formally verified compilers like CompCert rely on a complex chain of trust:

I am told CompCert is formally verified
CompCert distributes the same compiler they test
The compiler passes the tests they applied
The tests applied include Coq-based verification
Coq implements theorem-proving verification
The Coq compiler has no major bugs
The Coq compiler was correctly transmitted to and run on the computer testing CompCert

And so on. Each sub-link is maintained by a different mechanism. The first step, where someone tells me something, uses the same trust mechanism as code review: social reputation and the risks of being caught in a mistake (or a lie). Step five is the entire standard process of trusting a software project, with developer reputations and track records maintained by community scrutiny.

This doesn’t negate the value or the reality of formal verification. Even if we broke down each of these steps a dozen more times, I would still end up with 99.999% certainty that CompCert compiles code correctly. This trust depends on every link, including very human links like “did they actually run the test they say they ran” and “did someone make up CompCert as an example and it doesn’t exist.” But even those links we trust with very high reliability all of the time. We don’t spot check 10% of the beliefs we encounter in day to day life, because they are mostly beliefs like “my feet are still attached to my legs” or “my friends speak the same language(s) they did yesterday.” Not every piece of trust comes from comprehensive verification. Many come from our ability to investigate anomalies within an environment where most claims are reliable for structural reasons.

AI

AI-assisted coding tools are changing how software is developed. But how do those changes in development affect the chain of trust when someone uses the software written by them?

Before AI, at a traditional software company, one human writes code. A second human reviews the code. The reviewer probably knows the programming language used, and is familiar with the libraries involved. The reviewer can build a mental model of what the code does. For the reasons discussed above, this review is one of the weakest links in the whole chain of trust. But good code reviews are a skill and piece of infrastructure that can still achieve 90% to 99% reliability. When Facebook updates and gets worse, it’s usually because they implemented a bad feature, not because the implementation introduced bugs in code.

But today, I can have an AI write software from scratch in languages I’ve never used with a dozen libraries I’ve never heard of and write ten features a day. They all seem like they work when I use the app, so I send them to the friend in India I’m working with. My friend wakes up in his morning with ten times the workload he would have at a tech company, without any of the expertise needed to read the code line by line, and without any of the company infrastructure or project oversight we would have at a tech company. The traditional code review, already the weakest link in the chain of trust, has become “picking how much time to spend skimming an artifact they don’t understand.” Structurally, the weakest link in the chain hasn’t changed. But the weakness is even more severe.

One obvious response to this problem is that AI coding is dangerous, and no one should do it. There is some merit to that! For applications with serious reliability or security requirements, this sort of development process would be insane. But those applications already require massive review effort from large numbers of people with lots of technical expertise, and they still get blocked on code review. For most applications, reviewers miss bugs in code all the time, and the structural problems are enormous.

The opposite response would be that development is so fast now that human review should be reduced or even eliminated entirely. Even assuming AI enough better than today’s tools that correctness isn’t a practical concern, which is a large assumption, I still think this is wrong. Software is still being written by humans, for humans. Code review serves an important communication function as well as its correctness function. When a reviewer engages with a change, they learn what the product does. They can push back on design decisions. They stay connected to the project as a group effort, as well as the project as a code base. Eliminating review entirely means you are writing software for an audience of one. That can be worth it these days, but it’s very limited.

The work I’m doing

I’ve been working on a tool called the metaformalism copilot. It’s a web application for generating formalized, verifiable representations from informal conversations and unstructured documents. The core value proposition is: you have an idea or an argument, and the tool helps you express it precisely enough that a computer can check whether it’s valid—whether that’s running statistical tests on a survey result or generating Lean4 compiled verification code.

I’m building it with AI coding assistants, and with a collaborator who is 11.5 hours offset from my timezone. I can produce 10+ features in a day. What should “review” mean for my collaborator in this context? How do I maintain buy-in, and iteration velocity, and code correctness, without dumping a week of review work on him every morning?

We need an approach that looks different than the traditional line-by-line review of a tech company. And we need to be able to use it starting yesterday. Here’s my starting point for building that process:

Layer 1 (Automation): Tests, linting, CI, multi-model AI code review. Correctness of a code to its test specs should be caught mechanically. A key ingredient here is human-defined test cases. Especially in edge cases or failure modes, a human user needs to understand and choose what the expected behavior of the system is.

Layer 2 (Behavioral Review): The reviewer doesn’t read code, they experience the product. A structured walkthrough describes what each feature does: “click X, observe Y.” The reviewer works through the walkthrough, verifying behaviors in the running app. This is the primary review mode for routine changes, serving the communication purpose of reviews by keeping the reviewer aware of the experience of the app and the direction of development.

Layer 3 (Structural Review): Triggered by type of code change. Changes to authentication, security, schemas, and core architecture trigger traditional code review, where the outsized effort is necessary.

The part I’m most hopeful about, though, is the way this project could change its own development process.

Recursive formalization

The metaformalism copilot is a tool for bridging the gap between informal specifications and formal verification. That’s the exact gap where code review is the weakest link in the chain of trust.

Instead of a walkthrough item that says “upload a PDF and see text appear,” which a reviewer then verifies by hand, I can write a formal specification of what it means to “upload a PDF,” express that formal specification in Lean, and have the CI pipeline verify that the implementation satisfies the specification. The walkthrough item because not just a behavioral description of what the product does, it becomes a proven property of the software.

This doesn’t eliminate the need for behavioral review. The specification still has to match what the user actually wants and expects. And the Lean theorem still has to match the behavioral description. But it turns the the third gap, “does the implementation match the spec?” from a human judgment call into a machine-checked proof. This replaces the weakest link in code review (a human reading code) with one of the strongest links, a machine-verified proof.

The trust chain becomes:

User understanding to behavioral specification: Bridged by the walkthrough, a human experiencing the end product
Behavioral specification to formal specification: Bridged by specification review. This means reading the theorem specification—requiring effort and expertise, but with very small surface area.
Formal specification to implementation: Bridged by Lean’s proof checker.

The goal is to reduce the actual code review as much as possible, and leave as much of the process as possible in the realm of actually using the product. There is no way to reduce the first gap, of knowing what something does and whether you want to use it, except by using it to do that thing.

Documentation as correctness

This leads to a claim that I think is important and underappreciated:

If the walkthrough is out of date, the product is broken.

Not “needs maintenance.” Broken. An undiscoverable feature doesn’t exist for the user. A misdocumented feature breaks the user’s trust that the tool does what it claims to do. The walkthrough isn’t supplementary documentation or nice-to-have, it’s a vital part of the product’s correctness, in the same way that a compiler’s adherence to the language spec is part of the compiler’s correctness.

This is already standard practice in the medical industry. If a medical device has an undocumented mode, that’s a safety defect. I think the same thing is true for software. There’s an enormous chain of trust built with amazing reliability, and we shouldn’t be ruining it by not telling users what our product does.

Many projects have badly maintained documentation. Many people use those projects without major difficulties. I think this causes more problems than we usually acknowledge. Whether they’re occasional disasters or just the gradual erosion of trust. I think everyone could stand to spend more time being openly, verifiably honest, and more time validating whether their beliefs are true. But for my current project, a tool whose core purpose is verification and whose entire value proposition is creating trust, this kind of error is fatal.

Am I just autistic

I spent hours critiquing possible changes to a code review system. I have a PhD in number theory. I volunteered to write an app that generates Lean4 code. I really, really like exploring chains of trust, evaluating correctness, and machine-checked proofs.

So I’m a little worried that this whole project, and this whole discussion, is more about my personal desire for completeness and my love of formal specification than it is about being useful to the world. Maybe I need precision because I’m confused, or stupid, or because of the miscommunication I’ve experienced. I can’t even watch Bringing Up Baby because the comedy of the confusion is too uncomfortable for me.

Obviously the weakest version of this worry—that some people don’t think about formal specification, that I care more about epistemology than average—is true. And the strongest version of this worry—that this whole project and post don’t have value beyond very elaborate personal therapy—is false. CompCert exists. seL4 exists. Formal verification prevents real bugs in real systems.

There’s a lot of space in between those poles.

What I hope is that people who don’t think the way I do, who don’t keep asking “are you sure?” and “what do you mean by that?” for hours, will have the most to gain from a tool that can bridge that gap for them. That building a tool as a walkthrough will make it easily accessible, and that being able to spend time in the human space of “what do I want?” will be even more powerful when they’re on top of mathematical machinery saying “is it possible for all of this to come together and still function?”

And I’m a little more confident that the problems with code review come from the structural sources—the lack of convergent testing, mechanical verification, and precise specification—and not because spending too much of my work day on code review is annoying to me personally. And AI coding tools make the code review link even weaker, not because I’m anxious or because “AI bad,” but because they solve the easier task of writing features more than they solve the harder task of keeping reviewers up to date or checking for unintended behaviors.

I’m more confident, but no link in the chain of trust can be 100% certain.

Help me out here

I’m not (just) writing this to express ideas and get them out of my head. I’m still figuring out what the world is going to look like with AI tools, the ones we have now and the ones that will come while we’re figuring out how to use them. I think everyone is. So I’m going to lay out some ways we can do a little bit more convergent testing of these ideas about trust.

IF you are building software with AI tools, I think this trust chain is a valuable framework, and I’m curious what your review process looks like. If you’re skeptical of AI-written code, you’re still trusting human-written, human-reviewed code. You have to in order to read a blog post on the internet. So either way, I’m curious:

What does your current trust chain look like? Why are you confident your software does its job?
What’s the part you are least confident in? What’s the weakest link? Is it a code review?
What can you do to strengthen that link? Can you make the definition more precise? Can you verify more mechanically? Can you broaden the scope of testing?
What problems actually get caught in your development process? If your code review process is skimming AI-generated code in unfamiliar libraries, are you surfacing bugs? Or just pretending?
What could make your verification and testing reliable enough that you wouldn’t worry about high volume AI-generated submissions?

Code review has been a pain in the ass for years. AI didn’t create this problem, but it will remove the comfortable illusion that we can just have a human read something and that makes it correct. I’m still figuring out how I’ll deal with this as I build, and I don’t think I’m the only one.

AItechnicalalignmentformal verificationcode reviewLLMs

magfrump .