Two paragraphs from the mesa-optimizers post, which I quoted again in the adaptation-executors post:

Consider evolution, optimizing the fitness of animals. For a long time, it did so very mechanically, inserting behaviors like “use this cell to detect light, then grow toward the light” or “if something has a red dot on its back, it might be a female of your species, you should mate with it”. As animals became more complicated, they started to do some of the work themselves. Evolution gave them drives, like hunger and lust, and the animals figured out ways to achieve those drives in their current situation. Evolution didn’t mechanically instill the behavior of opening my fridge and eating a Swiss Cheese slice. It instilled the hunger drive, and I figured out that the best way to satisfy it was to open my fridge and eat cheese.


Mesa-optimizers would have an objective which is closely correlated with their base optimizer, but it might not be perfectly correlated. The classic example, again, is evolution. Evolution “wants” us to reproduce and pass on our genes. But my sex drive is just that: a sex drive. In the ancestral environment, where there was no porn or contraceptives, sex was a reliable proxy for reproduction; there was no reason for evolution to make me mesa-optimize for anything other than “have sex”. Now in the modern world, evolution’s proxy seems myopic - sex is a poor proxy for reproduction. I know this and I am pretty smart and that doesn’t matter. That is, just because I’m smart enough to know that evolution gave me a sex drive so I would reproduce - and not so I would have protected sex with somebody on the Pill - doesn’t mean I immediately change to wanting to reproduce instead. Evolution got one chance to set my value function when it created me, and if it screwed up that one chance, it’s screwed. I’m out of its control, doing my own thing.

[But] I feel compelled to admit that I do want to have kids. How awkward is that for this argument? I think not very - I don’t want to, eg, donate to hundreds of sperm banks to ensure that my genes are as heavily-represented in the next generation as possible. I just want kids because I like kids and feel some vague moral obligations around them. These might be different proxy objective evolution gave me, maybe a little more robust, but not fundamentally different from the sex one.

These posts both focus on the difference between two ways that a higher-level optimizer (evolution, gradient descent) can train an intelligence: instincts vs. planning. Probably the distinction is messier in real life, and there are lots of different sub-levels. But both posts share this idea of drives getting implemented at different levels of consequentialism.

How does this relate to willpower?

It sure feels like one could tell a story where “I” “am” “the planning module” of my mind. I come up with kind-of-consequentialist, long-term plans for achieving goals represented at a high level of abstraction. Then I fight against various instincts represented at lower levels of abstraction. The winner depends on a combination of hard-coded rules, and on which of us (the planning module vs. the lower-level instincts) have been better at getting reinforced in the past.

I don’t know how true this story is. “I am the planning module” seems not exactly the same as “I am the global workspace” or “I am a sampling from a probability distribution coherent enough to create working memory out of” (though it doesn’t really contradict those, either). Maybe the “I” of willpower/agency isn’t exactly the same as the “I” of conscious access? After all, the I of conscious access can clearly feel the desire to enact instinctual drives (eg binge on Doritos), even if the I of agency is trying to exert willpower to avoid doing it. But this generally fits my current best guess at how willpower works.

One corollary of this model is that future AIs may suffer weakness of will, the same as humans. Suppose an AI is trained on some task through gradient descent. It first learns the equivalent of “intuitive”/”instinctual” hacks and “reflexes” for doing the task. Later (if the mesa-optimizer literature is right), some of these combine/evolve into a genuine “consequentialist” “agent” or planning module, which is “superimposed upon” the original instincts. But the planning module will start out less effective than the original instincts at most things, and the overall mind design will have to come up with a policy for when to use the instincts vs. the planning module. At the beginning, this will be heavily weighted in favor of the instincts. Later, as the planning module gets better, with enough training it should learn to favor the planning module more. But lots of things happen with “enough” training, and real AIs could potentially still have situations where their agentic parts defer to their instinctual parts.

Many stories of AI risk focus on how single-minded AIs are: how they can focus literally every action on the exact right course to achieve some predetermined goal. Such single-minded AIs are theoretically possible, and we’ll probably get them eventually. But before that, we might get AIs that have weakness of will, just like we do.