Here’s something else I got from the first Yudkowsky-Ngo dialogue:

Suppose you go to Lion Country and get mauled by lions. You want the part of your brain that generates plans like “go to Lion Country” to get downgraded in your decision-making algorithms. This is basic reinforcement learning: plan → lower-than-expected hedonic state → do plan less. Plan → higher-than-expected hedonic state → do plan more. Lots of brain modules have this basic architecture; if you have a foot injury and walking normally causes pain, that will downweight some basic areas of the motor cortex and make you start walking funny (potentially without conscious awareness).

But suppose you see a lion, and your visual cortex processes the sensory signals and decides “Yup, that’s a lion”. Then you have to freak out and run away, and it ruins your whole day. That’s a lower-than-expected hedonic state! If your visual cortex was fundamentally a reinforcement learner, it would learn not to recognize lions (and then the lion would eat you). So the visual cortex (and presumably lots of other sensory regions) doesn’t do hedonic reinforcement learning in the same way.

So there are two types of brain region: basically behavioral (which hedonic reinforcement learning makes better), and basically epistemic (which hedonic reinforcement learning would make worse, so they don’t do it).

But it’s a fuzzy distinction. Suppose that out of the corner of your eye, you see a big yellowish blob. Is it a lion? To find out, you’d have to turn your head. Turning your head is a good idea and you should do it. But it’s going to involve a pretty decent chance that you see a lion and then your day is ruined. Turning your head is a behavior and not a theory, but it’s a pretty epistemic behavior. Do you do it or not? I think in this situation most people would head-turn. But it looks a lot like a class of problems people actually have trouble with - eg they’re pretty sure they’re behind on their taxes, so they dread opening their budgeting program to check, and then their finances just get worse and worse (Roko Mijic calls this an “ugh field”).

Speculatively, maybe taxes are such a novel situation that they get spread across different brain architecture types: some of them end up on nonreinforceable architecture, other parts on reinforceable architecture. It can’t be 100% reinforceable, or else you could train yourself into thinking your taxes were completely done and no IRS nastygram could ever convince you otherwise. But if it’s 5% reinforceable, it could at least teach you the behavior of not checking.

Motivated reasoning is the tendency for people to believe comfortable lies, like “my wife isn’t cheating on me” or “I’m totally right about politics, the only reason my program failed was that wreckers from the other party sabotaged it”. In this model, it’s got to be what happens when you try to run epistemics on partly-reinforceable architecture. Checking whether your political program worked or not involves a lot of behaviors analogous to head-turning: what sources to check, how much attention to pay to each. It also involves purely epistemic behaviors, like deciding how hard to update on each contrary fact, or whether or not to make excuses.

Maybe thinking about politics - like doing your taxes - is such a novel modality that the relevant brain networks get placed kind of randomly on a bunch of different architectures, and some of them are reinforceable and others aren’t. Or maybe evolution deliberately put some of this stuff on reinforceable architecture in order to keep people happy and conformist and politically savvy.

This question - why does the brain so often confuse what is true vs. what I want to be true? - has been bothering me for years. I think this explanation is obvious, almost tautological. I get the impression that Eliezer and Roko have both known it for ages, but it was new to me. If there’s other research on which parts of the brain are / aren’t reinforceable, or how to run your thoughts on one kind of architecture vs. the other, please let me know.