Train the Iterated Game
Wisdom comes from repeat experience. If we want to train AIs to have wisdom, we need to train them on repeat experience.
So, I was listening to this CHT podcast with Daniel Barcay (highly recommend, by the way), and he was talking about chain-of-reasoning scratchpads and how they can expose the gap between what a model wants to say and what it actually says. It got me thinking, as these things do, about the whole "base morals" problem for AI. Like, how do you even do that? How do you teach a model to trade off between competing virtues like mercy/justice, or honesty/outcomes?
The podcast’s point about the scratchpads made me wonder if a post-training alignment step, focused on aligning final outputs with some kind of internal moral compass, might be a piece of the puzzle. But then, of course, that leads to the infinite regress of "when/how to be honest." We don't teach kids "always tell the truth." Ok, we start there, but as soon as they become complex we teach them nuance. "Does this dress make me look fat?" gets a different answer than "How should we restructure the sales team, CEO?" which itself gets a very different answer than "Your Majesty, your new clothes are magnificent!" And that gets a very different answer than what you tell a mugger with a gun pointed at you.
Who’s asking matters. What happened in the past matters. What you think might happen if people behave like you do in the future matters, to.
Humanity figures this out (sort of) without explicit rules, because the rules are implicit and weighted situationally. It's all context, all the time. Which makes me wonder… could game theory be the key?
Enter the “Prisoner’s Dilemma”.
I recently re-visited Nicky Case's "Evolution of Trust" (https://ncase.me/trust/). If you haven't seen it, it's this brilliant interactive explainer of iterated prisoner's dilemmas, and how cooperation and trust evolve (or don't) over generations. It's fascinating. Players get payoffs that are good/bad depending on how trusting, naive, selfish, or retributive they are. And then you can simulate a bunch of these kinds of players playing over generations. You can simulate what it’s like when people are honest in a certain way in the future and get rewarded/punished for it.
And it made me think: maybe we're training these transformer models on the wrong thing. Maybe instead of just chain-of-thought, we need to train them on chain-of-events in social interactions. Not just the internal monologue, but the game itself, repeatedly, over and over. Imagine training a model on tons of simulations of these trust-based interactions, letting it see the long-term consequences of different strategies. Could that teach it the context-sensitive nature of honesty, the way humans learn it? Could it learn the evolution of nuanced trust?
Is anyone working on this? A search mostly shows that people are sort of thinking about it in general, but not connected like this:
using the scratchpads of chain-of-reasoning models to compare a sort of “inner monologue” to compare against actual actions
make those actions be moves in an iterated prisoner’s dilemma
make that iterated-prisoners-dilemma be played among a large group of AIs
Have reproduction/selection effects after a round of the IPD so effective players reproduce, ineffective players don’t
play this out over many generations
And have those entire generations be the training data.
I claim this is how inherited wisdom happens in human societies, so if we are going to train wise AIs, we should simulate games of honesty over generations and train on it.
(I used some AI to help me write this. I hope it didn’t lie to me…)