Discussion about this post

User's avatar
Wabi Sabi's avatar

Fascinating and well-written! Addresses so many questions I have about language in such a short space. Interesting how it agrees more with Cormac McCarthy in "The Kekulé Problem" than it does with Chomsky's theory of language - beyond the idea 'Human vs. ape communication is apples and oranges', it doesn't go along with much of the latter's thesis at all.

Expand full comment
The Birds 'n' the Bayes's avatar

I'm not sure whether it was right at the time, but it's certainly not right now to say that reinforcement learning is always or even mainly achieved via self-play. The reinforcement learning that forms the second main phase of training an LLM chatbot ("post-training"), turning it from a pure next-token predictor into a turn-taking, "helpful, honest, harmless assistant" persona is done by human feedback, with no self-play at all, and I'm pretty sure the chain-of-thought reinforcement learning in the newer reasoning models is similar. I'd be surprised if this is entirely a very new development, and that reinforcement learning was all done by self-play before this. Either way, it definitely isn't now.

This isn't a central point at all and I thoroughly enjoyed this excerpt, but I thought this was a confusion worth noting.

Expand full comment
9 more comments...

No posts