Pinned
Since everyone is talking about RL Environments and GRPO now but no one knows how it works we thought it would be cool to make an explainer video + code you can run:
This is an example of using GRPO to train Qwen 2.5 to play 2048 (code in thread) 🧵:
00:00










