AI risk demo

This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning

Github ~~repo: https://github.com/bhi5hmaraj/llm-reward-hacking-demos~~repo