Skip to main content

AI risk demo

This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning

Github repo: https://github.com/bhi5hmaraj/llm-reward-hacking-demosrepo