AI risk demo
This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning
This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning