Skip to main content

AI risk demo

This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning

Github repo