AI risk demo
This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning
Github repo: https://github.com/bhi5hmaraj/llm-reward-hacking-demosrepo
This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning
Github repo: https://github.com/bhi5hmaraj/llm-reward-hacking-demosrepo