AI risk demo
This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning
Github repo
This project aims to replicate the results from the Armstrong's toy model of reward hacking on LLMs trained with RLVR finetuning
Github repo
No comments to display
No comments to display