Conjectures

Bold, falsifiable ideas open for rational criticism

Filter by tag:

Clear filter
Showing conjectures tagged:
rlhf
Active
3 months ago

RLHF Misgeneralization

Reinforcement Learning from Human Feedback consistently produces models that exhibit goal mis-generalisation when exposed to novel adversarial inst...

By Anonymous User
0 refutations View