Return to Article Details Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment Download Download PDF