Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment
DOI:
https://doi.org/10.71143/216rys72Abstract
In multimodal large language models, one of the major challenges is aligning these models with human values in a correct way. Traditional supervised fine-tuning methods often produce outputs that contain demographic bias, factual inaccuracies, or harmful content, as the token-prediction objective does not directly penalise these issues. To address these issues, Reinforcement Learning from Human Feedback (RLHF) has been extended to multimodal settings. However, this task is difficult because different modalities have different representations, and there is also a trade-off between helpfulness and safety. In this research, we propose a Proximal Policy Optimisation (PPO)-based RLHF framework to address these challenges. The framework introduces several innovative techniques to improve alignment and safety in multimodal environments. Evaluation on the Wizard of Wikipedia benchmark shows that, compared with the supervised fine-tuning baseline, the proposed framework achieves a 36.08% improvement in helpfulness accuracy and a 48.55% reduction in the harmfulness rate.
Downloads
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







