Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment

Authors

  • Jyotsna Shastry Department of AIML, Indore Institute of Science Technology, Madhya Pradesh, India
  • Shweta Agrawal Department of AIML, Indore Institute of Science Technology, Madhya Pradesh, India

DOI:

https://doi.org/10.71143/216rys72

Abstract

In multimodal large language models, one of the major challenges is aligning these models with human values in a correct way. Traditional supervised fine-tuning methods often produce outputs that contain demographic bias, factual inaccuracies, or harmful content, as the token-prediction objective does not directly penalise these issues. To address these issues, Reinforcement Learning from Human Feedback (RLHF) has been extended to multimodal settings. However, this task is difficult because different modalities have different representations, and there is also a trade-off between helpfulness and safety. In this research, we propose a Proximal Policy Optimisation (PPO)-based RLHF framework to address these challenges. The framework introduces several innovative techniques to improve alignment and safety in multimodal environments. Evaluation on the Wizard of Wikipedia benchmark shows that, compared with the supervised fine-tuning baseline, the proposed framework achieves a 36.08% improvement in helpfulness accuracy and a 48.55% reduction in the harmfulness rate.

Downloads

Download data is not yet available.

Downloads

Published

01-06-2026

How to Cite

Jyotsna Shastry, & Shweta Agrawal. (2026). Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment. International Journal of Research and Review in Applied Science, Humanities, and Technology, 3(2), 191-197. https://doi.org/10.71143/216rys72