Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment

Jyotsna Shastry; Shweta Agrawal

doi:10.71143/216rys72

Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment

Authors

Jyotsna Shastry Department of AIML, Indore Institute of Science Technology, Madhya Pradesh, India
Shweta Agrawal Department of AIML, Indore Institute of Science Technology, Madhya Pradesh, India

DOI:

https://doi.org/10.71143/216rys72

Abstract

In multimodal large language models, one of the major challenges is aligning these models with human values in a correct way. Traditional supervised fine-tuning methods often produce outputs that contain demographic bias, factual inaccuracies, or harmful content, as the token-prediction objective does not directly penalise these issues. To address these issues, Reinforcement Learning from Human Feedback (RLHF) has been extended to multimodal settings. However, this task is difficult because different modalities have different representations, and there is also a trade-off between helpfulness and safety. In this research, we propose a Proximal Policy Optimisation (PPO)-based RLHF framework to address these challenges. The framework introduces several innovative techniques to improve alignment and safety in multimodal environments. Evaluation on the Wizard of Wikipedia benchmark shows that, compared with the supervised fine-tuning baseline, the proposed framework achieves a 36.08% improvement in helpfulness accuracy and a 48.55% reduction in the harmfulness rate.

Downloads

Download data is not yet available.

Downloads

Published

01-06-2026

Issue

Vol. 3 No. 2 (2026): IJRASHT: Vol 3, Issue 2, April-June 2026

Section

Articles

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

How to Cite

Jyotsna Shastry, & Shweta Agrawal. (2026). Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment. International Journal of Research and Review in Applied Science, Humanities, and Technology, 3(2), 191-197. https://doi.org/10.71143/216rys72

Download Citation

Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment

Authors

DOI:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Issn Block

Ugc Guidelines

Indexing

Make a Submission

publisher

Policies

LOGIN FORM

Aims & scope

Information

Latest publications

Keywords

Visitor-counter