[1]
Jyotsna Shastry and Shweta Agrawal 2026. Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment. International Journal of Research and Review in Applied Science, Humanities, and Technology. 3, 2 (Jun. 2026), 191–197. DOI:https://doi.org/10.71143/216rys72.