JYOTSNA SHASTRY; SHWETA AGRAWAL. Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment. International Journal of Research and Review in Applied Science, Humanities, and Technology, [S. l.], v. 3, n. 2, p. 191–197, 2026. DOI: 10.71143/216rys72. Disponível em: https://ijrasht.com/index.php/files/article/view/306. Acesso em: 26 jun. 2026.