1.
Jyotsna Shastry, Shweta Agrawal. Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment. IJRASHT [Internet]. 2026 Jun. 1 [cited 2026 Jun. 26];3(2):191-7. Available from: https://ijrasht.com/index.php/files/article/view/306