[1]

Jyotsna Shastry and Shweta Agrawal, “Learnable Reward Weighting in Multimodal RLHF: A Proximal Policy Optimization Framework for Safe and Helpful Dialogue Alignment”, IJRASHT, vol. 3, no. 2, pp. 191–197, Jun. 2026, doi: 10.71143/216rys72.