…Design and build scalable RL infrastructure supporting distributed training and evaluation across complex multi-modal environments.
Develop reward modeling strategies to improve alignment, training stability, and mitigate failure modes such as reward hacking.
Create…