Research/
OTPO: Refining LLM Alignment through Optimal Transport Policy Optimization
Researchers have introduced Optimal Transport Policy Optimization (OTPO), a method that replaces uniform weighting in preference learning with dynamic, smart weights. By prioritizing the most informative data points during the alignment phase, OTPO improves model performance and stability compared to traditional Direct Preference Optimization.