💡Reinforced self-training uses synthetic data generated by the model itself to improve reward.
🔄The data is then filtered using a reward model to retain only high-quality examples.
⬆️The iterative process of generating and filtering data boosts the reward of the language model.
📈The improved language model aligns better with human preferences.
🔁The process of generating and improving data can be repeated multiple times to further enhance the language model.