Feb 18, 2024
Efficient Exploration for LLMs
A look at active exploration for RLHF-style feedback collection and why choosing better questions can improve LLM training efficiency.
Recently, a new and exciting paper from Google DeepMind and Standford University on efficient exploration in gathering human feedback to improve LLMs was published. Authors raise important questions, including one that I also asked myself a few times. Will gathering more and more data will help us get better models? And what if we use all of the data? Given that we currently use training processes like RLFH to learn only from humans, how can we hope for superhuman efficiency? We are modeling LLMs to be like humans. So we are creating imperfect humans.
Imperfect human but with broad horizons?
Do you remember the unconfirmed OpenAI leak about the Q* algorithm that could bring us one step closer to AGI? At that time some theorists considered a theory that LLMs might not be perfect but they can still be used to generate thousands of ideas, where one of them can be groundbreaking. The authors had similar idea. Imagine a pretrained model that extrapolates from its training data to generate large numbers - perhaps millions or billions - of ideas, and concepts. If even only one of them is groundbreaking we can continue building on top of that. This way, with enough human feedback, a model is taught to become capable of generating content that a human could not. But how much time would take data collection? Months, years, or decades?
Exploration and Human Feedback
Classic RLHF is a method of model training where queries, each consisting of a prompt and generated responses, are sent to human annotators and rated from the best to worst, according to human preference. Then, using that data we train a Reward Model and use that during the final model alignment to human preference. The authors call this approach a passive exploration.
Active exploration
Active exploration is a strategic selection of queries based on past human feedback to improve the training process of LLMs. The method contrasts with passive exploration, where queries are chosen without leveraging past feedback. Agent, during the training actively selects response pairs to maximize the quality of future feedback, using information gained from past interactions. In simpler words, the idea is to use what the model has learned from earlier feedback to pick the next set of questions that are most likely to give new and useful information.
So how the answer pairs are selected? The authors experimented with two different approaches to exploration.
Boltzmann Exploration
The idea behind using Boltzmann exploration is to make "educated guesses" (assigning a probability to each response based on its estimated reward) on which questions to ask next. This is based on which questions got helpful answers before. Responses with higher estimated rewards are more likely to be chosen, but there's still a chance of selecting less favorable responses to ensure exploration.
Epistemic Neural Networks
ENNs model the uncertainty about the rewards, which are associated with different responses. It means, that ENN doesn't just assign a single, fixed reward to each response based on how well it matches human preferences, but instead, it considers a range of possible rewards, each corresponding to a different "what-if" scenario represented by the epistemic index. This approach knows that confidence in the reward might vary: in some cases, annotators might be very sure which response is better, while in others, they might be not certain. By modeling uncertainty, ENNs make more informed decisions about which responses are likely to be more preferred, which helps the model learn more effectively using human feedback.
- Infomax - takes an ENN reward model as input and generates N responses. For each pair of responses, the ENN predicts how likely it is that one will be preferred, using different scenarios (epistemic indices). Infomax then measures how much these predictions vary across scenarios and picks the pair of responses with the greatest variation (to learn the most from the feedback it gets).
- Double TS - picks queries that help find the best responses. It tries to choose two responses that might be the best by first generating a set of responses, then picking two from this set based on their potential rewards, ensuring that they are different. If it can't find two distinct responses after several tries, it randomly picks the second one.
Results
The best win rate was noted for double TS, then infomax, Boltzmann, and passive exploring at the last place. Authors showed that Double TS more accurately predicts human preference for the first response over time, after 40,000 queries. Unlike Boltzmann's exploration, which lacks guidance from uncertainty estimates and fails to adjust its predictions effectively, Double TS uses uncertainty to make better predictions, showing a great ability to adapt and learn from feedback. However, what is most important is that asking the right questions based on what the model already knows makes training much faster and might lead to breakthroughs, in theory surpassing human creativity.