LLM Alignment: Reward-Based mostly vs Reward-Free Strategies | by Anish Dubey

Optimization strategies for LLM alignment

Language fashions have demonstrated outstanding talents in producing a variety of compelling textual content primarily based on prompts offered by customers. Nevertheless, defining what constitutes “good” textual content is difficult, because it typically depends upon private preferences and the precise context. As an example, in storytelling, creativity is vital; in crafting informative content material, accuracy and reliability are essential; and when producing code, making certain it runs accurately is important. Therefore the “LLM alignment drawback,” which refers back to the problem of making certain that giant language fashions (LLMs) act in methods which might be in line with human values, intentions, and preferences.

Designing a loss perform that captures the various qualities we worth in textual content — like creativity, accuracy, or executability — is extremely advanced and infrequently impractical. Ideas like these should not differentiable and therefore not back-propagated and can’t be skilled upon with easy subsequent token technology.

Think about if we might harness human suggestions to judge the standard of generated textual content or, even higher, use that suggestions as a guiding loss perform to enhance the mannequin’s efficiency. This idea is on the coronary heart of Reinforcement Studying from Human Suggestions (RLHF). By making use of reinforcement studying strategies, RLHF permits us to fine-tune language fashions primarily based on direct human suggestions, aligning the fashions extra intently with nuanced human values and expectations. This strategy has opened up new potentialities for coaching language fashions that aren’t solely extra responsive but in addition extra aligned with the complexity of human preferences.

Beneath, we are going to purpose to study extra about RLHF by way of reward-based after which about RLHF by way of reward-free strategies.

Let’s undergo Reinforcement studying via human suggestions (RLHF). It consist of three essential levels:

Supervised positive tuning
Reward modeling part
RL fine-tuning part

Supervised positive tuning

RLHF is a pre-trained mannequin which is ok tuned already on a top quality knowledge set. Its goal is straightforward i.e. when given an enter (immediate), it produces an output. The last word goal right here is to additional positive tune this mannequin to supply output in response to human choice. Therefore, let’s name this a base mannequin for reference. At the moment, this mannequin is a vanilla base mannequin which isn’t conscious of any human choice.

Reward Modelling Section

Reward mannequin innovation: That is the place the brand new innovation begins on how reward fashions are included into RLHF. The concept behind the reward mannequin is {that a} new LLM mannequin, which might be similar because the above talked about base mannequin, can have the flexibility to generate human choice rating. The rationale it’s just like a big language mannequin is as a result of this mannequin additionally wants to know the language semantics earlier than it could price if an output is human most well-liked or not. Because the reward is scalar, we add a linear layer on prime of LLM to generate a scalar rating when it comes to human choice.

Knowledge assortment part: That is completed from the supervised positive tuning stage the place the bottom mannequin is requested to generate 2 outputs for a given textual content. Instance: For an enter token x, two output tokens are generated, y1 and y2 by the bottom mannequin. These outputs are proven to human raters to price and human choice is recorded for every particular person output.

Coaching part: As soon as the information pattern is collected from the information assortment part, the reward mannequin is skilled with the next immediate. “Given the next enter: <x>, LLM generated <y> output. Are you able to price the efficiency of the output?”. The mannequin will output r(reward) and we already know the precise worth of reward r1 from the information assortment part. Now, this may be back-propagated with the loss perform and the mannequin might be skilled. Beneath is the target loss perform which the mannequin optimises for via back-propagation:

Equation from this paper: https://arxiv.org/pdf/2305.18290

Notation:

rΦ(x, y): a reward mannequin parameterized by Φ which estimates the reward. Parameterized means we don’t know the precise worth and this must be optimized from the above equation. That is the reward LLM mannequin itself. Largely, the LLM parameters are frozen right here and solely few parameters are left to vary. Most necessary layer is the linear layer added on the prime. This does many of the studying to price the rating of output.
Ɗ: A dataset of triplets (x, yw, yl) the place x: enter, yw: the winner output and yl: the loser output
σ: the sigmoid perform which maps the distinction in reward to a chance (0–1)
∑(x, y,w yl) ~Ɗ means x, yw, yl are all sampled from Ɗ

Instance state of affairs: Think about you’re coaching a reward mannequin to judge responses. You may have pairs of responses to a given immediate, and human suggestions tells you which ones response is healthier. For context, x(“What’s the capital of France?”), you’ve yw(“The capital of France is Paris.”) as winner and yl(“The capital of France is Berlin.” ) as loser. The reward mannequin ought to finally study to provide greater reward for “The capital of France is Paris.” output when in comparison with “The capital of France is Berlin.” output if “What’s the capital of France?” enter is given.

RL fine-tuning part

Reinforcement studying concept: Now the base mannequin and reward mannequin are skilled, the thought is easy methods to leverage reward mannequin rating and replace base mannequin parameters to replicate human choice. Because the reward mannequin outputs a scalar rating and isn’t differentiable, we can’t use easy back-propogation to replace the bottom mannequin param. Therefore, we’d like different strategies to replace the bottom mannequin. That is the place reinforcement studying comes which helps the bottom mannequin to vary the params via reward mannequin rating. That is completed via PPO (proximal coverage optimization). Understanding the core structure of PPO isn’t required to understand this idea and therefore we won’t cowl it right here however on a excessive degree, the thought is that PPO can use scalar rating to replace base mannequin parameters. Now let’s perceive how base and reward fashions are included to make base fashions study human choice.

RL fine-tuning concept: In reinforcement studying, we now have motion, area and rewards. The concept is to provide you with a coverage which any motion agent can take within the area which maximizes the reward. This turns into fairly sophisticated however in a simplified sense, π is the coverage which is our base LLM mannequin solely. Πref means the bottom mannequin and ΠӨ means a unique LLM optimum mannequin which we are attempting to generate. We have to discover ΠӨ (the bottom mannequin’s neural community weights will likely be fine-tuned) which provides human-preferred output. It’s simply that we don’t know ΠӨ and the thought is to search out this optimum mannequin.

RL coaching and suggestions loop part: An enter x is given to 2 coverage fashions, Πref (baseline mannequin) and ΠӨ (optimum mannequin which we are attempting to generate). Initially each fashions are saved the identical. Enter x to 2 fashions individually will give two outputs correspondingly. The output from ΠӨ mannequin can be fed to reward mannequin (enter: x, output: y; as mentioned above) and requested to output the reward rating which is rΦ(x, y). Now we now have 3 issues, output from the baseline mannequin, output from the optimum mannequin, and a reward rating from the optimum mannequin. There are 2 issues we’re optimizing right here, one is to maximize the reward as a result of finally we would like the mannequin to be as shut as human choice and one other is to decrease the divergence from baseline mannequin. Maximizing the reward is straightforward since it’s already a scalar amount however how will we decrease the divergence of baseline and optimum mannequin. Right here we use “Kullback–Leibler divergence” which estimates the distinction between 2 steady chance distributions. Let’s take a deeper look into the target loss perform

Equation from this paper: https://arxiv.org/pdf/2305.18290

Notation:

rΦ(x, y): a scalar worth for an enter x and output y (from optimum mannequin). To be specific, output from the optimum mannequin is fed into the reward mannequin.
Dkl (ΠӨ (y | x) || Πref (y | x)): This computes the Kullback–Leibler divergence between 2 chance distributions. Every token from every mannequin is a chance distribution. KL estimates how far the distribution is from one another.
β : Hyperparameter which is used to find out how necessary it’s to have optimum mannequin near baseline mannequin.

Instance state of affairs: Think about you might be asking (“What’s the capital of France?”), Πref (baseline mannequin) says: “The capital of France is Berlin.” and ΠӨ (optimum mannequin) “There are 3 capitals, Paris, Versailles, and Lyon, however Paris is taken into account because the official capital”. Now rΦ(“x: What’s the capital…”, “y: There are 3 capital..”) ought to give low rating as it’s much less human-preferred and Kullback–Leibler divergence of (ΠӨ (y | x) || Πref (y | x)) needs to be excessive as effectively because the chance distribution area differs for each particular person output. Therefore the loss will likely be excessive from each phrases. We don’t need the mannequin to solely optimize for reward but in addition keep nearer to the baseline mannequin and therefore each the phrases are used to optimize the reward. Within the subsequent iteration with studying let’s say, ΠӨ (optimum mannequin) says “The capital of France is Delhi”, on this case mannequin realized to remain nearer to Πref (baseline mannequin) and output the format nearer to baseline mannequin however the reward element will nonetheless be decrease. Hopefully, within the third iteration ΠӨ (optimum mannequin) ought to have the ability to study and output “The capital of France is Paris” with greater reward and mannequin output aligning intently with baseline mannequin.

The under diagram helps illustrate the logic. I will even extremely suggest to undergo RLHF hyperlink from hugging face.

Picture by creator, impressed by https://huggingface.co/weblog/rlhf

With RLHF utilizing a reward-based technique in thoughts, let’s transfer to the reward-free technique. In response to the paper: “our key perception is to leverage an analytical mapping from reward capabilities to optimum insurance policies, which permits us to rework a loss perform over reward capabilities right into a loss perform over insurance policies. This modification-of-variables strategy avoids becoming an specific, standalone reward mannequin, whereas nonetheless optimizing below current fashions of human preferences”. Very sophisticated to know, however let’s attempt to break this down in easy phases within the subsequent part.

Reward-free technique’s key concept: In RLHF, a separate new reward mannequin is skilled which is pricey and expensive to keep up. Is there any mechanism to keep away from coaching a brand new reward mannequin and use the prevailing base mannequin to realize a brand new optimum mannequin? That is precisely what reward-free technique does i.e. it avoids coaching a brand new reward mannequin and in flip modifications the equation in such a manner that there isn’t any reward mannequin time period within the loss perform of DPO (Direct choice optimization). A technique to consider that is that we have to attain optimum mannequin coverage(ΠӨ) from base mannequin (Πref). It may be reached both via optimizing the reward perform area which helps construct a proxy to succeed in optimum mannequin coverage or instantly studying a mapping perform from reward to coverage and in flip optimize for coverage itself. That is precisely what the authors have tried by eradicating the reward perform element in loss perform and substitute it instantly by mannequin coverage parameter. That is what the creator meant after they say “leverage an analytical mapping from reward perform to optimum insurance policies …. right into a loss perform over insurance policies”. That is the core innovation of the paper.

DPO coaching and suggestions loop part: Utilizing Πref (baseline mannequin), enter x is given and requested to supply 2 outputs (y1 and y2). All x, y1 and y2 are utilized by human raters to determine successful yw and shedding yl. Offline knowledge set is collected with triplet data <x, yw and yl>. With this data, we all know what the successful (human most well-liked) and shedding (human not most well-liked) solutions are. Now, the identical enter x is given to 2 coverage (fashions) Πref (baseline mannequin) and ΠӨ (optimum mannequin). Initially each fashions are saved the identical for coaching functions. Enter x to 2 fashions individually will give two outputs correspondingly. We compute how far the output is from successful and shedding solutions from each reference and optimum mannequin via “Kullback–Leibler divergence”. Let’s take a deeper look into the target loss perform

Equation

ΠӨ (yw | x) -> Given x(enter), how far is the corresponding output of the mannequin say youtput from the successful output yw. Output youtput and yw are chance distributions and variations amongst each will likely be computed via “Kullback–Leibler divergence”. This will likely be a scalar worth. Additionally that is computed for each fashions with totally different combos of Πref (yw | x), Πref (yl | x), ΠӨ (yw | x) and ΠӨ (yl | x).
β : Hyperparameter which is used to find out how necessary it’s to have optimum mannequin near baseline mannequin.

Naturally, the query comes right down to which one is healthier, RLHF via reward-based technique utilizing PPO or reward-free technique utilizing DPO. There isn’t a proper reply to this query. A current paper compares “Is DPO superior to PPO for LLM alignment” (paper hyperlink) and concludes that PPO is usually higher than DPO and that DPO suffers extra closely from out-of-distribution knowledge. “Out-of-distribution” knowledge means the human choice knowledge is totally different from the baseline skilled knowledge. This could occur if base mannequin coaching is finished on some dataset whereas choice output is finished for another dataset.
Total, the analysis remains to be out on which one is healthier whereas we now have seen firms like OpenAI, Anthropic, Meta leverage each RLHF by way of PPO and DPO as a device for LLM alignment.