Wonderful-Tuning of Llama-2 7B Chat for Python Code Technology: Utilizing QLoRA, SFTTrainer, and Gradient Checkpointing on the Alpaca-14k Dataset

On this tutorial, we show tips on how to effectively fine-tune the Llama-2 7B Chat mannequin for Python code era utilizing superior methods resembling QLoRA, gradient checkpointing, and supervised fine-tuning with the SFTTrainer. Leveraging the Alpaca-14k dataset, we stroll via organising the setting, configuring LoRA parameters, and making use of reminiscence optimization methods to coach a mannequin that excels in producing high-quality Python code. This step-by-step information is designed for practitioners searching for to harness the ability of LLMs with minimal computational overhead.

!pip set up -q speed up
!pip set up -q peft
!pip set up -q transformers
!pip set up -q trl

First, set up the required libraries for our venture. They embody speed up, peft, transformers, and trl from the Python Package deal Index. The -q flag (quiet mode) retains the output minimal.

import os
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

Import the important modules for our coaching setup. They embody utilities for dataset loading, mannequin/tokenizer, coaching arguments, logging, LoRA configuration, and the SFTTrainer.

# The mannequin to coach from the Hugging Face hub
model_name = "NousResearch/llama-2-7b-chat-hf"
# The instruction dataset to make use of
dataset_name = "consumer/minipython-Alpaca-14k"


# Wonderful-tuned mannequin identify
new_model = "/kaggle/working/llama-2-7b-codeAlpaca"

We specify the bottom mannequin from the Hugging Face hub, the instruction dataset, and the brand new mannequin’s identify.

# QLoRA parameters
# LoRA consideration dimension
lora_r = 64
# Alpha parameter for LoRA scaling
lora_alpha = 16
# Dropout chance for LoRA layers
lora_dropout = 0.1

Outline the LoRA parameters for our mannequin. `lora_r` units the LoRA consideration dimension, `lora_alpha` scales LoRA updates, and `lora_dropout` controls dropout chance.

# TrainingArguments parameters


# Output listing the place the mannequin predictions and checkpoints will probably be saved
output_dir = "/kaggle/working/llama-2-7b-codeAlpaca"
# Variety of coaching epochs
num_train_epochs = 1
# Allow fp16 coaching (set to True for blended precision coaching)
fp16 = True
# Batch dimension per GPU for coaching
per_device_train_batch_size = 8
# Batch dimension per GPU for analysis
per_device_eval_batch_size = 8
# Variety of replace steps to build up the gradients for
gradient_accumulation_steps = 2
# Allow gradient checkpointing
gradient_checkpointing = True
# Most gradient norm (gradient clipping)
max_grad_norm = 0.3
# Preliminary studying price (AdamW optimizer)
learning_rate = 2e-4
# Weight decay to use to all layers besides bias/LayerNorm weights
weight_decay = 0.001
# Optimizer to make use of
optim = "adamw_torch"
# Studying price schedule
lr_scheduler_type = "fixed"
# Group sequences into batches with the identical size
# Saves reminiscence and hastens coaching significantly
group_by_length = True
# Ratio of steps for a linear warmup
warmup_ratio = 0.03
# Save checkpoint each X updates steps
save_steps = 100
# Log each X updates steps
logging_steps = 10

These parameters configure the coaching course of. They embody output paths, variety of epochs, precision (fp16), batch sizes, gradient accumulation, and checkpointing. Further settings like studying price, optimizer, and scheduling assist fine-tune coaching conduct. Warmup and logging settings management how the mannequin begins coaching and the way we monitor progress.

import torch
print("PyTorch Model:", torch.__version__)
print("CUDA Model:", torch.model.cuda)

Import PyTorch and print each the put in PyTorch model and the corresponding CUDA model.

This command exhibits the GPU info, together with driver model, CUDA model, and present GPU utilization.

# SFT parameters


# Most sequence size to make use of
max_seq_length = None
# Pack a number of quick examples in the identical enter sequence to extend effectivity
packing = False
# Load the whole mannequin on the GPU 0
device_map = {"": 0}

Outline SFT parameters, resembling the utmost sequence size, whether or not to pack a number of examples, and mapping the whole mannequin to GPU 0.

# SFT parameters


# Most sequence size to make use of
max_seq_length = None
# Pack a number of quick examples in the identical enter sequence to extend effectivity
packing = False
# Load dataset
dataset = load_dataset(dataset_name, cut up="prepare")


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "proper"
# Load base mannequin with 8-bit quantization
mannequin = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)


# Put together mannequin for coaching
mannequin.gradient_checkpointing_enable()
mannequin.enable_input_require_grads()

Set further SFT parameters and cargo our dataset and tokenizer. We configure padding tokens for the tokenizer and cargo the bottom mannequin with 8-bit quantization. Lastly, we allow gradient checkpointing and make sure the mannequin requires enter gradients for coaching.

from peft import get_peft_model

Import the `get_peft_model` operate, which applies parameter-efficient fine-tuning (PEFT) to our base mannequin.

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)


# Apply LoRA to the mannequin
mannequin = get_peft_model(mannequin, peft_config)
# Set coaching parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    max_grad_norm=max_grad_norm,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)
# Set supervised fine-tuning parameters
coach = SFTTrainer(
    mannequin=mannequin,
    train_dataset=dataset,
    dataset_text_field="textual content",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

Configure and apply LoRA to our mannequin utilizing `LoraConfig` and `get_peft_model`. We then create `TrainingArguments` for mannequin coaching, specifying epoch counts, batch sizes, and optimization settings. Lastly, we arrange the `SFTTrainer`, passing it the mannequin, dataset, tokenizer, and coaching arguments.

# Prepare mannequin
coach.prepare()
# Save skilled mannequin
coach.mannequin.save_pretrained(new_model)

Provoke the supervised fine-tuning course of (`coach.prepare()`) after which save the skilled LoRA mannequin to the desired listing.

# Run textual content era pipeline with the fine-tuned mannequin
immediate = "How can I write a Python program that calculates the imply, commonplace deviation, and coefficient of variation of a dataset from a CSV file?"
pipe = pipeline(process="text-generation", mannequin=coach.mannequin, tokenizer=tokenizer, max_length=400)
outcome = pipe(f"<s>[INST] {immediate} [/INST]")
print(outcome[0]['generated_text'])

Create a textual content era pipeline utilizing our fine-tuned mannequin and tokenizer. Then, we offer a immediate, generate textual content utilizing the pipeline, and print the output.

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("HF_TOKEN")

Entry Kaggle Secrets and techniques to retrieve a saved Hugging Face token (`HF_TOKEN`). This token is used for authentication with the Hugging Face Hub.

# Empty VRAM
# del mannequin
# del pipe
# del coach
# del dataset
del tokenizer
import gc
gc.acquire()
gc.acquire()
torch.cuda.empty_cache()

The above snippet exhibits tips on how to unencumber GPU reminiscence by deleting references and clearing caches. We delete the tokenizer, run rubbish assortment, and empty the CUDA cache to cut back VRAM utilization.

import torch


# Verify the variety of GPUs out there
num_gpus = torch.cuda.device_count()
print(f"Variety of GPUs out there: {num_gpus}")


# Verify if CUDA machine 1 is out there
if num_gpus > 1:
    print("cuda:1 is out there.")
else:
    print("cuda:1 isn't out there.")

We import PyTorch and test the variety of GPUs detected. Then, we print the depend and conditionally report whether or not the GPU with ID 1 is out there.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel


# Specify the machine ID on your desired GPU (e.g., 0 for the primary GPU, 1 for the second GPU)
device_id = 1  # Change this primarily based in your out there GPUs
machine = f"cuda:{device_id}"
# Load the bottom mannequin on the desired GPU
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",  # Use auto to load on the out there machine
)
# Load the LoRA weights
lora_model = PeftModel.from_pretrained(base_model, new_model)
# Transfer LoRA mannequin to the desired GPU
lora_model.to(machine)
# Merge the LoRA weights with the bottom mannequin weights
mannequin = lora_model.merge_and_unload()
# Make sure the merged mannequin is on the proper machine
mannequin.to(machine)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "proper"

Choose a GPU machine (device_id 1) and cargo the bottom mannequin with specified precision and reminiscence optimizations. Then, load and merge LoRA weights into the bottom mannequin, guaranteeing the merged mannequin is moved to the designated GPU. Lastly, load the tokenizer and configure it with applicable padding settings.

In conclusion, following this tutorial, you have got efficiently fine-tuned the Llama-2 7B Chat mannequin to specialise in Python code era. Integrating QLoRA, gradient checkpointing, and SFTTrainer demonstrates a sensible strategy to managing useful resource constraints whereas attaining excessive efficiency.

Obtain the Colab Pocket book right here. All credit score for this analysis goes to the researchers of this venture. Additionally, don’t neglect to observe us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Overlook to hitch our 75k+ ML SubReddit.

🚨 Marktechpost is inviting AI Corporations/Startups/Teams to associate for its upcoming AI Magazines on ‘Open Supply AI in Manufacturing’ and ‘Agentic AI’.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.