Now, let’s bounce into the actual deal of this text. Analyzing (Q, Okay, V, O) matrices of Llama-3–8B-Instruct mannequin through their singular values!
The Code
Let’s first import all mandatory packages wanted on this evaluation.
import transformers
import torch
import numpy as np
from transformers import AutoConfig, LlamaModel
from safetensors import safe_open
import os
import matplotlib.pyplot as plt
Then, let’s obtain the mannequin and put it aside into our native /tmp
listing.
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
!huggingface-cli obtain {MODEL_ID} --quiet --local-dir /tmp/{MODEL_ID}
When you’re GPU-rich, the next code won’t be related for you. Nonetheless, should you’re GPU-poor like me, the next code can be actually helpful to load solely particular layers of the LLama-3–8B mannequin.
def load_specific_layers_safetensors(mannequin, model_name, layer_to_load):
state_dict = {}
information = [f for f in os.listdir(model_name) if f.endswith('.safetensors')]
for file in information:
filepath = os.path.be part of(model_name, file)
with safe_open(filepath, framework="pt") as f:
for key in f.keys():
if f"layers.{layer_to_load}." in key:
new_key = key.exchange(f"mannequin.layers.{layer_to_load}.", 'layers.0.')
state_dict[new_key] = f.get_tensor(key)missing_keys, unexpected_keys = mannequin.load_state_dict(state_dict, strict=False)
if missing_keys:
print(f"Lacking keys: {missing_keys}")
if unexpected_keys:
print(f"Sudden keys: {unexpected_keys}")
The explanation we do it is because the free tier of Google Colab GPU is just not sufficient to load LLama-3–8B even with fp16
precision. Moreover, this evaluation requires us to work on fp32
precision resulting from how the np.linalg.svd
is constructed. Subsequent, we will outline the primary operate to get singular values for a given matrix_type
, layer_number
, and head_number
.
def get_singular_values(model_path, matrix_type, layer_number, head_number):
"""
Computes the singular values of the desired matrix within the Llama-3 mannequin.Parameters:
model_path (str): Path to the mannequin
matrix_type (str): Kind of matrix ('q', 'ok', 'v', 'o')
layer_number (int): Layer quantity (0 to 31)
head_number (int): Head quantity (0 to 31)
Returns:
np.array: Array of singular values
"""
assert matrix_type in ['q', 'k', 'v', 'o'], "Invalid matrix kind"
assert 0 <= layer_number < 32, "Invalid layer quantity"
assert 0 <= head_number < 32, "Invalid head quantity"
# Load the mannequin just for that particular layer since now we have restricted RAM even after utilizing fp16
config = AutoConfig.from_pretrained(model_path)
config.num_hidden_layers = 1
mannequin = LlamaModel(config)
load_specific_layers_safetensors(mannequin, model_path, layer_number)
# Entry the desired layer
# At all times index 0 since now we have loaded for the particular layer
layer = mannequin.layers[0]
# Decide the scale of every head
num_heads = layer.self_attn.num_heads
head_dim = layer.self_attn.head_dim
# Entry the desired matrix
weight_matrix = getattr(layer.self_attn, f"{matrix_type}_proj").weight.detach().numpy()
if matrix_type in ['q','o']:
begin = head_number * head_dim
finish = (head_number + 1) * head_dim
else: # 'ok', 'v' matrices
# Alter the head_number primarily based on num_key_value_heads
# That is achieved since llama3-8b use Grouped Question Consideration
num_key_value_groups = num_heads // config.num_key_value_heads
head_number_kv = head_number // num_key_value_groups
begin = head_number_kv * head_dim
finish = (head_number_kv + 1) * head_dim
# Extract the weights for the desired head
if matrix_type in ['q', 'k', 'v']:
weight_matrix = weight_matrix[start:end, :]
else: # 'o' matrix
weight_matrix = weight_matrix[:, start:end]
# Compute singular values
singular_values = np.linalg.svd(weight_matrix, compute_uv=False)
del mannequin, config
return record(singular_values)
It’s value noting that we will extract the weights for the desired head on the Okay, Q, and V matrices by doing row-wise slicing due to how it’s applied by HuggingFace.
(d_out,d_in)
. Supply: Picture by Creator.As for the O matrix, we will do column-wise slicing to extract the weights for the desired head on the O weight due to linear algebra! Particulars could be seen within the following determine.