Why Customise LLMs?
Massive Language Fashions (Llms) are deep studying fashions pre-trained based mostly on self-supervised studying, requiring an enormous quantity of sources on coaching information, coaching time and holding numerous parameters. LLM have revolutionized pure language processing particularly within the final 2 years, demonstrating outstanding capabilities in understanding and producing human-like textual content. Nonetheless, these common goal fashions’ out-of-the-box efficiency could not at all times meet particular enterprise wants or area necessities. LLMs alone can’t reply questions that depend on proprietary firm information or closed-book settings, making them comparatively generic of their purposes. Coaching a LLM mannequin from scratch is basically infeasible to small to medium groups as a result of demand of large quantities of coaching information and sources. Subsequently, a variety of LLM customization methods are developed in recent times to tune the fashions for varied eventualities that require specialised information.
The customization methods might be broadly cut up into two sorts:
- Utilizing a frozen mannequin: These methods don’t necessitate updating mannequin parameters and sometimes achieved by means of in-context studying or immediate engineering. They’re cost-effective since they alter the mannequin’s habits with out incurring in depth coaching prices, due to this fact broadly explored in each the {industry} and tutorial with new analysis papers revealed each day.
- Updating mannequin parameters: It is a comparatively resource-intensive strategy that requires tuning a pre-trained LLM utilizing customized datasets designed for the meant goal. This consists of well-liked methods like Wonderful-Tuning and Reinforcement Studying from Human Suggestions (RLHF).
These two broad customization paradigms department out into varied specialised methods together with LoRA fine-tuning, Chain of Thought, Retrieval Augmented Era, ReAct, and Agent frameworks. Every method affords distinct benefits and trade-offs relating to computational sources, implementation complexity, and efficiency enhancements.
Select LLMs?
Step one of customizing LLMs is to pick out the suitable basis fashions because the baseline. Neighborhood based mostly platform e.g. “Huggingface” affords a variety of open-source pre-trained fashions contributed by prime firms or communities, reminiscent of Llama collection from Meta and Gemini from Google. Huggingface moreover supplies leaderboards, for instance “Open LLM Leaderboard” to match LLMs based mostly on industry-standard metrics and duties (e.g. MMLU). Cloud suppliers (e.g., AWS) and AI firms (e.g., OpenAI and Anthropic) additionally provide entry to proprietary fashions which are sometimes paid providers with restricted entry. Following components are necessities to contemplate when selecting LLMs.
Open supply or proprietary mannequin: Open supply fashions permit full customization and self-hosting however require technical experience whereas proprietary fashions provide fast entry and sometimes higher high quality responses however with increased prices.
Process and metrics: Fashions excel at completely different duties together with question-answering, summarization, code technology and so forth. Examine benchmark metrics and check on domain-specific duties to find out the suitable fashions.
Structure: On the whole, decoder-only fashions (GPT collection) carry out higher at textual content technology whereas encoder-decoder fashions (T5) deal with translation effectively. There are extra structure rising and exhibiting promising outcomes, as an illustration Combination of Specialists (MoE) mannequin “DeepSeek”.
Variety of Parameters and Measurement: Bigger fashions (70B-175B parameters) provide higher efficiency however want extra computing energy. Smaller fashions (7B-13B) run quicker and cheaper however could have diminished capabilities.
After figuring out a base LLM, let’s discover 6 most typical methods for LLM customization, ranked so as of useful resource consumption from the least to essentially the most intensive:
- Immediate Engineering
- Decoding and Sampling Technique
- Retrieval Augmented Era
- Agent
- Wonderful Tuning
- Reinforcement Studying from Human Suggestions
In the event you’d choose a video walkthrough of those ideas, please take a look at my video on “6 Widespread LLM Customization Methods Briefly Defined”.
LLM Customization Methods
1. Immediate Engineering

Immediate is the enter textual content despatched to an LLM to elicit an AI-generated response, and it may be composed of directions, context, enter information and output indicator.
Directions: This supplies a process description or instruction for the way the mannequin ought to carry out.
Context: That is exterior data to information the mannequin to reply inside a sure scope.
Enter information: That is the enter for which you need a response.
Output indicator: This specifies the output kind or format.
Immediate Engineering entails crafting these immediate parts strategically to form and management the mannequin’s response. Primary immediate engineering methods embody zero shot, one shot, and few shot prompting. Consumer can implement primary immediate engineering methods immediately whereas interacting with the LLM, making it an environment friendly strategy to align mannequin’s habits to on a novel goal. API implementation can also be an choice and extra particulars are launched in my earlier article “A Easy Pipeline for Integrating LLM Immediate with Data Graph”.
Because of the effectivity and effectiveness of immediate engineering, extra complicated approaches are explored and developed to advance the logical construction of prompts.
Chain of Thought (CoT) asks LLMs to interrupt down complicated reasoning duties into step-by-step thought processes, enhancing efficiency on multi-step issues. Every step explicitly exposes its reasoning end result which serves because the precursor context of its subsequent steps till arriving on the reply.
Tree of ideas extends from CoT by contemplating a number of completely different reasoning branches and self-evaluating selections to determine the subsequent finest motion. It’s simpler for duties that contain preliminary selections, methods for the long run and exploration of a number of options.
Automated reasoning and gear use (ART) builds upon the CoT course of, it deconstructs complicated duties and permits the mannequin to pick out few-shot examples from a process library utilizing predefined exterior instruments like search and code technology.
Synergizing reasoning and performing (ReAct) combines reasoning trajectories with an motion house, the place the mannequin search by means of the motion house and decide the subsequent finest motion based mostly on environmental observations.
Methods like CoT and ReAct are sometimes mixed with an Agentic workflow to strengthen its functionality. These methods shall be launched in additional element within the “Agent” part.
Additional Studying
2. Decoding and Sampling Technique

Decoding technique might be managed at mannequin inference time by means of inference parameters (e.g. temperature, prime p, prime okay), figuring out the randomness and variety of mannequin responses. Grasping search, beam search and sampling are three frequent decoding methods for auto-regressive mannequin technology. ****
Throughout the autoregressive technology course of, LLM outputs one token at a time based mostly on a chance distribution of candidate tokens conditioned by the pervious token. By default, grasping search is utilized to supply the subsequent token with the very best chance.
In distinction, beam search decoding considers a number of hypotheses of next-best tokens and selects the speculation with the very best mixed possibilities throughout all tokens within the textual content sequence. The code snippet under makes use of transformers library to specify the the variety of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) through the mannequin technology course of.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(immediate, return_tensors="pt")
mannequin = AutoModelForCausalLM.from_pretrained(model_name)
outputs = mannequin.generate(**inputs, num_beams=5)
Sampling technique is the third strategy to regulate the randomness of mannequin responses by adjusting these inference parameters:
- Temperature: Decreasing the temperature makes the chance distribution sharper by growing the chance of producing high-probability phrases and reducing the chance of producing low-probability phrases. When temperature = 0, it turns into equal to grasping search (least inventive); when temperature = 1, it produces essentially the most inventive outputs.
- High Ok sampling: This methodology filters the Ok most possible subsequent tokens and redistributes the chance amongst these tokens. The mannequin then samples from this filtered set of tokens.
- High P sampling: As a substitute of sampling from the Ok most possible tokens, top-p sampling selects from the smallest potential set of tokens whose cumulative chance exceeds the edge p.
The instance code snippet under samples from the highest 50 most probably tokens (top_k=50) with a cumulative chance increased than 0.95 (top_p=0.95)
sample_outputs = mannequin.generate(
**model_inputs,
max_new_tokens=40,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=3,
)
Additional Studying
3. RAG

Retrieval Augmented Era (or RAG), initially launched within the paper “Retrieval-Augmented Era for Data-Intensive NLP Duties”, has been demonstrated as a promising answer that integrates exterior information and reduces frequent LLM “hallucination” points when dealing with area particular or specialised queries. RAG permits dynamically pulling related data from information area and customarily doesn’t contain in depth coaching to replace LLM parameters, making it an economical technique to adapt a general-purpose LLM for a specialised area.
A RAG system might be decomposed into retrieval and technology stage. The target of retrieval course of is to search out contents throughout the information base which are carefully associated to the consumer question, by chunking exterior information, creating embeddings, indexing and similarity search.
- Chunking: Paperwork are divided into smaller segments, with every phase containing a definite unit of knowledge.
- Create embeddings: An embedding mannequin compresses every data chunk right into a vector illustration. The consumer question can also be transformed into its vector illustration by means of the identical vectorization course of, in order that the consumer question might be in contrast in the identical dimensional house.
- Indexing: This course of shops these textual content chunks and their vector embeddings as key-value pairs, enabling environment friendly and scalable search performance. For giant exterior information bases that exceed reminiscence capability, vector databases provide environment friendly long-term storage.
- Similarity search: Similarity scores between the question embeddings and textual content chunk embeddings are calculated, that are used for looking data extremely related to the consumer question.
The technology course of of the RAG system then combines retrieved data with the consumer question to kind the augmented question which is parsed to the LLM to generate the context wealthy response.
Code Snippet
The code snippet firstly specifies the LLM and embedding mannequin, then carry out the steps to chunk the exterior information base paperwork
into a set of doc
. Create index
from doc
, outline the query_engine
based mostly on the index
and question the query_engine
with the consumer immediate.
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
Settings.llm = OpenAI(mannequin="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"
doc = Doc(textual content="nn".be a part of([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])
query_engine = index.as_query_engine()
response = query_engine.question(
"Inform me about LLM customization methods."
)
The instance above exhibits a easy RAG system. Superior RAG enhance based mostly on this by introducing pre-retrieval and post-retrieval methods to scale back pitfalls reminiscent of restricted synergy between the retrieval and technology course of. For instance rerank method reorders the retrieved data utilizing a mannequin able to understanding bidirectional context, and integration with information graph for superior question routing. Extra use instances might be discovered on the llamaindex web site.
Additional Studying
4. Agent

LLM Agent was a trending matter in 2024 and can seemingly stay a essential focus within the GenAI discipline in 2025. In comparison with RAG, Agent excels at creating question routes and planning LLM-based workflows, with the next advantages:
- Sustaining reminiscence and state of earlier mannequin generated responses.
- Leveraging varied instruments based mostly on particular standards. This tool-using functionality units brokers other than primary RAG programs by giving the LLM impartial management over instrument choice.
- Breaking down a posh process into smaller steps and planning for a sequence of actions.
- Collaborating with different brokers to kind a orchestrated system.
A number of in-context studying methods (e.g. CoT, ReAct ) might be carried out by means of the Agentic framework and we are going to focus on ReAct in additional particulars. ReAct, stands for “Synergizing Reasoning and Performing in Language Fashions”, consists of three key parts – actions, ideas and observations. This framework was launched by Google Analysis at Princeton College, constructed upon Chain of Thought by integrating the reasoning steps with an motion house that permits instrument makes use of and performance calling. Moreover, ReAct framework emphasizes on figuring out the subsequent finest motion based mostly on the environmental observations.
This instance from the unique paper demonstrated ReAct’s interior working course of, the place the LLM generates the primary thought and acts by calling the operate to “Search [Apple Remote]”, then observes the suggestions from its first output. The second thought is then based mostly on the earlier remark, therefore resulting in a special motion “Search [Front Row]”. This course of iterates till reaching the aim. The analysis exhibits that ReAct overcomes prevalent problems with hallucination and error propagation as extra typically noticed in chain-of-thought reasoning by interacting with a easy Wikipedia API. Moreover, by means of the implementation of resolution traces, ReAct framework moreover will increase the mannequin’s interpretability, trustworthiness and diagnosability.

Code Snippet
This demonstrates an ReAct-based agent implementation utilizing llamaindex
. Firstly, it defines two capabilities (multiply
and add
). Secondly, these two capabilities are encapsulated as FunctionTool
, forming the Agent’s motion house and executed based mostly on its reasoning.
from llama_index.core.agent import ReActAgent
from llama_index.core.instruments import FunctionTool
# create primary operate instruments
def multiply(a: float, b: float) -> float:
return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)
def add(a: float, b: float) -> float:
return a + b
add_tool = FunctionTool.from_defaults(fn=add)
agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)
Some great benefits of an Agentic Workflow are extra substantial when mixed with self-reflection or self-correction. It’s an more and more rising area with quite a lot of Agent structure being explored. For example, Reflexion framework facilitate iterative studying by offering a abstract of verbal suggestions from environmental and storing the suggestions in mannequin’s reminiscence; CRITIC framework empowers frozen LLMs to self-verify by means of interacting with exterior instruments reminiscent of code interpreter and API calls.
Additional Studying
5. Wonderful-Tuning

Wonderful-tuning is the method of feeding area of interest and specialised datasets to switch the LLM in order that it’s extra aligned with a sure goal. It differs from immediate engineering and RAG because it allows updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM by means of backpropogation, which requires massive reminiscence to retailer all weights and parameters and will undergo from vital discount in capability on different duties (i.e. catastrophic forgetting). Subsequently, PEFT (or parameter environment friendly tremendous tuning) is extra broadly used to mitigate these caveats whereas saving the time and value of mannequin coaching. There are three classes of PEFT strategies:
- Selective: Choose a subset of preliminary LLM parameters to tremendous tune which might be extra computationally intensive in comparison with different PEFT strategies.
- Reparameterization: Alter mannequin weights by means of coaching the weights of low rank representations. For instance, Decrease Rank Adaptation (LoRA) is amongst this class that accelerates fine-tuning by representing the load updates with two smaller matrices.
- Additive: Add further trainable layers to mannequin, together with methods like adapters and delicate prompts
The fine-tuning course of is much like deep studying coaching course of., requiring the next inputs:
- coaching and analysis datasets
- coaching arguments outline the hyperparameters e.g. studying charge, optimizer
- pretrained LLM mannequin
- compute metrics and goal capabilities that algorithm needs to be optimized for
Code Snippet
Beneath is an instance of implementing fine-tuning utilizing the transformer Coach.
from transformers import TrainingArguments, Coach
training_args = TrainingArguments(
output_dir=output_dir,
learning_rate=1e-5,
eval_strategy="epoch"
)
coach = Coach(
mannequin=mannequin,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
coach.practice()
Wonderful-tuning has a variety of use instances. For example, instruction fine-tuning optimizes LLMs for conversations and following directions by coaching them on prompt-completion pairs. One other instance is area adaptation, an unsupervised fine-tuning methodology that helps LLMs concentrate on particular information domains.
Additional Studying
6. RLHF

Reinforcement studying from human suggestions, or RLHF, is a reinforcement studying method that tremendous tunes LLMs based mostly on human preferences. RLHF operates by coaching a reward mannequin based mostly on human suggestions and makes use of this mannequin as a reward operate to optimize a reinforcement studying coverage by means of PPO (Proximal Coverage Optimization). The method requires two units of coaching information: a desire dataset for coaching reward mannequin, and a immediate dataset used within the reinforcement studying loop.
Let’s break it down into steps:
- Collect desire dataset annotated by human labelers who charge completely different completions generated by the mannequin based mostly on human desire. An instance format of the desire dataset is
{input_text, candidate1, candidate2, human_preference}
, indicating which candidate response is most well-liked. - Practice a reward mannequin utilizing the desire dataset, the reward mannequin is basically a regression mannequin that outputs a scalar indicating the standard of the mannequin generated response. The target of the reward mannequin is to maximise the rating between the successful candidate and shedding candidate.
- Use the reward mannequin in a reinforcement studying loop to fine-tune the LLM. The target is that the coverage is up to date in order that LLM can generate responses that maximize the reward produced by the reward mannequin. This course of makes use of the immediate dataset which is a set of prompts within the format of
{immediate, response, rewards}
.
Code Snippet
Open supply library Trlx is broadly utilized in implementing RLHF and so they supplied a template code that exhibits the fundamental RLHF setup:
- Initialize the bottom mannequin and tokenizer from a pretrained checkpoint
- Configure PPO hyperparameters
PPOConfig
like studying charge, epochs, and batch sizes - Create the PPO coach
PPOTrainer
by combining the mannequin, tokenizer, and coaching information - The coaching loop makes use of
step()
methodology to iteratively replace the mannequin to optimized therewards
calculated from thequestion
and mannequinresponse
# trl: Transformer Reinforcement Studying library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler
# provoke the pretrained mannequin and tokenizer
mannequin = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
# outline the hyperparameters of PPO algorithm
config = PPOConfig(
model_name=model_name,
learning_rate=learning_rate,
ppo_epochs=max_ppo_epochs,
mini_batch_size=mini_batch_size,
batch_size=batch_size
)
# provoke the PPO coach close to the mannequin
ppo_trainer = PPOTrainer(
config=config,
mannequin=ppo_model,
tokenizer=tokenizer,
dataset=dataset["train"],
data_collator=collator
)
# ppo_trainer is iteratively up to date by means of the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)
RLHF is broadly utilized for aligning mannequin responses with human desire. Widespread use instances contain decreasing response toxicity and mannequin hallucination. Nonetheless, it does have the draw back of requiring a considerable amount of human annotated information in addition to computation prices related to coverage optimization. Subsequently, alternate options like Reinforcement Studying from AI suggestions and Direct Desire Optimization (DPO) are launched to mitigate these limitations.
Additional Studying
Take-Residence Message
This text briefly explains six important LLM customization methods together with immediate engineering, decoding technique, RAG, Agent, fine-tuning, and RLHF. Hope you discover it useful by way of understanding the professionals/cons of every technique in addition to methods to implement them based mostly on the sensible examples.