It’s been a while since I wrote a blog on slightly technical topic. Today is the day!
It’s 2025 and LLMs have gained ability to “reason” their way through a tricky math problems. It’s not magic—it’s the result of specialized post-training. While pre-trained LLMs are incredibly knowledgeable, they aren’t born problem-solvers. To get them to excel at complex tasks like mathematical reasoning, we need to fine-tune them.
In this guide, we’ll learn how to do just that. We’re going to take a powerful base model, Qwen3-1.7B-Base
, and teach it to reason using a clever RL technique called GRPO and the speed-boosting Unsloth library. Let’s get started!
What’s a Base Model?
A base model is the raw, foundational LLM that has been trained on a massive corpus of text data. Its core capability is simply predicting the next word. It’s incredibly knowledgeable but doesn’t inherently know how to follow instructions or engage in a conversation. Think of it as a brilliant but untamed engine of knowledge.
What’s a Chat/Instruct Model?
A chat model (or instruction-tuned model) is a base model that has undergone a second stage of training. This alignment phase, often using techniques like Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), teaches the model to be helpful, harmless, and follow user instructions in a conversational format. This process gives the model a specific “personality” and a strong bias towards a certain style of response.
What is GRPO and How Does It Work?
Group Relative Policy Optimization (GRPO) is an advanced reinforcement learning (RL) technique designed to efficiently enhance a language model’s reasoning capabilities. To understand its benefits, we first need to look at the method it improves upon: Proximal Policy Optimization (PPO).
Traditional RL fine-tuning with PPO is notoriously expensive because it requires loading four large models into GPU memory: a Policy Model (the one being trained), a Reference Model, a Reward Model, and a Value Model. The Value Model, which is also trainable, estimates the potential for long-term rewards but adds significant complexity and memory overhead.
GRPO completely removes the need for the Value Model. This single change significantly cuts down computational requirements, making advanced RL fine-tuning more accessible.
It simply replaces the complex value estimation with a clever, three-step process based on group statistics:
Generate a Group of Outputs: Instead of creating a single response, the policy model is prompted to generate a group of varied responses for a given prompt.
Calculate Rewards: Each of these generated outputs is then scored by a reward function (or a separate Reward Model). For a reasoning task, rewards might be based on correct formatting or mathematical accuracy.
Estimate Advantage from the Group: This is the key step. GRPO calculates the “advantage” for each response—a signal telling the model whether to reinforce or discourage that type of output. It does this by normalizing each response’s reward against the mean and standard deviation of the entire group’s rewards.
The advantage is calculated with a simple formula that essentially asks, “How good is this response compared to the average of all other responses in this group?”
The advantage is calculated with this simple formula:
\[ \hat{A}_{i,t} = \frac{r_i - \text{mean}(r)}{\text{std}(r)} \]
Where: - \(r_i\) is the reward for a specific output. - \(\text{mean}(r)\) is the average reward of all outputs in the group. - \(\text{std}(r)\) is the standard deviation of all rewards.
This formula essentially asks, “How good or bad is this specific response compared to the average of all responses the model just generated for this prompt?”. An output with a reward far above the average gets a high positive advantage, strongly reinforcing that reasoning path. An output below the average gets a negative advantage, penalizing it. This group-based comparison provides a robust, on-the-fly baseline without needing a separate, memory-hungry Value LLM.
Imagine the prompt is “What is 7 * 6?”. The model generates a group of 3 responses that are then scored:
Response | Reason | Reward |
---|---|---|
<think>7*6 is 42</think><SOLUTION>42</SOLUTION> |
Correct format & answer | +4.0 |
<think>7*6 is 41</think><SOLUTION>41</SOLUTION> |
Correct format, wrong answer | +1.0 |
The answer is 42. |
Wrong format, “correct” answer | -2.0 |
GRPO then calculates the group’s statistics:
Mean reward:
(4.0 + 1.0 - 2.0) / 3 = 1.0
Advantage for Response A:
(4.0 - 1.0) / std_dev
= High positive value (Strongly reinforce!)Advantage for Response C:
(-2.0 - 1.0) / std_dev
= High negative value (Strongly penalize!)
This process allows the model to learn to prefer the structure and accuracy of Response A without needing a separate Value Model to make that judgment.
This group-based comparison provides a robust, on-the-fly baseline that effectively guides the model toward better reasoning.
Why Use Unsloth?
Unsloth is a powerful library designed to make fine-tuning LLMs faster and more memory-efficient. It achieves this through several optimizations, including:
- Faster training: Unsloth can significantly speed up the training process, in some cases by a factor of 2x or more.
- Reduced memory usage: It allows for fine-tuning larger models on consumer-grade hardware by reducing the memory footprint.
- Ease of use: Unsloth provides a user-friendly API that simplifies the fine-tuning workflow.
Now, let’s dive into the code and the fine-tuning process.
The Post Training Process
1. Setting Up the Environment
The first step is to install the necessary libraries. The script provides commands for both a standard Python environment and a Google Colab instance.
# For a standard environment
!pip install unsloth vllm
!pip install --no-deps unsloth vllm==0.8.5.post1
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0>" "huggingface_hub>=0.34.0" hf_transfer
These commands install Unsloth for post-training, vLLM for fast inference, and other essential libraries like PEFT (Parameter-Efficient Fine-Tuning), TRL (Transformer Reinforcement Learning), and Datasets.
2. Loading the Model and Preparing for PEFT
Next, we load the Qwen3-1.7B-Base model using Unsloth’s FastLanguageModel
. We also configure it for PEFT using LoRA (Low-Rank Adaptation).
from unsloth import FastLanguageModel
import torch
= 2048
max_seq_length = 32
lora_rank
= FastLanguageModel.from_pretrained(
model, tokenizer = "unsloth/Qwen3-1.7B-Base",
model_name = max_seq_length,
max_seq_length = False, # False for LoRA 16bit
load_in_4bit = True, # Enable vLLM fast inference
fast_inference = lora_rank,
max_lora_rank = 0.7, # Reduce if out of memory
gpu_memory_utilization
)
= FastLanguageModel.get_peft_model(
model
model,= lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
r = [
target_modules "q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],= lora_rank*2, # *2 speeds up training
lora_alpha = "unsloth", # Reduces memory usage
use_gradient_checkpointing = 3407,
random_state )
FastLanguageModel.from_pretrained
: This function from Unsloth loads the model and tokenizer with optimizations for speed and memory.max_seq_length
: This defines the maximum number of tokens the model can handle in a single input.lora_rank
: This is a key parameter for LoRA. It determines the rank of the matrices that are used to approximate the weight updates. A larger rank can lead to a more “intelligent” model but at the cost of slower training and higher memory usage.get_peft_model
: This function prepares the model for PEFT by adding LoRA adapters to the specified target_modules.lora_alpha
: This is a scaling factor for the LoRA updates. Think oflora_alpha
as controlling how much importance is given to the new, fine-tuned weights versus the original model weights. Setting it to twice thelora_rank
is a common heuristic that provides a good starting point for stable training.
3. Crafting a Custom Chat Template for GRPO
Before we can train our model, we need to teach it how to structure its responses. We do this by defining a chat template. This template acts as a blueprint, guiding the model to generate output in a consistent, predictable format that is ideal for our reasoning task.
Most Large Language Models are, at their core, text-completion engines. They don’t inherently understand the back-and-forth nature of a conversation. A chat template is a set of rules that tells the tokenizer how to convert a structured list of messages (from a “user” and “assistant”) into a single, formatted string the model can process.
Instruction-tuned models (like Llama-3-Instruct) already have a built-in template. Since we are using a base model, it has no default conversational format, so we must create our own.
Our goal is to create a template that forces the model to “show its work” and provide a clear final answer. To do this, we’ll build the template in three main steps.
Step 1 : Define the Structure with Special Tokens
First, we’ll define the special tags that will act as dividers in the model’s output. This allows our reward functions to easily parse the response later.
# Define the special tokens that will structure the output
= "<think>"
reasoning_start = "</think>"
reasoning_end = "<SOLUTION>"
solution_start = "</SOLUTION>" solution_end
Our desired output format will look like this:
<think>
's step-by-step reasoning goes here...
...the model</think>
<SOLUTION>
's final answer goes here...
...the model</SOLUTION>
Step 2 : Create the System Prompt and Jinja Template
Next, we create the instructions for the model. This involves two parts: a high-level system prompt telling the model its role, and a Jinja2 template that programmatically assembles the conversation for the model.
# Create the system prompt that instructs the model on the format
= \
system_prompt f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""
# Define the Jinja2 template for the tokenizer
= \
chat_template "{% if messages[0]['role'] == 'system' %}"\
"{{ messages[0]['content'] + eos_token }}"\
"{% set loop_messages = messages[1:] %}"\
"{% else %}"\
"{{ '{system_prompt}' + eos_token }}"\
"{% set loop_messages = messages %}"\
"{% endif %}"\
"{% for message in loop_messages %}"\
"{% if message['role'] == 'user' %}"\
"{{ message['content'] }}"\
"{% elif message['role'] == 'assistant' %}"\
"{{ message['content'] + eos_token }}"\
"{% endif %}"\
"{% endfor %}"\
"{% if add_generation_prompt %}{{ '{reasoning_start}' }}"\
"{% endif %}"
The Jinja template is the core logic. It loops through the conversation history and formats it correctly. The most important line is the last one: {% if add_generation_prompt %}
. This tells the tokenizer to automatically add our <think>
token whenever it’s the model’s turn to speak, kicking off the required reasoning process.
Step 3: Apply the Template to the Tokenizer
Finally, we inject our custom system_prompt into the Jinja template and assign the completed template to our tokenizer. This makes our custom format the official rulebook for all future conversations.
# Inject our custom system prompt and starting token into the template
= chat_template\
chat_template "'{system_prompt}'", f"'{system_prompt}'")\
.replace("'{reasoning_start}'", f"'{reasoning_start}'")
.replace(
# Finally, apply this template to the tokenizer
= chat_template tokenizer.chat_template
By enforcing this structure, we make the process of rewarding the model for good reasoning both programmatic and reliable. It’s the foundation upon which our entire GRPO training strategy is built.
4. Pre-Finetuning for Formatting
Before we let the model learn through trial-and-error with GRPO, we first give it a head start with a short phase of Supervised Fine-Tuning (SFT). Why? Reinforcement learning is most effective when the model already has a rough idea of what to do. If a base model has never seen our <think>
format, it will generate random, unstructured text. Rewarding the rare occasions it gets the format right is highly inefficient.
This SFT step acts as behavioral cloning. We show the model a few hundred examples of the exact format we want. This quickly teaches it the basic structure, making the subsequent GRPO phase much more stable and focused on improving the reasoning within the format, rather than just learning the format itself.
We use a small subset of NVIDIA’s Open Math Reasoning dataset for this.
Loading and Formatting the SFT Dataset
First, we load the dataset using Hugging Face’s datasets
library. We’ll filter it to only include problems with numerical answers to keep things simple for this pre-tuning step.
from datasets import load_dataset
import pandas as pd
import numpy as np
= load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
dataset = dataset.to_pandas()[
dataset "expected_answer", "problem", "generated_solution"]
[
]
# Keep only samples where the answer is a number
= pd.to_numeric(pd.Series(dataset["expected_answer"]), errors = "coerce").notnull()
is_number = dataset.iloc[np.where(is_number)[0]] dataset
Next, we create a function to reformat each row into the custom chat structure we defined earlier. This function takes the existing reasoning trace and wraps it with our special tokens (
def format_sft_dataset(x):
# Reformat the existing solution to match our template
= x["generated_solution"].replace("<think>", "").replace("</think>", "").strip()
thoughts
# Construct the final response format
= \
final_prompt + thoughts + reasoning_end + \
reasoning_start + x["expected_answer"] + solution_end
solution_start
# Return the message structure
return [
"role" : "system", "content" : system_prompt},
{"role" : "user", "content" : x["problem"]},
{"role" : "assistant", "content" : final_prompt},
{
]
"Messages"] = dataset.apply(format_sft_dataset, axis = 1) dataset[
Finally, we convert our pandas DataFrame back into a Hugging Face Dataset object, which the SFTTrainer expects.
from datasets import Dataset
"text"] = tokenizer.apply_chat_template(
dataset["Messages"].values.tolist(), tokenize = False
dataset[
)= Dataset.from_pandas(dataset) dataset
Now that our dataset is prepared, we can pass it to the SFTTrainer.
import numpy as np
from trl import SFTTrainer, SFTConfig
# Load and format the dataset from the original script
# This involves loading "unsloth/OpenMathReasoning-mini", filtering,
# and applying the format_dataset function.
# Create the SFT Trainer
= SFTTrainer(
trainer = model,
model = tokenizer,
tokenizer = dataset, # The formatted dataset
train_dataset = SFTConfig(
args = "text",
dataset_text_field = 1,
per_device_train_batch_size = 1, # Use GA to mimic batch size!
gradient_accumulation_steps = 5,
warmup_steps = 2, # Set this for 1 full training run.
num_train_epochs = 2e-4, # Reduce to 2e-5 for long training runs
learning_rate = 5,
logging_steps = "adamw_8bit",
optim = 0.01,
weight_decay = "linear",
lr_scheduler_type = 3407,
seed = "none", # Use "wandb" for Weights & Biases
report_to
),
)
# Start the pre-finetuning
trainer.train()
We map the dataset to create a “prompt” which includes our system message and the user’s question, and an “answer” which is the expected solution.
5. Defining the Reward System
With the model primed on formatting, it’s time to set up the main GRPO training loop. This starts with preparing our primary dataset and defining the reward functions that will guide the learning process.
Loading the GRPO Dataset
For the main RL phase, we’ll use the open-r1/DAPO-Math-17k-Processed
dataset. We’ll map it to a simple structure containing the prompt
and the ground-truth answer
.
from datasets import load_dataset
# Load the main dataset for GRPO
= load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split = "train")
dataset
# Map it to our required format
= dataset.map(lambda x: {
dataset "prompt" : [
"role": "system", "content": system_prompt},
{"role": "user", "content": x["prompt"]},
{
],"answer": x["solution"],
})
Reward Functions
The core of GRPO is scoring the model’s generated outputs. We’ll use the four reward functions. The GRPOTrainer sums the scores from all of them to get a final reward for each generation.
match_format_exactly
: We give a large positive reward to modelif the response perfectly follows our defined structure. We can use a regular expression to check if the, ,, and tags appear in the correct order.
import re
# We pre-compile regex for efficiency
= r"</SOLUTION>[\s]{0,}" + "(?:" + re.escape(tokenizer.eos_token) + ")?"
solution_end_regex = re.compile(
match_format rf"{reasoning_end}.*?{solution_start}(.+?){solution_end_regex}",
= re.MULTILINE | re.DOTALL
flags
)
def match_format_exactly(completions, **kwargs):
= []
scores for completion in completions:
= 0
score = completion[0]["content"]
response # Match if format is seen exactly!
if match_format.search(response) is not None: score += 3.0
scores.append(score)return scores
match_format_approximately
: If the format isn’t perfect, we should gives partial credit to model. To do this, we check for the presence of each required tag (,, ) and add a small reward for each one found, while penalizing the model if a tag is missing. This encourages the model to at least try to follow the format.
def match_format_approximately(completions, **kwargs):
= []
scores for completion in completions:
= 0
score = completion[0]["content"]
response # Count how many keywords are seen - we penalize if too many!
+= 0.5 if response.count(reasoning_end) == 1 else -1.0
score += 0.5 if response.count(solution_start) == 1 else -1.0
score += 0.5 if response.count(solution_end) == 1 else -1.0
score
scores.append(score)return scores
check_answer
: With this reward function assesses the correctness of the answer. We first tries to extract the text within thetags. If the format is wrong, we assigns a penalty. If the format is correct, we compare the extracted text to the ground-truth answer, giving a high reward for an exact match. We also provides a partial reward if the numerical ratio between the guessed answer and the true answer is close (e.g., within 10% or 20%).
def check_answer(prompts, completions, answer, **kwargs):
= prompts[0][-1]["content"]
question = [completion[0]["content"] for completion in completions]
responses
= [
extracted_responses 1)
guess.group(if (guess := match_format.search(r)) is not None else None \
for r in responses
]
= []
scores for guess, true_answer in zip(extracted_responses, answer):
= 0
score if guess is None:
-2.0)
scores.append(continue
# Correct answer gets 5 points!
if guess == true_answer:
+= 5.0
score # Match if spaces are seen, but less reward
elif guess.strip() == true_answer.strip():
+= 3.5
score else:
# We also reward it if the answer is close via ratios!
# Ie if the answer is within some range, reward it!
try:
= float(guess) / float(true_answer)
ratio if ratio >= 0.9 and ratio <= 1.1: score += 2.0
elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
else: score -= 2.5 # Penalize wrong answers
except:
-= 4.5 # Penalize
score
scores.append(score)return scores
check_numbers
- We also provide an additional reward specifically for numerical answers. To do this we extract only the numbers from the solution, clean them up (e.g., removes commas), convert them to floats, and give a positive reward for an exact numerical match or a penalty otherwise.
= re.compile(
match_numbers + r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
solution_start = re.MULTILINE | re.DOTALL
flags
)
global PRINTED_TIMES
= 0
PRINTED_TIMES global PRINT_EVERY_STEPS
= 5
PRINT_EVERY_STEPS
def check_numbers(prompts, completions, answer, **kwargs):
= prompts[0][-1]["content"]
question = [completion[0]["content"] for completion in completions]
responses = [
extracted_responses 1) if (guess := match_numbers.search(r)) is not None else None \
guess.group(for r in responses
]= []
scores global PRINTED_TIMES, PRINT_EVERY_STEPS
if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
print(f"*****\nQ: {question}\nA: {answer[0]}\nR: {responses[0]}\nE: {extracted_responses[0]}\n*****")
+= 1
PRINTED_TIMES
for guess, true_answer in zip(extracted_responses, answer):
if guess is None:
-2.5)
scores.append(continue
try:
= float(true_answer.strip())
true_num = float(guess.strip().replace(",", ""))
guess_num 3.5 if guess_num == true_num else -1.5)
scores.append(except:
0)
scores.append(return scores
By combining these reward functions, we create a comprehensive scoring system that encourages the model to generate well-structured and accurate responses.
6. Configuring and Launching the GRPO Trainer
Now, we configure the GRPOTrainer with our model, tokenizer, reward functions, and training arguments.
from trl import GRPOConfig, GRPOTrainer
from vllm import SamplingParams
# Define sampling parameters for vLLM
= SamplingParams(
vllm_sampling_params = 0.1,
min_p = 1.0,
top_p = -1,
top_k = 3407,
seed = [tokenizer.eos_token],
stop = True,
include_stop_str_in_output
)
# Configure GRPO training
= GRPOConfig(
training_args = vllm_sampling_params,
vllm_sampling_params = 1.0,
temperature = 5e-6,
learning_rate = 0.01,
weight_decay = 0.1,
warmup_ratio = "linear",
lr_scheduler_type = "adamw_8bit",
optim = 1,
logging_steps = 1,
per_device_train_batch_size = 1, # Increase for smoother training
gradient_accumulation_steps = 4, # Decrease if out of memory
num_generations = max_prompt_length,
max_prompt_length = max_completion_length,
max_completion_length = 100, # Set to a higher number for a full run
max_steps = 100,
save_steps = "none", # Can use "wandb"
report_to = "outputs",
output_dir )
num_generations
: This is the ‘G’ in GRPO—the size of the group of responses generated for each prompt. A larger group gives better statistics for the advantage calculation but uses more memory and compute. A value of 4-8 is a good starting point.temperature
: A higher temperature (like1.0
) encourages the model to generate more diverse and creative responses for the group. This diversity is essential for exploration, as it allows the model to try different reasoning paths. A low temperature would make all the responses in the group too similar, hindering learning.
# Initialize the trainer
= GRPOTrainer(
trainer = model,
model = tokenizer,
processing_class = [
reward_funcs
match_format_exactly,
match_format_approximately,
check_answer,
check_numbers,
],= training_args,
args = dataset,
train_dataset
)
# Start training!
trainer.train()
The GRPOTrainer uses the configuration to fine-tune the model. The goal during training is to see the reward column in the training logs increase over time.
7. Inference with the Fine-tuned Model
After training, we can test our fine-tuned model. First, we save the trained LoRA adapter, then load it during inference.
# Save the LoRA adapter
"grpo_saved_lora")
model.save_lora(
# Prepare messages for inference
= [
messages "role": "system", "content": system_prompt},
{"role": "user", "content": "What is the sqrt of 101?"},
{
]
= tokenizer.apply_chat_template(
text
messages,= True, # Must add for generation
add_generation_prompt = False,
tokenize
)
# Generate text using the fine-tuned LoRA
= model.fast_generate(
output
text,= sampling_params,
sampling_params = model.load_lora("grpo_saved_lora"),
lora_request 0].outputs[0].text
)[
print(output)
The lora_request argument tells the model to use our fine-tuned LoRA adapter for this generation. The output should now follow the reasoning format we defined.
8. Saving and Sharing Your Model
Finally, Unsloth provides convenient methods for saving your fine-tuned model in various formats.
# This combines the base model with the LoRA adapter into a single model.
# Merge to 16-bit
model.save_pretrained_merged("model",
tokenizer, = "merged_16bit"
save_method
)
# Merge to 4-bit
model.save_pretrained_merged("model",
tokenizer, = "merged_4bit"
save_method )
# Merge to 16-bit
model.save_pretrained_merged("model",
tokenizer, = "merged_16bit"
save_method
)
# Merge to 4-bit
model.save_pretrained_merged("model",
tokenizer, = "merged_4bit"
save_method )
# Save to 8-bit Q8_0 GGUF
"model", tokenizer)
model.save_pretrained_gguf(
# Save to q4_k_m GGUF
"model", tokenizer, quantization_method = "q4_k_m") model.save_pretrained_gguf(
Closing Note
That’s it! Now you know the fundamentals of fancy “post-training”. You see how easy it is create specialized models that excel at complex tasks by combining a powerful base model, a clever reinforcement learning technique, and an efficient library.
GRPO is an effective method for teaching models to reason by rewarding them for correct and well-structured responses.
Unsloth makes the fine-tuning process more accessible by improving speed and reducing memory usage.
A well-defined reward system is crucial for the success of GRPO.
Pre-finetuning can help to streamline the main RL training phase.