We’re moving from chat models to agents. Sure, agents are just wrappers around chat models. But there’s a key difference: chat interfaces help you with tasks, while agents run entire workflows for you - think of Comet browser that can shop on your behalf.
Traditional evals don’t work well for agents. Metrics like Contextual Precision and Recall measure chat performance, not workflow execution. For agents, we need to evaluate how well they handle complete user journeys. This is where environments come in. RL environments simulate these journeys - they give agents a place to interact, learn, and get better at solving real tasks.
This field is pretty new, as are agents. I found this cracked researcher Will Brown who has been working on this library Verifiers.
Verifiers provides a flexible framework for defining custom interaction protocols between LLMs and environments, enabling sophisticated multi-turn reasoning, tool use, and interactive evaluation.
I have been exploring Verifiers lately and in this blog is my learning log for conceptual understanding with practical implementation examples. And by reading it you’ll learn a few concepts, then see it in action through real environments from the prime-environments repository.
- Environments Types and Parsers.
- Rubrics
- Production Patterns
Before we begin, I want to formalize the statement Why Traditional Evaluation Fall flat in front of agents?.
As of now model evaluation follows a simple pattern:
input → model → output → score
This works for tasks like classification or generation (chat etc.), but breaks down for intelligent agents that need to:
- Reason through multi-step problems (debugging code, solving math)
- Use tools dynamically (search, calculate, execute code)
- Adapt based on feedback (interactive debugging, tutoring)
- Handle complex protocols (games, negotiations, collaborative tasks)
Agents need a different evaluation pattern: State → Model → Action → Environment → Feedback → Updated State → …
Why? Because agents don’t just give one answer and stop. They take actions, see what happens, and adjust their next move based on feedback. A coding agent might write a function, run tests, see failures, and iterate until tests pass. A tutoring agent asks questions, gauges student understanding from responses, and adapts its teaching strategy. This cyclical interaction is how agents actually work.
Verifiers provides the infrastructure for building these dynamic evaluation environments - places where agents can interact, get feedback, and be properly evaluated on their ability to handle complete workflows.
The Building Blocks of Verifiers
Every Verifiers environment is built from three fundamental components:
Environments - Define how agents interact with tasks. They set up the rules: is it a single question-answer (SingleTurnEnv), a back-and-forth conversation (MultiTurnEnv), or does the agent need tools like search or code execution (ToolEnv)?
Parsers - Extract structure from messy model outputs. Models generate free-form text, but you need clean data to evaluate. Parsers pull out the actual answer from all the reasoning, explanations, and formatting.
Rubrics - Score agent performance. They’re your reward functions. Rubrics decide what “good” looks like - whether it’s binary pass/fail, partial credit for progress, or multi-dimensional scoring across correctness, efficiency, and style.
Let’s start with environments - they’re the foundation everything else builds on.
1. Environments:
Environments orchestrate the interaction between models and tasks. In reinforcement learning, these environments provide the state space, action space, and reward signals that enable agents to learn complex behaviors.
Verifiers provides three base types that map to different categories of RL problems:
a. SingleTurnEnv: One-Shot Evaluation
Some tasks don’t need interaction - you present a problem, the agent responds, and you evaluate. That’s SingleTurnEnv. Think of it like a contextual bandit problem: each episode is independent, no state carries over.
# Basic structure
env = vf.SingleTurnEnv(
dataset=your_data,
system_prompt="Instructions for the model",
parser=your_parser,
rubric=your_rubric
)Where you’d use this:
Say you’re building a code completion agent. You give it a function signature and docstring, it generates the implementation, you run tests - done. One shot. The state is just the incomplete function and context. The action is the generated code. The reward is whether tests pass and if the code follows good practices.
Or a math reasoning agent - you present a word problem, it works through the solution step-by-step, and you check if the final answer is correct. The reasoning quality matters too, so you reward both the answer and the logical steps taken.
Or technical translation - you feed it a technical doc in one language, it translates to another, and you measure semantic accuracy, fluency, and whether it preserves technical terminology. Single input, single output, evaluate and move on.
b. MultiTurnEnv: Interactive Conversations
When agents need to maintain context across multiple interactions, you use MultiTurnEnv. Unlike SingleTurnEnv where everything happens in one shot, here the agent and environment have a back-and-forth conversation. State persists across turns, letting the agent build on previous exchanges.
class TutorEnv(vf.MultiTurnEnv):
async def env_response(self, messages, state):
# Define how environment responds to model
return response_messages, updated_state
async def is_completed(self, messages, state):
# Define when interaction should end
return completion_conditionWhere you’d use this:
A terminal/shell agent needs to remember what directory it’s in, what commands it ran, and what the file system looks like. Each command builds on the previous state. The agent might cd into a directory, then ls to see files, then cat a specific file - each action depends on where you are and what you’ve done. You evaluate the complete workflow: did the agent set up the development environment correctly?
Or a code review agent that goes back and forth with a developer. It suggests improvements, the developer responds or asks questions, the agent refines its suggestions based on feedback. The conversation continues until the code meets quality standards. State tracks the review history, what issues were found, what’s been addressed.
A debugging assistant walks through a problem with you. It asks questions about the error, you provide logs, it suggests where to look, you show it that code, it proposes a fix. The whole conversation is one episode, from bug report to working solution. Each turn builds on what came before.
c. ToolEnv: Agents with Capabilities
Text generation alone isn’t enough for many tasks. Agents need external tools - search engines, code execution, file systems, APIs. ToolEnv extends agents with these capabilities by converting Python functions into tools the model can call.
def search(query: str) -> str:
"""Search function automatically becomes a tool"""
return search_results
env = vf.ToolEnv(
tools=[search, calculate, code_executor],
dataset=dataset,
rubric=rubric
)Where you’d use this:
Think Cursor-style code agent. It needs to read files, write code, run tests, check with a linter, commit to git. Each tool is a capability - file_read, file_write, test_runner, git_commit. The agent picks which tool to use and what parameters to pass. You’re evaluating the complete feature: did it implement what was asked, does the code pass tests, is it good quality?
Or a data science agent that loads data, runs statistical analysis, trains models, generates visualizations, writes reports. Each step requires a different tool. The agent decides the analysis strategy, picks tools in the right sequence, and produces insights. You measure insight quality, model accuracy, whether the work is reproducible.
A research agent searching papers, reading PDFs, tracking citations, building a knowledge graph, writing synthesis. It’s exploring a large information space using multiple tools to gather and organize knowledge. The complete research project - from question to findings - is one episode.
Each environment type maps to different RL paradigms:
- SingleTurnEnv: Contextual bandits - maximize immediate reward
- MultiTurnEnv: Episodic RL - optimize long-term cumulative reward
- ToolEnv: Hierarchical RL - compose complex behaviors from primitive actions
2. Parsers:
Models output free-form text, but RL training needs consistent, structured reward signals. Parsers bridge this gap by extracting the actual answer from all the reasoning and fluff.
Parsers determine what gets rewarded. A lenient parser might reward partial attempts while a strict parser only rewards perfect formatting. This directly impacts learning dynamics - what you parse is what the model learns to produce.
Example: ThinkParser
Say you want the model to show its reasoning but you only care about scoring the final answer. ThinkParser lets you separate the two:
parser = vf.ThinkParser(extract_fn=extract_math_answer)Model output:
<think>
Let me work through this step by step.
The formula for circle area is A = π × r²
So: A = π × 5² = π × 25 = 25π ≈ 78.54
</think>
Final Answer: 25π
The parser ignores everything in <think> tags and extracts just 25π from the final answer. This encourages chain-of-thought reasoning (which helps accuracy) while keeping evaluation clean.
Verifiers also provides XMLParser for structured outputs, basic regex parsers for simple pattern matching, and you can write custom parsers for domain-specific formats.
3. Rubrics
Rubrics are reward functions. They directly shape what behaviors agents learn - the rubric design determines whether agents optimize for correctness, efficiency, style, safety, or any combination.
Types of Rubrics:
- Binary - Pass/fail scoring for mission-critical tasks where “almost correct” is worthless
- Graduated - Partial credit for incremental progress, encouraging experimentation
- Multi-Dimensional - Score across multiple criteria simultaneously (correctness + efficiency + safety + style)
- Dynamic - Adapt scoring based on task difficulty and learning trajectory
Let’s see how each type shapes agent behavior through a running example: “Write a function to find the maximum number in a list”
a. Binary Rubrics:
Think about a security scanner. If it catches 9 out of 10 SQL injection vulnerabilities, has it succeeded? No - it failed. That one missed vulnerability could compromise the entire system. Same with medical diagnosis systems, financial trading algorithms, safety-critical control systems. A 95% correct solution that fails on edge cases isn’t just suboptimal - it’s dangerous.
Binary rubrics create agents that prioritize reliability over creativity. They’re conservative by design: if you can’t guarantee perfection, don’t attempt the task.
The implementation is simple - loop through all test cases, and the moment one fails, return 0.0:
def exact_correctness(parser, completion, answer, info, **kwargs):
code = parser.parse_answer(completion)
test_cases = info["test_cases"]
for test_input, expected in test_cases:
try:
result = execute_code(code, test_input)
if result != expected:
return 0.0 # Single failure = total failure
except:
return 0.0 # Crashes also get zero
return 1.0 # Only return 1.0 if ALL tests pass
rubric = vf.Rubric(funcs=[exact_correctness])See how this plays out:
def find_max(lst):
return max(lst)Reward: 1.0 ✅ - Passes all tests including edge cases
def find_max(lst):
return max(lst) if lst else None # Returns None instead of raising errorReward: 0.0 ❌ - Fails one edge case test, gets zero despite being 90% correct
One failure = complete failure. This forces agents to be thorough and handle every case, which is exactly what you want for mission-critical tasks where “mostly working” isn’t acceptable.
b. Graduated Rubrics:
Binary rubrics force agents into an all-or-nothing scenario: pass all test cases or get zero reward. But for many tasks, partial success has real value. An agent that writes a function passing 8 out of 10 test cases has made progress toward the solution. Giving it zero credit wastes useful evaluation signal - you’re not telling the agent “you’re close, just handle these edge cases.”
Graduated rubrics give partial credit proportional to how much of the task the agent completed successfully. For example, a code refactoring agent that needs to update deprecated API calls across a codebase. If it successfully updates 15 out of 20 files, that’s valuable work even though it’s incomplete. You want to reward the successful updates and encourage the agent to handle the remaining edge cases, not give it zero for not being perfect.
def test_passage_score(parser, completion, answer, info, **kwargs):
"""Count passing tests, return ratio"""
code = parser.parse_answer(completion)
passed = sum(1 for test_input, expected in info["test_cases"]
if execute_code(code, test_input) == expected)
return passed / len(info["test_cases"]) # 0.8 if 8/10 pass
def code_quality_score(parser, completion, **kwargs):
"""Bonus points for good practices"""
code = parser.parse_answer(completion)
score = 0.0
if "max(" in code: score += 0.3 # Efficient
if len(code.split('\n')) <= 3: score += 0.2 # Concise
if "if" in code and "not" in code: score += 0.5 # Edge cases
return score
# Combine: 70% correctness + 30% quality
rubric = vf.Rubric(
funcs=[test_passage_score, code_quality_score],
weights=[0.7, 0.3]
)Now watch how different solutions get graded:
Quick-and-Dirty Solution:
def find_max(lst):
return max(lst)Test Score: 0.9 (works on 9/10 test cases) Quality Score: 0.5 (efficient but no edge case handling) Total: 0.7 × 0.9 + 0.3 × 0.5 = 0.78 ⭐ - Good foundation, room for improvement
Production-Ready Solution:
def find_max(lst):
if not lst:
return None
return max(lst)Test Score: 1.0 (handles all cases) Quality Score: 0.8 (efficient, concise, robust) Total: 0.7 × 1.0 + 0.3 × 0.8 = 0.94 ⭐⭐⭐ - Excellent balance
Learning-in-Progress Solution:
def find_max(lst):
if len(lst) == 0:
return None
max_val = lst[0]
for i in range(1, len(lst)):
if lst[i] > max_val:
max_val = lst[i]
return max_valTest Score: 1.0 (works correctly) Quality Score: 0.5 (handles edge cases but verbose) Total: 0.7 × 1.0 + 0.3 × 0.5 = 0.85 ⭐⭐ - Correct understanding, can improve style
This approach encourages agents to balance multiple objectives and provides meaningful feedback for partial progress, making it ideal for learning environments where improvement happens gradually.
c. Multi-Dimensional Rubrics:
Some tasks have multiple success criteria. A code generation agent shouldn’t just produce working code - the code needs to be maintainable, efficient, secure, and readable. A function that passes all tests but has security holes or terrible performance hasn’t fully solved the task.
For example, a code generation system for enterprise software. The generated code will be reviewed, maintained, and extended by developers. The agent needs to optimize for multiple dimensions: does it work (correctness), is it fast (efficiency), is it safe (security), can developers understand it (readability)?
Multi-dimensional rubrics score across these criteria simultaneously - essentially mimicking how a senior developer reviews code. The agent gets separate scores for each dimension, then they’re combined with weights reflecting what matters most for the task.
Each function scores one dimension - correctness, efficiency, safety, readability:
def correctness_score(parser, completion, info, **kwargs):
"""Does it work?"""
code = parser.parse_answer(completion)
passed = sum(1 for test_input, expected in info["test_cases"]
if safe_execute(code, test_input) == expected)
return passed / len(info["test_cases"])
def efficiency_score(parser, completion, **kwargs):
"""Is it fast?"""
code = parser.parse_answer(completion)
if "max(" in code: return 1.0 # O(n) built-in
elif code.count("for") == 1: return 0.8 # O(n) loop
elif code.count("for") > 1: return 0.3 # O(n²) nested
return 0.5
def safety_score(parser, completion, **kwargs):
"""Is it secure?"""
code = parser.parse_answer(completion)
score = 1.0
if "lst[0]" in code and "if" not in code: score -= 0.5 # Index error risk
if "exec(" in code or "eval(" in code: score -= 0.8 # Security risk
return max(0.0, score)
def readability_score(parser, completion, **kwargs):
"""Can devs maintain it?"""
code = parser.parse_answer(completion)
score = 0.0
if len(code.split('\n')) <= 3: score += 0.4 # Concise
if '"""' in code: score += 0.3 # Documented
return min(1.0, score)
# Weight correctness highest, then efficiency, safety, style
rubric = vf.Rubric(
funcs=[correctness_score, efficiency_score, safety_score, readability_score],
weights=[0.4, 0.3, 0.2, 0.1]
)Here’s how the scoring combines:
Professional-Quality Solution:
def find_max(lst):
"""Find maximum value in a list, handling empty lists."""
if not lst:
raise ValueError("Cannot find max of empty list")
return max(lst)Scores: Correctness: 1.0, Efficiency: 1.0, Safety: 1.0, Readability: 1.0
Total: 0.4×1.0 + 0.3×1.0 + 0.2×1.0 + 0.1×1.0 = 1.0 🌟🌟🌟 - Ready for production
Clever but Problematic Solution:
def find_max(lst):
return lst[0] if len(lst) == 1 else max(lst[0], find_max(lst[1:]))Scores: Correctness: 0.8, Efficiency: 0.3, Safety: 0.5, Readability: 0.2
Total: 0.4×0.8 + 0.3×0.3 + 0.2×0.5 + 0.1×0.2 = 0.51 ⭐ - Works but creates problems
This comprehensive approach creates agents that think like senior developers, considering not just whether code works, but whether it’s the kind of code you’d want to maintain and extend over time.
Dynamic Rubrics: Context-Aware Scoring
Not all tasks have the same difficulty. An agent getting 60% on a hard coding problem shows progress, but 60% on a trivial string reversal means something’s wrong. Fixed rubrics score both the same way.
Dynamic rubrics adjust expectations based on task difficulty and the agent’s recent performance. For example, an AI coding tutor where tasks vary in complexity. If the agent has been acing complex algorithms, getting basic loops wrong signals regression - the rubric penalizes harder. If the agent has been struggling with basics, any progress on harder problems gets bonus rewards to encourage exploration.
This enables adaptive evaluation. Harder tasks get more lenient scoring when the agent is still learning them. Easier tasks get stricter scoring as the agent improves. The agent is evaluated fairly based on task context, not just absolute performance.
def adaptive_scoring(parser, completion, info, state, **kwargs):
"""Adjust scoring based on context and learning trajectory"""
code = parser.parse_answer(completion)
base_score = get_correctness_score(code, info["test_cases"])
# Adapt expectations based on task difficulty
difficulty = info.get("difficulty", "medium")
if difficulty == "easy" and base_score < 0.9:
# High expectations for easy tasks
base_score *= 0.8
elif difficulty == "hard" and base_score > 0.5:
# Bonus for any success on hard tasks
base_score *= 1.2
# Consider learning trajectory
recent_scores = state.get("recent_scores", [])
if len(recent_scores) >= 5:
avg_recent = sum(recent_scores[-5:]) / 5
if base_score > avg_recent + 0.1:
# Reward improvement beyond current level
base_score *= 1.1
elif base_score < avg_recent - 0.2:
# Gentle penalty for regression
base_score *= 0.9
return min(1.0, base_score)
rubric = vf.Rubric(funcs=[adaptive_scoring])This creates personalized learning experiences where agents are challenged at appropriate levels, maintaining engagement and steady progress rather than getting stuck on tasks that are too easy or too hard for their current capabilities.
Rubric Impact on Learning Dynamics
Binary Rubrics → Reliable but Conservative Agents
- Clear success/failure signals
- Agents prioritize certainty over innovation
- Good for safety-critical applications
Graduated Rubrics → Balanced Risk-Taking
- Partial credit encourages experimentation
- Agents learn to optimize multiple objectives
- Better for learning complex skills
Multi-Dimensional Rubrics → Production-Ready Agents
- Optimizes for real-world software quality
- Creates well-rounded coding behavior
- Essential for professional development tasks
Dynamic Rubrics → Adaptive Learning
- Adjusts difficulty and expectations contextually
- Enables curriculum learning and progressive skill building
- Critical for long-term agent development
The rubric design is arguably the most critical component because it defines what “success” means during RL training. A well-designed rubric doesn’t just measure performance - it actively shapes the kind of agent you’ll end up with.
Now let’s see these concepts in action by building a complete environment from scratch.
Building an Environment: AlphabetSort
We’ll build AlphabetSort, a MultiTurnEnv that teaches all the key concepts - state management, dynamic interaction, parsers, and rubrics working together. MultiTurnEnv is the sweet spot for learning: more complex than SingleTurnEnv (one-shot Q&A), but simpler than ToolEnv (external capabilities). Plus, it shows how real agents work - maintaining context and building on previous interactions.
The Task
Turn 1: Sort names alphabetically.
Turn 2+: Integrate new names into the sorted list and mark them with “// new name!”. The agent must maintain context across turns.
Each dataset item stores ground truth answers for each turn, follow-up prompts, and expected outputs:
{
"prompt": "Sort these names: Alice, Charlie, Bob",
"info": {
"num_turns": 2,
"follow_ups": ["Now add Diana and Eve, sort all names and mark new ones"],
"ground_truths": [
["Alice", "Bob", "Charlie"],
["Alice", "Bob", "Charlie", "Diana // new name!", "Eve // new name!"]
]
}
}Building the Environment
MultiTurnEnv needs two methods to control the conversation flow:
1. env_response() - What the environment says after each agent response
After the agent sorts the initial names, the environment needs to send the next challenge. This method decides what message to send based on how many turns have happened:
async def env_response(self, messages: Messages, state: State, **kwargs):
# Count how many times the agent has responded
assistant_count = len([m for m in messages if m["role"] == "assistant"])
# Send the next pre-planned follow-up prompt
if assistant_count < state["info"]["num_turns"]:
return [{"role": "user", "content": state["info"]["follow_ups"][assistant_count - 1]}], state
return [{"role": "user", "content": "Continue"}], state2. is_completed() - When to stop the conversation
The environment needs to know when the episode is done. In AlphabetSort, we stop after the agent has responded N times (where N = num_turns):
async def is_completed(self, messages: Messages, state: State, **kwargs):
assistant_count = len([m for m in messages if m["role"] == "assistant"])
return assistant_count >= state["info"]["num_turns"]3. Parsing
The agent might add reasoning, explanations, or extra text. XMLParser pulls out just the sorted list. Different XML tags signal different task stages:
xml_tag = "alphabetical_sorted" if turn_num == 1 else "combined_alphabetical_sorted"
parser = vf.XMLParser([xml_tag], answer_field=xml_tag)
parsed = parser.parse_answer(agent_response)4. Scoring
Each turn gets scored independently, then averaged. Using sequence similarity with an exponential penalty (^4) means near-perfect matters - 95% similarity scores 0.81, not 0.95:
def eval_turn(completion, turn_num, state):
# Extract agent's answer for this specific turn
assistant_msgs = [m["content"] for m in completion if m["role"] == "assistant"]
xml_tag = "alphabetical_sorted" if turn_num == 1 else "combined_alphabetical_sorted"
parser = vf.XMLParser([xml_tag], answer_field=xml_tag)
parsed = parser.parse_answer(assistant_msgs[turn_num - 1])
# Compare with ground truth using exponential similarity
similarity = difflib.SequenceMatcher(None, parsed, expected).ratio()
return similarity ** 4 # Penalizes mistakes heavily5. Rubric
The rubric needs to score all turns and average them. We do this by creating a reward function that loops through each turn:
def create_weighted_rewards():
def weighted_reward(completion, state, **kwargs):
actual_turns = state["info"]["num_turns"]
total_score = 0.0
# Score each turn
for turn_num in range(1, actual_turns + 1):
turn_score = eval_turn(completion, turn_num, state)
total_score += turn_score
# Return average across all turns
return total_score / actual_turns if actual_turns > 0 else 0.0
return weighted_reward
# Create the rubric
rubric = vf.Rubric(funcs=[create_weighted_rewards()], weights=[1.0])Key Concepts
- MultiTurnEnv enables back-and-forth interaction
- env_response() controls what the environment says next
- is_completed() decides when to stop
- State carries information across turns (ground truths, follow-ups, turn count)
- Per-turn evaluation scores each response, then averages
- Rubric wraps the scoring logic and gets called during evaluation
import difflib
import json
import random
from typing import List, Tuple
import verifiers as vf
from datasets import Dataset, load_dataset
from verifiers.types import Messages, State
def load_environment(
dataset_name: str = "kalomaze/alphabetic-arxiv-authors-it1",
dataset_split: str = "train",
max_turns: int = 3,
min_turns: int = 1,
min_names_per_turn: int = 1,
max_names_per_turn: int = 5,
similarity_power: int = 4,
seed: int = 1337420,
) -> vf.Environment:
class SortingEnv(vf.MultiTurnEnv):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
async def is_completed(self, messages: Messages, state: State, **kwargs) -> bool:
assert isinstance(messages, list)
assistant_count = len([m for m in messages if m["role"] == "assistant"])
num_turns = state["info"]["num_turns"]
return assistant_count >= num_turns
async def env_response(self, messages: Messages, state: State, **kwargs) -> Tuple[Messages, State]:
assert isinstance(messages, list)
assistant_count = len([m for m in messages if m["role"] == "assistant"])
num_turns = state["info"]["num_turns"]
if assistant_count < num_turns:
follow_ups = state["info"]["follow_ups"]
follow_up_idx = assistant_count - 1
if follow_up_idx < len(follow_ups):
return [{"role": "user", "content": follow_ups[follow_up_idx]}], state
return [{"role": "user", "content": "Continue"}], state
def score_response(predicted: List[str], expected: List[str]) -> float:
if not predicted or not expected:
return 0.0
pred_clean = [s.strip().lower() for s in predicted]
exp_clean = [s.strip().lower() for s in expected]
pred_text = "\n".join(pred_clean)
exp_text = "\n".join(exp_clean)
similarity = difflib.SequenceMatcher(None, pred_text, exp_text).ratio()
return similarity**similarity_power
def eval_turn(completion: List[dict], turn_num: int, state: dict) -> float:
info = state.get("info", {})
ground_truths = info.get("ground_truths", [])
if turn_num > len(ground_truths):
return 0.0
expected = ground_truths[turn_num - 1]
if not isinstance(completion, list):
return 0.0
assistant_msgs = [m["content"] for m in completion if m["role"] == "assistant"]
if len(assistant_msgs) < turn_num:
return 0.0
xml_tag = "alphabetical_sorted" if turn_num == 1 else "combined_alphabetical_sorted"
parser = vf.XMLParser([xml_tag], answer_field=xml_tag)
parsed = parser.parse_answer(assistant_msgs[turn_num - 1])
if parsed is None:
return 0.0
predicted = parsed.split("\n")
return score_response(predicted, expected)
def create_weighted_rewards():
def weighted_reward(completion, state, **kwargs):
actual_turns = state["info"]["num_turns"]
total_score = 0.0
for turn_num in range(1, actual_turns + 1):
turn_score = eval_turn(completion, turn_num, state)
total_score += turn_score
return total_score / actual_turns if actual_turns > 0 else 0.0
return weighted_reward
# Dataset building logic would go here...
dataset = build_dataset() # Implementation details omitted for brevity
rubric = vf.Rubric(funcs=[create_weighted_rewards()], weights=[1.0])
env_instance = SortingEnv(dataset=dataset, rubric=rubric, max_turns=max_turns)
return env_instance- Prime Environments: github.com/PrimeIntellect-ai/prime-environments - Explore 40+ production environments
- Verifiers Documentation: verifiers.readthedocs.io - Complete technical reference
- Community: Connect with other environment builders and share your work