Evals are dead. Long live evals!

deep dive into RL environments and Verifiers
AI
LLM
Reinforcement Learning
Eval
Agents
Author

Anshuman Mishra

Published

October 5, 2025

We’re moving from chat models to agents. Sure, agents are just wrappers around chat models. But there’s a key difference: chat interfaces help you with tasks, while agents run entire workflows for you - think of Comet browser that can shop on your behalf.

Traditional evals don’t work well for agents. Metrics like Contextual Precision and Recall measure chat performance, not workflow execution. For agents, we need to evaluate how well they handle complete user journeys. This is where environments come in. RL environments simulate these journeys - they give agents a place to interact, learn, and get better at solving real tasks.

This field is pretty new, as are agents. I found this cracked researcher Will Brown who has been working on this library Verifiers.

Verifiers provides a flexible framework for defining custom interaction protocols between LLMs and environments, enabling sophisticated multi-turn reasoning, tool use, and interactive evaluation.

I have been exploring Verifiers lately and in this blog is my learning log for conceptual understanding with practical implementation examples. And by reading it you’ll learn a few concepts, then see it in action through real environments from the prime-environments repository.

What You’ll Learn
  • Environments Types and Parsers.
  • Rubrics
  • Production Patterns

Before we begin, I want to formalize the statement Why Traditional Evaluation Fall flat in front of agents?.

As of now model evaluation follows a simple pattern:

input → model → output → score

This works for tasks like classification or generation (chat etc.), but breaks down for intelligent agents that need to:

  1. Reason through multi-step problems (debugging code, solving math)
  2. Use tools dynamically (search, calculate, execute code)
  3. Adapt based on feedback (interactive debugging, tutoring)
  4. Handle complex protocols (games, negotiations, collaborative tasks)

Agents need a different evaluation pattern: State → Model → Action → Environment → Feedback → Updated State → …

Why? Because agents don’t just give one answer and stop. They take actions, see what happens, and adjust their next move based on feedback. A coding agent might write a function, run tests, see failures, and iterate until tests pass. A tutoring agent asks questions, gauges student understanding from responses, and adapts its teaching strategy. This cyclical interaction is how agents actually work.

Verifiers provides the infrastructure for building these dynamic evaluation environments - places where agents can interact, get feedback, and be properly evaluated on their ability to handle complete workflows.

The Building Blocks of Verifiers

Every Verifiers environment is built from three fundamental components:

  • Environments - Define how agents interact with tasks. They set up the rules: is it a single question-answer (SingleTurnEnv), a back-and-forth conversation (MultiTurnEnv), or does the agent need tools like search or code execution (ToolEnv)?

  • Parsers - Extract structure from messy model outputs. Models generate free-form text, but you need clean data to evaluate. Parsers pull out the actual answer from all the reasoning, explanations, and formatting.

  • Rubrics - Score agent performance. They’re your reward functions. Rubrics decide what “good” looks like - whether it’s binary pass/fail, partial credit for progress, or multi-dimensional scoring across correctness, efficiency, and style.

Let’s start with environments - they’re the foundation everything else builds on.

1. Environments:

Environments orchestrate the interaction between models and tasks. In reinforcement learning, these environments provide the state space, action space, and reward signals that enable agents to learn complex behaviors.

Verifiers provides three base types that map to different categories of RL problems:

a. SingleTurnEnv: One-Shot Evaluation

Some tasks don’t need interaction - you present a problem, the agent responds, and you evaluate. That’s SingleTurnEnv. Think of it like a contextual bandit problem: each episode is independent, no state carries over.

# Basic structure
env = vf.SingleTurnEnv(
    dataset=your_data,
    system_prompt="Instructions for the model",
    parser=your_parser,
    rubric=your_rubric
)

Where you’d use this:

Say you’re building a code completion agent. You give it a function signature and docstring, it generates the implementation, you run tests - done. One shot. The state is just the incomplete function and context. The action is the generated code. The reward is whether tests pass and if the code follows good practices.

Or a math reasoning agent - you present a word problem, it works through the solution step-by-step, and you check if the final answer is correct. The reasoning quality matters too, so you reward both the answer and the logical steps taken.

Or technical translation - you feed it a technical doc in one language, it translates to another, and you measure semantic accuracy, fluency, and whether it preserves technical terminology. Single input, single output, evaluate and move on.

b. MultiTurnEnv: Interactive Conversations

When agents need to maintain context across multiple interactions, you use MultiTurnEnv. Unlike SingleTurnEnv where everything happens in one shot, here the agent and environment have a back-and-forth conversation. State persists across turns, letting the agent build on previous exchanges.

class TutorEnv(vf.MultiTurnEnv):
    async def env_response(self, messages, state):
        # Define how environment responds to model
        return response_messages, updated_state

    async def is_completed(self, messages, state):
        # Define when interaction should end
        return completion_condition

Where you’d use this:

A terminal/shell agent needs to remember what directory it’s in, what commands it ran, and what the file system looks like. Each command builds on the previous state. The agent might cd into a directory, then ls to see files, then cat a specific file - each action depends on where you are and what you’ve done. You evaluate the complete workflow: did the agent set up the development environment correctly?

Or a code review agent that goes back and forth with a developer. It suggests improvements, the developer responds or asks questions, the agent refines its suggestions based on feedback. The conversation continues until the code meets quality standards. State tracks the review history, what issues were found, what’s been addressed.

A debugging assistant walks through a problem with you. It asks questions about the error, you provide logs, it suggests where to look, you show it that code, it proposes a fix. The whole conversation is one episode, from bug report to working solution. Each turn builds on what came before.

c. ToolEnv: Agents with Capabilities

Text generation alone isn’t enough for many tasks. Agents need external tools - search engines, code execution, file systems, APIs. ToolEnv extends agents with these capabilities by converting Python functions into tools the model can call.

def search(query: str) -> str:
    """Search function automatically becomes a tool"""
    return search_results

env = vf.ToolEnv(
    tools=[search, calculate, code_executor],
    dataset=dataset,
    rubric=rubric
)

Where you’d use this:

Think Cursor-style code agent. It needs to read files, write code, run tests, check with a linter, commit to git. Each tool is a capability - file_read, file_write, test_runner, git_commit. The agent picks which tool to use and what parameters to pass. You’re evaluating the complete feature: did it implement what was asked, does the code pass tests, is it good quality?

Or a data science agent that loads data, runs statistical analysis, trains models, generates visualizations, writes reports. Each step requires a different tool. The agent decides the analysis strategy, picks tools in the right sequence, and produces insights. You measure insight quality, model accuracy, whether the work is reproducible.

A research agent searching papers, reading PDFs, tracking citations, building a knowledge graph, writing synthesis. It’s exploring a large information space using multiple tools to gather and organize knowledge. The complete research project - from question to findings - is one episode.

RL Environment Design Principles

Each environment type maps to different RL paradigms:

  • SingleTurnEnv: Contextual bandits - maximize immediate reward
  • MultiTurnEnv: Episodic RL - optimize long-term cumulative reward
  • ToolEnv: Hierarchical RL - compose complex behaviors from primitive actions

2. Parsers:

Models output free-form text, but RL training needs consistent, structured reward signals. Parsers bridge this gap by extracting the actual answer from all the reasoning and fluff.

Parsers determine what gets rewarded. A lenient parser might reward partial attempts while a strict parser only rewards perfect formatting. This directly impacts learning dynamics - what you parse is what the model learns to produce.

Example: ThinkParser

Say you want the model to show its reasoning but you only care about scoring the final answer. ThinkParser lets you separate the two:

parser = vf.ThinkParser(extract_fn=extract_math_answer)

Model output:

<think>
Let me work through this step by step.
The formula for circle area is A = π × r²
So: A = π × 5² = π × 25 = 25π ≈ 78.54
</think>

Final Answer: 25π

The parser ignores everything in <think> tags and extracts just 25π from the final answer. This encourages chain-of-thought reasoning (which helps accuracy) while keeping evaluation clean.

Verifiers also provides XMLParser for structured outputs, basic regex parsers for simple pattern matching, and you can write custom parsers for domain-specific formats.

3. Rubrics

Rubrics are reward functions. They directly shape what behaviors agents learn - the rubric design determines whether agents optimize for correctness, efficiency, style, safety, or any combination.

Types of Rubrics:

  • Binary - Pass/fail scoring for mission-critical tasks where “almost correct” is worthless
  • Graduated - Partial credit for incremental progress, encouraging experimentation
  • Multi-Dimensional - Score across multiple criteria simultaneously (correctness + efficiency + safety + style)
  • Dynamic - Adapt scoring based on task difficulty and learning trajectory

Let’s see how each type shapes agent behavior through a running example: “Write a function to find the maximum number in a list”

a. Binary Rubrics:

Think about a security scanner. If it catches 9 out of 10 SQL injection vulnerabilities, has it succeeded? No - it failed. That one missed vulnerability could compromise the entire system. Same with medical diagnosis systems, financial trading algorithms, safety-critical control systems. A 95% correct solution that fails on edge cases isn’t just suboptimal - it’s dangerous.

Binary rubrics create agents that prioritize reliability over creativity. They’re conservative by design: if you can’t guarantee perfection, don’t attempt the task.

The implementation is simple - loop through all test cases, and the moment one fails, return 0.0:

def exact_correctness(parser, completion, answer, info, **kwargs):
    code = parser.parse_answer(completion)
    test_cases = info["test_cases"]

    for test_input, expected in test_cases:
        try:
            result = execute_code(code, test_input)
            if result != expected:
                return 0.0  # Single failure = total failure
        except:
            return 0.0  # Crashes also get zero

    return 1.0  # Only return 1.0 if ALL tests pass

rubric = vf.Rubric(funcs=[exact_correctness])

See how this plays out:

def find_max(lst):
    return max(lst)

Reward: 1.0 ✅ - Passes all tests including edge cases

def find_max(lst):
    return max(lst) if lst else None  # Returns None instead of raising error

Reward: 0.0 ❌ - Fails one edge case test, gets zero despite being 90% correct

One failure = complete failure. This forces agents to be thorough and handle every case, which is exactly what you want for mission-critical tasks where “mostly working” isn’t acceptable.

b. Graduated Rubrics:

Binary rubrics force agents into an all-or-nothing scenario: pass all test cases or get zero reward. But for many tasks, partial success has real value. An agent that writes a function passing 8 out of 10 test cases has made progress toward the solution. Giving it zero credit wastes useful evaluation signal - you’re not telling the agent “you’re close, just handle these edge cases.”

Graduated rubrics give partial credit proportional to how much of the task the agent completed successfully. For example, a code refactoring agent that needs to update deprecated API calls across a codebase. If it successfully updates 15 out of 20 files, that’s valuable work even though it’s incomplete. You want to reward the successful updates and encourage the agent to handle the remaining edge cases, not give it zero for not being perfect.

def test_passage_score(parser, completion, answer, info, **kwargs):
    """Count passing tests, return ratio"""
    code = parser.parse_answer(completion)
    passed = sum(1 for test_input, expected in info["test_cases"]
                 if execute_code(code, test_input) == expected)
    return passed / len(info["test_cases"])  # 0.8 if 8/10 pass

def code_quality_score(parser, completion, **kwargs):
    """Bonus points for good practices"""
    code = parser.parse_answer(completion)
    score = 0.0
    if "max(" in code: score += 0.3  # Efficient
    if len(code.split('\n')) <= 3: score += 0.2  # Concise
    if "if" in code and "not" in code: score += 0.5  # Edge cases
    return score

# Combine: 70% correctness + 30% quality
rubric = vf.Rubric(
    funcs=[test_passage_score, code_quality_score],
    weights=[0.7, 0.3]
)

Now watch how different solutions get graded:

Quick-and-Dirty Solution:

def find_max(lst):
    return max(lst)

Test Score: 0.9 (works on 9/10 test cases) Quality Score: 0.5 (efficient but no edge case handling) Total: 0.7 × 0.9 + 0.3 × 0.5 = 0.78 ⭐ - Good foundation, room for improvement

Production-Ready Solution:

def find_max(lst):
    if not lst:
        return None
    return max(lst)

Test Score: 1.0 (handles all cases) Quality Score: 0.8 (efficient, concise, robust) Total: 0.7 × 1.0 + 0.3 × 0.8 = 0.94 ⭐⭐⭐ - Excellent balance

Learning-in-Progress Solution:

def find_max(lst):
    if len(lst) == 0:
        return None
    max_val = lst[0]
    for i in range(1, len(lst)):
        if lst[i] > max_val:
            max_val = lst[i]
    return max_val

Test Score: 1.0 (works correctly) Quality Score: 0.5 (handles edge cases but verbose) Total: 0.7 × 1.0 + 0.3 × 0.5 = 0.85 ⭐⭐ - Correct understanding, can improve style

This approach encourages agents to balance multiple objectives and provides meaningful feedback for partial progress, making it ideal for learning environments where improvement happens gradually.

c. Multi-Dimensional Rubrics:

Some tasks have multiple success criteria. A code generation agent shouldn’t just produce working code - the code needs to be maintainable, efficient, secure, and readable. A function that passes all tests but has security holes or terrible performance hasn’t fully solved the task.

For example, a code generation system for enterprise software. The generated code will be reviewed, maintained, and extended by developers. The agent needs to optimize for multiple dimensions: does it work (correctness), is it fast (efficiency), is it safe (security), can developers understand it (readability)?

Multi-dimensional rubrics score across these criteria simultaneously - essentially mimicking how a senior developer reviews code. The agent gets separate scores for each dimension, then they’re combined with weights reflecting what matters most for the task.

Each function scores one dimension - correctness, efficiency, safety, readability:

def correctness_score(parser, completion, info, **kwargs):
    """Does it work?"""
    code = parser.parse_answer(completion)
    passed = sum(1 for test_input, expected in info["test_cases"]
                if safe_execute(code, test_input) == expected)
    return passed / len(info["test_cases"])

def efficiency_score(parser, completion, **kwargs):
    """Is it fast?"""
    code = parser.parse_answer(completion)
    if "max(" in code: return 1.0  # O(n) built-in
    elif code.count("for") == 1: return 0.8  # O(n) loop
    elif code.count("for") > 1: return 0.3  # O(n²) nested
    return 0.5

def safety_score(parser, completion, **kwargs):
    """Is it secure?"""
    code = parser.parse_answer(completion)
    score = 1.0
    if "lst[0]" in code and "if" not in code: score -= 0.5  # Index error risk
    if "exec(" in code or "eval(" in code: score -= 0.8  # Security risk
    return max(0.0, score)

def readability_score(parser, completion, **kwargs):
    """Can devs maintain it?"""
    code = parser.parse_answer(completion)
    score = 0.0
    if len(code.split('\n')) <= 3: score += 0.4  # Concise
    if '"""' in code: score += 0.3  # Documented
    return min(1.0, score)

# Weight correctness highest, then efficiency, safety, style
rubric = vf.Rubric(
    funcs=[correctness_score, efficiency_score, safety_score, readability_score],
    weights=[0.4, 0.3, 0.2, 0.1]
)

Here’s how the scoring combines:

Professional-Quality Solution:

def find_max(lst):
    """Find maximum value in a list, handling empty lists."""
    if not lst:
        raise ValueError("Cannot find max of empty list")
    return max(lst)

Scores: Correctness: 1.0, Efficiency: 1.0, Safety: 1.0, Readability: 1.0

Total: 0.4×1.0 + 0.3×1.0 + 0.2×1.0 + 0.1×1.0 = 1.0 🌟🌟🌟 - Ready for production

Clever but Problematic Solution:

def find_max(lst):
    return lst[0] if len(lst) == 1 else max(lst[0], find_max(lst[1:]))

Scores: Correctness: 0.8, Efficiency: 0.3, Safety: 0.5, Readability: 0.2

Total: 0.4×0.8 + 0.3×0.3 + 0.2×0.5 + 0.1×0.2 = 0.51 ⭐ - Works but creates problems

This comprehensive approach creates agents that think like senior developers, considering not just whether code works, but whether it’s the kind of code you’d want to maintain and extend over time.

Dynamic Rubrics: Context-Aware Scoring

Not all tasks have the same difficulty. An agent getting 60% on a hard coding problem shows progress, but 60% on a trivial string reversal means something’s wrong. Fixed rubrics score both the same way.

Dynamic rubrics adjust expectations based on task difficulty and the agent’s recent performance. For example, an AI coding tutor where tasks vary in complexity. If the agent has been acing complex algorithms, getting basic loops wrong signals regression - the rubric penalizes harder. If the agent has been struggling with basics, any progress on harder problems gets bonus rewards to encourage exploration.

This enables adaptive evaluation. Harder tasks get more lenient scoring when the agent is still learning them. Easier tasks get stricter scoring as the agent improves. The agent is evaluated fairly based on task context, not just absolute performance.

def adaptive_scoring(parser, completion, info, state, **kwargs):
    """Adjust scoring based on context and learning trajectory"""
    code = parser.parse_answer(completion)
    base_score = get_correctness_score(code, info["test_cases"])

    # Adapt expectations based on task difficulty
    difficulty = info.get("difficulty", "medium")
    if difficulty == "easy" and base_score < 0.9:
        # High expectations for easy tasks
        base_score *= 0.8
    elif difficulty == "hard" and base_score > 0.5:
        # Bonus for any success on hard tasks
        base_score *= 1.2

    # Consider learning trajectory
    recent_scores = state.get("recent_scores", [])
    if len(recent_scores) >= 5:
        avg_recent = sum(recent_scores[-5:]) / 5
        if base_score > avg_recent + 0.1:
            # Reward improvement beyond current level
            base_score *= 1.1
        elif base_score < avg_recent - 0.2:
            # Gentle penalty for regression
            base_score *= 0.9

    return min(1.0, base_score)

rubric = vf.Rubric(funcs=[adaptive_scoring])

This creates personalized learning experiences where agents are challenged at appropriate levels, maintaining engagement and steady progress rather than getting stuck on tasks that are too easy or too hard for their current capabilities.

Rubric Impact on Learning Dynamics

Rubric Design Patterns

Binary RubricsReliable but Conservative Agents

  • Clear success/failure signals
  • Agents prioritize certainty over innovation
  • Good for safety-critical applications

Graduated RubricsBalanced Risk-Taking

  • Partial credit encourages experimentation
  • Agents learn to optimize multiple objectives
  • Better for learning complex skills

Multi-Dimensional RubricsProduction-Ready Agents

  • Optimizes for real-world software quality
  • Creates well-rounded coding behavior
  • Essential for professional development tasks

Dynamic RubricsAdaptive Learning

  • Adjusts difficulty and expectations contextually
  • Enables curriculum learning and progressive skill building
  • Critical for long-term agent development

The rubric design is arguably the most critical component because it defines what “success” means during RL training. A well-designed rubric doesn’t just measure performance - it actively shapes the kind of agent you’ll end up with.

Now let’s see these concepts in action by building a complete environment from scratch.

Building an Environment: AlphabetSort

We’ll build AlphabetSort, a MultiTurnEnv that teaches all the key concepts - state management, dynamic interaction, parsers, and rubrics working together. MultiTurnEnv is the sweet spot for learning: more complex than SingleTurnEnv (one-shot Q&A), but simpler than ToolEnv (external capabilities). Plus, it shows how real agents work - maintaining context and building on previous interactions.

The Task

Turn 1: Sort names alphabetically.

Turn 2+: Integrate new names into the sorted list and mark them with “// new name!”. The agent must maintain context across turns.

Each dataset item stores ground truth answers for each turn, follow-up prompts, and expected outputs:

{
    "prompt": "Sort these names: Alice, Charlie, Bob",
    "info": {
        "num_turns": 2,
        "follow_ups": ["Now add Diana and Eve, sort all names and mark new ones"],
        "ground_truths": [
            ["Alice", "Bob", "Charlie"],
            ["Alice", "Bob", "Charlie", "Diana // new name!", "Eve // new name!"]
        ]
    }
}

Building the Environment

MultiTurnEnv needs two methods to control the conversation flow:

1. env_response() - What the environment says after each agent response

After the agent sorts the initial names, the environment needs to send the next challenge. This method decides what message to send based on how many turns have happened:

async def env_response(self, messages: Messages, state: State, **kwargs):
    # Count how many times the agent has responded
    assistant_count = len([m for m in messages if m["role"] == "assistant"])

    # Send the next pre-planned follow-up prompt
    if assistant_count < state["info"]["num_turns"]:
        return [{"role": "user", "content": state["info"]["follow_ups"][assistant_count - 1]}], state

    return [{"role": "user", "content": "Continue"}], state

2. is_completed() - When to stop the conversation

The environment needs to know when the episode is done. In AlphabetSort, we stop after the agent has responded N times (where N = num_turns):

async def is_completed(self, messages: Messages, state: State, **kwargs):
    assistant_count = len([m for m in messages if m["role"] == "assistant"])
    return assistant_count >= state["info"]["num_turns"]

3. Parsing

The agent might add reasoning, explanations, or extra text. XMLParser pulls out just the sorted list. Different XML tags signal different task stages:

xml_tag = "alphabetical_sorted" if turn_num == 1 else "combined_alphabetical_sorted"
parser = vf.XMLParser([xml_tag], answer_field=xml_tag)
parsed = parser.parse_answer(agent_response)

4. Scoring

Each turn gets scored independently, then averaged. Using sequence similarity with an exponential penalty (^4) means near-perfect matters - 95% similarity scores 0.81, not 0.95:

def eval_turn(completion, turn_num, state):
    # Extract agent's answer for this specific turn
    assistant_msgs = [m["content"] for m in completion if m["role"] == "assistant"]
    xml_tag = "alphabetical_sorted" if turn_num == 1 else "combined_alphabetical_sorted"

    parser = vf.XMLParser([xml_tag], answer_field=xml_tag)
    parsed = parser.parse_answer(assistant_msgs[turn_num - 1])

    # Compare with ground truth using exponential similarity
    similarity = difflib.SequenceMatcher(None, parsed, expected).ratio()
    return similarity ** 4  # Penalizes mistakes heavily

5. Rubric

The rubric needs to score all turns and average them. We do this by creating a reward function that loops through each turn:

def create_weighted_rewards():
    def weighted_reward(completion, state, **kwargs):
        actual_turns = state["info"]["num_turns"]
        total_score = 0.0

        # Score each turn
        for turn_num in range(1, actual_turns + 1):
            turn_score = eval_turn(completion, turn_num, state)
            total_score += turn_score

        # Return average across all turns
        return total_score / actual_turns if actual_turns > 0 else 0.0

    return weighted_reward

# Create the rubric
rubric = vf.Rubric(funcs=[create_weighted_rewards()], weights=[1.0])

Key Concepts

  • MultiTurnEnv enables back-and-forth interaction
  • env_response() controls what the environment says next
  • is_completed() decides when to stop
  • State carries information across turns (ground truths, follow-ups, turn count)
  • Per-turn evaluation scores each response, then averages
  • Rubric wraps the scoring logic and gets called during evaluation
import difflib
import json
import random
from typing import List, Tuple

import verifiers as vf
from datasets import Dataset, load_dataset
from verifiers.types import Messages, State


def load_environment(
    dataset_name: str = "kalomaze/alphabetic-arxiv-authors-it1",
    dataset_split: str = "train",
    max_turns: int = 3,
    min_turns: int = 1,
    min_names_per_turn: int = 1,
    max_names_per_turn: int = 5,
    similarity_power: int = 4,
    seed: int = 1337420,
) -> vf.Environment:

    class SortingEnv(vf.MultiTurnEnv):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

        async def is_completed(self, messages: Messages, state: State, **kwargs) -> bool:
            assert isinstance(messages, list)
            assistant_count = len([m for m in messages if m["role"] == "assistant"])
            num_turns = state["info"]["num_turns"]
            return assistant_count >= num_turns

        async def env_response(self, messages: Messages, state: State, **kwargs) -> Tuple[Messages, State]:
            assert isinstance(messages, list)
            assistant_count = len([m for m in messages if m["role"] == "assistant"])
            num_turns = state["info"]["num_turns"]

            if assistant_count < num_turns:
                follow_ups = state["info"]["follow_ups"]
                follow_up_idx = assistant_count - 1

                if follow_up_idx < len(follow_ups):
                    return [{"role": "user", "content": follow_ups[follow_up_idx]}], state

            return [{"role": "user", "content": "Continue"}], state

    def score_response(predicted: List[str], expected: List[str]) -> float:
        if not predicted or not expected:
            return 0.0

        pred_clean = [s.strip().lower() for s in predicted]
        exp_clean = [s.strip().lower() for s in expected]

        pred_text = "\n".join(pred_clean)
        exp_text = "\n".join(exp_clean)
        similarity = difflib.SequenceMatcher(None, pred_text, exp_text).ratio()

        return similarity**similarity_power

    def eval_turn(completion: List[dict], turn_num: int, state: dict) -> float:
        info = state.get("info", {})
        ground_truths = info.get("ground_truths", [])

        if turn_num > len(ground_truths):
            return 0.0

        expected = ground_truths[turn_num - 1]

        if not isinstance(completion, list):
            return 0.0

        assistant_msgs = [m["content"] for m in completion if m["role"] == "assistant"]
        if len(assistant_msgs) < turn_num:
            return 0.0

        xml_tag = "alphabetical_sorted" if turn_num == 1 else "combined_alphabetical_sorted"

        parser = vf.XMLParser([xml_tag], answer_field=xml_tag)
        parsed = parser.parse_answer(assistant_msgs[turn_num - 1])
        if parsed is None:
            return 0.0
        predicted = parsed.split("\n")

        return score_response(predicted, expected)

    def create_weighted_rewards():
        def weighted_reward(completion, state, **kwargs):
            actual_turns = state["info"]["num_turns"]
            total_score = 0.0

            for turn_num in range(1, actual_turns + 1):
                turn_score = eval_turn(completion, turn_num, state)
                total_score += turn_score

            return total_score / actual_turns if actual_turns > 0 else 0.0

        return weighted_reward

    # Dataset building logic would go here...
    dataset = build_dataset()  # Implementation details omitted for brevity
    rubric = vf.Rubric(funcs=[create_weighted_rewards()], weights=[1.0])
    env_instance = SortingEnv(dataset=dataset, rubric=rubric, max_turns=max_turns)

    return env_instance
References