Understand your student and give them the proper tools

Foreign Language 101

For all who took part in foreign language classes, instructors often tested you in situations meant to simulate the many ways one can socially interact: in a restaurant, making small talk, in a formal meeting, on a romantic date. In all of those scenarios, they gave you context and tried to validate how well you would do in the situation. But you weren’t actually in these situations; you just role-played. The purpose was clear: by practicing in a safe environment and getting feedback, you could improve and build confidence before facing these situations in the real world. Based on this, the instructor could estimate how well you would do in a real-life scenario. If it did not go well, surely you would receive feedback and suggestions on how to improve. Maybe it wasn’t your fault, but rather the teaching method; perhaps a different approach would have been more effective.

This is a great parallel for how Arcade Evals works. Just as language students need a way to practice and gain confidence before real conversations, we needed a way to test and increase confidence in the quality of tool definitions before deploying them to production. With Arcade Evals, we can build a role-play scenario with the LLM as the student, add a simulated conversation context, give it knowledge of tool definitions, and send a user message with the expected outcome we want to validate. Critically, this tests tool selection without requiring real tool executions. No API calls, no side effects, no infrastructure needed. The model is our constant and the tools are what we’re testing. Rubrics score whether the tool definitions were clear enough for the model to select the right tool and populate arguments correctly.

In this analogy, we are the teachers providing learners with tools that, if used properly, help them excel in real-world scenarios. If our student does not perform well, we also consider that the issue might be in the tools we provided. The whole process is pedagogic: we try to improve tools so they are intuitive to use based on the learner’s cognitive skills. We can build amazing tools that do incredible things, but if they only work for the most advanced models and we’re deploying to a mix of capabilities, they might be as good as nothing.

Quick vocabulary

Before elaborating further about Arcade Evals, we should get comfortable with some key terms:

  • Tool definition / schema: the JSON schema + descriptions the model sees for a tool (names, parameters, formats, enums).
  • Eval case: one user request + optional context + the expected tool call(s).
  • Critics: graders for tool arguments (exact match vs similarity).
  • Rubric: pass/warn/fail thresholds across the critics.
  • Tracks: parallel toolsets (e.g., “vague” vs “descriptive”) evaluated on the same cases.

Sharpening our pedagogic skills with Arcade Evals

Tool quality directly affects agent reliability. A well-designed tool has clear descriptions, intuitive parameter names, and unambiguous schemas that allow the LLM to select it at the right moment and populate arguments correctly. When tools are poorly designed, agents fail silently or unpredictably in production: choosing wrong tools, passing malformed arguments, or missing the right action entirely. This makes systematic evaluation essential. We need to measure whether tools are understandable to the model before users encounter failures.

The core idea is simple: create a set of realistic situations, ask the model to solve them with the tools you provide, and score whether the correct tool and arguments were used. Arcade Evals packages this into suites of cases, where each case has a user message, optional context, expected tool calls, and critics that grade the arguments. Rubrics then turn those scores into a clear pass, warn, or fail result so you can see if a tool is ready for real usage.

Take a look at the code snippet below. It represents a fully runnable eval case:

"""Minimal Arcade Evals example – a complete, runnable evaluation suite."""

from arcade_evals import (
    BinaryCritic,  # Exact match validator
    SimilarityCritic,  # Fuzzy text matching (cosine similarity)
    EvalRubric,  # Pass/warn/fail thresholds
    EvalSuite,  # Container for test cases
    ExpectedMCPToolCall,  # What we expect the LLM to call
    FuzzyWeight,  # Enum for qualitative weight assignment
    MCPToolDefinition,  # Tool schema definition
    tool_eval,  # Decorator to register eval functions
)

# ---------------------------------------------------------------------------
# 1. Define your MCP tools (the interface the LLM sees)
# ---------------------------------------------------------------------------
TOOLS: list[MCPToolDefinition] = [
    {
        "name": "schedule_meeting",
        "description": "Book a calendar meeting with one attendee.",
        "inputSchema": {
            "type": "object",
            "properties": {
                "title": {"type": "string", "description": "Meeting title"},
                "attendee": {"type": "string", "description": "Email of attendee"},
                "datetime": {
                    "type": "string",
                    "description": "ISO 8601 datetime (e.g., 2026-02-04T14:00)",
                },
                "location": {
                    "type": "string",
                    "description": "Meeting platform",
                    "enum": ["Zoom", "Google Meet", "Teams"],
                },
            },
            "required": ["title", "attendee", "datetime", "location"],
        },
    },
]

# ---------------------------------------------------------------------------
# 2. Create an evaluation suite with test cases
# ---------------------------------------------------------------------------
@tool_eval()  # Registers this function with the eval runner
async def eval_scheduling_tools() -> EvalSuite:
    suite = EvalSuite(
        name="Scheduling Tools Evaluation",
        system_message="You are a scheduling assistant. Use the provided tools.",
        rubric=EvalRubric(fail_threshold=0.8, warn_threshold=0.9),
        tools=TOOLS,
    )

    suite.add_case(
        name="Book team sync",
        user_message="Schedule a team sync with alice@acme.com tomorrow at 3pm on Zoom",
        expected_tool_calls=[
            ExpectedMCPToolCall(
                name="schedule_meeting",
                arguments={
                    "title": "Team sync",
                    "attendee": "alice@acme.com",
                    "datetime": "2026-01-23T15:00",
                    "location": "Zoom",
                },
            )
        ],
        critics=[
            SimilarityCritic(
                critic_field="title",
                weight=FuzzyWeight.HIGH,
                similarity_threshold=0.8,
            ),
            BinaryCritic(
                critic_field="attendee",
                weight=FuzzyWeight.CRITICAL,
            ),
            BinaryCritic(
                critic_field="datetime",
                weight=FuzzyWeight.CRITICAL,
            ),
            BinaryCritic(
                critic_field="location",
                weight=FuzzyWeight.MEDIUM,
            ),
        ],
    )

    return suite

This gives you a feedback loop you can use like any other craft: capture how the model behaves, set expectations, and keep iterating on tool descriptions and schemas until the model is consistently doing the right thing.

What insights Arcade Evals can provide

Arcade Evals was designed to help MCP tool and agent builders get objective results that simulate how LLMs act when prompted to solve an action request. Tool definitions can be loaded from Arcade Python tool source code, Arcade MCP Gateways, remote or local MCP servers, or dictionary-based tool definitions. Some examples of how this helps:

  • Evaluate how a toolset behaves across different models and providers.
  • Validate whether tool intent is communicated clearly; tune tool names, parameter names, and descriptions when it is not.
  • Iteratively test tool definitions with dicts without committing to production code.
  • Compare different MCP servers or versions using the same context and expectations.
  • Validate third-party MCP server quality using realistic simulated user interactions.
  • Verify Arcade MCP Gateway consistency as you add toolkits from different providers.
  • Use evals as regression checks when you change schemas or add tools.

What Arcade Evals does not evaluate (and why that’s okay)

Arcade Evals is intentionally scoped to tool selection + argument quality, without executing the tool. That means it does not validate:

  • Whether the downstream API call succeeds
  • Whether the tool output is correct or useful
  • Multi-step planning and retries (unless you choose to evaluate those patterns explicitly)
  • Latency, rate limits, auth, or side effects

This is a feature, not a bug: it lets you iterate on tool schemas early, cheaply, and safely.

Get started with Arcade Evals

Ready to test your MCP tools before they hit production? Arcade Evals is built into the Arcade CLI.

pip install 'arcade-mcp[evals]'
arcade evals

For a practical checklist of what makes tool definitions work (with before/after examples), read Building High-Quality MCP Tools with Arcade Evals.

For the full benchmark data behind this approach, get the technical report.