In our first post in this series, Everything Is a Test, we introduced Arcade Evals: a framework for testing whether LLMs can select the right MCP tool and fill arguments correctly, without executing any tools. This post focuses on the practical side: what separates a good tool definition from a bad one, and how to measure the difference.

Tool definitions are not function signatures

Building high-quality MCP tools requires considerable thought, not only on how tools work internally, but also how they are presented to LLMs in such a way that their proper use in the right scenarios can be understood. A set of well-described tool definitions can strongly impact the performance of your agent, which is directly proportional to the experience of those using it.

As an inexperienced tool builder, we might naively treat MCP tools as if they were just regular code functions. We don’t give much thought to the importance of how we present the tools to the models, so we pick the exact names from the OpenAPI spec (they seem descriptive enough) and skip adding descriptions. Who reads docs, right? To simplify our lives, we also accept enums as strings. Then we do the required wiring so the tools can use our REST API provider and call it a day.

The struggle begins when testing in the real world. The tool might not perform horribly: depending on input formatting requirements, the most competent (and expensive) models can often infer what to do with this poorly defined toolkit, but cheaper models will struggle more. The main issue arises when you need parameters that require specific formatting, like dates and enums. Without proper guidance, even competent models can miss the proper input formatting, especially since we are using strings for enums and dates, which introduces huge cardinal complexity.

Also, depending on your toolset, the model can fall into an ambiguity trap and pick a tool it believes does exactly what you want due to the lack of a good description. Depending on what that tool does, it could be disastrous. Besides these annoying issues, bad tools also result in higher token consumption due to repeated retries.

The restaurant menu analogy

We realized our first error was treating MCP tool definitions like function signatures. They are not. They are more like menu items in a restaurant: you give an LLM the options, and based on the context, it chooses the most attractive one and parameterizes it. Then the waiter sends the order to the kitchen and returns it to you.

If the menu only has generic names with no further elaboration, who knows what you will get if you want to try something new. When it arrives, if you don’t like it, you can try ordering again, but the same issue might happen indefinitely until you give up. In the worst case, you might get something containing ingredients you are allergic to, potentially causing irreparable side effects to your health.

What good looks like

The first logical change is updating the tool names so the intent is clearer even when only the tool name is available. This helps both LLMs and humans who curate their own toolsets. We also add a description that captures the tool’s purpose and critical constraints, without being verbose.

The next change applies to each parameter. We should describe required formats, defaults, and accepted values (including enum options). This is extremely important and will greatly increase the success rate of tool calls.

# BEFORE: Vague tool definition
{
    "name": "schedule",  # Generic name
    # No description!
    "inputSchema": {
        "properties": {
            "when": {"type": "string"},      # What format?
            "type": {"type": "string"},      # What are the options?
            "location": {"type": "string"},  # Zoom? Address? URL?
        }
    }
}

# AFTER: Descriptive tool definition
{
    "name": "create_meeting",
    "description": "Schedule a video meeting with ISO formats and one attendee.",
    "inputSchema": {
        "properties": {
            "datetime": {
                "type": "string",
                "description": "ISO 8601 datetime (YYYY-MM-DDTHH:MM)"
            },
            "meeting_type": {
                "type": "string",
                "description": "Meeting type: demo | review | planning | 1:1"
            },
            "platform": {
                "type": "string",
                "description": "Video platform",
                "enum": ["Zoom", "Google Meet", "Teams"]
            }
        }
    }
}

Measuring the difference

We ran both versions through Arcade Evals with the same test cases across multiple models. The results were consistent: the vague toolkit failed the 90% threshold on meeting scheduling cases across every model tested. The descriptive toolkit passed consistently.

The primary cause of failure? Missing formatting guidance (ISO datetime and duration) and enum/list mapping. When you don’t tell the model that “when” should be “2026-02-04T14:00”, it guesses. Sometimes it guesses wrong.

Knowing this, we built an evals framework focused on tool matching scoring that lets you test whether an LLM can select the right tool and fill the arguments correctly, using only tool definitions and prompts. No tool execution required.

A quick checklist for high-quality MCP tool schemas

Things you should keep in mind when building MCP tools:

  • Tool names communicate intent: prefer verbs + objects (create_meeting, send_email) over generic verbs (schedule, notify).
  • Descriptions state constraints: one sentence that explains what the tool does and what it does not do.
  • Parameters are model-friendly: names match user language; descriptions include formats and examples.
  • Closed sets are enums: if the value must be one of N options, list them.
  • Formats are explicit: ISO datetime/duration, IDs, URL parsing expectations, ordering rules.
  • Avoid “hidden requirements”: if a tool needs identity/selection/context, either make it an input or provide that context.

Get started with Arcade Evals

Ready to test your MCP tools before they hit production? Arcade Evals is built into the Arcade CLI.

pip install 'arcade-mcp[evals]'
arcade evals .

For the full docs, see the Arcade Evals guide.

Want to go deeper? We ran these evals across 14 models with 3 runs each, including a complete evaluation of the Figma MCP Server. Read the report or run it yourself.