When we moved our CRM to Attio, we built our own MCP toolkit to automate our go-to-market workflows. As part of this, we decided to run a benchmark comparing the Arcade Attio toolkit against Composio’s to better understand the impact of tool quality differences. Across 8 CRM queries, Arcade consumed 7,426 tokens total while Composio consumed 747,083, which is greater than a 100x difference.

At that scale, toolkit choice stops being an implementation detail and starts showing up in your infrastructure costs and agent reliability. Here’s the full breakdown of the benchmark.

Benchmark

We seeded an Attio CRM sandbox with 50 companies (Fortune 50), 100 contacts (real C-suite executives), and 50 deals across 6 pipeline stages. We then ran 8 identical CRM queries through two MCP toolkits, Arcade and Composio, and recorded the raw token output of each response. Both tests used Claude Code as the client with Claude Sonnet 4.6 as the model.

All raw response data, the sandbox seed script, and the exact eval prompts are open source at github.com/ArcadeAI/attio-mcp-benchmark. You can reproduce this yourself.

Results

Total tokens across all 8 queries:

ToolkitTotal TokensAvg per Query
Arcade7,426928
Composio747,08393,385

Composio provided 100.6x more response tokens than Arcade for the same 8 queries against the same workspace.

Per-query breakdown:

#QueryArcadeComposioRatio
01List 25 companies (name only)902144,363160x
02Deals in Nurture stage (name + stage)97448,79250x
03Deals over $50K (name + value)1,07266,75262x
04Companies with “Tech” in name35448,103136x
05Technology companies (name + categories)1,030165,958161x
06Deals before March 2026 (name + date + value)1,600111,82970x
07Large Technology companies (compound filter)1,329159,032120x
08Highest-value deal (sort desc, limit 1)1652,25414x

Token counts via tiktoken cl100k_base. Both toolkits tested live against the same Attio workspace with the same seeded data.

Why the gap exists

Three structural differences explain the 100x delta:

1. Field selection vs full record dump

Arcade requires agents to specify which fields they need (name, value, stage). The response contains only those fields.

Composio returns every field on every record: all custom attributes, all built-in attributes, with no way to select specific fields.

In this workspace, a single company record expands to ~5,800 tokens in Composio’s response format. In Arcade’s, the same record with name selected is ~30 tokens.

2. Temporal metadata on every field

Composio wraps every attribute value with full Attio API metadata: active_from, active_until, attribute_type, created_by actor objects, and type annotations. This is the raw Attio v2 API response passed through unmodified.

Arcade flattens these to key-value pairs. A company name is "Apple Inc.", not a nested object with timestamps and actor references.

3. Error recovery burns more context

Query 07 (compound filter: Technology companies with 1K+ employees) required 4 tool calls in Composio vs 1 in Arcade:

  • First attempt failed: $in operator not supported on select fields
  • Second attempt failed: option values 501-1000 don’t exist in this workspace
  • Schema discovery call: fetched full company attribute schema (~40 attributes) just to learn the actual option titles (5K-10K, 10K-50K, etc.)
  • Third attempt succeeded with the correct option values

Arcade resolved it in a single call. The total context consumed during Composio’s Q07 resolution was significantly larger than what hit disk.

What this means for agents

At Arcade’s token rate, an agent can run all 8 benchmark queries and consume 3.7% of a 200K context window.

At Composio’s rate, the same 8 queries consume 373% of a 200K window; the responses cannot fit in a single context. The initial batch response alone was 467K tokens.

For multi-step agent workflows where the agent needs to query, reason, and act across several CRM operations, context headroom determines whether the agent completes the task or loses track of what it was doing. Research on long-context LLM performance (Lost in the Middle, Liu et al. 2023) shows accuracy degrades as input grows, particularly when relevant information is buried in the middle of long inputs.

The cost at scale

Using the average per-query token counts from this benchmark (928 for Arcade, 93,385 for Composio) at Claude Sonnet 4.6 input pricing ($3/M tokens):

ScaleQueries/MonthArcadeComposioMonthly SavingsAnnual Savings
Small team (10 agents, 50 queries/day)15,000$42$4,202$4,161$49,928
Mid-market (25 agents, 100 queries/day)75,000$209$21,012$20,803$249,633
Enterprise (100 agents, 200 queries/day)600,000$1,670$168,093$166,423$1,997,071

These are input token costs only, the marginal difference between the two toolkits for tool response data. Total agent costs (system prompts, reasoning, output tokens) add to both sides equally.

At the mid-market tier, the toolkit choice alone could be a quarter-million dollars a year. At enterprise scale, it could be $2M. And that’s at Sonnet pricing. For agents running on Opus 4.6 ($15/M), multiply these numbers by 5x.

Methodology

  • Sandbox: 50 companies, 100 people, 50 deals, 26 custom attributes (15 on companies, 11 on deals). Seeded via scripts/seed_workspace.py.
  • Client: Claude Code with Claude Sonnet 4.6
  • Queries: 8 eval prompts covering list, filter by status, numeric filter, text search, select filter, date comparison, compound filter, and sort+limit
  • Measurement: tiktoken cl100k_base on complete raw JSON responses saved to disk

Reproduce it yourself:

git clone https://github.com/ArcadeAI/attio-mcp-benchmark
cd attio-mcp-benchmark

# Seed a sandbox (requires Attio API key)
ATTIO_API_KEY=your_key python3 scripts/seed_workspace.py

# Run evals (see evals/*.md for the 8 prompts)
# Connect your toolkit's MCP server and run each prompt

# Count tokens
pip3 install tiktoken
python3 scripts/count_tokens.py