When we moved our CRM to Attio, we built our own MCP toolkit to automate our go-to-market workflows. As part of this, we decided to run a benchmark comparing the Arcade Attio toolkit against Composio’s to better understand the impact of tool quality differences. Across 8 CRM queries, Arcade consumed 7,426 tokens total while Composio consumed 747,083, which is greater than a 100x difference.
At that scale, toolkit choice stops being an implementation detail and starts showing up in your infrastructure costs and agent reliability. Here’s the full breakdown of the benchmark.
Benchmark
We seeded an Attio CRM sandbox with 50 companies (Fortune 50), 100 contacts (real C-suite executives), and 50 deals across 6 pipeline stages. We then ran 8 identical CRM queries through two MCP toolkits, Arcade and Composio, and recorded the raw token output of each response. Both tests used Claude Code as the client with Claude Sonnet 4.6 as the model.
All raw response data, the sandbox seed script, and the exact eval prompts are open source at github.com/ArcadeAI/attio-mcp-benchmark. You can reproduce this yourself.
Results
Total tokens across all 8 queries:
| Toolkit | Total Tokens | Avg per Query |
|---|---|---|
| Arcade | 7,426 | 928 |
| Composio | 747,083 | 93,385 |
Composio provided 100.6x more response tokens than Arcade for the same 8 queries against the same workspace.
Per-query breakdown:
| # | Query | Arcade | Composio | Ratio |
|---|---|---|---|---|
| 01 | List 25 companies (name only) | 902 | 144,363 | 160x |
| 02 | Deals in Nurture stage (name + stage) | 974 | 48,792 | 50x |
| 03 | Deals over $50K (name + value) | 1,072 | 66,752 | 62x |
| 04 | Companies with “Tech” in name | 354 | 48,103 | 136x |
| 05 | Technology companies (name + categories) | 1,030 | 165,958 | 161x |
| 06 | Deals before March 2026 (name + date + value) | 1,600 | 111,829 | 70x |
| 07 | Large Technology companies (compound filter) | 1,329 | 159,032 | 120x |
| 08 | Highest-value deal (sort desc, limit 1) | 165 | 2,254 | 14x |
Token counts via tiktoken cl100k_base. Both toolkits tested live against the same Attio workspace with the same seeded data.
Why the gap exists
Three structural differences explain the 100x delta:
1. Field selection vs full record dump
Arcade requires agents to specify which fields they need (name, value, stage). The response contains only those fields.
Composio returns every field on every record: all custom attributes, all built-in attributes, with no way to select specific fields.
In this workspace, a single company record expands to ~5,800 tokens in Composio’s response format. In Arcade’s, the same record with name selected is ~30 tokens.
2. Temporal metadata on every field
Composio wraps every attribute value with full Attio API metadata: active_from, active_until, attribute_type, created_by actor objects, and type annotations. This is the raw Attio v2 API response passed through unmodified.
Arcade flattens these to key-value pairs. A company name is "Apple Inc.", not a nested object with timestamps and actor references.
3. Error recovery burns more context
Query 07 (compound filter: Technology companies with 1K+ employees) required 4 tool calls in Composio vs 1 in Arcade:
- First attempt failed:
$inoperator not supported on select fields - Second attempt failed: option values
501-1000don’t exist in this workspace - Schema discovery call: fetched full company attribute schema (~40 attributes) just to learn the actual option titles (
5K-10K,10K-50K, etc.) - Third attempt succeeded with the correct option values
Arcade resolved it in a single call. The total context consumed during Composio’s Q07 resolution was significantly larger than what hit disk.
What this means for agents
At Arcade’s token rate, an agent can run all 8 benchmark queries and consume 3.7% of a 200K context window.
At Composio’s rate, the same 8 queries consume 373% of a 200K window; the responses cannot fit in a single context. The initial batch response alone was 467K tokens.
For multi-step agent workflows where the agent needs to query, reason, and act across several CRM operations, context headroom determines whether the agent completes the task or loses track of what it was doing. Research on long-context LLM performance (Lost in the Middle, Liu et al. 2023) shows accuracy degrades as input grows, particularly when relevant information is buried in the middle of long inputs.
The cost at scale
Using the average per-query token counts from this benchmark (928 for Arcade, 93,385 for Composio) at Claude Sonnet 4.6 input pricing ($3/M tokens):
| Scale | Queries/Month | Arcade | Composio | Monthly Savings | Annual Savings |
|---|---|---|---|---|---|
| Small team (10 agents, 50 queries/day) | 15,000 | $42 | $4,202 | $4,161 | $49,928 |
| Mid-market (25 agents, 100 queries/day) | 75,000 | $209 | $21,012 | $20,803 | $249,633 |
| Enterprise (100 agents, 200 queries/day) | 600,000 | $1,670 | $168,093 | $166,423 | $1,997,071 |
These are input token costs only, the marginal difference between the two toolkits for tool response data. Total agent costs (system prompts, reasoning, output tokens) add to both sides equally.
At the mid-market tier, the toolkit choice alone could be a quarter-million dollars a year. At enterprise scale, it could be $2M. And that’s at Sonnet pricing. For agents running on Opus 4.6 ($15/M), multiply these numbers by 5x.
Methodology
- Sandbox: 50 companies, 100 people, 50 deals, 26 custom attributes (15 on companies, 11 on deals). Seeded via
scripts/seed_workspace.py. - Client: Claude Code with Claude Sonnet 4.6
- Queries: 8 eval prompts covering list, filter by status, numeric filter, text search, select filter, date comparison, compound filter, and sort+limit
- Measurement: tiktoken cl100k_base on complete raw JSON responses saved to disk
Reproduce it yourself:
git clone https://github.com/ArcadeAI/attio-mcp-benchmark
cd attio-mcp-benchmark
# Seed a sandbox (requires Attio API key)
ATTIO_API_KEY=your_key python3 scripts/seed_workspace.py
# Run evals (see evals/*.md for the 8 prompts)
# Connect your toolkit's MCP server and run each prompt
# Count tokens
pip3 install tiktoken
python3 scripts/count_tokens.py