Arcade.dev vs Composio: MCP Benchmark

When we moved our CRM to Attio, we built our own MCP toolkit to automate our go-to-market workflows. As part of this, we decided to run a benchmark comparing the Arcade.dev Attio toolkit against Composio’s to better understand the impact of tool quality differences. Across 8 CRM queries, Arcade consumed 7,426 tokens total while Composio consumed 747,083, which is greater than a 100x difference.

At that scale, toolkit choice stops being an implementation detail and starts showing up in your infrastructure costs and agent reliability. Here’s the full breakdown of the benchmark.

Benchmark

We seeded an Attio CRM sandbox with 50 companies (Fortune 50), 100 contacts (real C-suite executives), and 50 deals across 6 pipeline stages. We then ran 8 identical CRM queries through two MCP toolkits, Arcade and Composio, and recorded the raw token output of each response. Both tests used Claude Code as the client with Claude Sonnet 4.6 as the model.

All raw response data, the sandbox seed script, and the exact eval prompts are open source at github.com/ArcadeAI/attio-mcp-benchmark. You can reproduce this yourself.

Results

Total tokens across all 8 queries:

Toolkit	Total Tokens	Avg per Query
Arcade	7,426	928
Composio	747,083	93,385

Composio provided 100.6x more response tokens than Arcade for the same 8 queries against the same workspace.

Per-query breakdown:

#	Query	Arcade	Composio	Ratio
01	List 25 companies (name only)	902	144,363	160x
02	Deals in Nurture stage (name + stage)	974	48,792	50x
03	Deals over $50K (name + value)	1,072	66,752	62x
04	Companies with “Tech” in name	354	48,103	136x
05	Technology companies (name + categories)	1,030	165,958	161x
06	Deals before March 2026 (name + date + value)	1,600	111,829	70x
07	Large Technology companies (compound filter)	1,329	159,032	120x
08	Highest-value deal (sort desc, limit 1)	165	2,254	14x

Token counts via tiktoken cl100k_base. Both toolkits tested live against the same Attio workspace with the same seeded data.

Why the gap exists

Three structural differences explain the 100x delta:

1. Field selection vs full record dump

Arcade requires agents to specify which fields they need (name, value, stage). The response contains only those fields.

Composio returns every field on every record: all custom attributes, all built-in attributes, with no way to select specific fields.

In this workspace, a single company record expands to ~5,800 tokens in Composio’s response format. In Arcade’s, the same record with name selected is ~30 tokens.

2. Temporal metadata on every field

Composio wraps every attribute value with full Attio API metadata: active_from, active_until, attribute_type, created_by actor objects, and type annotations. This is the raw Attio v2 API response passed through unmodified.

Arcade flattens these to key-value pairs. A company name is "Apple Inc.", not a nested object with timestamps and actor references.

3. Error recovery burns more context

Query 07 (compound filter: Technology companies with 1K+ employees) required 4 tool calls in Composio vs 1 in Arcade:

First attempt failed: $in operator not supported on select fields
Second attempt failed: option values 501-1000 don’t exist in this workspace
Schema discovery call: fetched full company attribute schema (~40 attributes) just to learn the actual option titles (5K-10K, 10K-50K, etc.)
Third attempt succeeded with the correct option values

Arcade resolved it in a single call. The total context consumed during Composio’s Q07 resolution was significantly larger than what hit disk.

What this means for agents

At Arcade’s token rate, an agent can run all 8 benchmark queries and consume 3.7% of a 200K context window.

At Composio’s rate, the same 8 queries consume 373% of a 200K window; the responses cannot fit in a single context. The initial batch response alone was 467K tokens.

For multi-step agent workflows where the agent needs to query, reason, and act across several CRM operations, context headroom determines whether the agent completes the task or loses track of what it was doing. Research on long-context LLM performance (Lost in the Middle, Liu et al. 2023) shows accuracy degrades as input grows, particularly when relevant information is buried in the middle of long inputs.

The cost at scale

Using the average per-query token counts from this benchmark (928 for Arcade, 93,385 for Composio) at Claude Sonnet 4.6 input pricing ($3/M tokens):

Scale	Queries/Month	Arcade	Composio	Monthly Savings	Annual Savings
Small team (10 agents, 50 queries/day)	15,000	$42	$4,202	$4,161	$49,928
Mid-market (25 agents, 100 queries/day)	75,000	$209	$21,012	$20,803	$249,633
Enterprise (100 agents, 200 queries/day)	600,000	$1,670	$168,093	$166,423	$1,997,071

These are input token costs only, the marginal difference between the two toolkits for tool response data. Total agent costs (system prompts, reasoning, output tokens) add to both sides equally.

At the mid-market tier, the toolkit choice alone could be a quarter-million dollars a year. At enterprise scale, it could be $2M. And that’s at Sonnet pricing. For agents running on Opus 4.6 ($15/M), multiply these numbers by 5x.

Methodology

Sandbox: 50 companies, 100 people, 50 deals, 26 custom attributes (15 on companies, 11 on deals). Seeded via scripts/seed_workspace.py.
Client: Claude Code with Claude Sonnet 4.6
Queries: 8 eval prompts covering list, filter by status, numeric filter, text search, select filter, date comparison, compound filter, and sort+limit
Measurement: tiktoken cl100k_base on complete raw JSON responses saved to disk

Reproduce it yourself:

git clone https://github.com/ArcadeAI/attio-mcp-benchmark
cd attio-mcp-benchmark

# Seed a sandbox (requires Attio API key)
ATTIO_API_KEY=your_key python3 scripts/seed_workspace.py

# Run evals (see evals/*.md for the 8 prompts)
# Connect your toolkit's MCP server and run each prompt

# Count tokens
pip3 install tiktoken
python3 scripts/count_tokens.py