A reproducible TypeScript benchmark comparing MCP-native agents vs mcp-cli, capturing token usage, tool calls, retries, and latency across shared MCP tasks
MCP vs CLI MCP Token Benchmark (TypeScript)
A reproducible experiment framework to compare:
- MCP-native Agent using
openai/openai-agents-jsMCP servers directly. - CLI-MCP Agent using
philschmid/mcp-clias a thin transport (mcp-cli call <server> <tool> <json>).
Outputs:
results/raw-results.json(all runs + metrics)results/summary.json(per-task averages + MCP vs CLI deltas)
Requirements
- Node.js 18+
mcp-clion PATH- MCP servers (filesystem, GitHub, search) available
- OpenAI API key
Setup
npm install
Set environment variables (or copy .env.sample to .env and fill in keys):
OPENAI_API_KEY=...
GITHUB_PERSONAL_ACCESS_TOKEN=...
BRAVE_API_KEY=...
If GITHUB_PERSONAL_ACCESS_TOKEN or BRAVE_API_KEY are missing, the matching MCP server and tasks are skipped so you can still run filesystem-only benchmarks.
Configure MCP Servers
Two configs are used:
- Agent (direct MCP) uses
src/config.tsdefaults. - CLI (mcp-cli) uses
mcp_servers.json.
Adjust commands if your MCP servers are installed differently.
Default servers (stdio)
@modelcontextprotocol/server-filesystemgithub-mcp-server@modelcontextprotocol/server-brave-search
If you run a GitHub MCP server via Docker or HTTP, update src/config.ts and mcp_servers.json accordingly.
Run
npm run dev
Optional arguments:
npm run dev -- --runs 3 --model gpt-5-mini --tasks filesystem.read.sample,github.search.code
Multi-model and AI SDK examples:
# Run multiple OpenAI models in one sweep
npm run dev -- --models gpt-5-mini,gpt-4.1-mini
# Use AI SDK (e.g. Anthropic / Google) models
MODEL_PROVIDER=aisdk AI_SDK_PROVIDER=anthropic AI_SDK_MODELS=claude-3-5-sonnet-20241022 npm run dev
MODEL_PROVIDER=aisdk AI_SDK_PROVIDER=google AI_SDK_MODELS=gemini-1.5-pro-latest npm run dev
Notes
- The CLI agent calls exactly:
mcp-cli call -c mcp_servers.json <server> <tool> <json>
- Output is expected to be raw JSON.
NO_COLOR=1is set to avoid ANSI noise. - Metrics come from
openai-agents-jstracing/usage; no manual token estimation. - Retries are inferred from tool-call error spans followed by a subsequent tool call of the same name.