Research MCP Server

Intelligent multi-source research orchestration for AI assistants

Overview

Research MCP Server is a Model Context Protocol (MCP) server that provides consensus-driven, multi-source research capabilities to AI assistants like Claude, ChatGPT, and other MCP-compatible clients. It uses 3-5 LLMs to vote on research strategy, then dynamically orchestrates research across web search, academic papers, library documentation, and AI reasoning—delivering comprehensive, validated insights with built-in fact-checking.

Why Research MCP?

Consensus Planning: 2-5 LLMs vote on research strategy + independent planning for each sub-question
Production-Ready Reports: Enforced numeric specificity, no placeholder code, explicit success criteria
Phased Synthesis: Token-efficient approach with key findings extraction (~40% fewer tokens)
Code Validation: Post-synthesis validation against Context7 docs catches hallucinated code
Inline Citations: Every claim sourced ([perplexity:url], [context7:lib], [arxiv:id])
Multi-Model Validation: Critical challenge + consensus validation by multiple LLMs
Actionability Checklist: Synthesis evaluated for specificity, completeness, and contradiction-free output
Context-Efficient Reports: Sectioned architecture with on-demand reading (prevents AI context bloat)
Dynamic Execution: Custom research plans with parallel processing
Multi-Source Synthesis: Combines Perplexity, arXiv, Context7, and direct LLM reasoning

Features

Core Capabilities

Adaptive Research Planning: Root consensus + independent sub-question planning
Multi-Source Search:
- Web search via Perplexity API
- Academic papers via arXiv with AI-generated summaries
- Library documentation via Context7 (with shared + specific doc fetching)
- Deep reasoning
Parallel Processing: Main query + sub-questions execute simultaneously
Phased Synthesis: Main synthesis → key findings extraction → sub-Q synthesis (token-efficient)
Code Validation: Post-synthesis validation against Context7 documentation
Validation Pipeline: Critical challenge + multi-model consensus + sufficiency voting

🚀 Installation

Prerequisites

Node.js 18+
API keys for:
- Perplexity API
- Google AI (Gemini)
- OpenAI API
- Context7 (for library documentation)

Quick Start

# Clone the repository
git clone https://github.com/yourusername/research-mcp.git
cd research-mcp

# Install dependencies
npm install
# or
bun install

# Build TypeScript
npm run build

Integration with MCP Clients

Claude Desktop / Cursor

Add to your MCP configuration file:

Claude Desktop: ~/Library/Application Support/Claude/claude_desktop_config.json Cursor: ~/.cursor/mcp.json

{
  "mcpServers": {
    "research": {
      "command": "node",
      "args": ["/path/to/deep-research-mcp/dist/index.js"],
      "env": {
        "PERPLEXITY_API_KEY": "your-key",
        "GEMINI_API_KEY": "your-key",
        "OPENAI_API_KEY": "your-key",
        "ARXIV_STORAGE_PATH": "/path/to/storage/",
        "CONTEXT7_API_KEY": "your-key"
      }
    }
  }
}

Restart your client after adding the configuration.

💡 Usage

Basic Example (Async Pattern)

Ask your AI assistant to use the research tool:

I need to research transformer architectures.
Can you use the start_research tool to give me a comprehensive overview?

The AI will call:

{
  "query": "How do transformer architectures work?",
  "depth_level": 2
}

Then poll with check_research_status using the returned job_id.

Reading Research Report Citations

When personas cite research using format [R-135216:5-19], you can verify the content:

Use the read_report tool to verify what the persona cited:
{
  "citation": "[R-135216:5-19]"
}

This returns lines 5-19 from report R-135216.

Advanced Example with Rich Context

I'm building an AI memory companion that extracts entities and deduplicates memories.
I need to create 600+ test examples for evaluation, but my current template-based
approach creates unrealistic data. Research the best approaches for creating
high-quality evaluation datasets.

Context:
- Solo developer with 20 hour budget
- Already reviewed papers on Excel formula repair and DAHL biomedical benchmark
- Found that synthetic data is 40% simpler than real data
- Random template filling doesn't work

Specific questions:
1. What makes evaluation data representative?
2. How to generate hard negatives?

Tech stack: Python, Neo4j, LangSmith

This triggers a sophisticated research session with:

{
  "query": "How to create high-quality evaluation datasets for LLM testing?",
  "project_description": "AI memory companion with semantic extraction/dedup",
  "current_state": "85 test examples, need 600+",
  "problem_statement": "Template-based generation creates unrealistic data",
  "constraints": ["Solo developer", "20 hours"],
  "domain": "LLM evaluation datasets",
  "depth_level": 4,
  "papers_read": ["Excel formula repair", "DAHL biomedical benchmark"],
  "key_findings": ["Synthetic data 40% simpler than real data"],
  "rejected_approaches": ["Random template filling"],
  "sub_questions": [
    "What makes evaluation data representative?",
    "How to generate hard negatives?"
  ],
  "tech_stack": ["Python", "Neo4j", "LangSmith"],
  "output_format": "actionable_steps",
  "include_code_examples": true
}

Saving Research Reports

You can save research outputs as local markdown files by setting report: true:

{
  "query": "How do transformer architectures work?",
  "depth_level": 3,
  "report": true,
  "report_path": "/Users/name/Documents/research/"  // optional
}

Default behavior:

Reports saved to ~/research-reports/
Filename format: research-YYYY-MM-DD-sanitized-query.md
Example: research-2025-12-10-how-do-transformer-architectures-work.md
File path included in response

Custom directory: Use report_path parameter to specify a different location.

Available Tools

The MCP server exposes five tools:

start_research - Async research orchestrator with rich parameters (returns job_id immediately)
check_research_status - Poll async job status and retrieve results when complete
read_report - Read specific lines from research reports using citation format (e.g., [R-135216:5-19])
read_paper - Passthrough to arXiv MCP for reading full papers
download_paper - Passthrough to arXiv MCP for downloading PDFs

Parameters Reference

| Parameter | Type | Description | |-----------|------|-------------| | query | string | Required. Your research question | | depth_level | 1-5 | Research depth (auto-detected if omitted) | | project_description | string | What you're building | | current_state | string | Where you are now | | problem_statement | string | The specific problem to solve | | constraints | string[] | Time/budget/technical limits | | domain | string | Research domain/area | | papers_read | string[] | Papers already reviewed (prevents redundancy) | | key_findings | string[] | What you already know | | rejected_approaches | string[] | Approaches already ruled out | | sub_questions | string[] | Specific questions to answer in parallel | | tech_stack | string[] | Technologies in use (triggers Context7 docs) | | output_format | enum | summary, detailed, or actionable_steps | | include_code_examples | boolean | Whether to fetch code examples | | date_range | string | Preferred date range (e.g., "2024-2025") |

🔍 How It Works

Intelligent Research Architecture (v2)

The research system uses a sophisticated phased approach designed for token efficiency and code accuracy:

Phase 1: Root Planning

Consensus voting by 2-5 LLMs determines research complexity (1-5)
Root planner creates strategy for main query only
Identifies shared documentation needs (base API/syntax docs)
Each sub-question gets independent planning (lightweight, fast)

Phase 2: Parallel Data Gathering

Shared Context7 docs fetched once for all queries (e.g., "React basics")
Main query executed with full tool access
Sub-questions planned independently via fast LLM calls
- Each sub-Q chooses its own tools (context7, perplexity, arxiv)
- Can request specific Context7 topics beyond shared docs
All gathering happens in parallel for speed

Phase 3: Phased Synthesis (Token-Efficient)

Main query synthesis - comprehensive answer to primary question
Key findings extraction - ~500 token summary of main conclusions
Sub-question synthesis - parallel, with key findings injected for coherence
- Prevents contradictions between main and sub-answers
- Each sub-Q synthesis uses only relevant data (not all research)

Phase 4: Code Validation Pass

Extracts all code blocks from synthesized report
Validates against authoritative Context7 documentation
Fixes hallucinated APIs, outdated syntax, incorrect method names
Context7 becomes source of truth for code accuracy

Phase 5: Multi-Model Validation

Critical Challenge: LLM attacks synthesis to find gaps
Consensus (depth ≥4): 3 LLMs validate findings
Sufficiency Vote: Synthesis vs. critique
Re-synthesis if significant gaps found

Why This Architecture?

Token Efficiency:

Phased synthesis uses ~40% fewer tokens vs. monolithic approach
Sub-questions don't see full main query data dump
Key findings summary prevents redundant context

Code Accuracy:

Context7 validation catches hallucinated code before delivery
Inline citations trace every claim to source
Docs fetched once and cached for validation pass

Research Quality:

Independent sub-Q planning prevents bias from root plan
Each sub-Q gets optimal tool selection
Key findings injection ensures coherent, non-contradictory answers
Synthesis LLMs use temperature=0.2 for deterministic, specific outputs
Production engineer persona prompt for deployable solutions
Explicit numeric specificity mandates (no "high", "fast", "good")
Few-shot examples enforce production-ready code (no TODO/FIXME)
Checklist-based validation audits actionability before delivery

Research Flow Diagram

graph TD
    A[Query + Context] --> B[Root Planning: 2-5 LLM Consensus]
    B --> C[Main Query Strategy]
    B --> D[Identify Shared Docs]
    D --> E[Fetch Base Context7 Docs]
    A --> F[Plan Each Sub-Q Independently]
    
    C --> G[Execute Main Query]
    E --> G
    F --> H[Execute Sub-Qs in Parallel]
    E --> H
    
    G --> I[Phase 1: Main Synthesis]
    I --> J[Extract Key Findings ~500 tokens]
    J --> K[Phase 2: Sub-Q Syntheses Parallel]
    H --> K
    
    K --> L[Code Validation vs Context7]
    L --> M[Critical Challenge]
    M --> N{Depth ≥4?}
    N -->|Yes| O[Multi-LLM Consensus]
    N -->|No| P[Sufficiency Vote]
    O --> P
    P --> Q{Sufficient?}
    Q -->|Yes| R[Report with Inline Citations]
    Q -->|No| S[Re-synthesize]
    S --> I

Inline Citations

Reports now include inline source citations for traceability:

[perplexity:url] - Web search finding
[context7:library-name] - Library documentation/code
[arxiv:paper-id] - Academic paper
[deep_analysis] - LLM reasoning

Example:

LangSmith provides dataset management [context7:langsmith] which supports 
version control [perplexity:langsmith-docs] as validated in recent research 
[arxiv:2024.12345].

Context-Efficient Report Structure

Reports use sectioned architecture for AI consumption:

Executive Summary - Overview + section index with IDs and line ranges
On-demand Section Reading - AI can load specific sections only
Quick Reference - Citation examples (R-ID:section, R-ID:section:20-50)

Example usage:

read_report(citation="R-182602:q1")          # Read sub-question 1
read_report(citation="R-182602:q1:20-50")    # Lines 20-50 of sub-Q 1
read_report(citation="R-182602", full=true)  # Full report (last resort)

This prevents context bloat - AI assistants load only what they need.

Common Issues

Perplexity API Errors

401 Unauthorized: Check that PERPLEXITY_API_KEY is set correctly
429 Rate Limited: You've exceeded API quota. Check Perplexity dashboard
Connection timeout: Verify network connectivity

Context7 or arXiv Connection Issues

These are spawned as subprocesses. Check:

# Verify Context7 MCP is accessible (if installed separately)
# Verify arXiv MCP server is installed:
uv tool run arxiv-mcp-server --help

MCP Client Not Detecting Server

Verify the path in your MCP config is correct (absolute path)
Restart your MCP client (Claude Desktop, Cursor, etc.)
Check client logs for connection errors:
- Claude Desktop: ~/Library/Logs/Claude/
- Cursor: Developer Tools → Console

Environment Variables Not Loading

If you see "Not connected" errors despite having API keys in your MCP config, try these solutions:

Option 1: Use built JavaScript file (Recommended)

{
  "mcpServers": {
    "research": {
      "command": "node",
      "args": ["/absolute/path/to/research-mcp/dist/index.js"],
      "env": {
        "PERPLEXITY_API_KEY": "your-key",
        "GEMINI_API_KEY": "your-key",
        "OPENAI_API_KEY": "your-key"
      }
    }
  }
}

Option 2: Fix path with spaces

If your path contains spaces (e.g., /Users/name/Desktop/Personal and learning/...):

{
  "command": "npx",
  "args": [
    "tsx",
    "/Users/name/Desktop/Personal and learning/quick-mcp/research/src/index.ts"
  ]
}

Note: Paths with spaces are properly handled in JSON arrays. The issue is usually using source files instead of built files.

Example Output Structure

# Research Results: [Your Query]

## Complexity Assessment

**Level**: 4/5
**Reasoning**: Complex research requiring academic papers and library documentation

## Research Action Plan

**Estimated Time**: ~45s

**Steps Executed**:
1. **perplexity**: Search for recent approaches and best practices _(parallel)_
2. **deep_analysis**: Analyze web findings for technical insights
3. **context7**: Fetch React and TypeScript documentation _(parallel, shared + specific)_
4. **arxiv**: Search academic papers on evaluation datasets
5. **consensus**: Validate findings across multiple models

## Synthesis with Inline Citations

### Overview

LangSmith provides comprehensive dataset management [context7:langsmith] which enables
evaluation workflow automation [perplexity:langsmith-docs]. Recent research shows that
synthetic data generation requires careful attention to distribution matching [arxiv:2024.12345].

```typescript
// Code validated against Context7
import { Dataset } from "langsmith";

const dataset = new Dataset("my-eval-set");

Sub-Question 1: What makes evaluation data representative?

Representative data must match real-world distributions [deep_analysis] and include edge cases from production logs [context7:langsmith]. Studies indicate that 600+ examples provide sufficient statistical power for small effect detection [arxiv:2024.67890].

Code Validation Summary

✅ 3 code blocks validated ✅ 1 syntax correction applied (outdated API method)

Multi-Model Consensus

[3 LLMs validated findings—shows agreement/disagreement]

Critical Challenge

[Critical validation—alternative perspectives and gaps identified]

Quality Validation

Vote Result: 2 sufficient, 1 insufficient Status: ✅ Response is sufficient

Model Feedback:

✅ gemini-2.5-flash: Response comprehensively addresses the query with actionable steps
✅ gpt-5-mini-2025-08-07: Good coverage of edge cases and validation methods
❌ claude-3.5-haiku: Could benefit from more code examples

Report ID: R-182602

Usage Examples:

read_report(citation="R-182602:overview")      # Read overview section
read_report(citation="R-182602:q1")            # Read sub-question 1
read_report(citation="R-182602:q1:20-50")      # Lines 20-50 of sub-Q 1
read_report(citation="R-182602", full=true)    # Full report (last resort)


## This MCP is built on top of other MCP servers and tools

- [Perplexity AI](https://www.perplexity.ai/) - Web search capabilities
- [arXiv](https://arxiv.org/) - Academic paper repository
- [Context7](https://context7.com/) - Library documentation search
- All contributors and users of this project