AgentDesk MCP Documentation System

🚀 Modern toolkit for creating Model Context Protocol (MCP) documentation servers with intelligent content detection, advanced search optimization, and beautiful CLI tools.

This repository provides a complete system for building MCP documentation servers that can intelligently crawl, index, and search documentation websites with both keyword and semantic search capabilities.

📦 Packages

Core Packages

@agentdesk/mcp-docs - Core documentation indexing and search functionality
create-mcp-docs - CLI tool for generating MCP documentation servers

🚀 Quick Start

Create a New MCP Documentation Server

npx create-mcp-docs my-docs-server

This interactive CLI will:

✨ Guide you through project setup (name and description)
🌐 Collect documentation URLs to crawl
⚙️ Let you choose between FlexSearch (keyword) or Vectra (semantic) search
📁 Generate a complete MCP server project
✅ Provide ready-to-use TypeScript code

Generated Project Structure

packages/my-docs-server/
├── package.json          # Dependencies and scripts
├── src/
│   ├── server.ts         # MCP server implementation
│   └── build-index.ts    # Documentation indexer
├── .env                  # Environment configuration
├── README.md            # Usage instructions
└── ...

Start Your Server

cd packages/my-docs-server
pnpm install
pnpm build:index    # Build documentation search index
pnpm start          # Start MCP server

⚡ Search Provider Comparison

Choose the right search provider for your needs:

🔍 FlexSearch (Keyword Search)

Best for: Smaller documentation sets, fast setup, exact term matching

Pros:

Lightning-fast search performance
No API keys required
Smaller index size
Great for technical documentation with specific terms

Cons:

Limited semantic understanding
May miss conceptually related content

🧠 Vectra (Semantic Search)

Best for: Large documentation sets, conceptual queries, content discovery

Pros:

Understands meaning and context
Finds conceptually related content
Better for natural language queries
Advanced "Late Chunking" for context preservation

Cons:

Requires OpenAI API key
Larger index size
Slightly slower initial indexing

🏗️ System Architecture

Complete System Overview

graph TB
    subgraph "CLI Layer"
        CLI["create-mcp-docs CLI"]
        CLI --> Setup["Project Setup"]
        CLI --> URLs["URL Collection"]
        CLI --> Provider["Provider Selection"]
        CLI --> Gen["Project Generation"]
    end

    subgraph "Generated MCP Server"
        Server["MCP Server"]
        Index["Index Builder"]
        Config[".env Configuration"]
        Server --> Tool["search_docs tool"]
    end

    subgraph "Core Package (@agentdesk/mcp-docs)"
        CreateIndex["createIndex()"]
        KB["KnowledgeBase"]
        Heuristics["Content Detection"]
        Pipeline["Document Pipeline"]
        Optimizer["Search Optimizer"]
    end

    subgraph "Search Providers"
        FlexSearch["FlexSearch<br/>(Keyword)"]
        Vectra["Vectra<br/>(Semantic)"]
    end

    subgraph "Document Processing"
        Crawler["Playwright Crawler"]
        Parser["Content Parser"]
        Chunker["Chunking Service"]
        ReadabilityJS["Mozilla Readability"]
    end

    subgraph "AI Integration"
        AI["AI Model"]
        MCP["MCP Protocol"]
        OpenAI["OpenAI Embeddings"]
    end

    %% CLI Flow
    Gen --> Server
    Gen --> Index
    Gen --> Config

    %% Core Integration
    Index --> CreateIndex
    Tool --> KB
    CreateIndex --> Heuristics
    CreateIndex --> Pipeline

    %% Processing Pipeline
    Pipeline --> Crawler
    Pipeline --> Parser
    Pipeline --> Chunker
    Parser --> ReadabilityJS

    %% Provider Selection
    CreateIndex --> FlexSearch
    CreateIndex --> Vectra
    Vectra --> OpenAI
    KB --> FlexSearch
    KB --> Vectra
    KB --> Optimizer

    %% AI Integration
    AI --> MCP
    MCP --> Server
    Tool --> AI

    %% Styling
    classDef cli fill:#e1f5fe
    classDef core fill:#f3e5f5
    classDef provider fill:#e8f5e8
    classDef processing fill:#fff3e0
    classDef ai fill:#fce4ec

    class CLI,Setup,URLs,Provider,Gen cli
    class CreateIndex,KB,Heuristics,Pipeline,Optimizer core
    class FlexSearch,Vectra provider
    class Crawler,Parser,Chunker,ReadabilityJS processing
    class AI,MCP,OpenAI ai

User Workflow

sequenceDiagram
    participant User
    participant CLI as create-mcp-docs CLI
    participant Generator as Project Generator
    participant MCP as Generated MCP Server
    participant Indexer as Documentation Indexer
    participant Provider as Search Provider
    participant AI as AI Model

    User->>CLI: npx create-mcp-docs
    CLI->>User: Collect project details & URLs
    CLI->>Generator: Generate project files
    Generator->>MCP: Create MCP server & indexer

    User->>Indexer: pnpm build:index
    Indexer->>Provider: Extract & index documents
    Provider->>Indexer: Search index ready

    User->>MCP: pnpm start
    AI->>MCP: Search documentation
    MCP->>Provider: Execute search query
    Provider->>MCP: Optimized results
    MCP->>AI: Contextual documentation

✨ Key Features

🧠 Intelligent Content Detection

Automatically detects optimal CSS selectors using heuristics
Integrates Mozilla Readability for content extraction
Provides confidence scoring and fallback options
Validates selectors against real page content

🎨 Beautiful CLI Experience

Interactive React-based CLI with:

Project Setup: Name and description input
URL Collection: Add multiple documentation sources
Provider Selection: Choose between FlexSearch and Vectra
Live Generation: Real-time project creation feedback
Success Guide: Clear next steps after creation

🚀 Document-Centric Search Optimization

Advanced search optimization that goes beyond simple keyword matching:

Full Document Strategy: Returns entire documents when multiple chunks are highly relevant
Expanded Chunk Strategy: Intelligently expands related content sections
Token Budget Management: Optimizes results to fit within AI model context limits
Coherence Preservation: Maintains document structure and context flow

⚡ High Performance Indexing

Intelligent Crawling: Playwright-powered browser automation
Content Cleaning: Mozilla Readability integration for clean extraction
Flexible Chunking: Traditional, semantic, and Late Chunking strategies
Concurrent Processing: Configurable concurrency with rate limiting

🔧 Production-Ready Servers

Follows established MCP server patterns
Built with TypeScript for full type safety
Comprehensive error handling and logging
Environment-based configuration
Ready for deployment with zero additional setup

🎯 Use Cases

Documentation Teams

# Create a server for your product docs
npx create-mcp-docs product-docs
# URLs: https://docs.yourproduct.com
# Choose FlexSearch for fast, precise searches

Large Knowledge Bases

# Create a semantic search server for comprehensive docs
npx create-mcp-docs comprehensive-docs
# URLs: Multiple documentation sources
# Choose Vectra for conceptual understanding

API Documentation

# Create a server for API reference
npx create-mcp-docs api-docs
# URLs: https://api.yourservice.com/docs
# FlexSearch excels at exact API method/parameter searches

🔬 Advanced Features

Late Chunking Strategy

For Vectra users, our "Late Chunking" implementation preserves contextual information across chunk boundaries:

Contextual Embeddings: Documents are processed through full context before chunking
Semantic Boundaries: Intelligent splitting that respects document structure
Context Preservation: Related information stays connected across chunks
Optimized for Documentation: Tuned specifically for technical documentation patterns

Learn more in the @agentdesk/mcp-docs documentation

Document-Centric Optimization

Our search optimizer analyzes raw search results and intelligently decides the best strategy:

// Example optimization strategies
{
  fullDocumentThreshold: 3,      // 3+ chunks = return full document
  expandedChunkMultiplier: 2,    // Expand single chunks by 2x
  targetUtilization: 0.9,        // Use 90% of token budget
}

Detailed algorithm explanations in the core package documentation

🔧 Advanced Configuration

Manual Index Creation

import { createIndex } from "@agentdesk/mcp-docs";

await createIndex({
  pages: [
    {
      url: "https://docs.example.com",
      mode: "crawl",
      selectors: {
        links: 'a[href^="/docs"]',
        content: "article.prose",
      },
    },
  ],
  // Choose your provider
  provider: {
    type: "vectra",
    embeddings: {
      provider: "openai",
      model: "text-embedding-ada-002",
      apiKey: process.env.OPENAI_API_KEY,
    },
    chunking: {
      strategy: "late-chunking",
      useCase: "documentation",
    },
  },
  outputFile: "docs-vectra-index",
});

Knowledge Base Search

import { KnowledgeBase, getModuleDir } from "@agentdesk/mcp-docs";

const docs = new KnowledgeBase({
  path: getModuleDir(import.meta.url), // Directory containing index
  apiKey: process.env.OPENAI_API_KEY, // For Vectra indices
});

const results = await docs.search({
  query: "How do I authenticate users?",
  tokenLimit: 10000,
});

📚 Documentation

Package Documentation

@agentdesk/mcp-docs - Detailed API reference and algorithms
create-mcp-docs - CLI tool implementation details

🛠️ Development

Setup

git clone https://github.com/agentdesk/create-mcp-docs
cd create-mcp-docs
pnpm install
pnpm build

Package Development

# Core package
cd packages/mcp-docs
pnpm dev

# CLI package
cd packages/create-mcp-docs
pnpm build
pnpm link --global
create-mcp-docs test-project

Testing

# Run all tests
pnpm test

# Package-specific tests
cd packages/mcp-docs && pnpm test
cd packages/create-mcp-docs && pnpm test

🏷️ Requirements

Node.js >= 16.0.0
pnpm >= 8.0.0 (recommended)
OpenAI API Key (for Vectra semantic search only)

🤝 Contributing

We welcome contributions! Please see:

Issues - Bug reports and feature requests
Pull Requests - Code contributions
Documentation - Improvements and examples

Development Guidelines

Use TypeScript for all new code
Follow existing code style and patterns
Add comprehensive tests for new features
Update documentation for API changes

📝 License

MIT - See LICENSE file for details.

🔗 Related Projects

Model Context Protocol - The standard this implements
AgentKit - AI agent development framework
AgentDesk - AI agent platform

Built with ❤️ by the AgentDesk team