Local Speech-to-Text MCP Server

A high-performance Model Context Protocol (MCP) server providing local speech-to-text transcription using whisper.cpp, optimized for Apple Silicon.

🎯 Features

🏠 100% Local Processing: No cloud APIs, complete privacy
🚀 Apple Silicon Optimized: 15x+ real-time transcription speed
🎤 Speaker Diarization: Identify and separate multiple speakers
🎵 Universal Audio Support: Automatic conversion from MP3, M4A, FLAC, and more
📝 Multiple Output Formats: txt, json, vtt, srt, csv
💾 Low Memory Footprint: <2GB memory usage
🔧 TypeScript: Full type safety and modern development

🚀 Quick Start

Prerequisites

Node.js 18+
whisper.cpp (brew install whisper-cpp)
For audio format conversion: ffmpeg (brew install ffmpeg) - automatically handles MP3, M4A, FLAC, OGG, etc.
For speaker diarization: Python 3.8+ and HuggingFace token (free)

Supported Audio Formats

Native whisper.cpp formats: WAV, FLAC
Auto-converted formats: MP3, M4A, AAC, OGG, WMA, and more
Automatic conversion: Powered by ffmpeg with 16kHz/mono optimization for whisper.cpp
Format detection: Automatic format detection and conversion when needed

Installation

git clone https://github.com/your-username/local-stt-mcp.git
cd local-stt-mcp/mcp-server
npm install
npm run build

# Download whisper models
npm run setup:models

# For speaker diarization, set HuggingFace token
export HF_TOKEN="your_token_here"  # Get free token from huggingface.co

Speaker Diarization Note: Requires HuggingFace account and accepting pyannote/speaker-diarization-3.1 license.

MCP Client Configuration

Add to your MCP client configuration:

{
  "mcpServers": {
    "whisper-mcp": {
      "command": "node",
      "args": ["path/to/local-stt-mcp/mcp-server/dist/index.js"]
    }
  }
}

🛠️ Available Tools

| Tool | Description | |------|-------------| | transcribe | Basic audio transcription with automatic format conversion | | transcribe_long | Long audio file processing with chunking and format conversion | | transcribe_with_speakers | Speaker diarization and transcription with format support | | list_models | Show available whisper models | | health_check | System diagnostics | | version | Server version information |

📊 Performance

Apple Silicon Benchmarks:

Processing Speed: 15.8x real-time (vs WhisperX 5.5x)
Memory Usage: <2GB (vs WhisperX ~4GB)
GPU Acceleration: ✅ Apple Neural Engine
Setup: Medium complexity but superior performance

See /benchmarks/ for detailed performance comparisons.

🏗️ Project Structure

mcp-server/
├── src/                    # TypeScript source code
│   ├── tools/             # MCP tool implementations
│   ├── whisper/           # whisper.cpp integration
│   ├── utils/             # Speaker diarization & utilities
│   └── types/             # Type definitions
├── dist/                  # Compiled JavaScript
└── python/                # Python dependencies

🔧 Development

# Build
npm run build

# Development mode (watch)
npm run dev

# Linting & formatting
npm run lint
npm run format

# Type checking
npm run type-check

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

whisper.cpp for optimized inference
OpenAI Whisper for the original models
Model Context Protocol for the framework
Pyannote.audio for speaker diarization

MCP Servers