A high-performance Model Context Protocol (MCP) server providing local speech-to-text transcription using whisper.cpp, optimized for Apple Silicon.
Local Speech-to-Text MCP Server
A high-performance Model Context Protocol (MCP) server providing local speech-to-text transcription using whisper.cpp, optimized for Apple Silicon.
๐ฏ Features
- ๐ 100% Local Processing: No cloud APIs, complete privacy
- ๐ Apple Silicon Optimized: 15x+ real-time transcription speed
- ๐ค Speaker Diarization: Identify and separate multiple speakers
- ๐ต Universal Audio Support: Automatic conversion from MP3, M4A, FLAC, and more
- ๐ Multiple Output Formats: txt, json, vtt, srt, csv
- ๐พ Low Memory Footprint: <2GB memory usage
- ๐ง TypeScript: Full type safety and modern development
๐ Quick Start
Prerequisites
- Node.js 18+
- whisper.cpp (
brew install whisper-cpp
) - For audio format conversion: ffmpeg (
brew install ffmpeg
) - automatically handles MP3, M4A, FLAC, OGG, etc. - For speaker diarization: Python 3.8+ and HuggingFace token (free)
Supported Audio Formats
- Native whisper.cpp formats: WAV, FLAC
- Auto-converted formats: MP3, M4A, AAC, OGG, WMA, and more
- Automatic conversion: Powered by ffmpeg with 16kHz/mono optimization for whisper.cpp
- Format detection: Automatic format detection and conversion when needed
Installation
git clone https://github.com/your-username/local-stt-mcp.git
cd local-stt-mcp/mcp-server
npm install
npm run build
# Download whisper models
npm run setup:models
# For speaker diarization, set HuggingFace token
export HF_TOKEN="your_token_here" # Get free token from huggingface.co
Speaker Diarization Note: Requires HuggingFace account and accepting pyannote/speaker-diarization-3.1 license.
MCP Client Configuration
Add to your MCP client configuration:
{
"mcpServers": {
"whisper-mcp": {
"command": "node",
"args": ["path/to/local-stt-mcp/mcp-server/dist/index.js"]
}
}
}
๐ ๏ธ Available Tools
| Tool | Description |
|------|-------------|
| transcribe
| Basic audio transcription with automatic format conversion |
| transcribe_long
| Long audio file processing with chunking and format conversion |
| transcribe_with_speakers
| Speaker diarization and transcription with format support |
| list_models
| Show available whisper models |
| health_check
| System diagnostics |
| version
| Server version information |
๐ Performance
Apple Silicon Benchmarks:
- Processing Speed: 15.8x real-time (vs WhisperX 5.5x)
- Memory Usage: <2GB (vs WhisperX ~4GB)
- GPU Acceleration: โ Apple Neural Engine
- Setup: Medium complexity but superior performance
See /benchmarks/
for detailed performance comparisons.
๐๏ธ Project Structure
mcp-server/
โโโ src/ # TypeScript source code
โ โโโ tools/ # MCP tool implementations
โ โโโ whisper/ # whisper.cpp integration
โ โโโ utils/ # Speaker diarization & utilities
โ โโโ types/ # Type definitions
โโโ dist/ # Compiled JavaScript
โโโ python/ # Python dependencies
๐ง Development
# Build
npm run build
# Development mode (watch)
npm run dev
# Linting & formatting
npm run lint
npm run format
# Type checking
npm run type-check
๐ค Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- whisper.cpp for optimized inference
- OpenAI Whisper for the original models
- Model Context Protocol for the framework
- Pyannote.audio for speaker diarization