A high-performance Model Context Protocol (MCP) server providing local speech-to-text transcription using whisper.cpp, optimized for Apple Silicon.
Local Speech-to-Text MCP Server
A high-performance Model Context Protocol (MCP) server providing local speech-to-text transcription using whisper.cpp, optimized for Apple Silicon.
🎯 Features
- 🏠 100% Local Processing: No cloud APIs, complete privacy
- 🚀 Apple Silicon Optimized: 15x+ real-time transcription speed
- 🎤 Speaker Diarization: Identify and separate multiple speakers
- 🎵 Universal Audio Support: Automatic conversion from MP3, M4A, FLAC, and more
- 📝 Multiple Output Formats: txt, json, vtt, srt, csv
- 💾 Low Memory Footprint: <2GB memory usage
- 🔧 TypeScript: Full type safety and modern development
🚀 Quick Start
Prerequisites
- Node.js 18+
- whisper.cpp (
brew install whisper-cpp
) - For audio format conversion: ffmpeg (
brew install ffmpeg
) - automatically handles MP3, M4A, FLAC, OGG, etc. - For speaker diarization: Python 3.8+ and HuggingFace token (free)
Supported Audio Formats
- Native whisper.cpp formats: WAV, FLAC
- Auto-converted formats: MP3, M4A, AAC, OGG, WMA, and more
- Automatic conversion: Powered by ffmpeg with 16kHz/mono optimization for whisper.cpp
- Format detection: Automatic format detection and conversion when needed
Installation
git clone https://github.com/your-username/local-stt-mcp.git
cd local-stt-mcp/mcp-server
npm install
npm run build
# Download whisper models
npm run setup:models
# For speaker diarization, set HuggingFace token
export HF_TOKEN="your_token_here" # Get free token from huggingface.co
Speaker Diarization Note: Requires HuggingFace account and accepting pyannote/speaker-diarization-3.1 license.
MCP Client Configuration
Add to your MCP client configuration:
{
"mcpServers": {
"whisper-mcp": {
"command": "node",
"args": ["path/to/local-stt-mcp/mcp-server/dist/index.js"]
}
}
}
🛠️ Available Tools
| Tool | Description |
|------|-------------|
| transcribe
| Basic audio transcription with automatic format conversion |
| transcribe_long
| Long audio file processing with chunking and format conversion |
| transcribe_with_speakers
| Speaker diarization and transcription with format support |
| list_models
| Show available whisper models |
| health_check
| System diagnostics |
| version
| Server version information |
📊 Performance
Apple Silicon Benchmarks:
- Processing Speed: 15.8x real-time (vs WhisperX 5.5x)
- Memory Usage: <2GB (vs WhisperX ~4GB)
- GPU Acceleration: ✅ Apple Neural Engine
- Setup: Medium complexity but superior performance
See /benchmarks/
for detailed performance comparisons.
🏗️ Project Structure
mcp-server/
├── src/ # TypeScript source code
│ ├── tools/ # MCP tool implementations
│ ├── whisper/ # whisper.cpp integration
│ ├── utils/ # Speaker diarization & utilities
│ └── types/ # Type definitions
├── dist/ # Compiled JavaScript
└── python/ # Python dependencies
🔧 Development
# Build
npm run build
# Development mode (watch)
npm run dev
# Linting & formatting
npm run lint
npm run format
# Type checking
npm run type-check
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
📄 License
MIT License - see LICENSE file for details.
🙏 Acknowledgments
- whisper.cpp for optimized inference
- OpenAI Whisper for the original models
- Model Context Protocol for the framework
- Pyannote.audio for speaker diarization