Qwen3-TTS MCP Server

A Model Context Protocol (MCP) server that exposes Qwen3-TTS voice synthesis capabilities with Voice Design.

Features

Advanced Voice Synthesis: Generate realistic audio using the Qwen3-TTS 1.7B model
Voice Design: Customize voice with natural language descriptions
Multilingual: Support for 10 languages (Auto, Chinese, English, Japanese, Korean, French, German, Spanish, Portuguese, Russian)
MCP Integration: Access via Model Context Protocol for integration with LLMs

Installation

pip install -e .

Usage

As MCP Server

Start the MCP server:

python main.py

The server will connect via stdin/stdout and expose the following tools:

Available Tools

generate_tts: Generate audio from text

text: Text to convert to speech (single string or list for batch)
language: Language (Auto, Chinese, English, etc.)
voice_description: Description of desired voice characteristics
max_tokens: Maximum generation tokens (default: 2048)

Example (single):

{
  "text": "Hello, how are you?",
  "language": "English",
  "voice_description": "deep male voice, friendly tone"
}

Example (batch):

{
  "text": ["Hello, how are you?", "こんにちは、元気ですか？"],
  "language": ["English", "Japanese"],
  "voice_description": ["deep male voice, friendly tone", "cheerful female voice"]
}

generate_tts_voice_clone: Generate audio using voice cloning from reference audio
- target_text: Text to convert to speech with cloned voice
- language: Language (Auto, Chinese, English, etc.)
- ref_audio_base64: Reference audio encoded in base64 (WAV, 3-10 seconds recommended)
- ref_text: Transcript of reference audio (required if use_xvector_only is False)
- use_xvector_only: If True, use only speaker embedding (faster but less accurate)
- model_size: Model size ("0.6B" or "1.7B", default: "1.7B")
- max_tokens: Maximum generation tokens (default: 2048)
Example:
```
{
  "target_text": "Hello, how are you today?",
  "language": "English",
  "ref_audio_base64": "UklGRi4...",
  "ref_text": "Okay. Yeah. I resent you. I love you.",
  "use_xvector_only": false,
  "model_size": "1.7B"
}
```
list_languages: List all supported languages

MCP Architecture

The server implements the MCP specification with:

Tools: Available tools for clients to call
Resources: (Future) Access to files and data
Prompts: (Future) Reusable prompts for TTS

Requirements

Python >= 3.12
CUDA (recommended for better performance)
GPU with sufficient memory (4GB+)
Optional: flash-attn for improved performance with flash attention

Project Structure

.
├── main.py              # MCP server implementation
├── pyproject.toml       # Project configuration
├── requirements.txt     # Python dependencies
└── README.md            # This file

Performance Optimizations

This implementation uses optimizations from the official Qwen3-TTS repository:

Flash Attention 2 for faster inference and lower memory usage
bfloat16 precision for efficient computation
Batch inference support for processing multiple requests

Next Steps

[ ] Add support for MCP resources
[ ] Implement generated audio caching
[ ] Add structured logging
[ ] Create example client
[ ] Deployment documentation

MCP Servers