MCP-based orchestration system with Tri modal hybrid search, RAG pipeline, and MySQL, Chromadb and Neo4j graph database integration.
MCP Orchestration System
AI Agent System with Multi-Modal RAG and MCP Servers
Company: Sepia Innovations Japan Developer: Shreyash Shankarrao Jadhav
1. What this project is
This repository contains a complete AI knowledge assistant built with a Model Context Protocol (MCP) architecture.
It lets you:
- Chat with your own data (notes, PDFs, CSVs, Google Drive files).
- Combine Database search + PDF RAG + Web search + Google Drive in a single agent.
- Use a Smart Router that automatically decides which MCP server to call for each question.
- Build a Neo4j knowledge graph of entities and relationships from your data.
It is designed to be:
- Beginner friendly to run locally.
- Detailed enough for reviewers to understand the full architecture and pipelines.
If you are new, start with:
- Section 2 – Quick start
- Section 3 – Project map (what lives where)
If you are reviewing the implementation, focus on:
- Section 4 – MCP servers and clients
- Section 5 – Data & search pipelines
- Section 6 – Knowledge graph & Neo4j
2. Quick start (beginner friendly)
2.1. Prerequisites
- Python: 3.9+
- MySQL: 8.x (or compatible)
- Neo4j (optional but recommended for graph features)
- Node / npm: not required
- Browsers: Chrome/Edge for UI and Neo4j browser
Cloud/API keys:
- OpenAI API key (for GPT-4o-mini & entity extraction)
- Google API key + CSE ID (for web search MCP)
- Google OAuth credentials (for Google Drive MCP)
2.2. Clone the repo
git clone <your-github-repo-url>
cd "MCP Orch System"
On Windows, paths in this README assume something like:
s:\Shreyash\Sepia ML intern\MCP Orch System
2.3. Create virtual environments and install deps
You can use one global venv for everything or separate ones per service. For simplicity, one venv:
cd "s:\Shreyash\Sepia ML intern\MCP Orch System"
python -m venv .venv
.venv\Scripts\activate
pip install -r mcp_database/requirements.txt
pip install -r mcp_rag_pdf/requirements.txt
pip install -r mcp_google_drive/requirements.txt
pip install -r mcp_google_search/requirements.txt
pip install -r ai_agent_system/requirements.txt
python -m spacy download en_core_web_sm
2.4. Environment variables
Create these .env files (values are examples; change as needed).
mcp_database/.env
MYSQL_HOST=localhost
MYSQL_USER=root
MYSQL_PASSWORD=your_mysql_password
MYSQL_DATABASE=mcp_database
CHROMA_DIR=./chroma_db
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_neo4j_password
OPENAI_API_KEY=your_openai_key
SERVER_PORT=8003
mcp_rag_pdf/.env
MYSQL_HOST=localhost
MYSQL_USER=root
MYSQL_PASSWORD=your_mysql_password
MYSQL_DATABASE=mcp_rag_pdf
CHROMA_DIR=./chroma_db_pdf
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_neo4j_password
OPENAI_API_KEY=your_openai_key
UPLOAD_DIR=./uploads
SERVER_PORT=8004
mcp_google_search/.env
GOOGLE_API_KEY=your_google_api_key
GOOGLE_CSE_ID=your_google_custom_search_engine_id
SERVER_PORT=8001
mcp_google_drive/.env
SERVER_PORT=8002
GOOGLE_DRIVE_TOKEN_PATH=token.json
GOOGLE_DRIVE_CREDENTIALS_PATH=credentials.json
Place your Google OAuth credentials.json in mcp_google_drive/ (this file is typically gitignored).
ai_agent_system/.env
OPENAI_API_KEY=your_openai_key
USE_SMART_ROUTER=true
SMART_ROUTER_OPENAI=true
2.5. Initialize databases (basic)
In MySQL:
CREATE DATABASE mcp_database;
CREATE DATABASE mcp_rag_pdf;
Some helper scripts exist (see mcp_database/reset_database.py) but the system will create many tables on first use.
2.6. Start all services
Use multiple terminals (or PowerShell windows).
Terminal 1 – Database MCP
cd "s:\Shreyash\Sepia ML intern\MCP Orch System\mcp_database"
python server.py
Terminal 2 – RAG PDF MCP
cd "s:\Shreyash\Sepia ML intern\MCP Orch System\mcp_rag_pdf"
python server.py
Terminal 3 – Google Drive MCP
cd "s:\Shreyash\Sepia ML intern\MCP Orch System\mcp_google_drive"
python server.py
Terminal 4 – Google Search MCP
cd "s:\Shreyash\Sepia ML intern\MCP Orch System\mcp_google_search"
python server.py
Terminal 5 – Main AI Agent app (Streamlit UI)
cd "s:\Shreyash\Sepia ML intern\MCP Orch System\ai_agent_system"
streamlit run app.py
Then open: http://localhost:8501
2.7. First test (simple)
Once the app is running:
- Go to Create Note tab.
- Create a small note (title + content).
- Go to Chat tab.
- Ask: “Find my note about <your topic>”.
If you see a relevant answer and your note in results, your Database MCP + agent are working.
For deeper testing, see TESTING_GUIDE.md.
3. Project map (for orientation)
High-level structure:
MCP Orch System/
├── ai_agent_system/ # Streamlit UI + central agent
├── mcp_database/ # Database MCP server (MySQL, ChromaDB, Neo4j, CSV)
├── mcp_rag_pdf/ # RAG PDF MCP server
├── mcp_google_drive/ # Google Drive MCP server
├── mcp_google_search/ # Google Search MCP server
├── mcp_clients/ # Python clients for all MCP servers
├── TESTING_GUIDE.md # Detailed end-to-end testing steps
├── SMART_ROUTER_GUIDE.md # Smart routing design and usage
└── (other helper scripts/docs)
3.1. ai_agent_system/ – main app
app.py– Streamlit app, tabs, charts, and UI around the agent.agent.py– Core agent orchestration, integrates all MCP clients, search strategies, and smart router.router.py– High-level query routing logic and statistics.config_manager.py– Configuration utilities.
3.2. mcp_database/ – Database MCP
server.py– FastAPI server that exposes:- Document CRUD
- Hybrid search (BM25 + vector + graph + rerank)
- CSV upload and natural-language query (NL → SQL)
- Neo4j graph sync
- MCP endpoints:
/tools,/tools/call,/resources,/resources/read,/prompts,/prompts/get
csv_data_manager.py– Upload DataFrames/CSVs into MySQL tables.csv_query_engine.py– Natural-language queries over CSV-backed tables.csv_graph_mapper.py– Maps CSV tables into Neo4j nodes and relationships.graph_handler.py– Core Neo4j operations with label sanitization.entity_extractor.py– NER + LLM-based entity and relationship extraction.bm25_handler.py,chroma_handler.py,reranker.py– Search components.reset_database.py,reset_chromadb.py,test_database.py,test_neo4j.py,how_to_check.md– Utilities and docs.
3.3. mcp_rag_pdf/ – RAG PDF MCP
server.py– FastAPI server for PDF upload, indexing, RAG search, and MCP endpoints.models.py– Pydantic models for requests/responses.- Other files (e.g. PDF processing, chunking, hybrid RAG) implement the full PDF pipeline.
HYBRID_RAG_INTEGRATION.md– Documentation of PDF RAG integration.test_rag.py– RAG PDF testing script.
3.4. Google MCP servers
mcp_google_drive/server.py– Google Drive MCP:- Basic REST endpoints for file operations.
- MCP endpoints with
Tools,Resources,Promptsfor Drive files.
mcp_google_search/server.py– Google Search MCP:- Wraps Google Custom Search API.
- MCP-compatible
searchtool plus resources/prompts.
3.5. mcp_clients/
Python client wrappers used by the agent:
database_mcp_client.pyrag_pdf_mcp_client.pydrive_mcp_client.pysearch_mcp_client.py
Each client mirrors its server’s REST/MCP endpoints and hides HTTP details from the main agent.
4. MCP servers and protocol support
All four MCP servers follow the same pattern:
- Traditional REST endpoints (for direct HTTP use and debugging).
- MCP endpoints to be used by clients/agents:
GET /toolsPOST /tools/callGET /resourcesGET /resources/readGET /promptsGET /prompts/get
4.1. Database MCP (port 8003)
Main responsibilities:
- Document CRUD (
/documents) - Document search (
/documents/search) - CSV upload and NL query:
/csv/upload– multipart CSV upload → MySQL table + graph sync/csv/query–CSVQueryRequest(query, table_name, limit)
- Hybrid search pipeline:
- BM25 keyword
- Chroma vector
- Neo4j graph scores
- Cross-encoder reranking
- Graph integration:
- Entity extraction from text
- Node + relationship creation in Neo4j
- CSV-to-graph mapping via
csv_graph_mapper.sync_table_to_graph
MCP tools include (names may be slightly simplified here):
create_document,update_document,delete_documentsearch_documentsupload_csv,query_csvexecute_sql(safe wrapper)
The Database MCP client (database_mcp_client.py) exposes:
upload_csv(file_path, table_name)query_csv(query, table_name, limit)list_tables(),get_table_info(table_name),delete_table(table_name)execute_sql(...)
4.2. RAG PDF MCP (port 8004)
Responsibilities:
- Upload and store PDFs
- Extract and chunk PDF text
- Index chunks in Chroma + BM25
- Provide RAG-style answers with:
- Relevant chunks
- Scores
- Optional summarization
- (Optionally) integrate with Neo4j for entity/relationship graphing.
MCP side:
/toolsadvertises PDF-level tools (upload, search, summarize, etc.)./tools/callexecutes those tools given aToolCallRequest./resourcesand/resources/readallow fetching PDF-related resources./promptsand/prompts/getprovide RAG and summarization prompt templates.
The RAG PDF client (rag_pdf_mcp_client.py) wraps these endpoints and returns Python dicts, including a fixed get_pdf() that unwraps the response correctly.
4.3. Google Drive MCP (port 8002)
Responsibilities:
- List, upload, download, and delete files in Google Drive.
- Abstract OAuth token flow (first run).
MCP tools (high level):
list_filesget_file_infodelete_file- (Optionally)
download_file/upload_filemodeled as tools and resources.
Resources map to Drive files (URIs typically encode file IDs). Prompts may hold standard “search my Drive” patterns.
4.4. Google Search MCP (port 8001)
Responsibilities:
- Wrap the Google Custom Search JSON API.
- Provide web search results as an MCP resource.
MCP tools:
search(main tool – returns web search results)
Resources and prompts:
- Each search can be wrapped as a resource URI.
- Prompts offer standard ways to phrase web queries or combine web and local data.
The search_mcp_client.py mirrors these tools for consumption by the main agent.
5. Data and search pipelines (end‑to‑end)
This section is for reviewers who want to understand “what happens when the user asks a question?”
5.1. User journey: high-level
- User types a query in the Streamlit UI (
ai_agent_system/app.py). - The request goes to the Agent (
agent.py). - The Smart Router analyses the query:
- Keyword patterns (cheap, local)
- Confidence scores
- Optional OpenAI classification (for ambiguous queries)
- The router decides which MCP(s) to call:
- Database MCP (notes, documents, CSV)
- RAG PDF MCP (PDF content)
- Google Drive MCP (Drive files)
- Google Search MCP (web search)
- The agent calls the corresponding MCP client(s) in
mcp_clients/. - Clients call the FastAPI MCP server endpoints.
- Results are merged, reranked, and formatted into a response for the user.
For more detail on routing, see SMART_ROUTER_GUIDE.md.
5.2. Hybrid search pipeline (Database MCP)
When the Database MCP performs a hybrid search:
- Input: Query + optional filters.
- Steps:
- Keyword search via BM25.
- Semantic search via Chroma embeddings.
- Graph search via Neo4j (if enabled):
- Resolve entities mentioned in query.
- Find related documents via graph relationships.
- Combine scores from BM25, semantic, and graph.
- Use cross-encoder reranker (from
reranker.py) to produce final ranking.
- Output: List of documents with:
similarity_scorebm25_score,semantic_score,graph_score(when available)
Testing examples for this pipeline can be found in TESTING_GUIDE.md and mcp_database/test_database.py.
5.3. PDF RAG pipeline (RAG PDF MCP)
- Upload PDF → stored in DB or filesystem.
- Extract text pages with a PDF library.
- Chunk text into overlapping segments (chunk engine).
- Store embeddings in Chroma + tokens for BM25.
- At query time:
- Retrieve candidates by BM25 + vector.
- Optionally use graph/metadata.
- Rerank and return top chunks with scores and text snippets.
- Optionally call an LLM to synthesize an answer using those chunks.
See mcp_rag_pdf/HYBRID_RAG_INTEGRATION.md for deep details.
5.4. CSV → SQL → Answer (Database MCP)
When a CSV is uploaded via the app:
- App writes your CSV
DataFrameto a temp.csvfile. DatabaseMCPClient.upload_csv()sends it to/csv/upload.- The server uses
CSVDataManagerto:- Validate the CSV.
- Create or replace a MySQL table.
- After upload, the server calls
csv_graph_mapper.sync_table_to_graph(...):- Inspects columns (employees, managers, projects, departments).
- Creates
Employee,Project,Departmentnodes. - Creates relationships:
WORKS_ON(Employee → Project)REPORTS_TO(Employee → Manager)IN_DEPARTMENT(Employee → Department)
- When you ask a natural-language question about that CSV:
- App calls
db_client.query_csv(...). /csv/queryusesCSVQueryEngineto:- Convert NL → SQL.
- Execute on the MySQL table.
- Return rows + optional SQL preview.
- App calls
6. Knowledge graph and Neo4j
The graph features are optional but powerful. They live mainly in mcp_database/graph_handler.py and mcp_database/csv_graph_mapper.py.
6.1. Entity extraction
- Uses spaCy for base entities (Person, Org, Location, etc.).
- Uses OpenAI GPT-4o-mini to:
- Refine entity types.
- Extract relationships between entities (e.g.
WORKS_ON,REPORTS_TO).
Entities and relationships are then persisted to Neo4j via GraphHandler.
6.2. GraphHandler safety
graph_handler.py:
- Sanitizes all labels and relationship types (
_sanitize_label()). - Escapes property keys correctly.
- Uses backticked labels in Cypher to avoid syntax errors.
This ensures that unusual names in CSV/document data do not break Cypher queries.
6.3. CSV graph mapping
csv_graph_mapper.py:
_classify_columns(columns):- Heuristically identifies:
- Employee-related columns
- Manager columns
- Project columns
- Department/team columns
- Heuristically identifies:
sync_table_to_graph(mysql_conn, graph_handler, table_name, row_limit):- Reads distinct values from MySQL.
- Creates person/project/department nodes.
- Iterates through rows to create:
WORKS_ONedges.REPORTS_TOedges.IN_DEPARTMENTedges.
Example Cypher you can run in Neo4j browser:
// Who works on which project?
MATCH (e:Employee)-[:WORKS_ON]->(p:Project)
RETURN e.name, p.name
LIMIT 25;
// Reporting relationships
MATCH (e:Employee)-[:REPORTS_TO]->(m:Employee)
RETURN e.name AS employee, m.name AS manager
LIMIT 25;
For a testing walkthrough, see mcp_database/test_neo4j.py and the Neo4j section in TESTING_GUIDE.md.
7. Smart Router (cost‑optimized agent brain)
The Smart Router is documented in detail in SMART_ROUTER_GUIDE.md. In short:
- 90% of queries are answered using keyword-based routing (free).
- 10% of queries use OpenAI when:
- Confidence is low.
- The query is ambiguous or complex.
- Achieves very high accuracy with low monthly cost per user.
Configuration:
- Controlled via environment variables in
ai_agent_system/.env:USE_SMART_ROUTER=trueSMART_ROUTER_OPENAI=true
- Confidence threshold and more are adjustable in code (
router.py/agent.py).
8. Testing and verification
For a step‑by‑step testing guide (user level + API level + advanced), see:
TESTING_GUIDE.mdmcp_database/test_database.pymcp_database/test_neo4j.pymcp_rag_pdf/test_rag.py
Typical quick checks:
http://localhost:8003/health– Database MCP healthy.http://localhost:8004/health– RAG PDF MCP healthy.http://localhost:8001/health– Google Search MCP healthy (if implemented).http://localhost:8002/health– Google Drive MCP healthy.http://localhost:7487/http://localhost:7474– Neo4j browser.
9. Running only parts of the system
- Only notes + hybrid search:
- Start Database MCP and Streamlit app only.
- Only PDF RAG:
- Start RAG PDF MCP and call it via
mcp_rag_pdf/test_rag.py.
- Start RAG PDF MCP and call it via
- Only web search or Drive:
- Start the respective MCP server and hit
/docsor/tools.
- Start the respective MCP server and hit
This is useful for demos where you do not need the full stack.
10. Contributing / extending
-
To add a new MCP server:
- Create a new directory
mcp_your_service/. - Implement a FastAPI
server.pywith:- Domain REST endpoints.
- MCP endpoints:
/tools,/tools/call,/resources,/resources/read,/prompts,/prompts/get.
- Add a matching client in
mcp_clients/your_service_mcp_client.py. - Integrate with
ai_agent_system/agent.pyandrouter.py.
- Create a new directory
-
To extend the graph schema:
- Edit
mcp_database/csv_graph_mapper.pyandgraph_handler.pyto introduce new node/edge types and classification rules.
- Edit
Pull requests and code review notes can focus on:
- MCP protocol consistency between servers.
- Error handling and logging.
- Neo4j schema evolution.
- Search quality (BM25 + vector + reranker).
11. License and contact
- License: Proprietary – internal project at Sepia ML.
- Author: Shreyash Shankarrao Jadhav
For questions or review feedback, please contact the author or Sepia ML.