MCPC Data Pipeline Guide

This repository hosts the tooling we use to crawl Model Context Protocol (MCP) sources, normalize the collected repositories, and build curated issue datasets for annotation. The project is intentionally modular: every stage writes JSON artifacts that the next script consumes, so you can pause/resume or swap inputs at any time.

Requirements

Python 3.9 or newer (3.11+ recommended for better asyncio performance)
Google Chrome + matching ChromeDriver when running crawlers that rely on Selenium (Smithery, MCP Market, Cursor, etc.)
Dependencies from requirements.txt

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Environment variables

Most scripts talk to the GitHub API. Set a personal access token once before running long jobs:

export GITHUB_TOKEN=ghp_your_token_here  # Windows PowerShell: $env:GITHUB_TOKEN="..."

If you need site-specific API keys (for example, Smithery), set them in the same way before invoking the related crawler.

Repository layout

engine/                   # Crawler implementations (servers + clients)
scripts/                  # Data-processing utilities
mcp_servers/              # Generated server metadata
分组数据/                  # Issue sampling & grouping outputs
config/                   # YAML configs for crawl targets

End-to-end workflow

1. Crawl raw sources

Use the unified crawler engine to hit one or more data sources defined in your config file.

python -m engine.core.crawler_engine \
  --config config/sites_config.yaml \
  --sources smithery pulse awesome glama \
  --type servers

Common options:

--sources: space-delimited subset. Omit to crawl every known source.
--type: servers, clients, or all.
--download-sources: immediately download each repository after metadata is collected.
--download-only: skip crawling and only download sources for the JSONs already on disk.
--categories-only / --categories-source: rebuild category tags without touching metadata.

Each source writes mcp_servers/<source>/<source>.json (or mcp_clients/... for client crawlers).

2. Aggregate into a master list

Combine all source directories into a deduplicated list before filtering:

python scripts/aggregate_servers.py

By default it scans every subdirectory under mcp_servers/, normalizes GitHub URLs, removes duplicates, and writes mcp_servers/servers_aggregate.json (plus per-source counts).

3. Filter low-signal repositories

scripts/filter_servers.py calls the GitHub REST API for each repo and enforces simple health rules (minimum activity, non-template, has a README, etc.).

python scripts/filter_servers.py \
  --input mcp_servers/servers_aggregate.json \
  --output mcp_servers/servers_filtered.json \
  --resume

Key flags:

--limit N: stop after N retained repositories (helpful for smoke tests).
--resume: continue where a previous run left off using <output>.progress.

Reasons for exclusion (e.g., abandoned_project, placeholder_repo) are summarized inside the output JSON.

4. Collect meaningful issues

scripts/collect_meaningful_issues.py walks the filtered list, paginates through every GitHub issue, and stores those that meet both rules:

At least two comments, and
Multiple distinct participants (issue author plus commenters).

python scripts/collect_meaningful_issues.py \
  --resume \
  --input mcp_servers/servers_filtered.json \
  --output mcp_servers/servers_issues.json \
  --issues-per-repo 100 \
  --comments-per-issue 40

Tips:

Use a high --issues-per-repo (up to 100) to reduce pagination overhead. The script automatically keeps requesting additional pages until a page returns fewer results.
--comments-per-issue controls the top-N comments fetched per issue. Increase it if you need larger participant sets.
The script writes both the full dataset and a .progress checkpoint so you can restart safely after rate-limit or network failures.

5. Sample and group issues

After servers_issues.json is ready, run scripts/group_issues.py to (a) carve out a sampled subset and (b) optionally partition it into labeled groups for annotation.

# Example: sample 663 issues, then create six 100-item groups + one 63-item group
python scripts/group_issues.py \
  --input mcp_servers/servers_issues.json \
  --output-dir 分组数据 \
  --sample-count 663 \
  --group-sizes 100,100,100,100,100,100,63 \
  --seed 42 \
  --sample-output 分组数据/sampled_servers_issues.json \
  --annotated-output 分组数据/annotated_servers_issues.json

What you get:

sampled_servers_issues.json: standalone dataset containing only the sampled servers/issues (with sampled: true).
annotated_servers_issues.json: the original input enriched with sampled/group_assignment flags so you can keep track of what has been used.
group_<n>.json: one file per group containing flattened issue metadata for quick handoff.

Supplying --seed guarantees reproducible sampling/grouping. If you only need the sample without groups, omit --group-sizes and --groups/--size.

Resuming long jobs safely

filter_servers.py and collect_meaningful_issues.py both emit <output>.progress. Deleting the progress file forces a clean restart.
collect_meaningful_issues.py --resume also reloads the main output JSON so partial results are preserved across crashes.
group_issues.py does not mutate the source file except to add annotations when you point --annotated-output at the same path. Use a different path if you want to keep the original pristine.

Troubleshooting & best practices

| Symptom | Likely cause | Fix | | --- | --- | --- | | 403 with X-RateLimit-Remaining: 0 | GitHub rate limit | Wait for the script to sleep automatically, or add a higher-tier PAT. | | Selenium timeouts | Headless Chrome missing or mismatched driver | Install Chrome + ChromeDriver and ensure it’s on your PATH. | | Some repos missing issues | They may be archived/private or unreachable via API | Run the collector with --resume after fixing credentials; repos that previously failed will be retried. | | Need to test without overwriting production files | Use --output (and related flags) to point at a temporary location. |

License & contributions

Internal project – update this section if you plan to publish the toolkit. Pull requests that add new MCP sources or improve filtering heuristics are welcome.

MCP Servers