ssr-concept-mcp

Concept-test your product with synthetic consumers, straight from Claude Code.

Drop one description.md into your repo, ask Claude to "run an SSR concept test", and get a purchase-intent report: Claude roleplays a panel of demographically-grounded consumers reacting to your pitch, and this MCP server turns their free-text reactions into Likert purchase-intent distributions using Semantic Similarity Rating (SSR).

| Rank | Concept                          | Mean PI | Distribution 1-5            |
|------|----------------------------------|---------|-----------------------------|
| 1    | BrewSense Smart Coffee Scale     | 3.00    | 13%  22%  29%  26%  11%     |
| 2    | SurplusCrate (built-in honeypot) | 2.14    | 31%  35%  23%  10%   1%     |

Sanity: honeypot PASS - demographics toggle PASS - embedder calibration PASS

MIT licensed. Runs fully offline after first model download.

What is this?

An implementation of the concept-testing pipeline from the paper "LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings" (Maier, Aslak, Fiaschi, Rismal, Fletcher, Luhmann, Dow, Pappas, Wiecki — 2025, arXiv:2510.08338), packaged as an MCP server for Claude Code.

The core problem the paper solves: if you ask an LLM to rate purchase intent as a number, you get unrealistic, center-clustered distributions. SSR fixes this by eliciting free text ("Honestly, $79 feels steep for a scale, but the no-subscription part tempts me...") and mapping it to a probability distribution over the 1–5 Likert scale via embedding similarity to anchor statements. Against 9,300 real human survey responses, the paper's method achieved ~90% of human test–retest reliability with realistic response distributions.

The SSR math itself comes from the authors' reference implementation, pymc-labs/semantic-similarity-rating — this project does not reimplement it.

Division of labor — this is the part that makes it an MCP server rather than a script:

Your Claude Code session is the generator. It receives one impersonation prompt per persona ("You are a 38-year-old ops manager at an SMB...") and roleplays each consumer's brief reaction in-session. No extra API keys, no server-side LLM calls.
This server does everything else: samples personas from a weighted population, embeds responses, runs the SSR mapping, runs sanity checks, and writes the report.

One rule is enforced by construction: the synthetic consumer only ever sees the stimulus (your description.md + optional images) — never your code, your docs, or a running instance. A real buyer sees a landing page, not a codebase.

Quick start

Requirements: Python 3.11+, Claude Code.

# 1. Install the server (one-time; pulls PyTorch, allow a few minutes)
uv tool install git+https://github.com/0xksure/ssr-concept-mcp.git
#    or: pipx install git+https://github.com/0xksure/ssr-concept-mcp.git

# 2. Register it with Claude Code, from inside your product's repo
cd your-project
claude mcp add ssr -- ssr-mcp

# 3. Write the one required file: your concept, as a prospect would see it
cat > description.md <<'EOF'
# Your Product Name

**One-line tagline**

A paragraph or two of prospect-facing description, written the way your
landing page would say it. What it does, for whom, why it's different.

## Features
- First feature
- Second feature

## Pricing
$29/month
EOF

Then open Claude Code in that project and ask:

Run an SSR concept test on this project with 20 personas.

Claude walks the tool flow (init → sample personas → roleplay every consumer → score → sanity-check → report) and hands you the path to runs/<timestamp>/summary.md. The first scoring call downloads the local embedding model (~440 MB, cached forever after); everything runs offline from then on. Add runs/ to your .gitignore.

That's genuinely all a first run needs: one markdown file. Everything else (anchors, personas, run parameters, honeypots) falls back to built-in defaults — and the report tells you exactly which defaults were used. How to make that one file good is the next section.

Writing your `description.md`

The synthetic consumers react to this file and nothing else, so the quality of the whole run is bounded by the quality of this copy. Write it the way a prospect would meet the product — a landing page or a concept card — not the way a developer would describe the repo.

The structure is parsed tolerantly; only the title and at least one description paragraph are mandatory:

# Product Name        ← name (required)
**Tagline**           ← headline: first bold-only line
Paragraphs of copy.   ← description (required)
## Features           ← bulleted list
## Pricing            ← pricing line

Guidelines:

Benefit-first, buyer-facing. Say what changes for the customer, not how it's built. Internal jargon and tech-stack talk don't belong in a stimulus — no real buyer ever saw your architecture diagram.
Be concrete about who it's for and what it costs. Price is a major driver of purchase intent; include real pricing if it exists. Vague pitches get vague, center-clustered reactions.
Test the real product, not fantasy copy. Overselling inflates intent and the only person fooled is you. (Testing punchier positioning of real capabilities is legitimate — that's what variants in concepts/ are for.)
Keep it one-slide sized. The paper's stimulus was a marketing slide; a screenful of focused copy beats three pages.
Extra ## sections (e.g. ## Guarantee) are kept and shown to consumers. Use description.json instead when you need to attach concept images.

A complete example — this ships in examples/demo_project/, so you can run a concept test against it out of the box:

# BrewSense Smart Coffee Scale

**Dial in café-quality pour-over at home, every single morning**

BrewSense is a precision coffee scale with a built-in brew coach. Set your
recipe once and the LED ring guides your pour in real time — when to pour,
how fast, and when to stop — so every cup comes out the way your best cup did.
The companion app logs every brew, tracks your beans, and suggests one small
adjustment after each cup to get you closer to your ideal taste.

## Features
- Real-time pour guidance via LED ring and gentle haptics
- 0.1 g precision, water-resistant top plate
- Brew journal with bean tracking and taste notes
- Works fully offline; app is optional
- Rechargeable battery, 3 months per charge

## Pricing
$79 one-time. No subscription.

Note what makes it work as a stimulus: a named buyer situation (pour-over at home), a concrete mechanism the buyer can picture (LED ring guiding the pour), and an unambiguous price with the objection-killer attached (no subscription).

What the MCP does, and what you get out

Each run produces runs/<run_id>/summary.md (human) and summary.json (machine-readable, diffable across runs), containing:

Ranked concept table — mean purchase intent (PI, 1–5) and the full Likert distribution per concept. Your main concept is always tested alongside at least one deliberately terrible honeypot concept.
Segment breakdown — mean PI per persona segment, so you see who the concept lands with (e.g. support leads at 3.29 vs IT gatekeepers at 3.02).
Rationales by score band — the consumers' verbatim reactions grouped into high/mid/low intent. This qualitative layer is half the value: it reads like a focus-group transcript with the objections spelled out.
Sanity checks (pass/warn/fail) — see below; these are the run's validity signal.
Config provenance — which configs came from your project vs. built-in defaults, including a loud warning when the persona population was defaulted.
Caveats — the honesty block, in every report, always.

How to interpret the results

The numbers are relative, not absolute. There is no human-labeled calibration data in your run, so a mean PI of 3.2 does not mean "64% would buy". The signal is the ranking and spread between concepts evaluated under the same setup — variant A vs variant B, this week's pitch vs last week's.
Trust the run only as far as the sanity checks. | Check | Question | If it fails | |---|---|---| | honeypot | Did an obviously bad concept rank last? | Discard the run — the ranking signal is broken | | demographics_toggle | Do persona-stripped generations score the same as persona-conditioned ones? | warn: the roleplay is ignoring personas (the paper's ablation: ranking correlation collapsed ~92%→~50% without demographics) | | embedder_calibration | Do hard-coded extreme yes/no answers map to the scale's ends? | Switch embedder or lower ssr_temperature below 1.0 |
Mind the noise floor. At n=20 personas, PI differences under ~0.2 are ties; at n=50, under ~0.1–0.15. The paper-style n is 200. Raise n before reading meaning into small deltas.
The method is bounded by domain knowledge. It works where the generating model has seen abundant consumer discussion of the category (consumer goods, SaaS, services). For exotic niches, expect weaker signal — and garbage stimulus produces garbage intent regardless.
Generic personas answer a question about nobody. Until you write a config/personas.json for your actual target market, the report carries a warning for a reason: demographics drive the ranking signal.

How to iterate on your product with it

The loop that works:

Baseline. Run your current description.md against the default honeypot. Check sanity passes. Save the run.
Mine the objections. Read the low and mid band rationales — they cluster fast (price, trust, "who is this for", missing integration, compliance fear). These are positioning bugs you can fix in copy, or real product gaps for the roadmap.
A/B your fixes as variants. Put alternates in concepts/ — concepts/cheaper_pricing.md, concepts/security_first_pitch.md — and rerun. Every file there is scored and ranked in the same table. Change one thing per variant, keep the same persona seed and n, and the ranking is directly readable.
Sharpen the panel. As your ICP gets clearer, refine config/personas.json segments and weights. Watch the segment table: a concept that lifts overall but craters with your highest-weight segment is not a win.
Track over time. summary.json is stable and diffable; keep runs/ as a longitudinal record of how intent moved as the pitch evolved. The honeypot stays in every run as your canary that the whole pipeline is still discriminating.

Configuration

Everything resolves per-config: project file if present, built-in default otherwise, with provenance reported in every run.

| Config | Project file | Default | |---|---|---| | Stimulus | description.md / description.json | required — no default | | Anchors | config/anchors.json | 6 sets × 5 purchase-intent statements | | Personas | config/personas.json | generic professional population + loud warning | | Run config | config/run.json | paper defaults | | Honeypots | concepts/honeypot_*.md | built-in generic weak concept |

Choosing an embedder

Default is local (BAAI/bge-base-en-v1.5 via sentence-transformers): zero keys, zero cost, fully offline after first download. For parity with the paper's embedding side (text-embedding-3-small), switch to OpenAI in config/run.json:

{ "embedding": { "embedder": "openai" } }

Requires OPENAI_API_KEY in the server's environment. Nothing else changes — embeddings are cached by content hash either way (~/.cache/ssr-mcp/), so re-scoring is free. If embedder_calibration fails on a different local model, switch to OpenAI or sharpen the PMF by lowering ssr_temperature below 1.0 (the SSR package scales pmf^(1/T), so smaller T = sharper).

`config/run.json`

All keys optional; deep-merged over the defaults shown here:

{
  "generation": { "backend": "mcp_client", "samples_per_persona": 2 },
  "embedding":  { "embedder": "local",
                  "local_model": "BAAI/bge-base-en-v1.5",
                  "openai_model": "text-embedding-3-small" },
  "ssr":        { "reference_set_id": "mean", "ssr_temperature": 1.0, "epsilon": 0.0 },
  "run":        { "n_personas": 200, "include_honeypots": true,
                  "demographics_toggle_check": true }
}

`config/personas.json`

Your target market as weighted segments of a joint demographic distribution. The sampler allocates N personas proportional to weights and fixes concrete attributes (age within range, gender from split). Any extra attribute keys (like context below) flow verbatim into the roleplay prompt — use them to carry tooling, pain points, and buying power:

{
  "population": "US home-coffee enthusiasts",
  "segments": [
    {
      "id": "enthusiast",
      "label": "Pour-over enthusiasts",
      "weight": 0.6,
      "attributes": {
        "age_range": [25, 45],
        "gender_split": { "female": 0.5, "male": 0.5 },
        "role": "drinks pour-over daily, owns a gooseneck kettle",
        "income": "$60k-$120k household",
        "region": "urban US",
        "context": "already spends on specialty beans; gear budget exists but is scrutinized"
      }
    },
    { "id": "casual", "weight": 0.4,
      "attributes": { "age_range": [30, 60], "role": "casual drip-coffee drinker" } }
  ]
}

`config/anchors.json`

Six differently-worded reference sets, five statements each (int_response 1–5 + sentence), averaged to reduce wording sensitivity. The built-in sets are generic purchase-intent statements and fine for almost everyone; override only if your scale isn't purchase intent (e.g. likelihood-to-recommend).

Honeypots and variants

concepts/honeypot_*.md — your own deliberately weak concept(s); otherwise a built-in one is used. Keep one in every run.
concepts/<anything-else>.md — additional variants, scored and ranked alongside the main concept. Same format as description.md.
description.json instead of description.md when you need to attach images: { "name": ..., "description": ..., "images": ["concept-card.png"] }.

The tool flow (under the hood)

Six MCP tools; Claude Code orchestrates them and generates the responses:

ssr_init             resolve configs, parse stimulus, list concepts, new run
ssr_sample_personas  N personas from the weighted population (pure logic)
ssr_get_prompts      per-persona impersonation prompts + stimulus block
        ↓            (Claude roleplays every consumer's free-text answer,
                      plus persona-stripped answers for the toggle check)
ssr_score            embeddings → SSR PMFs → mean PI, distribution, segments
ssr_sanity           honeypot / demographics-toggle / embedder calibration
ssr_report           writes runs/<run_id>/summary.md + summary.json

Run state persists on disk under runs/<run_id>/ after every call, so the flow survives server restarts and every intermediate artifact (personas, raw responses, per-response PMFs) is inspectable. Any MCP client can drive the same flow — scripts/mcp_call.py shows how — but the prompts are written for an agent that can roleplay, which is why Claude Code is the documented host.

Method notes & honest limitations

Generation is Claude roleplaying in your session; the paper used GPT-4o and Gemini-2.0-flash. The method should transfer, but this configuration is off-paper — and so are your results until validated against real buyers.
The local default embedder differs from the paper's. The SSR levers (min-subtraction, temperature) absorb embedder differences, but with no human data they can only be sanity-checked, not tuned to truth.
Short idiomatic answers without explicit purchase language ("Pass — tells you everything about this offer") occasionally embed ambiguously and land in the wrong band individually. The aggregate distributions are robust to this; the per-response view is where you'll notice it.
This is a screening instrument. It compresses "would my market plausibly care" from weeks to minutes, and it is exactly as honest as its stimulus, personas, and sanity checks. It does not replace talking to customers.

Development

git clone https://github.com/0xksure/ssr-concept-mcp.git && cd ssr-concept-mcp
python3 -m venv .venv && .venv/bin/pip install -e ".[dev]"
.venv/bin/python -m pytest             # unit tests (fake embedder, no network)
.venv/bin/python scripts/smoke_core.py # end-to-end on examples/demo_project, real embedder
.venv/bin/python scripts/mcp_smoke.py  # all six tools over stdio MCP

Layout: src/ssr_mcp/core/ is the framework-agnostic library (stimulus parser, config resolution, persona sampler, embedders, SSR scoring, sanity harness, report, run state); src/ssr_mcp/server.py is the thin MCP layer; src/ssr_mcp/defaults/ holds the built-in configs.

Citation

If you use this, cite the people who invented the method:

@article{maier2025ssr,
  title   = {LLMs Reproduce Human Purchase Intent via Semantic Similarity
             Elicitation of Likert Ratings},
  author  = {Maier, Benjamin F. and Aslak, Ulf and Fiaschi, Luca and
             Rismal, Nina and Fletcher, Kemble and Luhmann, Christian C. and
             Dow, Robbie and Pappas, Kli and Wiecki, Thomas V.},
  journal = {arXiv preprint arXiv:2510.08338},
  year    = {2025}
}

SSR reference implementation: pymc-labs/semantic-similarity-rating (MIT).

MCP Servers

ssr-concept-mcp

What is this?

Quick start

Writing your `description.md`

What the MCP does, and what you get out

How to interpret the results

How to iterate on your product with it

Configuration

Choosing an embedder

`config/run.json`

`config/personas.json`

`config/anchors.json`

Honeypots and variants

The tool flow (under the hood)

Method notes & honest limitations

Development

Citation

License

Install Package (if required)

Cursor configuration (mcp.json)

ssr-concept-mcp

What is this?

Quick start

Writing your description.md

What the MCP does, and what you get out

How to interpret the results

How to iterate on your product with it

Configuration

Choosing an embedder

config/run.json

config/personas.json

config/anchors.json

Honeypots and variants

The tool flow (under the hood)

Method notes & honest limitations

Development

Citation

License

Install Package (if required)

Cursor configuration (mcp.json)

Writing your `description.md`

`config/run.json`

`config/personas.json`

`config/anchors.json`