2026-05-17 · KY · ~5 min read

How to Turn Your Website Into an MCP Server (Step-by-Step)

Two paths reach the same outcome — a website you can query as an MCP server from Claude, ChatGPT, Cursor, or any other MCP client. Path A: self-host the crawler, vector index, and MCP server yourself. Path B: use a managed service. This post walks through both honestly, including the architectural pieces you can't skip on the self-host route, so you can pick with eyes open.

What "turning a website into an MCP server" actually requires

An MCP server that lets an AI answer questions about a website needs four pieces, regardless of who builds it:

A crawler that fetches every URL on the site (respecting robots.txt), handles JavaScript-rendered pages, dedups duplicates, and re-crawls on a schedule.
An extractor + chunker that converts each HTML page into clean Markdown, splits it into retrieval-sized chunks (typically 500–1500 chars, with overlap), and discards boilerplate.
A hybrid search index — both vector (semantic similarity via embeddings) and keyword (BM25 / FTS5) — so the AI can find relevant chunks across the whole site for any query.
An MCP server endpoint that speaks the Model Context Protocol over streamable HTTP, accepts search / fetch tool calls from the client, runs them against the index, and returns chunks with source URLs.

The MCP layer itself is the easy part — it's a small wrapper. The hard parts are crawling reliably, keeping the index fresh, and the search quality.

Path A — Self-host the whole stack

Honest assessment: a working MVP is 2–4 days for an experienced developer. A production-grade version that handles JS rendering, bot challenges, content updates, and search-quality edge cases is closer to 2–4 weeks of work, plus ongoing maintenance.

What you'll wire together

Crawler: Crawlee, Scrapy, or your own httpx + BeautifulSoup loop. Plan for robots.txt parsing, rate-limit politeness, sitemap.xml ingestion, and dedup. Add Puppeteer or Playwright for sites that render content client-side (Shopify, Notion-style SPAs).
Content extractor: Readability or Trafilatura to strip nav/footer/ad noise, plus Turndown (or similar) to convert clean HTML → Markdown.
Chunker: Recursive-character splitter (LangChain-style) or sentence-aware splitting; pick chunk size based on your embedder's context window.
Embeddings: OpenAI text-embedding-3-small ($0.02/M tokens), Cohere embed-v3, or local BGE / nomic-embed. Plan for ~$1–$10 per 1,000-page site, recurring on each re-crawl.
Vector store: pgvector on Postgres, Qdrant, Weaviate, Pinecone, or Cloudflare Vectorize. For BM25 you'll want a sibling SQLite FTS5 / Elasticsearch.
MCP server: the @modelcontextprotocol/sdk (TypeScript) or the Python equivalent. Expose search(query, k) and fetch(url) tools. Mount on a public URL with TLS.
Hosting: Cloudflare Workers, Vercel, Fly.io, Railway, or self-managed VPS. Pick something with a generous WebSockets / SSE story (MCP streamable HTTP needs it).

Real-world pitfalls you'll hit

SPA detection. Static fetch returns an empty shell on JS-rendered pages. You need a heuristic — empty Markdown after extraction, or a framework-root div regex — to fall back to Puppeteer for that URL.
Canonical URL handling. Sites publish the same content under /page, /page?ref=campaign, /page/, and www. vs apex. Without canonical handling, you index the same content 5× and dilute search results.
Refresh / dedup. Re-crawling needs to detect unchanged pages (ETag, Last-Modified, content hash) so you don't burn embedding cost on unchanged pages.
Bot challenges. Cloudflare's bot protection, hCaptcha, and others block plain HTTP fetch. Headless browsers help but cost more.
Quality drift. Without per-site evaluation, you won't notice when search quality degrades after an embedder version bump or a site redesign.

Cost shape

Per 1,000-page site, expect roughly: $1–$5 for the initial embedding pass, $0.50–$2 for weekly refreshes, plus your vector store and compute costs. The cost scales with total chunks (page count × ~10 chunks/page on average), not query volume.

Path B — Use a managed service (5 minutes)

If you'd rather not be the one debugging SPA detection at 2 AM, the same end state is available as a managed service. WebToMCP does all four pieces above on Cloudflare's edge: submit a URL, get an MCP endpoint URL, paste it into your AI client.

Sign in at webtomcp.net with Google (free tier, no credit card).
Add your website URL — your docs, store, wiki, or any public site.
Wait 1–10 minutes for crawling + indexing. JS rendering is metered per plan.
Copy the MCP endpoint URL from the dashboard and paste it into Claude Desktop, ChatGPT, Cursor, Gemini CLI, or Codex.

Free tier: 1 KB, 1 website, ~500 questions/month. Hobby ($9/mo): 5 KBs, 10 websites, 5,000 questions, 100 JS-rendered pages. See pricing →

Side-by-side

Aspect	Self-host	WebToMCP (managed)
Time to first endpoint	2–4 days MVP, 2–4 weeks production	5 minutes
Cost per 1k-page site	$1–$5 embed + infra	$0 (free) or $9/mo
JS rendering, canonical, refresh, dedup	You build all of it	Built in
Customization	Full — your stack, your call	Limited to dashboard settings
Operations	Your on-call rotation	Ours

When to pick each

Pick self-host when: you have unusual indexing rules (private auth-gated content, custom chunking by document type, embedded retrieval inside your existing data pipeline), you want to own the stack end-to-end, or you already have RAG infra in production for other use cases and adding MCP is incremental.

Pick managed when: you want an MCP endpoint live today, your site is a typical content-heavy public site (docs, store, wiki, blog), you don't have an existing RAG pipeline to plug into, or your team's time is better spent on product than on crawler maintenance.

A pragmatic hybrid

Many teams ship the managed version first (5 minutes to evaluate whether MCP-querying your site moves your AI experience), then decide if the workflow is valuable enough to bring in-house. If WebToMCP's defaults work for you, there's no need to migrate. If you need something WebToMCP doesn't expose, the self-host architecture above is the same shape — you can swap us out when you outgrow us.

What a website MCP endpoint is and how to get one in 5 minutes — the managed path in detail.
Connect your website to Claude
Connect your website to ChatGPT
What "web to MCP" means — the brand explainer

Questions? developer@webtomcp.net. Or sign in with Google to index your own site (free).