How to Turn Your Website Into an MCP Server (Step-by-Step)
Two paths reach the same outcome — a website you can query as an MCP server from Claude, ChatGPT, Cursor, or any other MCP client. Path A: self-host the crawler, vector index, and MCP server yourself. Path B: use a managed service. This post walks through both honestly, including the architectural pieces you can't skip on the self-host route, so you can pick with eyes open.
What "turning a website into an MCP server" actually requires
An MCP server that lets an AI answer questions about a website needs four pieces, regardless of who builds it:
-
A crawler that fetches every URL on the site (respecting
robots.txt), handles JavaScript-rendered pages, dedups duplicates, and re-crawls on a schedule. - An extractor + chunker that converts each HTML page into clean Markdown, splits it into retrieval-sized chunks (typically 500–1500 chars, with overlap), and discards boilerplate.
- A hybrid search index — both vector (semantic similarity via embeddings) and keyword (BM25 / FTS5) — so the AI can find relevant chunks across the whole site for any query.
-
An MCP server endpoint that speaks the Model Context Protocol over streamable HTTP, accepts
search/fetchtool calls from the client, runs them against the index, and returns chunks with source URLs.
The MCP layer itself is the easy part — it's a small wrapper. The hard parts are crawling reliably, keeping the index fresh, and the search quality.
Path A — Self-host the whole stack
Honest assessment: a working MVP is 2–4 days for an experienced developer. A production-grade version that handles JS rendering, bot challenges, content updates, and search-quality edge cases is closer to 2–4 weeks of work, plus ongoing maintenance.
What you'll wire together
- Crawler: Crawlee, Scrapy, or your own httpx + BeautifulSoup loop. Plan for robots.txt parsing, rate-limit politeness, sitemap.xml ingestion, and dedup. Add Puppeteer or Playwright for sites that render content client-side (Shopify, Notion-style SPAs).
- Content extractor: Readability or Trafilatura to strip nav/footer/ad noise, plus Turndown (or similar) to convert clean HTML → Markdown.
- Chunker: Recursive-character splitter (LangChain-style) or sentence-aware splitting; pick chunk size based on your embedder's context window.
-
Embeddings: OpenAI
text-embedding-3-small($0.02/M tokens), Cohereembed-v3, or localBGE/nomic-embed. Plan for ~$1–$10 per 1,000-page site, recurring on each re-crawl. - Vector store: pgvector on Postgres, Qdrant, Weaviate, Pinecone, or Cloudflare Vectorize. For BM25 you'll want a sibling SQLite FTS5 / Elasticsearch.
-
MCP server: the @modelcontextprotocol/sdk (TypeScript) or the Python equivalent. Expose
search(query, k)andfetch(url)tools. Mount on a public URL with TLS. - Hosting: Cloudflare Workers, Vercel, Fly.io, Railway, or self-managed VPS. Pick something with a generous WebSockets / SSE story (MCP streamable HTTP needs it).
Real-world pitfalls you'll hit
- SPA detection. Static fetch returns an empty shell on JS-rendered pages. You need a heuristic — empty Markdown after extraction, or a framework-root div regex — to fall back to Puppeteer for that URL.
- Canonical URL handling. Sites publish the same content under
/page,/page?ref=campaign,/page/, andwww.vs apex. Without canonical handling, you index the same content 5× and dilute search results. - Refresh / dedup. Re-crawling needs to detect unchanged pages (ETag, Last-Modified, content hash) so you don't burn embedding cost on unchanged pages.
- Bot challenges. Cloudflare's bot protection, hCaptcha, and others block plain HTTP fetch. Headless browsers help but cost more.
- Quality drift. Without per-site evaluation, you won't notice when search quality degrades after an embedder version bump or a site redesign.
Cost shape
Per 1,000-page site, expect roughly: $1–$5 for the initial embedding pass, $0.50–$2 for weekly refreshes, plus your vector store and compute costs. The cost scales with total chunks (page count × ~10 chunks/page on average), not query volume.
Path B — Use a managed service (5 minutes)
If you'd rather not be the one debugging SPA detection at 2 AM, the same end state is available as a managed service. WebToMCP does all four pieces above on Cloudflare's edge: submit a URL, get an MCP endpoint URL, paste it into your AI client.
- Sign in at webtomcp.net with Google (free tier, no credit card).
- Add your website URL — your docs, store, wiki, or any public site.
- Wait 1–10 minutes for crawling + indexing. JS rendering is metered per plan.
- Copy the MCP endpoint URL from the dashboard and paste it into Claude Desktop, ChatGPT, Cursor, Gemini CLI, or Codex.
Free tier: 1 KB, 1 website, ~500 questions/month. Hobby ($9/mo): 5 KBs, 10 websites, 5,000 questions, 100 JS-rendered pages. See pricing →
Side-by-side
| Aspect | Self-host | WebToMCP (managed) |
|---|---|---|
| Time to first endpoint | 2–4 days MVP, 2–4 weeks production | 5 minutes |
| Cost per 1k-page site | $1–$5 embed + infra | $0 (free) or $9/mo |
| JS rendering, canonical, refresh, dedup | You build all of it | Built in |
| Customization | Full — your stack, your call | Limited to dashboard settings |
| Operations | Your on-call rotation | Ours |
When to pick each
Pick self-host when: you have unusual indexing rules (private auth-gated content, custom chunking by document type, embedded retrieval inside your existing data pipeline), you want to own the stack end-to-end, or you already have RAG infra in production for other use cases and adding MCP is incremental.
Pick managed when: you want an MCP endpoint live today, your site is a typical content-heavy public site (docs, store, wiki, blog), you don't have an existing RAG pipeline to plug into, or your team's time is better spent on product than on crawler maintenance.
A pragmatic hybrid
Many teams ship the managed version first (5 minutes to evaluate whether MCP-querying your site moves your AI experience), then decide if the workflow is valuable enough to bring in-house. If WebToMCP's defaults work for you, there's no need to migrate. If you need something WebToMCP doesn't expose, the self-host architecture above is the same shape — you can swap us out when you outgrow us.
Related reads
- What a website MCP endpoint is and how to get one in 5 minutes — the managed path in detail.
- Connect your website to Claude
- Connect your website to ChatGPT
- What "web to MCP" means — the brand explainer
Questions? developer@webtomcp.net. Or sign in with Google to index your own site (free).