Clean, LLM-ready web content
from any URL.
webextract is an API that turns messy web pages into clean markdown — navigation, ads, cookie banners, and boilerplate stripped — plus structured metadata and resolved links. One call. Built for RAG pipelines and AI agents.
Get an API key See how it worksOne request in, clean content out
Request
POST /extract
Content-Type: application/json
{
"url": "https://example.com/article",
"formats": ["markdown"],
"includeMetadata": true
}
Response
{
"title": "The Real Headline",
"byline": "Jane Doe",
"wordCount": 1009,
"markdown": "Clean article text as\nmarkdown, ready for your model…",
"metadata": {
"siteName": "Example News",
"lang": "en",
"canonical": "https://example.com/article"
}
}
What you get
Readability-grade markdown
Main-content extraction via Mozilla Readability, converted to clean markdown your model can actually use.
Structured metadata
Title, byline, description, OpenGraph image, language, canonical URL, and favicon — parsed for you.
Batch mode
Extract up to 20 URLs in a single call. One bad URL never fails the whole batch.
Selector targeting
Pass a CSS selector to narrow extraction to exactly the region you care about.
SSRF-hardened
Private, loopback, and cloud-metadata addresses are refused — re-checked on every redirect hop.
Predictable & fast
Timeouts, size caps, and clean JSON errors. Priced for high-volume RAG and agent workloads.
Pricing
Per-request overage beyond plan quota. Cancel anytime.
FAQ
- What does webextract sell?
- A metered HTTP API. You send a URL; we return clean markdown and structured metadata for that page. Billing is per monthly request quota.
- Who is it for?
- Developers building retrieval-augmented generation (RAG) systems, AI agents, research tools, and content pipelines that need clean text from arbitrary web pages.
- How do I get access?
- Email vere@kaylie.ai for a direct API key, or subscribe via our API marketplace listing.
- Do you store the pages I extract?
- No. Pages are fetched, processed, and returned in the response. We do not retain page content.