Web content extraction tool

Contextractor extracts clean, readable content from any webpage – powered by Trafilatura

What is Contextractor?

Contextractor extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need.

It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958). Ideal for building LLM training datasets, RAG pipelines, and research applications.

Use it via the CLI, Docker, or Apify actor. Try the Playground to configure settings and preview commands.

Trafilatura is a Python library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.

Did you know? Apify offers a free tier — you get $5 to use monthly.

Apify also has a super generous Creator plan (though you can run only your own actors) that costs just $1/month (billed $6 semi-annually) and includes a one-time $500 platform credit for your first 6 months — with up to 32 GB RAM and 32 concurrent actor runs.

Web content extraction tool

Paste HTML content to extract

Trafilatura Settings

Extraction

Content

Metadata

Other

Extract Output

Generate Commands

What is Contextractor?

What is Trafilatura?