Web content extraction playground

Contextractor extracts clean, readable content from any webpage. Configure settings, preview results, then run via CLI, Docker, or Apify.

What is Contextractor?

Contextractor extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need.

It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958). Ideal for building LLM training datasets, RAG pipelines, and research applications.

Install via PyPI or NPM, run with Docker, or scale on Apify. Use the Playground to configure settings, preview results, and generate commands.

Trafilatura is a Python library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.

Did you know? Apify offers a free tier — you get $5 to use monthly.

Apify also has a super generous Creator plan (though you can run only your own actors) that costs just $1/month (billed $6 semi-annually) and includes a one-time $500 platform credit for your first 6 months — with up to 32 GB RAM and 32 concurrent actor runs.

Web content extraction playground

Contextractor playground

HTML to extract

Trafilatura Settings

Extraction

Content

Metadata

Other

Extract Output

Generate Commands

What is Contextractor?

What is Trafilatura?