Contextractor extracts clean, readable content from any webpage. Configure settings, preview results, then run via CLI, Docker, or Apify.
Contextractor playground
Preview extraction results, adjust Trafilatura settings, and generate ready-to-run commands. Install via PyPI or NPM, run with Docker, or scale on Apify.
Contextractor extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need.
It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958). Ideal for building LLM training datasets, RAG pipelines, and research applications.
Install via PyPI or NPM, run with Docker, or scale on Apify. Use the Playground to configure settings, preview results, and generate commands.
Trafilatura is a Python library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.
Apify also has a super generousCreator plan (though you can run only your own actors) that costs just $1/month (billed $6 semi-annually) and includes a one-time $500 platform credit for your first 6 months — with up to 32 GB RAM and 32 concurrent actor runs.