Contextractor extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need.
It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958). Ideal for building LLM training datasets, RAG pipelines, and research applications.
Trafilatura is a Python library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.
Apify also has a super generousCreator plan (though you can run only your own actors) that costs just $1/month (billed $6 semi-annually) and includes a one-time $500 platform credit for your first 6 months — with up to 32 GB RAM and 32 concurrent actor runs.