NEW AI Studio is now available Try it now
POST /crawl

Recursive Web
Crawling API

Point Spider at any URL. It recursively discovers every page on the domain, streams results as they're found, and returns clean content in your preferred format — all from a single API call.

100K+ pages/sec
depth
50K req/min
5 formats

Recursive Expansion

1
depth 0 Seed URL
8
depth 1 First links
47
depth 2 Two hops
200+
depth 3 Full discovery

How It Works

STEP 1

Submit a seed URL

Send one or more starting URLs. Spider loads each page and identifies every link on it.

STEP 2

Recursive discovery

Links within the domain are followed until your depth or page limits are reached. Duplicates are automatically skipped.

STEP 3

Stream structured output

Each discovered page is returned in your chosen format — markdown, HTML, text, or bytes — with optional metadata, links, and headers.

Without Spider

  • Build and maintain crawler infrastructure
  • Handle dedup, rate limits, and politeness
  • Parse HTML and extract content manually
  • Manage browsers, proxies, JS rendering

With Spider

  • One POST request to crawl an entire site
  • Auto dedup, robots.txt, smart rate control
  • Clean markdown or text for AI pipelines
  • Built-in JS rendering, proxy rotation, anti-bot

Key Capabilities

Crawl Control

Depth & Page Limits

Control how deep the crawler goes with depth and cap total pages with limit. Set both to zero for unlimited.

Smart Request Modes

Choose HTTP-only for speed, Chrome for JS-heavy sites, or Smart mode that picks automatically.

Subdomain & TLD Expansion

Extend crawling beyond the seed domain. Include subdomains like docs.example.com or follow links to related TLDs.

Output & Extraction

Multiple Output Formats

Markdown, raw HTML, plain text, or bytes. Markdown strips nav, ads, and boilerplate for LLM-ready content.

Content Chunking

Segment output by words, lines, characters, or sentences. Fit content into embedding model context windows.

CSS & XPath Selectors

Target specific elements on every page with css_extraction_map. Extract only the data you need.

Data & Controls

Metadata & Headers

Collect page titles, descriptions, keywords, HTTP headers, and cookies alongside content.

External Domain Linking

Treat additional domains as part of the same crawl with external_domains. Supports exact matches and regex.

Budget Controls

Set credit budgets per crawl or per page to cap spending. The crawler stops when the limit is reached.

Code Examples

from spider import Spider

client = Spider()

# Crawl up to 500 pages, return markdown
pages = client.crawl(
    "https://example.com",
    params={
        "return_format": "markdown",
        "limit": 500,
        "depth": 10,
        "metadata": True,
    }
)

for page in pages:
    print(page["url"], len(page["content"]))
curl -X POST https://api.spider.cloud/crawl \
  -H "Authorization: Bearer $SPIDER_API_KEY" \
  -H "Content-Type: application/jsonl" \
  -d '{
    "url": "https://example.com",
    "limit": 100,
    "return_format": "markdown",
    "metadata": true,
    "return_page_links": true
  }'
import Spider from "@spider-cloud/spider-client";

const client = new Spider();

const pages = await client.crawl("https://example.com", {
  return_format: "markdown",
  limit: 500,
  depth: 10,
  metadata: true,
});

pages.forEach(page => console.log(page.url));

Common Parameters

Parameter Type Description
url string The starting URL to crawl. Comma-separate for multiple seed URLs.
limit integer Maximum number of pages to collect. Defaults to 0 (unlimited).
depth integer How many link-hops from the seed URL. Default 25.
return_format string Output format: markdown, html, text, or bytes.
request string Rendering mode: http, chrome, or smart (default).
metadata boolean Include page title, description, and keywords in the response.

See the full API reference for all available parameters including proxy configuration, caching, and network filtering.

Popular Use Cases

ML
AI Training Datasets — Crawl documentation sites, blogs, and knowledge bases to build high-quality training corpora. Markdown output feeds directly into LLM fine-tuning pipelines.
RAG
RAG Knowledge Bases — Keep retrieval-augmented generation systems current by periodically crawling source websites. Use chunking to produce embedding-ready segments.
CMS
Content Migration — Migrate an entire website to a new CMS by crawling all pages and extracting clean content with metadata intact.
BIZ
Competitive Analysis — Index competitor websites to understand their content strategy, product catalog, or pricing structure across hundreds of pages.

Related Resources

> spider.crawl("https://...")

Ready to crawl the web?

Start collecting web content at scale in minutes. No infrastructure to manage.