Recursive Web
Crawling API
Point Spider at any URL. It recursively discovers every page on the domain, streams results as they're found, and returns clean content in your preferred format — all from a single API call.
Recursive Expansion
How It Works
Submit a seed URL
Send one or more starting URLs. Spider loads each page and identifies every link on it.
Recursive discovery
Links within the domain are followed until your depth or page limits are reached. Duplicates are automatically skipped.
Stream structured output
Each discovered page is returned in your chosen format — markdown, HTML, text, or bytes — with optional metadata, links, and headers.
Without Spider
- ✕ Build and maintain crawler infrastructure
- ✕ Handle dedup, rate limits, and politeness
- ✕ Parse HTML and extract content manually
- ✕ Manage browsers, proxies, JS rendering
With Spider
- ✓ One POST request to crawl an entire site
- ✓ Auto dedup, robots.txt, smart rate control
- ✓ Clean markdown or text for AI pipelines
- ✓ Built-in JS rendering, proxy rotation, anti-bot
Key Capabilities
Crawl Control
Depth & Page Limits
Control how deep the crawler goes with depth and cap total pages with limit. Set both to zero for unlimited.
Smart Request Modes
Choose HTTP-only for speed, Chrome for JS-heavy sites, or Smart mode that picks automatically.
Subdomain & TLD Expansion
Extend crawling beyond the seed domain. Include subdomains like docs.example.com or follow links to related TLDs.
Output & Extraction
Multiple Output Formats
Markdown, raw HTML, plain text, or bytes. Markdown strips nav, ads, and boilerplate for LLM-ready content.
Content Chunking
Segment output by words, lines, characters, or sentences. Fit content into embedding model context windows.
CSS & XPath Selectors
Target specific elements on every page with css_extraction_map. Extract only the data you need.
Data & Controls
Metadata & Headers
Collect page titles, descriptions, keywords, HTTP headers, and cookies alongside content.
External Domain Linking
Treat additional domains as part of the same crawl with external_domains. Supports exact matches and regex.
Budget Controls
Set credit budgets per crawl or per page to cap spending. The crawler stops when the limit is reached.
Code Examples
from spider import Spider
client = Spider()
# Crawl up to 500 pages, return markdown
pages = client.crawl(
"https://example.com",
params={
"return_format": "markdown",
"limit": 500,
"depth": 10,
"metadata": True,
}
)
for page in pages:
print(page["url"], len(page["content"])) curl -X POST https://api.spider.cloud/crawl \
-H "Authorization: Bearer $SPIDER_API_KEY" \
-H "Content-Type: application/jsonl" \
-d '{
"url": "https://example.com",
"limit": 100,
"return_format": "markdown",
"metadata": true,
"return_page_links": true
}' import Spider from "@spider-cloud/spider-client";
const client = new Spider();
const pages = await client.crawl("https://example.com", {
return_format: "markdown",
limit: 500,
depth: 10,
metadata: true,
});
pages.forEach(page => console.log(page.url)); Common Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | The starting URL to crawl. Comma-separate for multiple seed URLs. |
limit | integer | Maximum number of pages to collect. Defaults to 0 (unlimited). |
depth | integer | How many link-hops from the seed URL. Default 25. |
return_format | string | Output format: markdown, html, text, or bytes. |
request | string | Rendering mode: http, chrome, or smart (default). |
metadata | boolean | Include page title, description, and keywords in the response. |
See the full API reference for all available parameters including proxy configuration, caching, and network filtering.
Popular Use Cases
Related Resources
> spider.crawl("https://...")
Ready to crawl the web?
Start collecting web content at scale in minutes. No infrastructure to manage.