Documentation Index
Fetch the complete documentation index at: https://docs.scrapegraphai.com/llms.txt
Use this file to discover all available pages before exploring further.
These docs cover
scrapegraph-py ≥ 2.1.0 and require Python ≥ 3.12. Earlier 1.x releases expose the deprecated v1 API and point to a different backend — none of the snippets on this page work there. The 2.0.x series used typed request wrappers (ScrapeRequest, ExtractRequest, …); 2.1.0 removed those wrappers in favour of direct positional/keyword arguments, so upgrade if you are pinned to 2.0.x.Installation
What’s New in v2
- Complete rewrite built on Pydantic v2 + httpx.
- Client rename:
Client→ScrapeGraphAI,AsyncClient→AsyncScrapeGraphAI. - Direct arguments (v2.1.0): every method accepts positional/keyword args — no more
ScrapeRequest/ExtractRequest/… wrappers. ApiResult[T]wrapper: no exceptions on API errors — every call returnsstatus: "success" | "error",data,error, andelapsed_ms.- Nested resources:
sgai.crawl.*,sgai.monitor.*,sgai.history.*. - camelCase on the wire, snake_case in Python: automatic via Pydantic’s
alias_generator. - Removed:
markdownify(),agenticscraper(),sitemap(),feedback()— usescrape()with the appropriate format entry instead.
Quick Start
ApiResult
Every method returnsApiResult[T] — no try/except needed for API errors:
Environment Variables
| Variable | Description | Default |
|---|---|---|
SGAI_API_KEY | Your ScrapeGraphAI API key | — |
SGAI_API_URL | Override API base URL | https://v2-api.scrapegraphai.com/api |
SGAI_TIMEOUT | Request timeout in seconds | 120 |
SGAI_DEBUG | Enable debug logging (set to "1") | off |
Services
Scrape
Fetch a page in one or more formats (markdown, html, screenshot, json, links, images, summary, branding).scrape() parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | str | Yes | URL to scrape (positional) |
formats | list[FormatConfig] | No | Defaults to [MarkdownFormatConfig()] |
content_type | str | No | Override detected content type (e.g. "application/pdf", "text/html") |
fetch_config | FetchConfig | No | Fetch configuration (mode, stealth, timeout, cookies, country, …) |
Format entries
| Class | Fields |
|---|---|
MarkdownFormatConfig | mode: "normal" | "reader" | "prune" |
HtmlFormatConfig | mode: same as above |
ScreenshotFormatConfig | full_page, width (320–3840), height (200–2160), quality |
JsonFormatConfig | prompt (1–10k chars), schema (JSON Schema dict — pass a Pydantic model’s model_json_schema() to reuse a BaseModel), mode |
LinksFormatConfig | — |
ImagesFormatConfig | — |
SummaryFormatConfig | — |
BrandingFormatConfig | — |
Duplicate
type entries in formats are rejected by a Pydantic validator.Extract
Run structured extraction against a URL, HTML, or markdown using AI.Using a Pydantic model as the schema
schema= is a JSON Schema dict. Any Pydantic BaseModel produces one via model_json_schema(), so you can define the desired shape once and reuse it to validate the response client-side.
JsonFormatConfig(schema=...) in scrape() and for search(schema=...).
extract() parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
prompt | str | Yes | 1–10,000 chars (positional) |
url | str | Yes* | Page URL |
html | str | Yes* | Raw HTML (alternative to url) |
markdown | str | Yes* | Raw markdown (alternative to url) |
schema | dict | No | JSON Schema for the structured output. Pass a Pydantic model’s model_json_schema() to reuse a BaseModel. |
mode | str | No | "normal" (default), "reader", "prune" |
content_type | str | No | Override detected content type |
fetch_config | FetchConfig | No | Fetch configuration |
*At least one of
url, html, or markdown is required.Search
Run a web search and optionally extract structured data from the results.search() parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
query | str | Yes | 1–500 chars (positional) |
num_results | int | No | 1–20, default 3 |
format | str | No | "markdown" (default) or "html" |
mode | str | No | HTML processing: "prune" (default), "normal", "reader" |
prompt | str | No | Required when schema is set |
schema | dict | No | JSON Schema for structured output. Pass a Pydantic model’s model_json_schema() to reuse a BaseModel. |
location_geo_code | str | No | Two-letter country code (e.g. "us", "it") |
time_range | str | No | "past_hour", "past_24_hours", "past_week", "past_month", "past_year" |
fetch_config | FetchConfig | No | Fetch configuration |
Crawl
Crawl a site and its linked pages asynchronously. Access via thesgai.crawl resource.
crawl.start() parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | str | Yes | Starting URL (positional) |
formats | list[FormatConfig] | No | Defaults to [MarkdownFormatConfig()] |
max_depth | int | No | ≥ 0, default 2 |
max_pages | int | No | 1–1000, default 50 |
max_links_per_page | int | No | ≥ 1, default 10 |
allow_external | bool | No | Default False |
include_patterns | list[str] | No | URL glob patterns to include |
exclude_patterns | list[str] | No | URL glob patterns to exclude |
content_types | list[str] | No | Allowed response content types |
fetch_config | FetchConfig | No | Fetch configuration |
Monitor
Scheduled extraction jobs. Access via thesgai.monitor resource.
monitor.activity() — poll tick history
Paginate through the per-run ticks a monitor has produced (what changed on each scheduled run).
monitor.activity() accepts limit (1–100, default 20) and optional cursor for pagination. Each MonitorTickEntry exposes id, created_at, status, changed, elapsed_ms, and a diffs model with per-format deltas.
monitor.create() parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | str | Yes | URL to monitor (positional) |
interval | str | Yes | Cron expression, 1–100 chars (positional) |
name | str | No | ≤ 200 chars |
formats | list[FormatConfig] | No | Defaults to [MarkdownFormatConfig()] |
webhook_url | str | No | Webhook invoked on change detection |
fetch_config | FetchConfig | No | Fetch configuration |
History
Fetch recent request history. Access via thesgai.history resource.
Credits / Health
Configuration Objects
FetchConfig
Controls how pages are fetched. See the proxy configuration guide for details on modes and geotargeting.Async Support
Every sync method has an async equivalent onAsyncScrapeGraphAI:
Support
GitHub
Report issues and contribute to the SDK
Email Support
Get help from our development team
