Python SDK

PyPI Package

Python Support

Source on GitHub

Issues, PRs, and the changelog

These docs cover scrapegraph-py ≥ 2.1.0 and require Python ≥ 3.12. Earlier 1.x releases expose the deprecated v1 API and point to a different backend — none of the snippets on this page work there. The 2.0.x series used typed request wrappers (ScrapeRequest, ExtractRequest, …); 2.1.0 removed those wrappers in favour of direct positional/keyword arguments, so upgrade if you are pinned to 2.0.x.

Installation

pip install "scrapegraph-py>=2.1.0"
# or
uv add "scrapegraph-py>=2.1.0"

What’s New in v2

Complete rewrite built on Pydantic v2 + httpx.
Client rename: Client → ScrapeGraphAI, AsyncClient → AsyncScrapeGraphAI.
Direct arguments (v2.1.0): every method accepts positional/keyword args — no more ScrapeRequest/ExtractRequest/… wrappers.
ApiResult[T] wrapper: no exceptions on API errors — every call returns status: "success" | "error", data, error, and elapsed_ms.
Nested resources: sgai.crawl.*, sgai.monitor.*, sgai.history.*.
camelCase on the wire, snake_case in Python: automatic via Pydantic’s alias_generator.
Removed: markdownify(), agenticscraper(), sitemap(), feedback() — use scrape() with the appropriate format entry instead.

v2 is a breaking release. See the Migration Guide if you’re upgrading from v1.

Quick Start

from scrapegraph_py import ScrapeGraphAI

# reads SGAI_API_KEY from env, or pass it explicitly:
# sgai = ScrapeGraphAI(api_key="sgai-...")
sgai = ScrapeGraphAI()

result = sgai.scrape("https://example.com")

if result.status == "success":
    print(result.data.results["markdown"]["data"])
else:
    print(result.error)

ApiResult

Every method returns ApiResult[T] — no try/except needed for API errors:

from typing import Generic, Literal, TypeVar
from pydantic import BaseModel

T = TypeVar("T")

class ApiResult(BaseModel, Generic[T]):
    status: Literal["success", "error"]
    data: T | None
    error: str | None = None
    elapsed_ms: int

Environment Variables

Variable	Description	Default
`SGAI_API_KEY`	Your ScrapeGraphAI API key	—
`SGAI_API_URL`	Override API base URL	`https://v2-api.scrapegraphai.com/api`
`SGAI_TIMEOUT`	Request timeout in seconds	`120`
`SGAI_DEBUG`	Enable debug logging (set to `"1"`)	off

The client supports context managers for automatic session cleanup:

with ScrapeGraphAI() as sgai:
    result = sgai.scrape("https://example.com")

Services

Scrape

Fetch a page in one or more formats (markdown, html, screenshot, json, links, images, summary, branding).

from scrapegraph_py import (
    ScrapeGraphAI, FetchConfig,
    MarkdownFormatConfig, ScreenshotFormatConfig, JsonFormatConfig,
)

sgai = ScrapeGraphAI()

res = sgai.scrape(
    "https://example.com",
    formats=[
        MarkdownFormatConfig(mode="reader"),
        ScreenshotFormatConfig(full_page=True, width=1440, height=900),
        JsonFormatConfig(prompt="Extract product info"),
    ],
    content_type="text/html",  # optional, auto-detected
    fetch_config=FetchConfig(
        mode="js",
        stealth=True,
        timeout=30000,
        wait=2000,
        scrolls=3,
    ),
)

if res.status == "success":
    markdown = res.data.results["markdown"]["data"]

`scrape()` parameters

Parameter	Type	Required	Description
`url`	`str`	Yes	URL to scrape (positional)
`formats`	`list[FormatConfig]`	No	Defaults to `[MarkdownFormatConfig()]`
`content_type`	`str`	No	Override detected content type (e.g. `"application/pdf"`, `"text/html"`)
`fetch_config`	`FetchConfig`	No	Fetch configuration (mode, stealth, timeout, cookies, country, …)

Format entries

Class	Fields
`MarkdownFormatConfig`	`mode`: `"normal" \| "reader" \| "prune"`
`HtmlFormatConfig`	`mode`: same as above
`ScreenshotFormatConfig`	`full_page`, `width` (320–3840), `height` (200–2160), `quality`
`JsonFormatConfig`	`prompt` (1–10k chars), `schema` (JSON Schema dict — pass a Pydantic model’s `model_json_schema()` to reuse a `BaseModel`), `mode`
`LinksFormatConfig`	—
`ImagesFormatConfig`	—
`SummaryFormatConfig`	—
`BrandingFormatConfig`	—

Duplicate type entries in formats are rejected by a Pydantic validator.

Extract

Run structured extraction against a URL, HTML, or markdown using AI.

from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product names and prices",
    url="https://example.com",
    schema={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name":  {"type": "string"},
                        "price": {"type": "string"},
                    },
                },
            },
        },
    },
)

if res.status == "success":
    print(res.data.json_data)
    print(f"Tokens: {res.data.usage.prompt_tokens} / {res.data.usage.completion_tokens}")

Using a Pydantic model as the schema

schema= is a JSON Schema dict. Any Pydantic BaseModel produces one via model_json_schema(), so you can define the desired shape once and reuse it to validate the response client-side.

from pydantic import BaseModel, Field
from scrapegraph_py import ScrapeGraphAI

class Product(BaseModel):
    name: str
    price: str | None = None

class Products(BaseModel):
    products: list[Product] = Field(default_factory=list)

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product names and prices",
    url="https://example.com",
    schema=Products.model_json_schema(),
)

if res.status == "success":
    parsed = Products.model_validate(res.data.json_data)
    for p in parsed.products:
        print(p.name, p.price)

The same pattern works for JsonFormatConfig(schema=...) in scrape() and for search(schema=...).

`extract()` parameters

Parameter	Type	Required	Description
`prompt`	`str`	Yes	1–10,000 chars (positional)
`url`	`str`	Yes*	Page URL
`html`	`str`	Yes*	Raw HTML (alternative to `url`)
`markdown`	`str`	Yes*	Raw markdown (alternative to `url`)
`schema`	`dict`	No	JSON Schema for the structured output. Pass a Pydantic model’s `model_json_schema()` to reuse a `BaseModel`.
`mode`	`str`	No	`"normal"` (default), `"reader"`, `"prune"`
`content_type`	`str`	No	Override detected content type
`fetch_config`	`FetchConfig`	No	Fetch configuration

*At least one of url, html, or markdown is required.

Search

Run a web search and optionally extract structured data from the results.

from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.search(
    "best programming languages 2024",
    num_results=5,
    prompt="Summarize the top languages and reasons",
    time_range="past_week",
    location_geo_code="us",
)

if res.status == "success":
    for hit in res.data.results:
        print(hit.title, hit.url)
    print(res.data.json_data)  # when prompt/schema are set

`search()` parameters

Parameter	Type	Required	Description
`query`	`str`	Yes	1–500 chars (positional)
`num_results`	`int`	No	1–20, default `3`
`format`	`str`	No	`"markdown"` (default) or `"html"`
`mode`	`str`	No	HTML processing: `"prune"` (default), `"normal"`, `"reader"`
`prompt`	`str`	No	Required when `schema` is set
`schema`	`dict`	No	JSON Schema for structured output. Pass a Pydantic model’s `model_json_schema()` to reuse a `BaseModel`.
`location_geo_code`	`str`	No	Two-letter country code (e.g. `"us"`, `"it"`)
`time_range`	`str`	No	`"past_hour"`, `"past_24_hours"`, `"past_week"`, `"past_month"`, `"past_year"`
`fetch_config`	`FetchConfig`	No	Fetch configuration

Crawl

Crawl a site and its linked pages asynchronously. Access via the sgai.crawl resource.

from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig

sgai = ScrapeGraphAI()

# Start
start = sgai.crawl.start(
    "https://example.com",
    formats=[MarkdownFormatConfig()],
    max_depth=2,
    max_pages=50,
    max_links_per_page=10,
    include_patterns=["/blog/*"],
    exclude_patterns=["/admin/*"],
)

crawl_id = start.data.id

# Poll
status = sgai.crawl.get(crawl_id)
print(f"{status.data.finished}/{status.data.total} - {status.data.status}")

# Control
sgai.crawl.stop(crawl_id)
sgai.crawl.resume(crawl_id)
sgai.crawl.delete(crawl_id)

`crawl.start()` parameters

Parameter	Type	Required	Description
`url`	`str`	Yes	Starting URL (positional)
`formats`	`list[FormatConfig]`	No	Defaults to `[MarkdownFormatConfig()]`
`max_depth`	`int`	No	`≥ 0`, default `2`
`max_pages`	`int`	No	`1–1000`, default `50`
`max_links_per_page`	`int`	No	`≥ 1`, default `10`
`allow_external`	`bool`	No	Default `False`
`include_patterns`	`list[str]`	No	URL glob patterns to include
`exclude_patterns`	`list[str]`	No	URL glob patterns to exclude
`content_types`	`list[str]`	No	Allowed response content types
`fetch_config`	`FetchConfig`	No	Fetch configuration

Monitor

Scheduled extraction jobs. Access via the sgai.monitor resource.

from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig

sgai = ScrapeGraphAI()

mon = sgai.monitor.create(
    "https://example.com",
    "0 * * * *",                 # cron expression (positional)
    name="Price Monitor",
    formats=[MarkdownFormatConfig()],
    webhook_url="https://example.com/webhook",
)

cron_id = mon.data.cron_id

sgai.monitor.list()
sgai.monitor.get(cron_id)
sgai.monitor.update(cron_id, interval="0 */6 * * *")
sgai.monitor.pause(cron_id)
sgai.monitor.resume(cron_id)
sgai.monitor.delete(cron_id)

`monitor.activity()` — poll tick history

Paginate through the per-run ticks a monitor has produced (what changed on each scheduled run).

act = sgai.monitor.activity(cron_id, limit=20)

if act.status == "success":
    for tick in act.data.ticks:
        status = "CHANGED" if tick.changed else "no change"
        print(f"[{tick.created_at}] {tick.status} - {status} ({tick.elapsed_ms}ms)")

    if act.data.next_cursor:
        more = sgai.monitor.activity(cron_id, limit=20, cursor=act.data.next_cursor)

monitor.activity() accepts limit (1–100, default 20) and optional cursor for pagination. Each MonitorTickEntry exposes id, created_at, status, changed, elapsed_ms, and a diffs model with per-format deltas.

`monitor.create()` parameters

Parameter	Type	Required	Description
`url`	`str`	Yes	URL to monitor (positional)
`interval`	`str`	Yes	Cron expression, 1–100 chars (positional)
`name`	`str`	No	≤ 200 chars
`formats`	`list[FormatConfig]`	No	Defaults to `[MarkdownFormatConfig()]`
`webhook_url`	`str`	No	Webhook invoked on change detection
`fetch_config`	`FetchConfig`	No	Fetch configuration

History

Fetch recent request history. Access via the sgai.history resource.

from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

page = sgai.history.list(service="scrape", page=1, limit=20)
for entry in page.data.data:
    print(entry.id, entry.service, entry.status, entry.elapsed_ms)

one = sgai.history.get("request-id")

Credits / Health

credits = sgai.credits()
# ApiResult[CreditsResponse] with .remaining, .used, .plan, .jobs.crawl, .jobs.monitor

health = sgai.health()
# ApiResult[HealthResponse] with .status, .uptime, .services

Configuration Objects

FetchConfig

Controls how pages are fetched. See the proxy configuration guide for details on modes and geotargeting.

from scrapegraph_py import FetchConfig

config = FetchConfig(
    mode="js",            # "auto" (default), "fast", "js"
    stealth=True,         # Residential proxies / anti-bot headers (+5 credits)
    timeout=30000,        # 1,000–60,000 ms
    wait=2000,            # 0–30,000 ms
    scrolls=3,            # 0–100
    country="us",         # ISO 3166-1 alpha-2
    headers={"X-Custom": "header"},
    cookies={"session": "abc"},
    mock=False,           # Or a MockConfig object for testing
)

Async Support

Every sync method has an async equivalent on AsyncScrapeGraphAI:

import asyncio
from scrapegraph_py import AsyncScrapeGraphAI

async def main():
    async with AsyncScrapeGraphAI() as sgai:
        res = await sgai.scrape("https://example.com")
        if res.status == "success":
            print(res.data.results["markdown"]["data"])

        start = await sgai.crawl.start("https://example.com", max_pages=25)
        status = await sgai.crawl.get(start.data.id)
        print(status.data.status)

        credits = await sgai.credits()
        print(credits.data.remaining)

asyncio.run(main())

Get Started

Services

Official SDKs

LLM SDKs

Frameworks

No-code

Contribute

PyPI Package

Python Support

Source on GitHub

Installation

What’s New in v2

Quick Start

ApiResult

Environment Variables

Services

Scrape

`scrape()` parameters

Format entries

Extract

Using a Pydantic model as the schema

`extract()` parameters

Search

`search()` parameters

Crawl

`crawl.start()` parameters

Monitor

`monitor.activity()` — poll tick history

`monitor.create()` parameters

History

Credits / Health

Configuration Objects

FetchConfig

Async Support

Support

GitHub

Email Support

Get Started

Services

Official SDKs

LLM SDKs

Frameworks

No-code

Contribute

Documentation Index

PyPI Package

Python Support

Source on GitHub

​Installation

​What’s New in v2

​Quick Start

​ApiResult

​Environment Variables

​Services

​Scrape

​scrape() parameters

​Format entries

​Extract

Using a Pydantic model as the schema

​extract() parameters

​Search

​search() parameters

​Crawl

​crawl.start() parameters

​Monitor

​monitor.activity() — poll tick history

​monitor.create() parameters

​History

​Credits / Health

​Configuration Objects

​FetchConfig

​Async Support

​Support

GitHub

Email Support

Installation

What’s New in v2

Quick Start

ApiResult

Environment Variables

Services

Scrape

`scrape()` parameters

Format entries

Extract

`extract()` parameters

Search

`search()` parameters

Crawl

`crawl.start()` parameters

Monitor

`monitor.activity()` — poll tick history

`monitor.create()` parameters

History

Credits / Health

Configuration Objects

FetchConfig

Async Support

Support