Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.scrapegraphai.com/llms.txt

Use this file to discover all available pages before exploring further.

PyPI Package

PyPI version

Python Support

Python Support

Source on GitHub

Issues, PRs, and the changelog
These docs cover scrapegraph-py ≥ 2.1.0 and require Python ≥ 3.12. Earlier 1.x releases expose the deprecated v1 API and point to a different backend — none of the snippets on this page work there. The 2.0.x series used typed request wrappers (ScrapeRequest, ExtractRequest, …); 2.1.0 removed those wrappers in favour of direct positional/keyword arguments, so upgrade if you are pinned to 2.0.x.

Installation

pip install "scrapegraph-py>=2.1.0"
# or
uv add "scrapegraph-py>=2.1.0"

What’s New in v2

  • Complete rewrite built on Pydantic v2 + httpx.
  • Client rename: ClientScrapeGraphAI, AsyncClientAsyncScrapeGraphAI.
  • Direct arguments (v2.1.0): every method accepts positional/keyword args — no more ScrapeRequest/ExtractRequest/… wrappers.
  • ApiResult[T] wrapper: no exceptions on API errors — every call returns status: "success" | "error", data, error, and elapsed_ms.
  • Nested resources: sgai.crawl.*, sgai.monitor.*, sgai.history.*.
  • camelCase on the wire, snake_case in Python: automatic via Pydantic’s alias_generator.
  • Removed: markdownify(), agenticscraper(), sitemap(), feedback() — use scrape() with the appropriate format entry instead.
v2 is a breaking release. See the Migration Guide if you’re upgrading from v1.

Quick Start

from scrapegraph_py import ScrapeGraphAI

# reads SGAI_API_KEY from env, or pass it explicitly:
# sgai = ScrapeGraphAI(api_key="sgai-...")
sgai = ScrapeGraphAI()

result = sgai.scrape("https://example.com")

if result.status == "success":
    print(result.data.results["markdown"]["data"])
else:
    print(result.error)

ApiResult

Every method returns ApiResult[T] — no try/except needed for API errors:
from typing import Generic, Literal, TypeVar
from pydantic import BaseModel

T = TypeVar("T")

class ApiResult(BaseModel, Generic[T]):
    status: Literal["success", "error"]
    data: T | None
    error: str | None = None
    elapsed_ms: int

Environment Variables

VariableDescriptionDefault
SGAI_API_KEYYour ScrapeGraphAI API key
SGAI_API_URLOverride API base URLhttps://v2-api.scrapegraphai.com/api
SGAI_TIMEOUTRequest timeout in seconds120
SGAI_DEBUGEnable debug logging (set to "1")off
The client supports context managers for automatic session cleanup:
with ScrapeGraphAI() as sgai:
    result = sgai.scrape("https://example.com")

Services

Scrape

Fetch a page in one or more formats (markdown, html, screenshot, json, links, images, summary, branding).
from scrapegraph_py import (
    ScrapeGraphAI, FetchConfig,
    MarkdownFormatConfig, ScreenshotFormatConfig, JsonFormatConfig,
)

sgai = ScrapeGraphAI()

res = sgai.scrape(
    "https://example.com",
    formats=[
        MarkdownFormatConfig(mode="reader"),
        ScreenshotFormatConfig(full_page=True, width=1440, height=900),
        JsonFormatConfig(prompt="Extract product info"),
    ],
    content_type="text/html",  # optional, auto-detected
    fetch_config=FetchConfig(
        mode="js",
        stealth=True,
        timeout=30000,
        wait=2000,
        scrolls=3,
    ),
)

if res.status == "success":
    markdown = res.data.results["markdown"]["data"]

scrape() parameters

ParameterTypeRequiredDescription
urlstrYesURL to scrape (positional)
formatslist[FormatConfig]NoDefaults to [MarkdownFormatConfig()]
content_typestrNoOverride detected content type (e.g. "application/pdf", "text/html")
fetch_configFetchConfigNoFetch configuration (mode, stealth, timeout, cookies, country, …)

Format entries

ClassFields
MarkdownFormatConfigmode: "normal" | "reader" | "prune"
HtmlFormatConfigmode: same as above
ScreenshotFormatConfigfull_page, width (320–3840), height (200–2160), quality
JsonFormatConfigprompt (1–10k chars), schema (JSON Schema dict — pass a Pydantic model’s model_json_schema() to reuse a BaseModel), mode
LinksFormatConfig
ImagesFormatConfig
SummaryFormatConfig
BrandingFormatConfig
Duplicate type entries in formats are rejected by a Pydantic validator.

Extract

Run structured extraction against a URL, HTML, or markdown using AI.
from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product names and prices",
    url="https://example.com",
    schema={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name":  {"type": "string"},
                        "price": {"type": "string"},
                    },
                },
            },
        },
    },
)

if res.status == "success":
    print(res.data.json_data)
    print(f"Tokens: {res.data.usage.prompt_tokens} / {res.data.usage.completion_tokens}")
Using a Pydantic model as the schema
schema= is a JSON Schema dict. Any Pydantic BaseModel produces one via model_json_schema(), so you can define the desired shape once and reuse it to validate the response client-side.
from pydantic import BaseModel, Field
from scrapegraph_py import ScrapeGraphAI

class Product(BaseModel):
    name: str
    price: str | None = None

class Products(BaseModel):
    products: list[Product] = Field(default_factory=list)

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product names and prices",
    url="https://example.com",
    schema=Products.model_json_schema(),
)

if res.status == "success":
    parsed = Products.model_validate(res.data.json_data)
    for p in parsed.products:
        print(p.name, p.price)
The same pattern works for JsonFormatConfig(schema=...) in scrape() and for search(schema=...).

extract() parameters

ParameterTypeRequiredDescription
promptstrYes1–10,000 chars (positional)
urlstrYes*Page URL
htmlstrYes*Raw HTML (alternative to url)
markdownstrYes*Raw markdown (alternative to url)
schemadictNoJSON Schema for the structured output. Pass a Pydantic model’s model_json_schema() to reuse a BaseModel.
modestrNo"normal" (default), "reader", "prune"
content_typestrNoOverride detected content type
fetch_configFetchConfigNoFetch configuration
*At least one of url, html, or markdown is required.
Run a web search and optionally extract structured data from the results.
from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.search(
    "best programming languages 2024",
    num_results=5,
    prompt="Summarize the top languages and reasons",
    time_range="past_week",
    location_geo_code="us",
)

if res.status == "success":
    for hit in res.data.results:
        print(hit.title, hit.url)
    print(res.data.json_data)  # when prompt/schema are set

search() parameters

ParameterTypeRequiredDescription
querystrYes1–500 chars (positional)
num_resultsintNo1–20, default 3
formatstrNo"markdown" (default) or "html"
modestrNoHTML processing: "prune" (default), "normal", "reader"
promptstrNoRequired when schema is set
schemadictNoJSON Schema for structured output. Pass a Pydantic model’s model_json_schema() to reuse a BaseModel.
location_geo_codestrNoTwo-letter country code (e.g. "us", "it")
time_rangestrNo"past_hour", "past_24_hours", "past_week", "past_month", "past_year"
fetch_configFetchConfigNoFetch configuration

Crawl

Crawl a site and its linked pages asynchronously. Access via the sgai.crawl resource.
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig

sgai = ScrapeGraphAI()

# Start
start = sgai.crawl.start(
    "https://example.com",
    formats=[MarkdownFormatConfig()],
    max_depth=2,
    max_pages=50,
    max_links_per_page=10,
    include_patterns=["/blog/*"],
    exclude_patterns=["/admin/*"],
)

crawl_id = start.data.id

# Poll
status = sgai.crawl.get(crawl_id)
print(f"{status.data.finished}/{status.data.total} - {status.data.status}")

# Control
sgai.crawl.stop(crawl_id)
sgai.crawl.resume(crawl_id)
sgai.crawl.delete(crawl_id)

crawl.start() parameters

ParameterTypeRequiredDescription
urlstrYesStarting URL (positional)
formatslist[FormatConfig]NoDefaults to [MarkdownFormatConfig()]
max_depthintNo≥ 0, default 2
max_pagesintNo1–1000, default 50
max_links_per_pageintNo≥ 1, default 10
allow_externalboolNoDefault False
include_patternslist[str]NoURL glob patterns to include
exclude_patternslist[str]NoURL glob patterns to exclude
content_typeslist[str]NoAllowed response content types
fetch_configFetchConfigNoFetch configuration

Monitor

Scheduled extraction jobs. Access via the sgai.monitor resource.
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig

sgai = ScrapeGraphAI()

mon = sgai.monitor.create(
    "https://example.com",
    "0 * * * *",                 # cron expression (positional)
    name="Price Monitor",
    formats=[MarkdownFormatConfig()],
    webhook_url="https://example.com/webhook",
)

cron_id = mon.data.cron_id

sgai.monitor.list()
sgai.monitor.get(cron_id)
sgai.monitor.update(cron_id, interval="0 */6 * * *")
sgai.monitor.pause(cron_id)
sgai.monitor.resume(cron_id)
sgai.monitor.delete(cron_id)

monitor.activity() — poll tick history

Paginate through the per-run ticks a monitor has produced (what changed on each scheduled run).
act = sgai.monitor.activity(cron_id, limit=20)

if act.status == "success":
    for tick in act.data.ticks:
        status = "CHANGED" if tick.changed else "no change"
        print(f"[{tick.created_at}] {tick.status} - {status} ({tick.elapsed_ms}ms)")

    if act.data.next_cursor:
        more = sgai.monitor.activity(cron_id, limit=20, cursor=act.data.next_cursor)
monitor.activity() accepts limit (1–100, default 20) and optional cursor for pagination. Each MonitorTickEntry exposes id, created_at, status, changed, elapsed_ms, and a diffs model with per-format deltas.

monitor.create() parameters

ParameterTypeRequiredDescription
urlstrYesURL to monitor (positional)
intervalstrYesCron expression, 1–100 chars (positional)
namestrNo≤ 200 chars
formatslist[FormatConfig]NoDefaults to [MarkdownFormatConfig()]
webhook_urlstrNoWebhook invoked on change detection
fetch_configFetchConfigNoFetch configuration

History

Fetch recent request history. Access via the sgai.history resource.
from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

page = sgai.history.list(service="scrape", page=1, limit=20)
for entry in page.data.data:
    print(entry.id, entry.service, entry.status, entry.elapsed_ms)

one = sgai.history.get("request-id")

Credits / Health

credits = sgai.credits()
# ApiResult[CreditsResponse] with .remaining, .used, .plan, .jobs.crawl, .jobs.monitor

health = sgai.health()
# ApiResult[HealthResponse] with .status, .uptime, .services

Configuration Objects

FetchConfig

Controls how pages are fetched. See the proxy configuration guide for details on modes and geotargeting.
from scrapegraph_py import FetchConfig

config = FetchConfig(
    mode="js",            # "auto" (default), "fast", "js"
    stealth=True,         # Residential proxies / anti-bot headers (+5 credits)
    timeout=30000,        # 1,000–60,000 ms
    wait=2000,            # 0–30,000 ms
    scrolls=3,            # 0–100
    country="us",         # ISO 3166-1 alpha-2
    headers={"X-Custom": "header"},
    cookies={"session": "abc"},
    mock=False,           # Or a MockConfig object for testing
)

Async Support

Every sync method has an async equivalent on AsyncScrapeGraphAI:
import asyncio
from scrapegraph_py import AsyncScrapeGraphAI

async def main():
    async with AsyncScrapeGraphAI() as sgai:
        res = await sgai.scrape("https://example.com")
        if res.status == "success":
            print(res.data.results["markdown"]["data"])

        start = await sgai.crawl.start("https://example.com", max_pages=25)
        status = await sgai.crawl.get(start.data.id)
        print(status.data.status)

        credits = await sgai.credits()
        print(credits.data.remaining)

asyncio.run(main())

Support

GitHub

Report issues and contribute to the SDK

Email Support

Get help from our development team