API Reference
The Spider API is based on REST. Our API is predictable, returns JSON-encoded responses, uses standardized HTTP response codes, and authentication. The API supports bulk updates. You can work on multiple objects per request for the core endpoints.
Authentication
Include your API key in the authorization header.
Authorization: Bearer sk-xxxx...Response formats
Set the content-type header to shape the response.
Prefix any path with v1 to lock the version. Requests on this page consume live credits.
Just getting started? Quickstart guide →
Not a developer? Use Spider's no-code options to get started without writing code.
https://api.spider.cloudCommon Parameters
These parameters are shared across Crawl, Scrape, Unblocker, Search, Links, Screenshot, and Fetch. Click any parameter to jump to its full description in the Crawl section.
Advanced(35)
| Name | Type | Default | Description |
|---|---|---|---|
| blacklist | array | — | Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list. |
| block_ads | boolean | true | Block advertisements when running request as |
| block_analytics | boolean | true | Block analytics when running request as |
| block_stylesheets | boolean | true | Block stylesheets when running request as |
| budget | object | — | Object that has paths with a counter for limiting the amount of pages. Use |
| chunking_alg | object | — | Use a chunking algorithm to segment your content output. Pass an object like |
| concurrency_limit | number | — | Set the concurrency limit to help balance request for slower websites. The default is unlimited. |
| crawl_timeout | object | — | The The values for the timeout duration are in the object shape |
| data_connectors | object | — | Stream crawl results directly to cloud storage and data services. Configure one or more connectors to automatically receive page data as it is crawled. Supports S3, Google Cloud Storage, Google Sheets, Azure Blob Storage, and Supabase. |
| depth | number | 25 | The crawl limit for maximum depth. If |
| disable_intercept | boolean | false | Disable request interception when running request as |
| event_tracker | object | — | Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following |
| exclude_selector | string | — | A CSS query selector to use for ignoring content from the markup of the response. |
| execution_scripts | object | — | Run custom JavaScript on certain paths. Requires |
| external_domains | array | — | A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to |
| full_resources | boolean | — | Crawl and download all the resources for a website. |
| max_credits_allowed | number | — | Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny). |
| max_credits_per_page | number | — | Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny). |
| metadata | boolean | false | Collect metadata about the content found like page title, description, keywords and etc. This could help improve AI interoperability. |
| preserve_host | boolean | false | Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. |
| redirect_policy | string | Loose | The network redirect policy to use when performing HTTP request. |
| request | string | smart | The request type to perform. Use |
| request_timeout | number | 60 | The timeout to use for request. Timeouts can be from |
| root_selector | string | — | The root CSS query selector to use extracting content from the markup for the response. |
| run_in_background | boolean | false | Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. |
| session | boolean | true | Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. |
| sitemap | boolean | false | Include the sitemap results to crawl. |
| sitemap_only | boolean | false | Only include the sitemap results to crawl. |
| sitemap_path | string | sitemap.xml | The sitemap URL to use when using |
| subdomains | boolean | false | Allow subdomains to be included. |
| tld | boolean | false | Allow TLD's to be included. |
| user_agent | string | — | Add a custom HTTP user agent to the request. By default this is set to a random agent. |
| wait_for | object | — | The The key The key The key The key The key The key The key If The values for the timeout duration are in the object shape |
| webhooks | object | — | Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. |
| whitelist | array | — | Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list. |
Core(6)
| Name | Type | Default | Description |
|---|---|---|---|
| disable_hints | boolean | — | Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs. |
| limit | number | 0 | The maximum amount of pages allowed to crawl per website. Remove the value or set it to |
| lite_mode | boolean | — | Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections. |
| max_size | number | — | The max content size in bytes per page response. Content exceeding this limit will be truncated with a smart head/tail strategy that preserves the beginning and end of the content. |
| network_blacklist | string[] | — | Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources from ever being requested. Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist takes precedence.
|
| network_whitelist | string[] | — | Allows only matching network requests to be fetched/loaded. Use this for a strict "allowlist-first" approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution. Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry are blocked by default.
|
Output(16)
| Name | Type | Default | Description |
|---|---|---|---|
| clean_html | boolean | — | Clean the HTML of unwanted attributes. |
| css_extraction_map | object | — | Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page. |
| encoding | string | — | The type of encoding to use like |
| filter_images | boolean | — | Filter image elements from the markup. |
| filter_output_images | boolean | — | Filter the images from the output. |
| filter_output_main_only | boolean | true | Filter the nav, aside, and footer from the output. |
| filter_output_svg | boolean | — | Filter the svg tags from the output. |
| filter_svg | boolean | — | Filter SVG elements from the markup. |
| link_rewrite | json | — | Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another). The value must be a JSON object with a
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored. |
| readability | boolean | false | Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. |
| return_cookies | boolean | false | Return the HTTP response cookies with the results. |
| return_embeddings | boolean | false | Include OpenAI embeddings for |
| return_format | string | array | raw | The format to return the data in. Possible values are |
| return_headers | boolean | false | Return the HTTP response headers with the results. |
| return_json_data | boolean | false | Return the JSON data found in scripts used for SSR. |
| return_page_links | boolean | false | Return the links found on each page. |
Config(7)
| Name | Type | Default | Description |
|---|---|---|---|
| cookies | string | — | Add HTTP cookies to use for request. |
| fingerprint | boolean | true | Use advanced fingerprint detection for chrome. |
| headers | object | — | Forward HTTP headers to use for all requests. The object is expected to be a map of key value pairs. |
| proxy | 'residential' | 'mobile' | 'isp' | — | Select the proxy pool for this request. Leave blank to disable proxy routing. Using this param overrides all other |
| proxy_enabled | boolean | false | Enable premium high performance proxies to prevent detection and increase speed. You can also use Proxy-Mode to route requests through Spider's proxy front-end instead. |
| remote_proxy | string | — | Use a remote external proxy connection. You also save 50% on data transfer costs when you bring your own proxy. |
| stealth | boolean | true | Use stealth mode for headless chrome request to help prevent being blocked. |
Performance(5)
| Name | Type | Default | Description |
|---|---|---|---|
| cache | boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean } | true | Use HTTP caching for the crawl to speed up repeated runs. Defaults to Accepts either:
Default behavior by route type:
|
| delay | number | 0 | Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. |
| respect_robots | boolean | true | Respect the robots.txt file for crawling. |
| service_worker_enabled | boolean | true | Allow the website to use Service Workers as needed. |
| skip_config_checks | boolean | true | Skip checking the database for website configuration. This will increase performance for requests that use limit=1. |
Automation(4)
| Name | Type | Default | Description |
|---|---|---|---|
| automation_scripts | object | — | Run custom web automated tasks on certain paths. Requires Below are the available actions for web automation:
|
| evaluate_on_new_document | string | — | Set a custom script to evaluate on new document creation. |
| scroll | number | — | Infinite scroll the page as new content loads, up to a duration in milliseconds. You may still need to use the |
| viewport | object | — | Configure the viewport for chrome. |
Geolocation(2)
| Name | Type | Default | Description |
|---|---|---|---|
| country_code | string | — | Set a ISO country code for proxy connections. View the locations list for available countries. |
| locale | string | — | The locale to use for request, example |
Per-endpoint notes
Scrape & Unblocker exclude limit, depth, and delay. Single-page endpoints.
Screenshot exclude request, return_format, and readability. Returns image data.
Every endpoint below includes these parameters in its own parameter tabs with full descriptions. This section is a quick-reference index.