API Reference

The Spider API is based on REST. Our API is predictable, returns JSON-encoded responses, uses standardized HTTP response codes, and authentication.

Set your API secret key in the authorization header to commence with the format Bearer $TOKEN. You can use the content-type header with application/json, application/xml, text/csv, and application/jsonl for shaping the response.

The Spider API supports bulk updates. You can work on multiple objects per request for the core API endpoints.

You can add v1 before any path to lock in that version. Executing a request on the page by pressing the Run button will consume live credits and treat the response as a genuine result.

Download OpenAPI Specification:Download

LLM-Ready API Docs:llms.txt

Just getting started?

Check out our development quickstart guide.

Not a developer?

Use Spiders no-code options or applications to get started with Spider and to do more with your Spider account no code required.

Base Url

https://api.spider.cloud

Client libraries

CLI Javascript Python Rust

Crawl

View details

Start crawling website(s) to collect resources. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/crawl

Body

application/json

Crawl API - url string required
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
Crawl API - limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
Crawl API - disable_hints boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
Tip
If you’re tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
Crawl API - lite_mode boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
Crawl API - network_blacklist string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources (analytics, ads, maps, chat widgets, heavy CDNs) from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist should take precedence (allowlist-first) and blacklist can be treated as a deny override for anything not explicitly allowed.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Crawl API - network_whitelist string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict “allowlist-first” approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry should be blocked by default.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
- Tip: include your target’s base domain + its common asset hostnames/CDNs.

Crawl API - request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
Crawl API - depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
Crawl API - metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
Crawl API - session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
Crawl API - request_timeout number
The timeout to use for request. Timeouts can be from 5-255. Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
Crawl API - wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
Crawl API - webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
Crawl API - user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
Crawl API - sitemap boolean
Include the sitemap results to crawl. Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
Crawl API - sitemap_only boolean
Only include the sitemap results to crawl. Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
Crawl API - sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
Crawl API - subdomains boolean
Allow subdomains to be included. Defaults to false.
Crawl API - tld boolean
Allow TLD's to be included. Defaults to false.
Crawl API - root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
Crawl API - preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
Crawl API - full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
Crawl API - redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
Crawl API - external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
Crawl API - exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
Crawl API - concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
Crawl API - execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
Crawl API - disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
Crawl API - block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Crawl API - block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Crawl API - block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Crawl API - run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
Crawl API - chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
Crawl API - budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
Crawl API - max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Crawl API - max_credits_allowed number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Crawl API - event_tracker object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
Crawl API - blacklist array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
Crawl API - whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
Crawl API - crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.

Crawl API - return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format. PDF transformations may take up to 1 cent per page for high accuracy.
Crawl API - readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
Crawl API - css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
Crawl API - link_rewrite json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
Crawl API - clean_html boolean
Clean the HTML of unwanted attributes.
Crawl API - filter_svg boolean
Filter SVG elements from the markup.
Crawl API - filter_images boolean
Filter image elements from the markup.
Crawl API - filter_main_only boolean
Filter the main content from the markup excluding nav, footer, and aside elements. This is enabled by default.
Crawl API - return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
Crawl API - return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
Crawl API - return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
Crawl API - return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
Crawl API - filter_output_svg boolean
Filter the svg tags from the output.
Crawl API - filter_output_images boolean
Filter the images from the output.
Crawl API - filter_output_main_only boolean
Filter the nav, aside, and footer from the output.
Crawl API - encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
Crawl API - return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

Crawl API - scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
Crawl API - viewport object
Configure the viewport for chrome. Defaults to a random desktop viewport.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
Crawl API - automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Crawl API - evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "duration_elapsed_ms": 122,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Scrape

View details

Start scraping a single page on website(s) to collect resources. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/scrape

Body

application/json

Scrape API - url string required
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
Scrape API - disable_hints boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
Tip
If you’re tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
Scrape API - lite_mode boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
Scrape API - network_blacklist string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources (analytics, ads, maps, chat widgets, heavy CDNs) from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist should take precedence (allowlist-first) and blacklist can be treated as a deny override for anything not explicitly allowed.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Scrape API - network_whitelist string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict “allowlist-first” approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry should be blocked by default.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
- Tip: include your target’s base domain + its common asset hostnames/CDNs.

Scrape API - request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
Scrape API - metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
Scrape API - session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
Scrape API - request_timeout number
The timeout to use for request. Timeouts can be from 5-255. Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
Scrape API - wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
Scrape API - webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
Scrape API - user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
Scrape API - sitemap boolean
Include the sitemap results to crawl. Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
Scrape API - sitemap_only boolean
Only include the sitemap results to crawl. Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
Scrape API - sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
Scrape API - subdomains boolean
Allow subdomains to be included. Defaults to false.
Scrape API - tld boolean
Allow TLD's to be included. Defaults to false.
Scrape API - root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
Scrape API - preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
Scrape API - full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
Scrape API - redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
Scrape API - external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
Scrape API - exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
Scrape API - concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
Scrape API - execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
Scrape API - disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
Scrape API - block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Scrape API - block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Scrape API - block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Scrape API - run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
Scrape API - chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
Scrape API - budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
Scrape API - max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Scrape API - max_credits_allowed number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Scrape API - event_tracker object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
Scrape API - blacklist array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
Scrape API - whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
Scrape API - crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
Scrape API - full_page boolean
Take a screenshot of the full page. Defaults to true.

Scrape API - return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format. PDF transformations may take up to 1 cent per page for high accuracy.
Scrape API - readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
Scrape API - css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
Scrape API - link_rewrite json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
Scrape API - clean_html boolean
Clean the HTML of unwanted attributes.
Scrape API - filter_svg boolean
Filter SVG elements from the markup.
Scrape API - filter_images boolean
Filter image elements from the markup.
Scrape API - filter_main_only boolean
Filter the main content from the markup excluding nav, footer, and aside elements. This is enabled by default.
Scrape API - return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
Scrape API - return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
Scrape API - return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
Scrape API - return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
Scrape API - filter_output_svg boolean
Filter the svg tags from the output.
Scrape API - filter_output_images boolean
Filter the images from the output.
Scrape API - filter_output_main_only boolean
Filter the nav, aside, and footer from the output.
Scrape API - encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
Scrape API - return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.
Scrape API - binary boolean
Return the image as binary instead of base64.
Scrape API - cdp_params object
The settings to use to adjust clip, format, quality, and more. Defaults to null.

Scrape API - scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
Scrape API - viewport object
Configure the viewport for chrome. Defaults to a random desktop viewport.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
Scrape API - automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Scrape API - evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/scrape', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "duration_elapsed_ms": 122,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Unblocker

View details

Start unblocking challenging website(s) to collect data. You can pass an array of objects for the request body. Cost 10-40 credits additional per success.

POSThttps://api.spider.cloud/unblocker

Body

application/json

Unblocker API - url string required
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
Unblocker API - disable_hints boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
Tip
If you’re tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
Unblocker API - lite_mode boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
Unblocker API - network_blacklist string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources (analytics, ads, maps, chat widgets, heavy CDNs) from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist should take precedence (allowlist-first) and blacklist can be treated as a deny override for anything not explicitly allowed.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Unblocker API - network_whitelist string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict “allowlist-first” approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry should be blocked by default.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
- Tip: include your target’s base domain + its common asset hostnames/CDNs.

Unblocker API - request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
Unblocker API - metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
Unblocker API - session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
Unblocker API - request_timeout number
The timeout to use for request. Timeouts can be from 5-255. Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
Unblocker API - wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
Unblocker API - webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
Unblocker API - user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
Unblocker API - sitemap boolean
Include the sitemap results to crawl. Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
Unblocker API - sitemap_only boolean
Only include the sitemap results to crawl. Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
Unblocker API - sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
Unblocker API - subdomains boolean
Allow subdomains to be included. Defaults to false.
Unblocker API - tld boolean
Allow TLD's to be included. Defaults to false.
Unblocker API - root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
Unblocker API - preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
Unblocker API - full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
Unblocker API - redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
Unblocker API - external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
Unblocker API - exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
Unblocker API - concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
Unblocker API - execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
Unblocker API - disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
Unblocker API - block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Unblocker API - block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Unblocker API - block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Unblocker API - run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
Unblocker API - chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
Unblocker API - budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
Unblocker API - max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Unblocker API - max_credits_allowed number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Unblocker API - event_tracker object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
Unblocker API - blacklist array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
Unblocker API - whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
Unblocker API - crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
Unblocker API - full_page boolean
Take a screenshot of the full page. Defaults to true.

Unblocker API - return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format. PDF transformations may take up to 1 cent per page for high accuracy.
Unblocker API - readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
Unblocker API - css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
Unblocker API - link_rewrite json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
Unblocker API - clean_html boolean
Clean the HTML of unwanted attributes.
Unblocker API - filter_svg boolean
Filter SVG elements from the markup.
Unblocker API - filter_images boolean
Filter image elements from the markup.
Unblocker API - filter_main_only boolean
Filter the main content from the markup excluding nav, footer, and aside elements. This is enabled by default.
Unblocker API - return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
Unblocker API - return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
Unblocker API - return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
Unblocker API - return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
Unblocker API - filter_output_svg boolean
Filter the svg tags from the output.
Unblocker API - filter_output_images boolean
Filter the images from the output.
Unblocker API - filter_output_main_only boolean
Filter the nav, aside, and footer from the output.
Unblocker API - encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
Unblocker API - return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.
Unblocker API - binary boolean
Return the image as binary instead of base64.
Unblocker API - cdp_params object
The settings to use to adjust clip, format, quality, and more. Defaults to null.

Unblocker API - scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
Unblocker API - viewport object
Configure the viewport for chrome. Defaults to a random desktop viewport.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
Unblocker API - automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Unblocker API - evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/unblocker', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "url": "https://spider.cloud",
    "status": 200,
    "cookies": {
        "a": "something",
        "b": "something2"
    },
    "headers": {
        "x-id": 123,
        "x-cookie": 123
    },
    "status": 200,
    "costs": {
        "ai_cost": 0.001,
        "ai_cost_formatted": "0.0010",
        "bytes_transferred_cost": 3.1649999999999997e-9,
        "bytes_transferred_cost_formatted": "0.0000000031649999999999997240",
        "compute_cost": 0.0,
        "compute_cost_formatted": "0",
        "file_cost": 0.000029291250000000002,
        "file_cost_formatted": "0.0000292912499999999997868372",
        "total_cost": 0.0010292944150000001,
        "total_cost_formatted": "0.0010292944149999999997865612",
        "transform_cost": 0.0,
        "transform_cost_formatted": "0"
    },
    "content": "<html>...</html>",
    "error": null
  },
  // more content...
]

Search

View details

Perform a Google search to gather a list of websites for crawling and resource collection, including fallback options if the query yields no results. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/search

Body

application/json

Search API - search string required
The search query you want to search for.
Search
Search API - limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
Search API - quick_search boolean
Prioritize speed over output quantity.
Search API - disable_hints boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
Tip
If you’re tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
Search API - lite_mode boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
Search API - network_blacklist string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources (analytics, ads, maps, chat widgets, heavy CDNs) from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist should take precedence (allowlist-first) and blacklist can be treated as a deny override for anything not explicitly allowed.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Search API - network_whitelist string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict “allowlist-first” approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry should be blocked by default.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
- Tip: include your target’s base domain + its common asset hostnames/CDNs.

Search API - search_limit number
The limit amount of URLs to fetch or crawl from the search results. Remove the value or set it to 0 to crawl all URLs from the realtime search results. This is a shorthand if you do not want to use num.
Search API - fetch_page_content boolean
Fetch all the content of the websites by performing crawls. If this is disabled, only the search results are returned instead with the meta title and description. Defaults to false.
Search API - request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
Search API - depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
Search API - metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
Search API - session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
Search API - request_timeout number
The timeout to use for request. Timeouts can be from 5-255. Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
Search API - wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
Search API - webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
Search API - user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
Search API - sitemap boolean
Include the sitemap results to crawl. Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
Search API - sitemap_only boolean
Only include the sitemap results to crawl. Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
Search API - sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
Search API - subdomains boolean
Allow subdomains to be included. Defaults to false.
Search API - tld boolean
Allow TLD's to be included. Defaults to false.
Search API - root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
Search API - preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
Search API - full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
Search API - redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
Search API - external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
Search API - exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
Search API - concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
Search API - execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
Search API - disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
Search API - block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Search API - block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Search API - block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Search API - run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
Search API - chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
Search API - budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
Search API - max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Search API - max_credits_allowed number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Search API - event_tracker object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
Search API - blacklist array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
Search API - whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
Search API - auto_pagination boolean
Automatically paginates to fetch the exact number of desired results, as specified by the num parameter. Note that credit usage may increase, and response time may be slower when retrieving larger result sets.
Search API - crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
Search API - num number
The maximum number of results to return for the search.
Search API - page number
The page number for the search results.
Search API - tbs 'qdr:h' | 'qdr:d' | 'qdr:w' | 'qdr:m' | 'qdr:y'
Restrict results to a specific time range. Common options:qdr:h (past hour), qdr:d (past 24 hours), qdr:w (past week),qdr:m (past month), qdr:y (past year).

Search API - return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format. PDF transformations may take up to 1 cent per page for high accuracy.
Search API - readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
Search API - css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
Search API - link_rewrite json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
Search API - clean_html boolean
Clean the HTML of unwanted attributes.
Search API - filter_svg boolean
Filter SVG elements from the markup.
Search API - filter_images boolean
Filter image elements from the markup.
Search API - filter_main_only boolean
Filter the main content from the markup excluding nav, footer, and aside elements. This is enabled by default.
Search API - return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
Search API - return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
Search API - return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
Search API - return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
Search API - filter_output_svg boolean
Filter the svg tags from the output.
Search API - filter_output_images boolean
Filter the images from the output.
Search API - filter_output_main_only boolean
Filter the nav, aside, and footer from the output.
Search API - encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
Search API - return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

Search API - scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
Search API - viewport object
Configure the viewport for chrome. Defaults to a random desktop viewport.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
Search API - automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Search API - evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"search":"sports news today","search_limit":3,"limit":5,"return_format":"markdown"}

response = requests.post('https://api.spider.cloud/search', 
  headers=headers, json=json_data)

print(response.json())

Response

{
  "content": [
      {
          "description": "Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.",
          "title": "ESPN - Serving Sports Fans. Anytime. Anywhere.",
          "url": "https://www.espn.com/"
      },
      {
          "description": "Sports Illustrated, SI.com provides sports news, expert analysis, highlights, stats and scores for the NFL, NBA, MLB, NHL, college football, soccer,&nbsp;...",
          "title": "Sports Illustrated",
          "url": "https://www.si.com/"
      },
      {
          "description": "CBS Sports features live scoring, news, stats, and player info for NFL football, MLB baseball, NBA basketball, NHL hockey, college basketball and football.",
          "title": "CBS Sports - News, Live Scores, Schedules, Fantasy ...",
          "url": "https://www.cbssports.com/"
      },
      {
          "description": "Sport is a form of physical activity or game. Often competitive and organized, sports use, maintain, or improve physical ability and skills.",
          "title": "Sport",
          "url": "https://en.wikipedia.org/wiki/Sport"
      },
      {
          "description": "Watch FOX Sports and view live scores, odds, team news, player news, streams, videos, stats, standings &amp; schedules covering NFL, MLB, NASCAR, WWE, NBA, NHL,&nbsp;...",
          "title": "FOX Sports News, Scores, Schedules, Odds, Shows, Streams ...",
          "url": "https://www.foxsports.com/"
      },
      {
          "description": "Founded in 1974 by tennis legend, Billie Jean King, the Women's Sports Foundation is dedicated to creating leaders by providing girls access to sports.",
          "title": "Women's Sports Foundation: Home",
          "url": "https://www.womenssportsfoundation.org/"
      },
      {
          "description": "List of sports · Running. Marathon · Sprint · Mascot race · Airsoft · Laser tag · Paintball · Bobsleigh · Jack jumping · Luge · Shovel racing · Card stacking&nbsp;...",
          "title": "List of sports",
          "url": "https://en.wikipedia.org/wiki/List_of_sports"
      },
      {
          "description": "Stay up-to-date with the latest sports news and scores from NBC Sports.",
          "title": "NBC Sports - news, scores, stats, rumors, videos, and more",
          "url": "https://www.nbcsports.com/"
      },
      {
          "description": "r/sports: Sports News and Highlights from the NFL, NBA, NHL, MLB, MLS, and leagues around the world.",
          "title": "r/sports",
          "url": "https://www.reddit.com/r/sports/"
      },
      {
          "description": "The A-Z of sports covered by the BBC Sport team. Find all the latest live sports coverage, breaking news, results, scores, fixtures, tables,&nbsp;...",
          "title": "AZ Sport",
          "url": "https://www.bbc.com/sport/all-sports"
      }
  ]
}

Links

View details

Start crawling a website(s) to collect links found. You can pass an array of objects for the request body. This endpoint can save on latency if you only need to index the content URLs.

POSThttps://api.spider.cloud/links

Body

application/json

Get API - url string required
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
Get API - limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
Get API - disable_hints boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
Tip
If you’re tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
Get API - lite_mode boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
Get API - network_blacklist string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources (analytics, ads, maps, chat widgets, heavy CDNs) from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist should take precedence (allowlist-first) and blacklist can be treated as a deny override for anything not explicitly allowed.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Get API - network_whitelist string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict “allowlist-first” approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry should be blocked by default.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
- Tip: include your target’s base domain + its common asset hostnames/CDNs.

Get API - request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML. Defaults to smart.
The request greatly influences how the output is going to look. If the page is server-side rendered, you can stick to the defaults for the most part.
Get API - depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
Get API - metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
Get API - session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
Get API - request_timeout number
The timeout to use for request. Timeouts can be from 5-255. Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
Get API - wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
Get API - webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
Get API - user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
Get API - sitemap boolean
Include the sitemap results to crawl. Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
Get API - sitemap_only boolean
Only include the sitemap results to crawl. Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
Get API - sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
Get API - subdomains boolean
Allow subdomains to be included. Defaults to false.
Get API - tld boolean
Allow TLD's to be included. Defaults to false.
Get API - root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
Get API - preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
Get API - full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
Get API - redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
Get API - external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
Get API - exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
Get API - concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
Get API - execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
Get API - disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
Get API - block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Get API - block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Get API - block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Get API - run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
Get API - chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
Get API - budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
Get API - max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Get API - max_credits_allowed number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Get API - event_tracker object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
Get API - blacklist array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
Get API - whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
Get API - crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.

Get API - return_format string | array
The format to return the data in. Possible values are markdown, commonmark, raw, text, xml, bytes, and empty. Use raw to return the default format of the page like HTML etc. Defaults to raw.
Usually you want to use markdown for LLM processing or text. If you need to store the files without losing any encoding, use the bytes or raw format. PDF transformations may take up to 1 cent per page for high accuracy.
Get API - readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage. Defaults to false.
This uses the Safari Reader Mode algorithm to extract only important information from the content.
Get API - css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
Get API - link_rewrite json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
Get API - clean_html boolean
Clean the HTML of unwanted attributes.
Get API - filter_svg boolean
Filter SVG elements from the markup.
Get API - filter_images boolean
Filter image elements from the markup.
Get API - filter_main_only boolean
Filter the main content from the markup excluding nav, footer, and aside elements. This is enabled by default.
Get API - return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
Get API - return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
Get API - return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
Get API - return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
Get API - filter_output_svg boolean
Filter the svg tags from the output.
Get API - filter_output_images boolean
Filter the images from the output.
Get API - filter_output_main_only boolean
Filter the nav, aside, and footer from the output.
Get API - encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
Get API - return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.

Get API - scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
Get API - viewport object
Configure the viewport for chrome. Defaults to a random desktop viewport.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
Get API - automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Get API - evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"return_format":"markdown","url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/links', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "url": "https://spider.cloud",
    "status": 200,
    "duration_elasped_ms": 112
    "error": null
  },
  // more content...
]

Screenshot

View details

Take screenshots of a website to base64 or binary encoding. You can pass an array of objects for the request body.

POSThttps://api.spider.cloud/screenshot

Body

application/json

Screenshot API - url string required
The URI resource to crawl. This can be a comma split list for multiple URLs.
To reduce latency, enhance performance, and save on rate limits batch multiple URLs into a single call. For large websites with high page limits, it's best to run requests individually.
Test Url
Screenshot API - limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Defaults to 0.
It is better to set a limit upfront on websites where you do not know the size. Re-crawling can effectively use cache to keep costs low as new pages are found.
Crawl Limit
Screenshot API - disable_hints boolean
Disables service-provided hints that automatically optimize request types, geo-region selection, and network filters (for example, updating network_blacklist/network_whitelist recommendations based on observed request-pattern outcomes). Hints are enabled by default for all smart request modes.
Enable this if you want fully manual control over filtering behavior, are debugging request load order/coverage, or need deterministic behavior across runs.
Tip
If you’re tuning filters, keep hints enabled and pair with event_tracker to see the complete URL list; once stable, you can flip disable_hints on to lock behavior.
Screenshot API - lite_mode boolean
Lite mode reduces data transfer costs by 50%, with trade-offs in speed, accuracy, geo-targeting, and reliability. It’s best suited for non-urgent data collection or when targeting websites with minimal anti-bot protections.
Screenshot API - network_blacklist string[]
Blocks matching network requests from being fetched/loaded. Use this to reduce bandwidth and noise by preventing known-unneeded third-party resources (analytics, ads, maps, chat widgets, heavy CDNs) from ever being requested.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). If both whitelist and blacklist are set, whitelist should take precedence (allowlist-first) and blacklist can be treated as a deny override for anything not explicitly allowed.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can quickly discover what to block (or allow) next.
- Good targets: googletagmanager.com, doubleclick.net, maps.googleapis.com
- Prefer specific domains over broad substrings to avoid breaking essential assets.
Screenshot API - network_whitelist string[]
Allows only matching network requests to be fetched/loaded. Use this for a strict “allowlist-first” approach: keep the crawl lightweight while still permitting the essential scripts/styles needed for rendering and JS execution.
Each entry is a string match pattern (commonly a hostname, domain, or URL substring). When set, requests not matching any whitelist entry should be blocked by default.
Tip
Pair this with event_tracker to capture the full list of URLs your session attempted to fetch, so you can tune your allowlist quickly and safely.
- Start with first-party: example.com, cdn.example.com
- Add only what you observe you truly need (fonts/CDNs), then iterate.
- Tip: include your target’s base domain + its common asset hostnames/CDNs.

Screenshot API - depth number
The crawl limit for maximum depth. If 0, no limit will be applied. The default is set to 25.
Depth allows you to place a distance between the base URL path and sub paths.
Crawl Depth
Screenshot API - metadata boolean
Collect metadata about the content found like page title, description, keywards and etc. This could help improve AI interoperability. Defaults to false.
Using metadata can help extract critical information to use for AI.
Screenshot API - session boolean
Persist the session for the client that you use on a website. This allows the HTTP headers and cookies to be set like a real browser session. Defaults to true.
Screenshot API - request_timeout number
The timeout to use for request. Timeouts can be from 5-255. Defaults to 60 seconds.
The timeout of the request helps prevent long request times from hanging.
Screenshot API - wait_for object
The wait_for parameter allows you to specify various waiting conditions for a website operation. If provided, it contains the following sub-parameters:
The key idle_network specifies the conditions to wait for the network request to be idle within a period. It can include an optional timeout value.
The key idle_network0 specifies the conditions to wait for the network request to be idle with a max timeout. It can include an optional timeout value.
The key almost_idle_network0 specifies the conditions to wait for the network request to be almost idle with a max timeout. It can include an optional timeout value.
The key selector specifies the conditions to wait for a particular CSS selector to be found on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key dom specifies the conditions to wait for a particular element to stop updating for a duration on the page. It includes an optional timeout value, and the CSS selector to wait for.
The key delay specifies a delay to wait for, with an optional timeout value.
The key page_navigationsset to true then waiting for all page navigations will be handled.
If wait_for is not provided, the default behavior is to wait for the network to be idle for 500 milliseconds. All of the durations are capped at capped at 60 seconds.
The values for the timeout duration are in the object shape { secs: 10, nanos: 0 }.
Screenshot API - webhooks object
Use webhooks to get notified on events like credit depleted, new pages, metadata, and website status. { destination: string, on_credits_depleted: bool, on_credits_half_depleted: bool, on_website_status: bool, on_find: bool, on_find_metadata: bool }
Screenshot API - user_agent string
Add a custom HTTP user agent to the request. By default this is set to a random agent.
Screenshot API - sitemap boolean
Include the sitemap results to crawl. Defaults to false.
The sitemap allows you to include links that may not be exposed in the HTML.
Screenshot API - sitemap_only boolean
Only include the sitemap results to crawl. Defaults to false.
Using this option allows you to get only the pages on the sitemap without crawling the entire website.
Screenshot API - sitemap_path string
The sitemap url to use when using sitemap. The default is sitemap.xml.
The sitemap_path allows you to include the url of a unique sitemap.
Screenshot API - subdomains boolean
Allow subdomains to be included. Defaults to false.
Screenshot API - tld boolean
Allow TLD's to be included. Defaults to false.
Screenshot API - root_selector string
The root CSS query selector to use extracting content from the markup for the response.
Test Query Selector
Screenshot API - preserve_host boolean
Preserve the default HOST header for the client. This may help bypass pages that require a HOST, and when the TLS cannot be determined. The default setting is false.
Screenshot API - full_resources boolean
Crawl and download all the resources for a website.
Collect all the content from the website, including assets like images, videos, etc.
Screenshot API - redirect_policy string
The network redirect policy to use when performing HTTP request. Set the value to Loose to allow all redirects to any domain. Defaults to Loose only allowing redirects on the same domain. Possible values are Loose, Strict, and None.
Loose will only capture the initial page redirect to the resource. You need to include the website in external_domains to allow the request to continue crawling if it is outside of the domain.
Screenshot API - external_domains array
A list of external domains to treat as one domain. You can use regex paths to include the domains. Set one of the array values to * to include all domains.
Screenshot API - exclude_selector string
A CSS query selector to use for ignoring content from the markup of the response.
Screenshot API - concurrency_limit number
Set the concurrency limit to help balance request for slower websites. The default is unlimited.
Screenshot API - execution_scripts object
Run custom JavaScript on certain paths. This helps with authenticated pages. The request mode to be made through chrome or smart to run. The values for the timeout duration should be in the object shape "/path_or_url": "custom js".
Custom scripts allow you to take control of the browser with events for up to 60 seconds at a time per page.
Screenshot API - disable_intercept boolean
Disable request interception when running request as chrome or smart. This may help if you bypass pages that use third-party scripts or external domains. The default setting is false. Note that cost and speed may increase when disabling this feature, as it removes our native Chrome interception.
Screenshot API - block_ads boolean
Block advertisements when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Screenshot API - block_analytics boolean
Block analytics when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Screenshot API - block_stylesheets boolean
Block stylesheets when running request as chrome or smart. This can greatly increase performance. The default setting is true. Note that cost and speed might increase when disabling this feature.
Screenshot API - run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This param needs storageless set to false or webhooks to be enabled.
Screenshot API - chunking_alg object
Use a chunking algorithm to segment your content output. Possible values are ByWords, ByLines, ByCharacterLength, and BySentence. The following is an example of the object shape for "chunking_alg": { "type": "bysentence", "value": 2 } which splits the text into an array by every 2 sentences found. This works well when used with markdown or text formats.
The chunking algorithm allows you to prepare the content for the AI quick, without needing to add extra code or loaders.
Screenshot API - budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
The budget explicitly allows you to set paths and limits for the crawl.
Crawl Budget
Screenshot API - max_credits_per_page number
Set the maximum number of credits to use per page. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Screenshot API - max_credits_allowed number
Set the maximum number of credits to use per run. This will return a blocked by client if the initial response is empty. Credits are measured in decimal units, where 10,000 credits equal one dollar (100 credits per penny).
Screenshot API - event_tracker object
Track the event request, responses, and automation output when using browser rendering. Pass in the object with the following requests and responses for the network output of the page. automation will send detailed information including a screenshot of each automation step used under automation_scripts.
Screenshot API - blacklist array
Blacklist a set of paths that you do not want to crawl. You can use regex patterns to help with the list.
Screenshot API - whitelist array
Whitelist a set of paths that you want to crawl, ignoring all other routes that do not match the patterns. You can use regex patterns to help with the list.
Screenshot API - crawl_timeout object
The crawl_timeout parameter allows you to put a max duration on the entire crawl. The default setting is 2 mins.
The values for the timeout duration are in the object shape { secs: 300, nanos: 0 }.
Screenshot API - full_page boolean
Take a screenshot of the full page. Defaults to true.

Screenshot API - css_extraction_map object
Use CSS or XPath selectors to scrape contents from the web page. Set the paths and the extraction object map to perform extractions per path or page.
Scrape content using CSS selectors to get data. You can scrape using selectors at no extra cost.
Screenshot API - link_rewrite json
Optional URL rewrite rule applied to every discovered link before it's crawled. This lets you normalize or redirect URLs (for example, rewriting paths or mapping one host pattern to another).
The value must be a JSON object with a type field. Supported types:
- "replace" – simple substring replacement.
  Fields:
  host?: string (optional) – only apply when the link's host matches this value (e.g. "blog.example.com").
  find: string – substring to search for in the URL.
  replace_with: string – replacement substring.
- "regex" – regex-based rewrite with capture groups.
  Fields:
  host?: string (optional) – only apply for this host.
  pattern: string – regex applied to the full URL.
  replace_with: string – replacement string supporting $1, $2, etc.
Invalid or unsafe regex patterns (overly long, unbalanced parentheses, advanced lookbehind constructs, etc.) are rejected by the server and ignored.
Screenshot API - clean_html boolean
Clean the HTML of unwanted attributes.
Screenshot API - filter_svg boolean
Filter SVG elements from the markup.
Screenshot API - filter_images boolean
Filter image elements from the markup.
Screenshot API - filter_main_only boolean
Filter the main content from the markup excluding nav, footer, and aside elements. This is enabled by default.
Screenshot API - return_json_data boolean
Return the json data found in scripts used for SSR. Defaults to false unless you have the website already stored with the configuration enabled.
This is useful for getting JSON ready data for LLM's and getting data from websites built with Next.js etc.
Screenshot API - return_headers boolean
Return the HTTP response headers with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP headers can help setup authentication flows.
Screenshot API - return_cookies boolean
Return the HTTP response cookies with the results. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the HTTP cookies can help setup authentication SSR flows.
Screenshot API - return_page_links boolean
Return the links found on each page. Defaults to false unless you have the website already stored with the configuration enabled.
Getting the links can help index the reference locations found for the resource.
Screenshot API - filter_output_svg boolean
Filter the svg tags from the output.
Screenshot API - filter_output_images boolean
Filter the images from the output.
Screenshot API - filter_output_main_only boolean
Filter the nav, aside, and footer from the output.
Screenshot API - encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
Perform the encoding on the server when you know in advance the type of website.
Screenshot API - return_embeddings boolean
Include OpenAI embeddings for title and description. The default is false and the param requires metadata to be used in conjunction.
If you are embedding data, you can use these matrices as staples for most vector baseline operations.
Screenshot API - binary boolean
Return the image as binary instead of base64.
Screenshot API - cdp_params object
The settings to use to adjust clip, format, quality, and more. Defaults to null.

Screenshot API - cache boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }
Use HTTP caching for the crawl to speed up repeated runs. Defaults to false.
Accepts either:
true / false
A cache control object:
maxAge (ms) — freshness window (default: 172800000 = 2 days). Set 0 for always fetch fresh.
allowStale — serve cached results even if stale.
period — RFC3339 timestamp cutoff (overrides maxAge), e.g. "2025-11-29T12:00:00Z"
skipBrowser — skip browser entirely if cached HTML exists. Returns cached HTML directly without launching Chrome for instant responses.
Enabling caching can save costs on repeated runs and when using chrome to get assets on pages.
Screenshot API - delay number
Add a crawl delay of up to 60 seconds, disabling concurrency. The delay needs to be in milliseconds format. Defaults to 0, which indicates it is disabled.
Using a delay can help with websites that are set on a cron and do not require immediate data retrieval.
Screenshot API - respect_robots boolean
Respect the robots.txt file for crawling. Defaults to true.
If you have trouble crawling a website it may be an issue with the robots.txt file. Setting the value to false could help. Make sure to use this config sparingly.
Screenshot API - skip_config_checks boolean
Skip checking the database for website configuration. This will increase performance for request that are using limit=1. Set the value to false in order to get the configs. Defaults to true.
Screenshot API - service_worker_enabled boolean
Allow the website to use a Service Workers as needed. Defaults to true.
Enabling service workers can allow websites that explicit run background tasks to load data.
Screenshot API - storageless boolean
Prevent storing any type of data for the request including storage. Defaults to true unless you have the website already stored.
Screenshot API - block_images boolean
Block the images from loading to speed up the screenshot. Defaults to false.
Screenshot API - omit_background boolean
Omit the background from loading. Defaults to false.

Screenshot API - scroll number
Infinite scroll the page as new content loads, up to a duration in milliseconds. The duration represents the maximum time you would wait to scroll. You may still need to use the wait_for parameters. You also need to ensure that the request is made using chrome.
Use the wait_for configuration to scroll until and disable_intercept to make sure you get data from the network regardless of hostname.
Screenshot API - viewport object
Configure the viewport for chrome. Defaults to a random desktop viewport.
If you need to get data from a website as a mobile, set the viewport to a phone device's size ex: 375x414.
Screenshot API - automation_scripts object
Run custom web automated tasks on certain paths. The request mode to be made through chrome or smart to run.
Custom web automation allows you to take control of the browser with events for up to 60 seconds at a time per page.
Below are the available actions for web automation:
Evaluate: Runs custom JavaScript code.
{ "Evaluate": "console.log('Hello, World!');" }
Click: Clicks on an element identified by a CSS selector.
{ "Click": "button#submit" }
ClickAll: Clicks on all elements matching a CSS selector.
{ "ClickAll": "button.loadMore" }
ClickPoint: Clicks at the position x and y coordinates.
{ "ClickPoint": { "x": 120.5, "y": 340.25 } }
ClickAllClickable: Clicks on common clickable elements (buttons/inputs/role=button/etc.).
{ "ClickAllClickable": true }
ClickHold: Clicks and holds on an element (via selector) for a duration in milliseconds.
{ "ClickHold": { "selector": "#sliderThumb", "hold_for_ms": 750 } }
ClickHoldPoint: Clicks and holds at a point for a duration in milliseconds.
{ "ClickHoldPoint": { "x": 250.0, "y": 410.0, "hold_for_ms": 750 } }
ClickDrag: Click-and-drag from one element to another (selector → selector) with optional modifier.
{ "ClickDrag": { "from": "#handle", "to": "#target", "modifier": 8 } }
ClickDragPoint: Click-and-drag from one point to another with optional modifier.
{ "ClickDragPoint": { "from_x": 100.0, "from_y": 200.0, "to_x": 500.0, "to_y": 220.0, "modifier": 0 } }
Wait: Waits for a specified duration in milliseconds.
{ "Wait": 2000 }
WaitForNavigation: Waits for the next navigation event.
{ "WaitForNavigation": true }
WaitFor: Waits for an element to appear identified by a CSS selector.
{ "WaitFor": "div#content" }
WaitForWithTimeout: Waits for an element to appear with a timeout (ms).
{ "WaitForWithTimeout": { "selector": "div#content", "timeout": 8000 } }
WaitForAndClick: Waits for an element to appear and then clicks on it, identified by a CSS selector.
{ "WaitForAndClick": "button#loadMore" }
WaitForDom: Waits for DOM updates to settle (quiet/stable) on a selector (or body) with timeout (ms).
{ "WaitForDom": { "selector": "main", "timeout": 12000 } }
ScrollX: Scrolls the screen horizontally by a specified number of pixels.
{ "ScrollX": 100 }
ScrollY: Scrolls the screen vertically by a specified number of pixels.
{ "ScrollY": 200 }
Fill: Fills an input element with a specified value.
{ "Fill": { "selector": "input#name", "value": "John Doe" } }
Type: Type a key into the browser with an optional modifier.
{ "Type": { "value": "John Doe", "modifier": 0 } }
InfiniteScroll: Scrolls the page until the end for certain duration.
{ "InfiniteScroll": 3000 }
Screenshot: Perform a screenshot on the page.
{ "Screenshot": { "full_page": true, "omit_background": true, "output": "out.png" } }
ValidateChain: Set this before a step to validate the prior action to break out of the chain.
{ "ValidateChain": true }
Screenshot API - evaluate_on_new_document string
Set a custom script to evaluate on new document creation.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"limit":5,"url":"https://spider.cloud"}

response = requests.post('https://api.spider.cloud/screenshot', 
  headers=headers, json=json_data)

print(response.json())

Response

[
  {
    "content": "<resource>...",
    "error": null,
    "status": 200,
    "duration_elapsed_ms": 122,
    "costs": {
      "ai_cost": 0,
      "compute_cost": 0.00001,
      "file_cost": 0.00002,
      "bytes_transferred_cost": 0.00002,
      "total_cost": 0.00004,
      "transform_cost": 0.0001
    },
    "url": "https://spider.cloud"
  },
  // more content...
]

Transform HTML

View details

Transform HTML into Markdown or plain text quickly. Each HTML transformation starts at 0.1 credits, while PDF transformations can cost up to 10 credits per page. You can submit up to 10 MB of data per request. The Transform API is also integrated into the /crawl endpoint via the return_format parameter.

POSThttps://api.spider.cloud/transform

Body

application/json

Transform API - data object required
A list of html data to transform. The object list takes the keys html and url. The url key is optional and only used when the readability is enabled.
Data<html><body> <h1>Example Website</h1> <p>This is some example markup to use to test the transform function.</p> <p><a href="https://spider.cloud/guides">Guides</a></p> </body></html>

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/json',
}

json_data = {"return_format":"markdown","data":[{"html":"<html><body>\n<h1>Example Website</h1>\n<p>This is some example markup to use to test the transform function.</p>\n<p><a href=\"https://spider.cloud/guides\">Guides</a></p>\n</body></html>","url":"https://example.com"}]}

response = requests.post('https://api.spider.cloud/transform', 
  headers=headers, json=json_data)

print(response.json())

Response

{
    "content": [
      "# Example Website
This is some example markup to use to test the transform function.
[Guides](https://spider.cloud/guides)"
    ],
    "cost": {
        "ai_cost": 0,
        "compute_cost": 0,
        "file_cost": 0,
        "bytes_transferred_cost": 0,
        "total_cost": 0,
        "transform_cost": 0.0001
    },
    "error": null,
    "status": 200
  }

Proxy-Mode

Spider also offers a proxy front-end to the service. The Spider proxy will then handle requests just like any standard request, with the option to use high-performance and residential proxies up to 10GB per/s. Take a look at all of our proxy locations to see if we support the country.

**HTTP address**: proxy.spider.cloud:80**HTTPS address**: proxy.spider.cloud:443**Username**: YOUR-API-KEY**Password**: PARAMETERS

Residential

Speed: Up to 1GB/s
Purpose: Real-User IPs, Global Reach, High Anonymity
Cost: $1/GB - $4/GB

ISP

Speed: Up to 10GB/s
Purpose: Stable Datacenter IPs, Highest Performance
Cost: $1/GB

Mobile

Speed: Up to 100MB/s
Purpose: Real Mobile Devices, Avoid Detection
Cost: $2/GB

Use the country_code parameter to determine the proxy geolocation and the proxy parameter to change the proxy.

Proxy Type	Price	Multiplier	Description
residential	$2.00/GB	×2-x4	Entry-level residential pool
mobile	$2.00/GB	×2	4G/5G mobile proxies for stealth
isp	$1.00/GB	×1	ISP-grade residential routing

Example proxy request

import requests, os


# Proxy configuration
proxies = {
    'http': f"http://{os.getenv('SPIDER_API_KEY')}:proxy=residential@proxy.spider.cloud:8888",
    'https': f"https://{os.getenv('SPIDER_API_KEY')}:proxy=residential@proxy.spider.cloud:8889"
}

# Function to make a request through the proxy
def get_via_proxy(url):
    try:
        response = requests.get(url, proxies=proxies)
        response.raise_for_status()
        print('Response HTTP Status Code: ', response.status_code)
        print('Response HTTP Response Body: ', response.content)
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return None

# Example usage
if __name__ == "__main__":
     get_via_proxy("https://www.example.com")
     get_via_proxy("https://www.example.com/community")

Queries

Query the data that you collect during crawling and scraping. Add dynamic filters for extracting exactly what is needed.

Logs

Get the last 24 hours of logs.

GEThttps://api.spider.cloud/data/crawl_logs

Params

Logs API - url string
Filter a single url record.
Test Url
Logs API - limit string
The limit of records to get.
Crawl Limit
Logs API - domain string
Filter a single domain record.
Logs API - page number
The current page to get.

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/crawl_logs?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "domain": "spider.cloud",
    "url": "https://spider.cloud",
    "links": 1,
    "credits_used": 3,
    "mode": 2,
    "crawl_duration": 340,
    "message": null,
    "request_user_agent": "Spider",
    "level": "UI",
    "status_code": 0,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  },
  "error": null
}

Credits

Get the remaining credits available.

GEThttps://api.spider.cloud/data/credits

Request

import requests

headers = {
    'Authorization': 'Bearer $SPIDER_API_KEY',
    'Content-Type': 'application/jsonl',
}

response = requests.get('https://api.spider.cloud/data/credits?limit=5&return_format=markdown&url=https%253A%252F%252Fspider.cloud', 
  headers=headers)

print(response.json())

Response

{
  "data": {
    "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
    "credits": 53334,
    "created_at": "2024-04-21T01:21:32.886863+00:00",
    "updated_at": "2024-04-21T01:21:32.886863+00:00"
  }
}

API Reference

Just getting started?

Not a developer?

Crawl

Crawl API - url string required

Crawl API - limit number

Crawl API - disable_hints boolean

Crawl API - lite_mode boolean

Crawl API - network_blacklist string[]

Crawl API - network_whitelist string[]

Crawl API - request string

Crawl API - depth number

Crawl API - metadata boolean

Crawl API - session boolean

Crawl API - request_timeout number

Crawl API - wait_for object

Crawl API - webhooks object

Crawl API - user_agent string

Crawl API - sitemap boolean

Crawl API - sitemap_only boolean

Crawl API - sitemap_path string

Crawl API - subdomains boolean

Crawl API - tld boolean

Crawl API - root_selector string

Crawl API - preserve_host boolean

Crawl API - full_resources boolean

Crawl API - redirect_policy string

Crawl API - external_domains array

Crawl API - exclude_selector string

Crawl API - concurrency_limit number

Crawl API - execution_scripts object

Crawl API - disable_intercept boolean

Crawl API - block_ads boolean

Crawl API - block_analytics boolean

Crawl API - block_stylesheets boolean

Crawl API - run_in_background boolean

Crawl API - chunking_alg object

Crawl API - budget object

Crawl API - max_credits_per_page number

Crawl API - max_credits_allowed number

Crawl API - event_tracker object

Crawl API - blacklist array

Crawl API - whitelist array

Crawl API - crawl_timeout object

Crawl API - return_format string | array

Crawl API - readability boolean

Crawl API - css_extraction_map object

Crawl API - link_rewrite json

Crawl API - clean_html boolean

Crawl API - filter_svg boolean

Crawl API - filter_images boolean

Crawl API - filter_main_only boolean

Crawl API - return_json_data boolean

Crawl API - return_headers boolean

Crawl API - return_cookies boolean

Crawl API - return_page_links boolean

Crawl API - filter_output_svg boolean

Crawl API - filter_output_images boolean

Crawl API - filter_output_main_only boolean

Crawl API - encoding string

Crawl API - return_embeddings boolean

Crawl API - proxy 'residential' | 'mobile' | 'isp'

Crawl API - remote_proxy string

Crawl API - cookies string

Crawl API - headers object

Crawl API - fingerprint boolean

Crawl API - stealth boolean

Crawl API - proxy_enabled boolean

Crawl API - cache boolean | { maxAge?: number; allowStale?: boolean; period?: string; skipBrowser?: boolean }

Crawl API - delay number

Crawl API - respect_robots boolean

Crawl API - skip_config_checks boolean

Crawl API - service_worker_enabled boolean

Crawl API - storageless boolean

Crawl API - scroll number

Crawl API - viewport object

Crawl API - automation_scripts object

Crawl API - evaluate_on_new_document string

Crawl API - country_code string

Crawl API - locale string