ArchiveBox Logo
  • Contents
  • Overview
    • ArchiveBox Documentation
    • Key Features
    • 🤝 Professional Integration
    • Quickstart
      • ✳️  Easy Setup
      • 🛠  Package Manager Setup
      • 🎗  Other Options
      • ➡️  Next Steps
      • Usage
        • ⚡️  CLI Usage
        • ArchiveBox Subcommands
    • Overview
      • Input Formats: How to pass URLs into ArchiveBox for saving
      • Output Formats: What ArchiveBox saves for each URL
      • Configuration
      • Dependencies
      • Archive Layout
      • Static Archive Exporting
      • Caveats
        • Archiving Private Content
        • Security Risks of Viewing Archived JS
        • Working Around Sites that Block Archiving
        • Saving Multiple Snapshots of a Single URL
        • Storage Requirements
      • Screenshots
    • Background & Motivation
      • Comparison to Other Projects
      • Internet Archiving Ecosystem
    • Documentation
      • Getting Started
      • Advanced
      • Developers
      • More Info
    • ArchiveBox Development
      • Setup the dev environment
        • 1. Clone the main code repo (making sure to pull the submodules as well)
        • 2. Option A: Install the Python, JS, and system dependencies directly on your machine
        • 2. Option B: Build the docker container and use that for development instead
      • Common development tasks
        • Run in DEBUG mode
        • Install and run a specific GitHub branch
          • Use a Pre-Built Image
          • Build Branch from Source
        • Run the linters / tests
        • Make DB migrations, enter Django shell, other dev helper commands
        • Contributing a new extractor
        • Build the docs, pip package, and docker image
        • Roll a release
      • Further Reading
  • Getting Started
    • Quickstart
      • 1. Set up ArchiveBox
      • 2. Get your list of URLs to archive
      • 3. Add your URLs to the archive
      • ✅ Done!
    • Install
      • Supported Systems
      • Option A. Docker / Docker Compose Setup ⭐️
      • Option B. Automatic Setup Script
      • Option C. Bare Metal Setup
        • 1. Install base system dependencies needed for your OS
          • macOS
          • Ubuntu/Debian-based Systems
          • FreeBSD
          • OpenBSD
          • Arch Linux / Nix / Guix / etc. Other OSs
        • 2. Install the Python dependencies using pip
        • 3. Install the JS dependencies using archivebox setup
        • Troubleshooting
        • Next Steps: Add some URLs to archive and try out CLI / Web UI
        • Next Steps: Upgrading Archivebox to a new version
        • Further Reading
    • Docker
      • Overview
      • Docker Compose
        • Setup
        • Upgrading
        • Usage
        • Accessing the data
        • Configuration
      • Docker
        • Setup
        • Upgrading
        • Usage
        • Accessing the data
        • Configuration
    • Configuration
      • General Settings
        • ONLY_NEW
        • OVERWRITE
        • TIMEOUT
        • MAX_URL_ATTEMPTS
        • RESOLUTION
        • CHECK_SSL_VALIDITY
        • USER_AGENT
        • COOKIES_FILE
        • DEFAULT_PERSONA
        • URL_DENYLIST
        • URL_ALLOWLIST
        • SAVE_ALLOWLIST
        • SAVE_DENYLIST
        • TAG_SEPARATOR_PATTERN
      • Server Settings
        • ADMIN_USERNAME / ADMIN_PASSWORD
        • PUBLIC_INDEX / PUBLIC_SNAPSHOTS / PUBLIC_ADD_VIEW
        • SECRET_KEY
        • BIND_ADDR
        • LISTEN_HOST
        • ALLOWED_HOSTS
        • CSRF_TRUSTED_ORIGINS
        • ADMIN_BASE_URL
        • ARCHIVE_BASE_URL
        • SNAPSHOTS_PER_PAGE
        • PREVIEW_ORIGINALS
        • FOOTER_INFO
        • CUSTOM_TEMPLATES_DIR
        • REVERSE_PROXY_USER_HEADER
        • REVERSE_PROXY_WHITELIST
        • LOGOUT_REDIRECT_URL
        • LDAP Settings
          • LDAP_ENABLED
          • LDAP_SERVER_URI
          • LDAP_BIND_DN
          • LDAP_BIND_PASSWORD
          • LDAP_USER_BASE
          • LDAP_USER_FILTER
          • LDAP_USERNAME_ATTR
          • LDAP_FIRSTNAME_ATTR
          • LDAP_LASTNAME_ATTR
          • LDAP_EMAIL_ATTR
          • LDAP_CREATE_SUPERUSER
      • Storage Settings
        • OUTPUT_PERMISSIONS
        • PUID / PGID
        • RESTRICT_FILE_NAMES
        • ENFORCE_ATOMIC_WRITES
        • TMP_DIR
        • LIB_DIR
        • LIB_BIN_DIR
      • Search Settings
        • USE_INDEXING_BACKEND
        • USE_SEARCHING_BACKEND
        • SEARCH_BACKEND_ENGINE
        • SEARCH_PROCESS_HTML
      • Shell Options
        • DEBUG
        • IS_TTY
        • USE_COLOR
        • SHOW_PROGRESS
        • IN_DOCKER
        • IN_QEMU
      • Plugin Settings
        • Title Settings
          • TITLE_ENABLED
          • TITLE_TIMEOUT
        • Favicon Settings
          • FAVICON_ENABLED
          • FAVICON_TIMEOUT
          • FAVICON_USER_AGENT
        • Wget Settings
          • WGET_ARGS
          • WGET_ARGS_EXTRA
          • WGET_BINARY
          • WGET_CHECK_SSL_VALIDITY
          • WGET_COOKIES_FILE
          • WGET_ENABLED
          • WGET_TIMEOUT
          • WGET_USER_AGENT
          • WGET_WARC_ENABLED
        • Screenshot Settings
          • SCREENSHOT_ENABLED
          • SCREENSHOT_RESOLUTION
          • SCREENSHOT_TIMEOUT
        • PDF Settings
          • PDF_ENABLED
          • PDF_RESOLUTION
          • PDF_TIMEOUT
        • DOM Settings
          • DOM_ENABLED
          • DOM_TIMEOUT
        • SingleFile Settings
          • SINGLEFILE_ARGS
          • SINGLEFILE_ARGS_EXTRA
          • SINGLEFILE_BINARY
          • SINGLEFILE_CHECK_SSL_VALIDITY
          • SINGLEFILE_CHROME_ARGS
          • SINGLEFILE_COOKIES_FILE
          • SINGLEFILE_ENABLED
          • SINGLEFILE_TIMEOUT
          • SINGLEFILE_USER_AGENT
        • Readability Settings
          • READABILITY_ARGS
          • READABILITY_ARGS_EXTRA
          • READABILITY_BINARY
          • READABILITY_ENABLED
          • READABILITY_TIMEOUT
        • Mercury Settings
          • MERCURY_ARGS
          • MERCURY_ARGS_EXTRA
          • MERCURY_BINARY
          • MERCURY_ENABLED
          • MERCURY_TIMEOUT
        • Defuddle Settings
          • DEFUDDLE_ARGS
          • DEFUDDLE_ARGS_EXTRA
          • DEFUDDLE_BINARY
          • DEFUDDLE_ENABLED
          • DEFUDDLE_TIMEOUT
        • HTML to Text Settings
          • HTMLTOTEXT_ENABLED
          • HTMLTOTEXT_TIMEOUT
        • Trafilatura Settings
          • TRAFILATURA_BINARY
          • TRAFILATURA_ENABLED
          • TRAFILATURA_OUTPUT_CSV
          • TRAFILATURA_OUTPUT_HTML
          • TRAFILATURA_OUTPUT_JSON
          • TRAFILATURA_OUTPUT_MARKDOWN
          • TRAFILATURA_OUTPUT_TXT
          • TRAFILATURA_OUTPUT_XML
          • TRAFILATURA_OUTPUT_XMLTEI
          • TRAFILATURA_TIMEOUT
        • Git Settings
          • GIT_ARGS
          • GIT_ARGS_EXTRA
          • GIT_BINARY
          • GIT_DOMAINS
          • GIT_ENABLED
          • GIT_TIMEOUT
        • yt-dlp Settings
          • YTDLP_ARGS
          • YTDLP_ARGS_EXTRA
          • YTDLP_BINARY
          • YTDLP_CHECK_SSL_VALIDITY
          • YTDLP_COOKIES_FILE
          • YTDLP_ENABLED
          • YTDLP_MAX_SIZE
          • YTDLP_TIMEOUT
        • gallery-dl Settings
          • GALLERYDL_ARGS
          • GALLERYDL_ARGS_EXTRA
          • GALLERYDL_BINARY
          • GALLERYDL_CHECK_SSL_VALIDITY
          • GALLERYDL_COOKIES_FILE
          • GALLERYDL_ENABLED
          • GALLERYDL_TIMEOUT
        • forum-dl Settings
          • FORUMDL_ARGS
          • FORUMDL_ARGS_EXTRA
          • FORUMDL_BINARY
          • FORUMDL_ENABLED
          • FORUMDL_OUTPUT_FORMAT
          • FORUMDL_TIMEOUT
        • papers-dl Settings
          • PAPERSDL_ARGS
          • PAPERSDL_ARGS_EXTRA
          • PAPERSDL_BINARY
          • PAPERSDL_ENABLED
          • PAPERSDL_TIMEOUT
        • Archive.org Settings
          • ARCHIVEDOTORG_ENABLED
          • ARCHIVEDOTORG_TIMEOUT
          • ARCHIVEDOTORG_USER_AGENT
        • Chrome Settings
          • CHROME_ARGS
          • CHROME_ARGS_EXTRA
          • CHROME_BINARY
          • CHROME_CHECK_SSL_VALIDITY
          • CHROME_DELAY_AFTER_LOAD
          • CHROME_ENABLED
          • CHROME_HEADLESS
          • CHROME_PAGELOAD_TIMEOUT
          • CHROME_RESOLUTION
          • CHROME_SANDBOX
          • CHROME_TIMEOUT
          • CHROME_USER_AGENT
          • CHROME_USER_DATA_DIR
          • CHROME_WAIT_FOR
        • DNS Settings
          • DNS_ENABLED
          • DNS_TIMEOUT
        • SSL Settings
          • SSL_ENABLED
          • SSL_TIMEOUT
        • Headers Settings
          • HEADERS_ENABLED
          • HEADERS_TIMEOUT
        • Redirects Settings
          • REDIRECTS_ENABLED
          • REDIRECTS_TIMEOUT
        • Responses Settings
          • RESPONSES_ENABLED
          • RESPONSES_TIMEOUT
        • Console Log Settings
          • CONSOLELOG_ENABLED
          • CONSOLELOG_TIMEOUT
        • Accessibility Settings
          • ACCESSIBILITY_ENABLED
          • ACCESSIBILITY_TIMEOUT
        • SEO Settings
          • SEO_ENABLED
          • SEO_TIMEOUT
        • Hashes Settings
          • HASHES_ENABLED
          • HASHES_TIMEOUT
        • Static File Settings
          • STATICFILE_ENABLED
          • STATICFILE_TIMEOUT
        • uBlock Origin Settings
          • UBLOCK_ENABLED
        • I Still Don’t Care About Cookies Settings
          • ISTILLDONTCAREABOUTCOOKIES_ENABLED
        • 2captcha Settings
          • TWOCAPTCHA_API_KEY
          • TWOCAPTCHA_AUTO_SUBMIT
          • TWOCAPTCHA_ENABLED
          • TWOCAPTCHA_RETRY_COUNT
          • TWOCAPTCHA_RETRY_DELAY
          • TWOCAPTCHA_TIMEOUT
        • Modal Closer Settings
          • MODALCLOSER_ENABLED
          • MODALCLOSER_POLL_INTERVAL
          • MODALCLOSER_TIMEOUT
        • Infinite Scroll Settings
          • INFINISCROLL_ENABLED
          • INFINISCROLL_EXPAND_DETAILS
          • INFINISCROLL_MIN_HEIGHT
          • INFINISCROLL_SCROLL_DELAY
          • INFINISCROLL_SCROLL_DISTANCE
          • INFINISCROLL_SCROLL_LIMIT
          • INFINISCROLL_TIMEOUT
        • DOM Outlinks Parser Settings
          • PARSE_DOM_OUTLINKS_ENABLED
          • PARSE_DOM_OUTLINKS_TIMEOUT
        • HTML URL Parser Settings
          • PARSE_HTML_URLS_ENABLED
        • JSONL URL Parser Settings
          • PARSE_JSONL_URLS_ENABLED
        • Netscape URL Parser Settings
          • PARSE_NETSCAPE_URLS_ENABLED
        • Text URL Parser Settings
          • PARSE_TXT_URLS_ENABLED
        • RSS URL Parser Settings
          • PARSE_RSS_URLS_ENABLED
        • Claude Code Settings
          • ANTHROPIC_API_KEY
          • CLAUDECODE_BINARY
          • CLAUDECODE_ENABLED
          • CLAUDECODE_MAX_TURNS
          • CLAUDECODE_MODEL
          • CLAUDECODE_TIMEOUT
        • Claude Chrome Settings
          • CLAUDECHROME_ENABLED
          • CLAUDECHROME_MAX_ACTIONS
          • CLAUDECHROME_MODEL
          • CLAUDECHROME_PROMPT
          • CLAUDECHROME_TIMEOUT
        • Claude Code Extract Settings
          • CLAUDECODEEXTRACT_ENABLED
          • CLAUDECODEEXTRACT_MAX_TURNS
          • CLAUDECODEEXTRACT_MODEL
          • CLAUDECODEEXTRACT_PROMPT
          • CLAUDECODEEXTRACT_TIMEOUT
        • Claude Code Cleanup Settings
          • CLAUDECODECLEANUP_ENABLED
          • CLAUDECODECLEANUP_MAX_TURNS
          • CLAUDECODECLEANUP_MODEL
          • CLAUDECODECLEANUP_PROMPT
          • CLAUDECODECLEANUP_TIMEOUT
        • Ripgrep Search Settings
          • RIPGREP_ARGS
          • RIPGREP_ARGS_EXTRA
          • RIPGREP_BINARY
          • RIPGREP_TIMEOUT
        • Sonic Search Settings
          • SEARCH_BACKEND_SONIC_BUCKET
          • SEARCH_BACKEND_SONIC_COLLECTION
          • SEARCH_BACKEND_SONIC_HOST_NAME
          • SEARCH_BACKEND_SONIC_PASSWORD
          • SEARCH_BACKEND_SONIC_PORT
        • SQLite FTS Search Settings
          • SEARCH_BACKEND_SQLITE_DB
          • SEARCH_BACKEND_SQLITE_SEPARATE_DATABASE
          • SEARCH_BACKEND_SQLITE_TOKENIZERS
    • Security Overview
      • Web UI Permissions
      • ArchiveBox Use-Cases
        • Archiving Public Content Only ⭐️ [Default, recommended for most people]
        • Archiving Content Behind Log-Ins 🚨 [Advanced users only]
        • ⚠️ Things to watch out for: ⚠️
        • Publishing
      • Do not run as root
      • Output Folder
        • Database
        • Filesystem
          • Purging entries
          • Permissions
    • Usage
      • CLI Usage
        • Run ArchiveBox with configuration options
        • Import a single URL
        • Import a list of URLs from a text file
        • Import list of links from browser history
        • Import browser cookies into a persona
      • UI Usage
        • Explanation of buttons in the web UI - admin snapshots list
      • Browser Extension Usage
        • More Info
      • Disk Layout
        • Large Archives
      • SQL Shell Usage
      • Python Shell Usage
      • Python API Usage
  • Guides
    • Setting Up Storage
      • Supported Local Filesystems
        • EXT4 (default on Linux), APFS (default on macOS)
        • ZFS (recommended for best experience on Linux/BSD) ⭐️
        • NTFS, HFS+, BTRFS
        • EXT2, EXT3, FAT32, exFAT
      • Supported Remote Filesystems
        • NFS (Docker Driver)
        • SMB / Ceph (Docker CIFS Driver)
        • Amazon S3 / Backblaze B2 / Google Drive / etc. (RClone)
          • RClone Config Examples
          • Option A: Running RClone on Bare Metal host
          • Option B: Running RClone with Docker Storage Plugin
        • More Docker Storage Plugins
    • Setting Up Authentication
      • Set Up Admin Web UI Permissions
      • Admin Web UI Authentication Methods
        • Username & Password (the default)
        • Reverse Proxy Authentication
        • LDAP Authentication
        • Not Yet Supported: SAML / OAuth2 / OpenID Authentication
      • REST API
        • API Bearer Token Authentication
        • API Request Header Authentication
        • API Query Parameter Authentication
        • API Session Cookie Authentication
        • API HTTP Basic Authentication
          • Further Reading
    • Setting Up Search
      • How to Search in ArchiveBox
      • How Search Works
      • ArchiveBox Search Backends
        • ripgrep (the default)
          • Pros
          • Cons
        • ripgrep-all (aka rga)
        • ugrep
          • Pros
          • Cons
        • sonic ⭐️ (the recommended upgrade path for most people)
          • Pros
          • Cons
        • SQLite FTS5
          • Pros
          • Cons
        • Further Reading
    • Publishing Your Archive
      • 1. Use the built-in web server
      • 2. Export and host it as static HTML
      • Security Concerns
        • Protecting the Admin Dashboard
      • Copyright Concerns
        • Further Reading: USA Copyright Law & Fair Use Exemptions
    • Scheduled Archiving
      • How It Works
      • CLI Usage
      • Docker Compose
      • Examples
    • Chrome / Chromium Setup
      • Installing Chromium
        • ⭐️ Any OS (recommended)
        • macOS
        • Ubuntu/Debian
      • Installing Google Chrome
        • macOS
        • Ubuntu/Debian
      • Troubleshooting Chromium Install
    • Setting Up a Chromium User Profile
      • Docker VNC Setup
      • Non-Docker Setup (Local Host)
      • Non-Docker Setup (Remote Host)
      • More Info & Troubleshooting
    • Upgrading Versions
      • Upgrading with Docker Compose ⭐️
      • Upgrading with plain Docker
      • Upgrading with a package manager
      • Merge two or more existing archives
      • Related Documents
    • Upgrading or Merging Archives
    • Merging Collections
      • Modify the ArchiveBox SQLite3 DB directly
        • Example: Modifying an existing user’s email
        • Example: Adding a new user with a hashed password
      • Database Troubleshooting
      • Related Documents
    • Troubleshooting
      • Installing
        • macOS
        • Python
        • Chromium/Google Chrome
        • Wget & Curl
        • NPM Dependencies
      • Archiving
        • No links parsed from export file
        • Lots of skipped sites
        • Lots of errors
        • Lots of broken links from the index
        • Removing unwanted links from the index
      • Hosting the Archive
        • Other database or filesystem issues
          • Docker Permissions issues
      • Database
        • Filesystem doesn’t support FSYNC (e.g. network mounts)
        • Database and filesystem contention issues when running multiple ArchiveBox processes
        • Database migrations errors or upgrade issues
        • Repairing a corrupted SQLite3 database file
  • Architecture
    • ArchiveBox Architecture Diagrams
      • High-Level System Execution Flow
      • State Diagrams for Main Models
        • Crawl
      • Snapshot
        • ArchiveResult
  • API Reference
    • Filesystem
    • SQL API
    • REST API
    • Python API
      • archivebox
        • Subpackages
          • archivebox.config
          • archivebox.misc
          • archivebox.search
          • archivebox.cli
          • archivebox.api
          • archivebox.base_models
          • archivebox.services
          • archivebox.ldap
          • archivebox.mcp
          • archivebox.crawls
          • archivebox.personas
          • archivebox.core
          • archivebox.ideas
          • archivebox.workers
          • archivebox.machine
        • Submodules
          • archivebox.manage
          • archivebox.__main__
          • archivebox.hooks
          • archivebox.uuid_compat
        • Package Contents
          • Classes
          • Data
          • API
  • Meta
    • Roadmap
      • Planned Specification
        • v0.7: Schema improvements
        • v0.8:  Security
        • v0.9:  Performance
        • v1.0: Full headless browser control
        • v2.0 Federated or distributed archiving + paid hosted service offering
        • Major long-term changes
        • Smaller planned features
      • Past Releases
      • UI / UX Improvements Planned
      • New Extractors Planned
        • Social Media
        • Video/Streams
        • Audio/Music
        • Photos/Images/Comics
        • Text/Forums
        • MOOC/Educational Content
        • Re-Archiving / WARC Creation
        • Other
    • Changelog
    • Supporting Development
    • Web Archiving Community
      • The Master Lists
      • Web Archiving Projects
        • Bookmarking Services
        • From the Archive.org & Archive-It teams
        • From Webrecorder
        • From Rhizome.org (Conifer)
        • From the Old Dominion University: Web Science Team
        • From the Archives Unleashed Team
        • From the IIPC team
        • Other Public Archiving Services
        • Other ArchiveBox Alternatives
        • Smaller Utilities
      • Reading List
        • Blogs Friends of ArchiveBox
        • Articles We Like About Internet Archiving
        • ArchiveBox-Specific Posts, Tutorials, and Guides
        • ArchiveBox Discussions in News & Social Media
      • Communities
        • Most Active Communities
        • Web Archiving Communities
        • General Archiving Foundations, Coalitions, Initiatives, and Institutes
      • ArchiveBox Community Resources
        • ArchiveBox Chat Rooms
        • ArchiveBox on Social Media
        • ArchiveBox on Package Distribution Platforms
ArchiveBox
  • ArchiveBox
  • Edit on GitHub

Welcome to ArchiveBox!

Just getting started?

Check out the Quickstart guide.

Need help with something?

Open an issue on Github or chat on Zulip.

Want to join the community?

See our Community Wiki page.

ArchiveBox Logo

ArchiveBox

“The open-source self-hosted internet archive.”

Website | Github | Source | Bug Tracker

mkdir my-archive; cd my-archive/
pip install archivebox

archivebox init
archivebox add https://example.com
archivebox info

Documentation

  • Contents
  • Overview
    • ArchiveBox Documentation
    • Key Features
    • 🤝 Professional Integration
    • Quickstart
    • Overview
    • Background & Motivation
    • Documentation
    • ArchiveBox Development
  • Getting Started
    • Quickstart
    • Install
    • Docker
    • Configuration
    • Security Overview
    • Usage
  • Guides
    • Setting Up Storage
    • Setting Up Authentication
    • Setting Up Search
    • Publishing Your Archive
    • Scheduled Archiving
    • Chrome / Chromium Setup
    • Setting Up a Chromium User Profile
    • Upgrading Versions
    • Upgrading or Merging Archives
    • Merging Collections
    • Troubleshooting
  • Architecture
    • ArchiveBox Architecture Diagrams
  • API Reference
    • Filesystem
    • SQL API
    • REST API
    • Python API
  • Meta
    • Roadmap
    • Changelog
    • Supporting Development
    • Web Archiving Community
Next

© Copyright 2026 ArchiveBox.

Advertisement