mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-06-22 11:31:05 -04:00

Files

T

Nick Sweeting 113fae8cd1 docs: rewrite Configuration.md (remove plugin tree, add Database/crawl limits, redirect to abx-plugins)

This rewrite (now reapplied on top of the wiki subtree) covers the full
session's work on Configuration.md:

- Add crawl/snapshot limits (CRAWL_MAX_URLS/SIZE/TIMEOUT,
  CRAWL_MAX_CONCURRENT_SNAPSHOTS, SNAPSHOT_MAX_SIZE), DELETE_AFTER,
  PERMISSIONS, PLUGINS/ENABLED_PLUGINS/ACTIVE_PERSONA.
- Add new Database Settings section (SQLITE_* tuning + DATABASE_NAME).
- Add SERVER_SECURITY_MODE deep-dive (4 modes, host-layout table).
- Add Storage path overrides (DATA_DIR, ARCHIVE_DIR, USERS_DIR,
  PERSONAS_DIR, CRAWL_DIR, SNAP_DIR, ALLOW_NO_UNIX_SOCKETS).
- Remove ALLOWED_HOSTS + CSRF_TRUSTED_ORIGINS as user-settable; both
  auto-derived from BASE_URL + SERVER_SECURITY_MODE. Backward-compat
  anchors preserved on BASE_URL with the 0.7.3 -> 0.9 legacy upgrade note.
- Remove the entire Plugin Settings tree (~200 options, 41 subsections);
  replace with prominent redirect to https://archivebox.github.io/abx-plugins/
  and a "shared core options that plugins fall back to" table.
- Add 231 backward-compat <a id="..."></a> anchors so old URLs to plugin
  sections / removed options / multi-option headers all still resolve
  (e.g. #wget_args -> Plugin Configuration section, #public_snapshots ->
  PERMISSIONS, #ssl_enabled -> Plugin Configuration, #admin_username ->
  ADMIN_USERNAME/PASSWORD heading, #dir_output_permissions ->
  OUTPUT_PERMISSIONS, #url_blacklist -> URL_DENYLIST).
- Fix wrong default: PUBLIC_ADD_VIEW is False, not True.
- Drop the 7 TRAFILATURA_OUTPUT_* per-format flags (replaced by single
  TRAFILATURA_OUTPUT_FORMATS in plugin); SSL_ENABLED/SSL_TIMEOUT (wrong
  plugin namespace) — anchors redirected to Plugin Configuration.
- Reframe COOKIES_FILE as low-level escape hatch; personas are the
  preferred auth path.
- Link every named plugin to its specific anchor on the abx-plugins page
  (e.g. WGET_TIMEOUT -> #wget, SONIC_HOST -> #search_backend_sonic).
- Strip implementation-detail mentions (Pydantic, etc.).
- Slim Shell Options to only user-settable (DEBUG, USE_COLOR,
  SHOW_PROGRESS); drop IS_TTY/IN_DOCKER/IN_QEMU.
- Restructure: General -> Server (+LDAP) -> Storage -> Database (new) ->
  Search -> Shell -> Plugin Configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-31 02:24:36 -07:00

71 KiB

Raw Blame History

Configuration

Configuration of ArchiveBox is done by using the archivebox config command, modifying the ArchiveBox.conf file in the data folder, or by using environment variables. All three methods work equivalently when using Docker as well.

Some equivalent examples of setting some configuration options:

archivebox config --set TIMEOUT=120
# OR
echo "TIMEOUT=120" >> ArchiveBox.conf
# OR
env TIMEOUT=120 archivebox add ~/Downloads/bookmarks_export.html

Environment variables take precedence over the config file, which is useful if you only want to use a certain option temporarily during a single run. For more examples see Usage: Configuration...

Available Configuration Options:

General Settings: Archiving process, output format, crawl limits, and retention.
Server Settings: Web UI, authentication, subdomain routing, and reverse proxy options.
Storage Settings: File layout, permissions, and temp/lib directories.
Database Settings: SQLite tuning and lock-retry behavior.
Search Settings: Full-text search backend selection.
Shell Options: Format & behavior of CLI output.
Plugin Configuration: Per-plugin options (now documented separately).

In case this document is ever out of date, check the source code for config definitions: archivebox/config/common.py ➡️

General Settings

General options around the archiving process, output format, retention, and concurrency limits.

`ONLY_NEW`

Possible Values: [True]/False Toggle whether or not to attempt rechecking old links when adding new ones, or leave old incomplete links alone and only archive the new links.

By default, ArchiveBox will only archive new links on each import. If you want it to go back through all links in the index and download any missing files on every run, set this to False.

Note: Regardless of how this is set, ArchiveBox will never re-download sites that have already succeeded previously. When this is False it only attempts to fix previous pages that have missing archive extractor outputs, it does not re-archive pages that have already been successfully archived.

`TIMEOUT`

Possible Values: [60]/120/... Maximum allowed runtime per-extractor, per-Snapshot in seconds. If you have a slow network connection or are seeing frequent timeout errors, you can raise this value.

This is a plugin-shared setting — each individual extractor can override it with its own <EXTRACTOR>_TIMEOUT (e.g. WGET_TIMEOUT, CHROME_TIMEOUT, YTDLP_TIMEOUT). See the per-plugin docs for the full list.

Note

TIMEOUT only caps a single extractor invocation. To bound the total wall-clock runtime of an entire crawl, use CRAWL_TIMEOUT instead.

Warning

Do not set this to anything less than 5 seconds — Chrome will hang indefinitely and many sites will fail completely. Anywhere between 30 and 3000 is the recommended range.

Related options: CRAWL_TIMEOUT, CRAWL_MAX_URLS, SNAPSHOT_MAX_SIZE

`RESOLUTION`

Possible Values: [1440,2000]/1024,768/... Default screenshot/PDF viewport resolution in width,height pixels. Used as the fallback for SCREENSHOT_RESOLUTION, PDF_RESOLUTION, and CHROME_RESOLUTION.

This is a plugin-shared setting — individual extractors override it via <EXTRACTOR>_RESOLUTION (e.g. SCREENSHOT_RESOLUTION, PDF_RESOLUTION, CHROME_RESOLUTION). See the per-plugin docs for plugin-specific overrides.

`CHECK_SSL_VALIDITY`

Possible Values: [True]/False Whether to enforce HTTPS certificate validity and HSTS chain of trust when archiving sites. Set this to False if you want to archive pages even if they have expired or invalid certificates.

This is a plugin-shared setting — every HTTP-fetching extractor (wget, yt-dlp, gallery-dl, chrome, etc.) honors it, and individual extractors can override with <EXTRACTOR>_CHECK_SSL_VALIDITY. See the per-plugin docs.

Warning

When False, ArchiveBox cannot guarantee that the captured content matches the real site — a man-in-the-middle could substitute responses. Only disable for trusted networks or for archiving legacy/internal sites with expired certs.

`USER_AGENT`

Possible Values: [Mozilla/5.0 ... ArchiveBox/{VERSION} ...]/"Mozilla/5.0 ..."/... The default User-Agent string sent during archiving. The built-in default identifies ArchiveBox and links back to the GitHub repo so site operators can identify and contact archivers if needed.

This is a plugin-shared setting — each extractor (wget, chrome, yt-dlp, singlefile, …) can override it with its own <EXTRACTOR>_USER_AGENT, otherwise it falls back to this value. See the per-plugin docs for per-extractor specifics.

Note

Some sites block requests that look like bots or that don't match a real browser. If you're getting 403s or empty responses, try setting this to a current Chrome/Firefox UA string.

`COOKIES_FILE`

Possible Values: [None]//path/to/cookies.txt/...

Tip

Prefer personas over COOKIES_FILE for authentication. A persona bundles a cookies.txt, a Chrome user-data-dir, a user-agent, and any other per-identity state into one named profile that's swappable per-crawl and automatically scoped across every extractor. COOKIES_FILE (and the per-extractor <EXTRACTOR>_COOKIES_FILE overrides) is a low-level escape hatch for when you specifically need to point at a hand-rolled cookies file outside the persona system — most users should ignore it and configure auth through archivebox persona create instead.

Path to a Netscape-format cookies.txt file passed to wget, curl, yt-dlp, and other non-Chrome extractors for authentication. Required when archiving sites behind a login (paywalls, social media feeds, members-only forums, etc.) if you're not using a persona.

This is a plugin-shared setting — each extractor can override it with <EXTRACTOR>_COOKIES_FILE (e.g. WGET_COOKIES_FILE, YTDLP_COOKIES_FILE, GALLERYDL_COOKIES_FILE). Chrome-based extractors instead read auth state from the persona's CHROME_USER_DATA_DIR. See the per-plugin docs for per-extractor variants.

You can generate a cookies.txt using a browser extension, or with wget --save-cookies + --user=... --password=....

The recommended path is to create a persona and let it manage cookies + Chrome profile state for you:

archivebox persona create --import=chrome personal
archivebox add --persona=personal https://members.example.com/feed

Warning

Use separate burner credentials dedicated to archiving — don't re-use your normal daily Facebook/Instagram/Youtube/etc. account cookies as server responses often contain your name/email/PII and session tokens, which then get preserved in your snapshots forever!

`DEFAULT_PERSONA`

Possible Values: [Default]/personal/work/... The persona profile used when no explicit persona is selected for a crawl. Personas bundle a Chrome user-data-dir, a cookies.txt, auth state, a user-agent, and any other per-identity config into a single named profile, letting you swap between archiving contexts (logged-out vs. signed-into-work-account vs. signed-into-personal-account) without manually juggling files.

ArchiveBox auto-creates the named persona on disk if it doesn't already exist. See the Personas wiki page for the full directory layout.

Related options: ACTIVE_PERSONA, COOKIES_FILE

`ACTIVE_PERSONA`

Possible Values: auto-set, read-only at runtime The name of the persona actually being used for the current crawl/snapshot. Where DEFAULT_PERSONA is the user-configured fallback, ACTIVE_PERSONA is derived — ArchiveBox sets it automatically based on the resolved persona for each Snapshot (explicit selection on the Crawl > persona on the URL > DEFAULT_PERSONA).

You generally read this rather than write it. Plugins and templates can inspect ACTIVE_PERSONA to render persona-specific UI or pick persona-scoped paths. Setting it manually in ArchiveBox.conf has no effect — it will be overwritten on every run by the persona resolver.

Related options: DEFAULT_PERSONA

`URL_DENYLIST`

Regex pattern matched against every URL discovered during a crawl. Any matching URL is excluded from archiving — useful for blocking tracking pixels, ad networks, CDN-hosted CSS/fonts, or arbitrary file extensions you don't want to capture.

The default skips common static assets (CSS, fonts, Google Fonts CDN) so they aren't re-fetched as separate Snapshots during recursive crawls — the parent page's singlefile/dom output already inlines them.

Note: This option is also recognized under its legacy alias URL_BLACKLIST.

Related options: URL_ALLOWLIST

`URL_ALLOWLIST`

Possible Values: [None]/^http(s)?:\/\/(.+)?example\.com\/?.*$/...

Regex pattern matched against every URL discovered during a crawl. When set, any URL that does not match is excluded. Useful for recursive crawling scoped to a single domain or path prefix (e.g. only follow links within docs.example.com/v2/).

When both are set, URL_DENYLIST takes precedence over URL_ALLOWLIST.

Note: This option is also recognized under its legacy alias URL_WHITELIST.

Related options: URL_DENYLIST

`TAG_SEPARATOR_PATTERN`

Possible Values: [[,]]/[,;]/[,;\s]/... Regex character class used to split tag strings (e.g. news,politics; longform) into individual tags when importing URLs. The default splits on commas only; widen it if you paste in tags separated by semicolons, spaces, or other delimiters.

`CRAWL_MAX_URLS`

Possible Values: [0]/50/500/... Maximum number of unique URLs (Snapshots) a single crawl is allowed to produce. 0 means unlimited. Counts both seed URLs you submitted and URLs discovered by recursive crawlers (parse_dom_outlinks, parse_html_urls, etc.).

Once the cap is reached, recursive crawlers stop emitting new Snapshots and the crawl is marked with stop_reason = "crawl_max_urls". Raising the cap later and re-queuing the crawl will resume discovery — the limit state is persisted in <crawl_dir>/.abx-dl/limits.json and re-evaluated each tick.

Note

Use this as a safety net for recursive crawls (--depth=N) that could otherwise blow up to thousands of pages on link-heavy sites.

`CRAWL_MAX_SIZE`

Possible Values: [0]/50MB/5GB/104857600/... Maximum cumulative output size (in bytes) a single crawl is allowed to produce across all of its Snapshots. 0 means unlimited.

Accepts a raw byte count (104857600) or a unit-suffixed string (100MB, 5GB, 1TiB). Sizes are accumulated by the extractor service as each ArchiveResult writes its outputs to disk; once the cap is exceeded, in-flight Snapshots finish but no new ones are admitted and the crawl stops with stop_reason = "crawl_max_size".

Note

Bounds the disk footprint of a crawl, not the wire transfer — a 2MB HTML page can produce 50MB of screenshots, PDFs, SingleFile bundles, and media downloads, and this cap applies to the on-disk total.

Related options: SNAPSHOT_MAX_SIZE, CRAWL_MAX_URLS, CRAWL_TIMEOUT

`CRAWL_TIMEOUT`

Possible Values: [0]/300/3600/... Maximum total wall-clock runtime for a single crawl in seconds. 0 means unlimited.

Distinct from TIMEOUT: TIMEOUT caps one extractor invocation on one Snapshot; CRAWL_TIMEOUT caps the entire crawl — all Snapshots, all extractors, all retries, all recursive discovery passes — together. Once exceeded the crawl is marked stop_reason = "crawl_timeout" and queued Snapshots are skipped.

Note

Useful as a hard ceiling for unattended/scheduled crawls (e.g. "spend at most 1 hour archiving Hacker News tonight"). Pair with CRAWL_MAX_URLS and CRAWL_MAX_SIZE for belt-and-suspenders bounds.

Related options: TIMEOUT, CRAWL_MAX_URLS, CRAWL_MAX_SIZE

`CRAWL_MAX_CONCURRENT_SNAPSHOTS`

Possible Values: [4]/1/8/16/... How many Snapshots within a single crawl ArchiveBox will archive in parallel. The runner schedules up to this many extractor pipelines at once, then waits for one to finish before starting the next.

Raising this speeds up large crawls on beefy hardware, but each concurrent Snapshot launches its own Chrome instance (when Chrome-based extractors are enabled) — RAM and CPU pressure scale roughly linearly. On a typical laptop, 2-4 is sane; on a dedicated server with 32GB+ RAM, 8-16 can be reasonable.

Note

This is per-crawl concurrency. If you run multiple crawls simultaneously, each one independently gets up to CRAWL_MAX_CONCURRENT_SNAPSHOTS parallel Snapshots.

Related options: CRAWL_MAX_URLS, TIMEOUT

`SNAPSHOT_MAX_SIZE`

Possible Values: [0]/10MB/500MB/... Maximum cumulative output size (in bytes) per individual Snapshot. 0 means unlimited. Same unit-suffix parsing as CRAWL_MAX_SIZE (10MB, 2GB, raw bytes, etc.).

Where CRAWL_MAX_SIZE is a crawl-wide budget, SNAPSHOT_MAX_SIZE puts a ceiling on any one page's output. Once a Snapshot's outputs exceed the cap, remaining extractors for that Snapshot are skipped and the Snapshot is tagged with stop_reason = "snapshot_max_size" — but the rest of the crawl continues normally.

Note

Particularly useful when crawling sites with occasional huge pages (e.g. a forum where most threads are small but a few are 500MB media galleries) — it caps the outliers without throttling the whole crawl.

Related options: CRAWL_MAX_SIZE, CRAWL_MAX_URLS

`DELETE_AFTER`

Possible Values: [0]/24h/7d/4w/6mo/1y/... Retention policy: automatically delete Crawls, Snapshots, ArchiveResults, and Process rows (and their on-disk outputs) after this duration has elapsed. 0, "", or None disables auto-deletion (the default — ArchiveBox never deletes anything unless you ask).

Accepted units: h/hr/hour, d/day, w/week, mo/month, y/yr/year. The minimum non-zero duration is 1h. Examples:

archivebox config --set DELETE_AFTER=24h     # daily rolling buffer
archivebox config --set DELETE_AFTER=30d     # 30-day retention
archivebox config --set DELETE_AFTER=6mo     # 6 months

DELETE_AFTER can be set globally, per-persona, per-crawl, or per-snapshot — the most-specific value wins. When a Snapshot is created, its delete_at timestamp is computed from the effective DELETE_AFTER and persisted; the retention sweeper then deletes rows whose delete_at is in the past.

Warning

Deletion is destructive and irreversible. Files in the snapshot's output directory are removed from disk. Use with care on important archives — and never set this on the global config if you have legacy snapshots you don't want garbage-collected.

Related options: PERMISSIONS

`PERMISSIONS`

Possible Values: [public]/unlisted/private Default visibility for newly created Snapshots. Inherited by every Snapshot in a Crawl unless explicitly overridden at the Crawl or Snapshot level.

public — Snapshot appears in the public index and its content is directly accessible without login.
unlisted — Snapshot content is accessible via direct link, but it is not listed in the public index. Equivalent to a "secret URL."
private — Snapshot is hidden from the public index and its content requires admin login.

This option supersedes the removed PUBLIC_SNAPSHOTS boolean and is also driven by the still-current PUBLIC_INDEX flag — both are interpreted as a coarse mapping onto PERMISSIONS for backwards compatibility (PUBLIC_SNAPSHOTS=False ⇒ private, PUBLIC_INDEX=False ⇒ unlisted, either set to True ⇒ public). Setting PERMISSIONS directly wins over either legacy flag.

Note

PERMISSIONS controls per-Snapshot visibility. Server-wide auth (whether the whole UI requires login, whether the add-view is open) is still controlled by PUBLIC_INDEX and PUBLIC_ADD_VIEW under Server Settings.

Related options: PUBLIC_INDEX, PUBLIC_ADD_VIEW, DELETE_AFTER

`PLUGINS`

Possible Values: [""]/wget,favicon,screenshot/chrome,singlefile,dom/... Comma-separated whitelist of plugins to load and run for this archiving run. When empty (the default), ArchiveBox uses the installed/enabled plugin set — i.e. every plugin whose <PLUGIN>_ENABLED config evaluates true.

When set, only the listed plugins (plus any plugins they declare as required_plugins in their config.json — e.g. picking singlefile automatically pulls in chrome) participate in the run. Equivalent to the CLI flag:

archivebox add --plugins=wget,favicon,screenshot https://example.com

Useful for one-off runs ("just grab a screenshot and skip everything else") or for reproducible per-crawl pipelines stored on the Crawl row.

Related options: ENABLED_PLUGINS

`ENABLED_PLUGINS`

Possible Values: [""]/wget,chrome,singlefile/... Comma-separated override of the enabled plugin set, used primarily by the admin UI and REST API to express "these are the plugins I want enabled for this Crawl/Snapshot/Persona" without having to flip every individual <PLUGIN>_ENABLED flag.

The distinction vs. PLUGINS:

PLUGINS is the run-time selector (what to actually execute on this add invocation, with transitive dependency expansion).
ENABLED_PLUGINS is the persisted enabled set (what the UI/API thinks should be on for this scope, used to compute per-plugin <PLUGIN>_ENABLED defaults).

When both are set, PLUGINS wins for the actual run; ENABLED_PLUGINS remains as the stored default for future runs at the same scope.

Related options: PLUGINS

Server Settings

Options for the web UI, authentication, subdomain routing, and reverse proxy configuration.

`ADMIN_USERNAME` / `ADMIN_PASSWORD`

Possible Values: [None]/"admin"/...

Only used on first run / initial setup in Docker. ArchiveBox will create an admin superuser with the specified username and password when both options are present in the environment at startup. After the user exists, changing these values has no effect — use archivebox manage changepassword <username> or the Django admin UI instead.

Warning

Setting ADMIN_PASSWORD via environment variable bakes the secret into your shell history, Docker inspect output, and process listings. For long-lived deployments, set it once during provisioning, create the user, then unset the variable.

More info:

https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication

Related options: LDAP_ENABLED, REVERSE_PROXY_USER_HEADER

`PUBLIC_INDEX` / `PUBLIC_ADD_VIEW`

Possible Values: [True]/False (for PUBLIC_INDEX), [False]/True (for PUBLIC_ADD_VIEW)

Server-wide toggles for whether login is required to use each public area of ArchiveBox.

archivebox config --set PUBLIC_INDEX=True        # allow viewing the snapshot index without login
archivebox config --set PUBLIC_ADD_VIEW=False    # require login to submit new URLs via the web UI

PUBLIC_INDEX (default True) — when on, anonymous visitors can browse the snapshot list page. Individual snapshot visibility is still gated by each Snapshot's own PERMISSIONS field.
PUBLIC_ADD_VIEW (default False) — when on, anonymous visitors can submit new URLs to be archived via the /add form. Leave this off on any internet-exposed instance unless you actively want a public submission endpoint.

Note

PUBLIC_SNAPSHOTS has been removed as a global toggle. Snapshot visibility is now decided per-Snapshot via the PERMISSIONS field (public / unlisted / private) under General Settings. The old anchors are preserved on PERMISSIONS so existing links keep working.

Related options: PERMISSIONS, SERVER_SECURITY_MODE, ADMIN_USERNAME

`SECRET_KEY`

Possible Values: auto-generated 50-character random string

Django's secret key, used for cryptographic signing of sessions, CSRF tokens, password reset links, and other signed payloads. Auto-generated on first server start and persisted to ArchiveBox.conf so it survives restarts. If the config file isn't writable (read-only mount, mid-init race), an in-memory random key is used and all users are logged out on the next boot.

Warning

Treat this value like a password. Anyone with the SECRET_KEY can forge sessions and CSRF tokens for your instance. Don't commit ArchiveBox.conf to public repos, and rotate it (forcing all users to log in again) if you suspect it's been exposed.

`BIND_ADDR`

Possible Values: [127.0.0.1:8000]/0.0.0.0:8000/[::]:8000/0.0.0.0:80/...

The host:port socket the ArchiveBox web server actually listens on. This is the local bind socket, not the public URL — for the public URL clients see, set BASE_URL.

127.0.0.1:8000 (default) — listen only on the loopback interface. Safest when you're running a reverse proxy on the same host and don't want the server reachable directly from the network.
0.0.0.0:8000 — listen on all IPv4 interfaces. Required when running in Docker without --network=host, or when you want the server reachable from other machines on your LAN without a reverse proxy.
[::]:8000 — listen on all IPv6 interfaces (most modern OSes will accept v4-mapped connections too).
unix:/path/to/archivebox.sock — bind to a Unix socket instead of a TCP port (useful for nginx/Caddy on the same host).

IPv6 literal addresses must be bracketed: [::1]:8000, not ::1:8000.

Note

Inside Docker, binding to 127.0.0.1 means the server is unreachable from outside the container — use 0.0.0.0:8000 and let Docker handle the port-forwarding, or publish the port with -p 127.0.0.1:8000:8000 on the host side instead.

Related options: BASE_URL, SERVER_SECURITY_MODE

`BASE_URL`

Possible Values: [""]/https://archive.example.com/http://archivebox.localhost:8000/...

The canonical public URL of your ArchiveBox instance. Used to build absolute links in templates, redirects (/admin/login/?next=...), admin notification emails, OG/meta tags, and — in subdomain security mode — to derive the admin., web., api., public., and per-snapshot snap-<id>. subdomains.

When BASE_URL is set explicitly, ArchiveBox treats it as the source of truth and ignores the incoming Host header for URL building. In safe-subdomains-fullreplay mode this is required for redirects to work — without an explicit base, the middleware can't safely emit admin.<host> redirects (they'd compound onto whatever subdomain the request already arrived on).

When BASE_URL is empty, the value is resolved at request time from the incoming request's Host header (with any leading admin. / web. / api. / public. / snap-*. label stripped to recover the canonical base). Loopback hostnames (localhost, 127.0.0.1, 0.0.0.0, ::) are rewritten to archivebox.localhost so subdomain routing works without /etc/hosts edits. If there's no live request, BIND_ADDR is used as a last resort.

The scheme is taken from the explicit BASE_URL if set, otherwise from the request (so put a reverse proxy in front for HTTPS and trust X-Forwarded-Proto).

ArchiveBox automatically derives the underlying Django ALLOWED_HOSTS and CSRF_TRUSTED_ORIGINS settings from BASE_URL + SERVER_SECURITY_MODE, so you do not set those directly — the system widens them as needed to admit the admin/web/api/public subdomains.

Note

In safe-subdomains-fullreplay mode, pin BASE_URL explicitly. Without it, the misconfig banner will surface in the rendered page and host-based redirects (/admin → admin.<host>) are suppressed to avoid the admin.admin.admin.<host> compounding bug.

Note

Legacy upgrade path (0.7.3 → 0.9): older deployments that set CSRF_TRUSTED_ORIGINS=https://archive.example.com for their reverse-proxy login but never set BASE_URL still work — when exactly one CSRF origin is present and BASE_URL is empty, ArchiveBox uses that origin as the implicit base URL. New installs should set BASE_URL directly; CSRF_TRUSTED_ORIGINS is no longer a user-settable knob.

Related options: SERVER_SECURITY_MODE, BIND_ADDR

`SERVER_SECURITY_MODE`

Possible Values: [safe-subdomains-fullreplay]/safe-onedomain-nojsreplay/unsafe-onedomain-noadmin/danger-onedomain-fullreplay

The top-level security posture of the server. Controls how archived content is served, whether the admin/API control plane is reachable, and which host(s) the UI is split across. This is the most important security knob — pick the most restrictive mode that still works for your use case.

ArchiveBox splits its surfaces across four logical hosts: admin.* (Django admin + session cookies, the entire control plane), web.* (logged-in browsing UI), api.* (REST/JSON endpoints), and public.* (unauthenticated browsing of PERMISSIONS=public snapshots). In subdomain mode each gets its own host derived from BASE_URL; session/CSRF cookies are scoped to admin.* only, so a compromised replay page on snap-<id>.* can't read admin auth.

Mode	Host layout	JS replay	Control plane	Use when
`safe-subdomains-fullreplay` (default, recommended)	admin/web/api/public/snap-* on separate subdomains	Full JS replay enabled	Enabled on `admin.*` only	You have wildcard DNS (`*.archive.example.com`) and a TLS cert that covers it. Archived JS runs sandboxed away from the admin origin.
`safe-onedomain-nojsreplay`	Everything on one host	JS in replays is neutered (served as `text/plain` or stripped)	Enabled	You can't get wildcard DNS. Trades replay fidelity for same-origin safety — archived pages won't execute scripts.
`unsafe-onedomain-noadmin`	Everything on one host	Full JS replay enabled	Disabled — `/admin`, `/accounts`, `/api`, `/add`, `/web` return 403; only GET/HEAD/OPTIONS allowed	Read-only public archive on a single host. Operate the instance via CLI only; the web admin is unreachable.
`danger-onedomain-fullreplay`	Everything on one host	Full JS replay enabled	Enabled	Local dev / trusted-network only. Archived JS runs on the same origin as the admin UI — a malicious archived page can call admin endpoints with your session. Do not expose this mode to the internet.

Warning

Switching to any mode whose name starts with unsafe- or danger- is logged at startup and surfaces a banner in the UI. Don't use these modes on a public hostname — archived JavaScript will run on the same origin as your admin session.

Note

Subdomain mode requires both wildcard DNS (*.archive.example.com) and (if using TLS) a wildcard certificate. Without those, fall back to safe-onedomain-nojsreplay.

Related options: BASE_URL, PERMISSIONS

More info:

Security Overview

`SNAPSHOTS_PER_PAGE`

Possible Values: [40]/100/...

Maximum number of Snapshots to render per page on the snapshot list views (both the admin index and the public index). Larger values speed up bulk browsing at the cost of heavier per-request rendering.

`FOOTER_INFO`

Possible Values: [Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.]/...

Free-form text rendered in the footer of every archive page. Useful for adding a takedown contact, an org disclaimer, or attribution. Plain text — no HTML.

`CUSTOM_TEMPLATES_DIR`

Possible Values: [data/custom_templates]//path/to/custom_templates/...

Path to a directory containing custom HTML / CSS / image overrides for the default ArchiveBox templates. Files placed here shadow the built-in templates of the same path, letting you rebrand the UI without forking. See the Django template loader docs for the resolution order.

`REVERSE_PROXY_USER_HEADER`

Possible Values: [Remote-User]/X-Remote-User/X-Forwarded-User/...

HTTP header your reverse proxy (Authelia, oauth2-proxy, Authentik, nginx auth_request, etc.) sets to the authenticated username. ArchiveBox's ReverseProxyAuthMiddleware reads this header only when the request's source IP is inside REVERSE_PROXY_WHITELIST — otherwise the header is ignored to prevent direct-connect spoofing.

The header name is matched case-insensitively and normalized to the HTTP_* form Django exposes (e.g. Remote-User → HTTP_REMOTE_USER).

Related options: REVERSE_PROXY_WHITELIST, LOGOUT_REDIRECT_URL

`REVERSE_PROXY_WHITELIST`

Possible Values: [""]/172.16.0.0/16/10.0.0.5/32,fd00::/8/...

Comma-separated list of IPv4 / IPv6 addresses or CIDR networks that are trusted to set REVERSE_PROXY_USER_HEADER. When empty (the default), reverse-proxy auth is completely disabled — the header is never consulted no matter who set it.

When non-empty, only requests whose REMOTE_ADDR falls inside one of the listed networks have the header honored. Anything else falls back to standard session auth. The CIDR list is validated on every request; an invalid entry raises ImproperlyConfigured and breaks the server, so test changes carefully.

Warning

Set this to the actual IP of your reverse proxy, never 0.0.0.0/0 or a public network. With a wide-open whitelist, anyone who can reach the server directly can forge any username they like via the header.

Related options: REVERSE_PROXY_USER_HEADER, LOGOUT_REDIRECT_URL

`LOGOUT_REDIRECT_URL`

Possible Values: [/]/https://example.com/some/other/app//accounts/logout-landing//...

URL users are redirected to after logging out. The default / keeps users on ArchiveBox; set this to an external URL when using reverse-proxy SSO so logout terminates the upstream session too (e.g. https://auth.example.com/logout).

LDAP Settings

Options for LDAP / Active Directory authentication via django-auth-ldap. Requires pip install archivebox[ldap] (which also pulls in the system libldap / libsasl headers).

`LDAP_ENABLED`

Possible Values: [False]/True

Master switch for LDAP authentication. When True, ArchiveBox loads the django-auth-ldap backend and validates that LDAP_SERVER_URI, LDAP_BIND_DN, LDAP_BIND_PASSWORD, and LDAP_USER_BASE are all set — startup fails fast otherwise.

pip install archivebox[ldap]

Then set these configuration values:

LDAP_ENABLED: True
LDAP_SERVER_URI: "ldap://ldap.example.com:3389"
LDAP_BIND_DN: "ou=archivebox,ou=services,dc=ldap.example.com"
LDAP_BIND_PASSWORD: "secret-bind-user-password"
LDAP_USER_BASE: "ou=users,ou=archivebox,ou=services,dc=ldap.example.com"
LDAP_USER_FILTER: "(uid=%(user)s)"
LDAP_USERNAME_ATTR: "username"
LDAP_FIRSTNAME_ATTR: "givenName"
LDAP_LASTNAME_ATTR: "sn"
LDAP_EMAIL_ATTR: "mail"
LDAP_CREATE_SUPERUSER: False

More info:

Related options: ADMIN_USERNAME, REVERSE_PROXY_USER_HEADER

`LDAP_SERVER_URI`

Possible Values: [None]/ldap://ldap.example.com:389/ldaps://ldap.example.com:636/...

URI of the LDAP server to bind against. Use ldaps:// for TLS or ldap:// for plaintext (plus optional StartTLS at the protocol level). Required when LDAP_ENABLED is True.

`LDAP_BIND_DN`

Possible Values: [None]/cn=archivebox,ou=services,dc=example,dc=com/...

Distinguished name of the service account used to perform user searches. This account only needs read access to the user subtree under LDAP_USER_BASE. Required when LDAP_ENABLED is True.

`LDAP_BIND_PASSWORD`

Possible Values: [None]/<bind-service-password>/...

Password for the LDAP_BIND_DN service account. Required when LDAP_ENABLED is True.

Warning

Treat this like any other service credential — keep it out of shell history and version control. Prefer setting it via the config file (which has owner-only permissions) over environment variables.

`LDAP_USER_BASE`

Possible Values: [None]/ou=users,dc=example,dc=com/...

Base DN under which to search for user entries. Required when LDAP_ENABLED is True. The search is performed as LDAP_BIND_DN with the filter from LDAP_USER_FILTER.

`LDAP_USER_FILTER`

Possible Values: [(uid=%(user)s)]/(sAMAccountName=%(user)s)/(&(objectClass=person)(mail=%(user)s))/...

LDAP search filter used to find a user entry at login. The literal token %(user)s is replaced with the username the user typed into the login form. Common values:

(uid=%(user)s) — OpenLDAP-style
(sAMAccountName=%(user)s) — Active Directory
(mail=%(user)s) — match by email

`LDAP_USERNAME_ATTR`

Possible Values: [username]/uid/sAMAccountName/...

LDAP attribute on the user entry that becomes the local Django username. Must be unique within the directory.

`LDAP_FIRSTNAME_ATTR`

Possible Values: [givenName]/...

LDAP attribute mapped to Django's User.first_name.

`LDAP_LASTNAME_ATTR`

Possible Values: [sn]/...

LDAP attribute mapped to Django's User.last_name.

`LDAP_EMAIL_ATTR`

Possible Values: [mail]/userPrincipalName/...

LDAP attribute mapped to Django's User.email.

`LDAP_CREATE_SUPERUSER`

Possible Values: [False]/True

When True, every LDAP user who successfully authenticates is auto-promoted to Django superuser. Off by default — leave it off unless your directory's user base is already restricted to operators, since superusers can modify config, delete snapshots, and run server commands.

Warning

Combining LDAP_CREATE_SUPERUSER=True with a broad LDAP_USER_BASE (e.g. an entire company OU) effectively grants admin to every employee. Scope the user base or use group-based access control via django-auth-ldap's AUTH_LDAP_USER_FLAGS_BY_GROUP (configured in custom settings.py) instead.

Storage Settings

Options for the on-disk layout, file permissions, and temp/lib directories that ArchiveBox reads and writes during archiving.

`OUTPUT_PERMISSIONS`

Possible Values: [644]/755/... Permissions to set on output files written into the ARCHIVE_DIR. The directory mode is derived from this by OR-ing in the execute bits (so 644 files imply 755 dirs), which subsumes the legacy DIR_OUTPUT_PERMISSIONS option (formerly a separate 755-default field) — directory mode is no longer settable on its own.

Note

Set this to 600 if you want archives to be readable only by the ArchiveBox user, or 664/775 if you need a shared group to read/write the data dir.

Related options: PUID / PGID, ENFORCE_ATOMIC_WRITES

`PUID` / `PGID`

Possible Values: [911]/1000/... Note: These are Docker-only environment variables — they only take effect when set on the Docker entrypoint at container startup. Setting them in ArchiveBox.conf or via archivebox config --set has no effect. Outside Docker the UID/GID is auto-detected from the ownership of the data directory (or the running user) and cannot be overridden.

The UID/GID that the ArchiveBox process should run as (and that all files in the data dir should be owned by). Honored by the Docker entrypoint, which chowns the data dir and drops privileges before running ArchiveBox. Outside Docker, ArchiveBox refuses to run as root and instead drops to the user that owns the data dir.

Learn more:

`ENFORCE_ATOMIC_WRITES`

Possible Values: [True]/False Whether to write output files atomically (write to a tempfile + rename() into place) so that a crash or kill -9 mid-write can never leave a partial file in the archive. Disable only if you are debugging a filesystem that doesn't support atomic renames (some FUSE mounts).

`TMP_DIR`

Possible Values: [<DATA_DIR>/tmp/<machine_id>]//tmp/archivebox/abc5d851/... Path for temporary files, the supervisord unix socket, and generated supervisor config. The default is a per-machine subdirectory under the data dir (tmp/<machine_id>) so multiple machines sharing the same data dir (e.g., over NFS) don't collide on socket files.

Warning

TMP_DIR must be a short, local path readable/writable by the ArchiveBox user. Unix socket paths have a hard ~96-character limit, so a deeply nested TMP_DIR will silently break the supervisor. It also must live on a real local filesystem (tmpfs/SSD) — FUSE, network mounts, and Docker bind mounts on macOS often cannot host unix sockets at all (see ALLOW_NO_UNIX_SOCKETS).

If ArchiveBox detects the configured TMP_DIR is unwritable or too long, it will auto-fall-back to /tmp/archivebox/<collection_id> at startup.

Related options: LIB_DIR, ALLOW_NO_UNIX_SOCKETS

`LIB_DIR`

Possible Values: [<DATA_DIR>/lib/<arch>-<os>]//opt/archivebox/lib/~/.config/abx/lib/... Path for installed binary dependencies (chromium, single-file, yt-dlp, ripgrep, etc.) managed by abxpkg. The default is namespaced by architecture/OS (e.g. arm64-darwin, x86_64-linux-docker) so the same data dir can be safely mounted into containers with different CPU architectures without re-downloading binaries.

Note

LIB_DIR can grow to several GB. Put it on a fast local disk — running extractors off a network-mounted LIB_DIR will be painfully slow.

Related options: LIB_BIN_DIR, TMP_DIR

`LIB_BIN_DIR`

Possible Values: [<LIB_DIR>/bin] Path where installed binaries are symlinked for a flat, shared lookup PATH. Both abxpkg and abx-dl build the executable-resolution environment from this directory at exec time, so anything dropped (or symlinked) here becomes available to all extractor hooks.

Almost no one needs to change this — it tracks LIB_DIR automatically when LIB_DIR is overridden.

`DATA_DIR`

Possible Values: [<cwd>]//data/~/archivebox-data/... The root of an ArchiveBox collection. Holds index.sqlite3, ArchiveBox.conf, the ARCHIVE_DIR, PERSONAS_DIR, sources/, logs/, cache/, etc.

Normally you do not set this explicitly — instead you cd into the data folder and run archivebox there, and DATA_DIR defaults to the current working directory. The DATA_DIR environment variable is available as an override (used internally by the test suite and some wrappers), but if it's set it must match the cwd or ArchiveBox will refuse to start — this is a guardrail against accidentally pointing two different processes at different roots.

Warning

ArchiveBox refuses to run as root, refuses to run from an unwritable directory, and refuses to run when DATA_DIR disagrees with the current working directory. Always cd into your data folder first.

`ARCHIVE_DIR`

Possible Values: [<DATA_DIR>/archive] Where Snapshot output directories are written. This is the heavy directory — every archived URL gets a subtree here. Override it when you want index/config to live on a small fast disk but snapshot data on bulk storage:

archivebox config --set ARCHIVE_DIR=/mnt/bulk/archivebox/archive

Relative paths are resolved against DATA_DIR.

Related options: USERS_DIR, DATA_DIR

`USERS_DIR`

Possible Values: [<ARCHIVE_DIR>/users] Root of the per-user namespace inside the archive. Each ArchiveBox user gets a subdir (users/<username>/crawls/... and users/<username>/snapshots/...) so multiple users sharing one collection do not collide on output paths, and per-user retention/permission policies are easy to enforce at the filesystem level.

Relative paths are resolved against ARCHIVE_DIR.

`PERSONAS_DIR`

Possible Values: [<DATA_DIR>/personas] Where persona state lives — Chrome user-data-dirs, cookie jars, sessionstorage, and any other auth/profile state that should follow a "persona" across snapshots. Each persona owns a subdirectory here that gets bind-mounted (or pointed at via CHROME_USER_DATA_DIR) when extractors run on its behalf.

Warning

PERSONAS_DIR typically contains plaintext cookies and logged-in browser sessions. Treat it as secret material — set restrictive OUTPUT_PERMISSIONS (e.g. 600) and never commit it to git or include it in shared backups without encryption.

`CRAWL_DIR`

Possible Values: runtime-injected, default None The output directory of the currently running crawl (e.g. <USERS_DIR>/<username>/crawls/YYYYMMDD/<domain>/<crawl-id>/). Crawl-level extractors (chrome launcher, parsers, etc.) write here.

You almost never set this yourself — the snapshot/crawl orchestrator injects it into the per-call config and passes it through to plugin hooks via the CRAWL_DIR environment variable. It is documented here for plugin authors who need to read config.CRAWL_DIR from inside a hook to locate sibling crawl-level outputs.

Related options: SNAP_DIR, USERS_DIR

`SNAP_DIR`

Possible Values: runtime-injected, default None The output directory of the currently running snapshot (e.g. <USERS_DIR>/<username>/snapshots/YYYYMMDD/<domain>/<snapshot-uuid>/). Snapshot-level extractors (screenshot, pdf, dom, singlefile, etc.) write their output into per-plugin subdirectories of this path.

Like CRAWL_DIR, this is set per-call by the orchestrator and passed to hooks via the SNAP_DIR environment variable — it is not something users configure. Documented only so plugin authors know which config key to read inside a hook.

Related options: CRAWL_DIR, ARCHIVE_DIR

`ALLOW_NO_UNIX_SOCKETS`

Possible Values: [False]/True Alias: ARCHIVEBOX_ALLOW_NO_UNIX_SOCKETS

Skip the startup check that verifies TMP_DIR can host unix-domain sockets (a real bind() on a .sock file). Set to True only when running ArchiveBox on a filesystem that cannot back unix sockets — most commonly Docker Desktop on macOS with a host bind-mounted TMP_DIR, where the osxfs/virtiofs layer rejects bind() calls.

Warning

This disables a real safety check, not a cosmetic one. When unix sockets are unavailable some plugins that talk to long-lived helpers over .sock files (supervisord control socket, browser launcher RPC) may behave unpredictably. Prefer fixing TMP_DIR to point at a tmpfs/SSD inside the container; reach for ALLOW_NO_UNIX_SOCKETS only when that's genuinely not possible.

Related options: TMP_DIR

Database Settings

Options for tuning the SQLite index database that backs ArchiveBox's snapshot, tag, and crawl metadata.

ArchiveBox stores all of its index metadata in a single SQLite database file (index.sqlite3 inside your data directory). The defaults are tuned for nearly all users — the knobs below mostly govern lock-contention behavior, which matters when multiple workers touch the database concurrently (e.g. supervised orchestrators, parallel archivebox add runs, container restarts that race against an in-flight write, or long-running web/admin processes alongside CLI commands).

Note

These are advanced operator tuning options. If you are not actively diagnosing database is locked errors or planning a non-default storage layout, you can safely leave everything in this section at its default.

Learn more:

`DATABASE_NAME`

Possible Values: [<DATA_DIR>/index.sqlite3]//absolute/path/to/index.sqlite3/... Absolute filesystem path to the SQLite index database file. Settable as the environment variable ARCHIVEBOX_DATABASE_NAME.

By default this resolves to index.sqlite3 inside your data directory and you should not need to change it. Override only when you have a specific reason — e.g. pointing a temporary process at a snapshot of the DB for testing, running multiple ArchiveBox instances out of the same data directory against separate indexes, or relocating the index file onto a different volume.

Warning

The data directory layout (snapshots, tags, archive folders on disk) is keyed off the index database. Pointing DATABASE_NAME at a database that does not match the surrounding data directory will produce broken references and missing archive folders.

`SQLITE_JOURNAL_MODE`

Possible Values: [WAL]/DELETE/TRUNCATE/PERSIST/MEMORY/OFF SQLite journal mode, applied via PRAGMA journal_mode = ... on every new connection. Settable as ARCHIVEBOX_SQLITE_JOURNAL_MODE.

The default WAL (Write-Ahead Logging) lets readers and a single writer operate concurrently without blocking each other — readers see a stable snapshot while a write is in progress, instead of being serialized behind it. This is a substantial win for ArchiveBox, where the web UI, admin, and CLI workers frequently read the index while an extractor is writing.

Warning

Do not change this unless you have a specific reason. DELETE and TRUNCATE serialize all readers against any writer (much worse concurrency). MEMORY and OFF disable durable journaling and can corrupt the database on crash or power loss. WAL requires the database to live on a real local filesystem — it does not work correctly over network filesystems like NFS or SMB.

`SQLITE_MMAP_SIZE`

Possible Values: [134217728] (128 MiB) on bare-metal, [0] (disabled) inside Docker / 0 / 268435456 / ... Maximum number of bytes of the database file SQLite is allowed to map into memory via mmap(), applied via PRAGMA mmap_size = .... Settable as ARCHIVEBOX_SQLITE_MMAP_SIZE.

When mmap is enabled, SQLite reads pages directly from the OS page cache instead of issuing read() syscalls and copying into a userspace buffer — meaningfully faster page reads on large databases when there is RAM available to cache them. Setting this to 0 disables memory-mapped I/O entirely and falls back to regular read() calls.

Note: The default is 0 (disabled) inside Docker, because the container's reported memory limits often do not reflect the host page cache and large mmap regions can interact poorly with cgroup accounting. On bare-metal installs the default is 134217728 (128 MiB).

`SQLITE_TIMEOUT`

Possible Values: [30.0]/5.0/60.0/... (seconds, float) Python sqlite3 connection-level busy timeout in seconds, passed as the timeout= argument when the Django backend opens a connection. Settable as ARCHIVEBOX_SQLITE_TIMEOUT.

This is the maximum amount of time the underlying Python driver will wait on a contended lock before raising OperationalError: database is locked. Raise it if you see spurious lock errors under sustained write contention and you would rather block than fail; lower it if you want callers to fail fast.

Related options: SQLITE_BUSY_TIMEOUT, SQLITE_LOCK_RETRY_TIMEOUT

`SQLITE_BUSY_TIMEOUT`

Possible Values: [30000]/5000/60000/... (milliseconds, integer) SQLite-internal busy-wait timeout in milliseconds, applied via PRAGMA busy_timeout = ... on every new connection. Settable as ARCHIVEBOX_SQLITE_BUSY_TIMEOUT.

This is SQLite's own retry-on-busy loop, sitting one layer below SQLITE_TIMEOUT: when a statement encounters a write lock, SQLite will sleep and retry internally for up to this many milliseconds before returning SQLITE_BUSY to the Python driver. The default (30000 = 30 seconds) is deliberately matched to SQLITE_TIMEOUT.

Warning

Easy to confuse with SQLITE_TIMEOUT: this one is in milliseconds, that one is in seconds. Keep them aligned in real time when adjusting either.

`SQLITE_LOCK_RETRY_TIMEOUT`

Possible Values: [60.0]/0/120.0/... (seconds, float) Total wall-clock budget in seconds that ArchiveBox's own retry loop will spend re-attempting a single locked statement before aborting it. Settable as ARCHIVEBOX_SQLITE_LOCK_RETRY_TIMEOUT.

When the SQLite driver eventually surfaces a database is locked error (after SQLITE_BUSY_TIMEOUT / SQLITE_TIMEOUT have already elapsed), ArchiveBox wraps the cursor in a higher-level retry loop that logs the locking holders and re-issues the statement. This is the maximum total time spent in that outer loop, across all retries, before giving up and raising. Set to 0 to disable the cap and retry indefinitely.

Note

The outer retry only applies to statements that are not inside an explicit transaction.atomic() block. Statements inside an explicit transaction propagate the error to the caller immediately, since silently retrying would re-execute statements the caller already considered committed.

Related options: SQLITE_LOCK_RETRY_INTERVAL

`SQLITE_LOCK_RETRY_INTERVAL`

Possible Values: [5.0]/1.0/10.0/... (seconds, float, must be > 0) Sleep duration in seconds between successive attempts inside the ArchiveBox lock-retry loop. Settable as ARCHIVEBOX_SQLITE_LOCK_RETRY_INTERVAL.

Lower values retry more aggressively (useful if you expect locks to clear quickly and want to minimize end-to-end latency); higher values reduce log noise and wasted CPU when locks are typically held for a long time. Must be strictly greater than 0.

Related options: SQLITE_LOCK_RETRY_TIMEOUT

Search Settings

Options for full-text search backend configuration.

ArchiveBox can index Snapshot text/HTML output into a searchable index that powers the search bar in the Web UI and the archivebox search <query> CLI command. Multiple backend engines are supported — pick the one that best matches your collection size, available system resources, and tolerance for extra moving parts.

Note

Each backend has its own tuning knobs (e.g. Sonic host/port, ripgrep flags, SQLite FTS database path). Those backend-specific options now live with the plugin that implements them — see the abx-plugins docs for the full per-backend schema.

`SEARCH_BACKEND_ENGINE`

Possible Values: [ripgrep]/sqlite/sonic

Which search backend engine to use when running archivebox search and rendering the Web UI search bar.

ripgrep (default) — Pure filesystem grep across each Snapshot's archived output (HTML, text, metadata) via the search_backend_ripgrep plugin. No extra daemon, no extra database to maintain — just install rg and it works. Slow on very large collections (each query re-scans the disk) but always 100% correct: results reflect what's actually on disk right now, no stale index. Best choice for small-to-medium collections (≲50k snapshots) and for users who don't want to run extra services.
sonic — Fast, suggest-style fuzzy search via a running Sonic daemon (configured via the search_backend_sonic plugin). ArchiveBox pushes text into Sonic at index time and queries it at search time. Sub-millisecond queries even at very large scale, but you have to run and maintain the Sonic process (Docker compose has it built in). Best choice for large collections (≳100k snapshots) when query latency matters.
sqlite — FTS5 full-text index stored alongside ArchiveBox's main index.sqlite3, configured via the search_backend_sqlite plugin. No extra processes, no extra binary — uses the SQLite already shipped with Python. Faster than ripgrep on large collections, slightly slower than sonic, but no daemon to babysit. Good middle ground for users who want a real index without operational overhead.

Note: Backend-specific tuning (Sonic host/port/password, ripgrep flag overrides, SQLite FTS database path, indexer batch size, etc.) lives in each search-backend plugin's own config schema — see the abx-plugins docs for the full per-backend option list.

Shell Options

Options around the format & behavior of CLI output.

Most of the values in this section are auto-detected from your terminal at startup, but each can be overridden explicitly via env var, ArchiveBox.conf, or archivebox config --set — useful for CI logs, cron jobs, log files, and Docker stdout where the auto-detection isn't what you want.

`DEBUG`

Possible Values: [False]/True

Enable verbose debug mode for the entire ArchiveBox process. Automatically set to True when --debug is passed on the command line; otherwise honors the env var / config value.

When enabled this turns on:

Full Python tracebacks (instead of the trimmed friendly version) on any error
Django SQL query logging to stderr
Template auto-reload (no caching) for the web UI
Verbose plugin / hook lifecycle logging
Extra detail in archivebox version, archivebox status, and crash reports

Warning

Do not leave DEBUG=True enabled on a production / publicly-reachable server. It exposes tracebacks with file paths, SQL queries, and environment details that can leak sensitive info to anyone who triggers an error page.

Related options: USE_COLOR, SHOW_PROGRESS

`USE_COLOR`

Possible Values: [True (auto-detected)]/False

Whether to colorize console output with ANSI escape codes. Defaults to True when stdout is a TTY (interactive terminal) and False otherwise.

Override to force-off when piping archivebox output into a log file or cron-mail wrapper that doesn't strip ANSI codes (otherwise you'll see ^[[31m...^[[0m litter throughout your logs). Override to force-on for tools like script(1) or some CI runners that don't report as a TTY but do render ANSI correctly.

USE_COLOR=False archivebox add https://example.com >> archive.log

Related options: SHOW_PROGRESS, DEBUG

`SHOW_PROGRESS`

Possible Values: [True (auto-detected)]/False

Whether to render live progress bars during long-running operations (archiving, indexing, migrations). Defaults to True when stdout is a TTY, False otherwise.

Override to force-off in environments where the auto-detection is fooled into thinking it has a TTY (some Docker setups, Kubernetes log collectors, tmux/screen pipes) but the redrawing carriage-return output ends up as garbage in your logs.

SHOW_PROGRESS=False archivebox add < urls.txt

Related options: USE_COLOR

Plugin Configuration

Important

Per-plugin configuration has moved to its own documentation site. This Configuration.md doc covers only ArchiveBox's core settings. For everything that lives inside a plugin — extractor toggles, binary paths, timeouts, args, user agents, cookies, persona scoping, etc. — see:

➡️ https://archivebox.github.io/abx-plugins/

That site is regenerated from each plugin's config.json schema on every release, so it stays in sync with the code. Looking for WGET_ARGS, CHROME_USER_DATA_DIR, SCREENSHOT_RESOLUTION, YTDLP_EXTRA_ARGS, SINGLEFILE_*, SONIC_HOST, etc.? They all live there now.

Shared core options that plugins fall back to

A handful of core options (documented above on this page) act as the fallback default for every plugin that has a matching per-extractor override. If you set the core option, every plugin honors it; if you also set the plugin-specific override, that wins for just that one plugin.

Core option (this doc)	Plugin-level overrides (see abx-plugins)
`TIMEOUT`	`WGET_TIMEOUT`, `CHROME_TIMEOUT`, `YTDLP_TIMEOUT`, `SINGLEFILE_TIMEOUT`, `TITLE_TIMEOUT`, `FAVICON_TIMEOUT`, ...
`CHECK_SSL_VALIDITY`	`WGET_CHECK_SSL_VALIDITY`, `YTDLP_CHECK_SSL_VALIDITY`, `GALLERYDL_CHECK_SSL_VALIDITY`, `CHROME_CHECK_SSL_VALIDITY`, ...
`USER_AGENT`	`WGET_USER_AGENT`, `CHROME_USER_AGENT`, `SINGLEFILE_USER_AGENT`, ...
`COOKIES_FILE`	`WGET_COOKIES_FILE`, `YTDLP_COOKIES_FILE`, `GALLERYDL_COOKIES_FILE`, `SINGLEFILE_COOKIES_FILE`, ...
`RESOLUTION`	`SCREENSHOT_RESOLUTION`, `PDF_RESOLUTION`, `CHROME_RESOLUTION`
`DEFAULT_PERSONA`	per-plugin persona scoping (browser profile / cookie jar selection)

Tip

The resolution order for any plugin-tunable option is always: 1. <PLUGIN>_<OPTION> (explicit per-plugin override) → 2. the matching shared core option above → 3. the plugin's own hardcoded default.

So setting TIMEOUT=120 once at the top of your ArchiveBox.conf raises the timeout for every extractor at once; setting CHROME_TIMEOUT=300 on top of that lifts it further for just Chrome.

Listing & setting plugin options

All plugin options can be set via the same three mechanisms as core options — env var, ArchiveBox.conf, or archivebox config --set — and inspected with archivebox config:

archivebox config                                  # show every option (core + every installed plugin)
archivebox config --get SCREENSHOT_RESOLUTION      # read one value
archivebox config --set SCREENSHOT_RESOLUTION=1920,1080
archivebox config --search wget                    # search options by name/description

Why is plugin config documented separately?

Plugin schemas evolve on their own release cadence — new extractors ship between ArchiveBox releases, options are added/renamed as hooks revise, and pinning their docs to the core release schedule produced unavoidable drift. The per-plugin doc is auto-generated from each plugin's config.json schema at build time, so it never lags behind the code.

71 KiB Raw Blame History

Configuration

General Settings

ONLY_NEW

TIMEOUT

RESOLUTION

CHECK_SSL_VALIDITY

USER_AGENT

COOKIES_FILE

DEFAULT_PERSONA

ACTIVE_PERSONA

URL_DENYLIST

URL_ALLOWLIST

TAG_SEPARATOR_PATTERN

CRAWL_MAX_URLS

CRAWL_MAX_SIZE

CRAWL_TIMEOUT

CRAWL_MAX_CONCURRENT_SNAPSHOTS

SNAPSHOT_MAX_SIZE

DELETE_AFTER

PERMISSIONS

PLUGINS

ENABLED_PLUGINS

Server Settings

ADMIN_USERNAME / ADMIN_PASSWORD

PUBLIC_INDEX / PUBLIC_ADD_VIEW

SECRET_KEY

BIND_ADDR

BASE_URL

SERVER_SECURITY_MODE

SNAPSHOTS_PER_PAGE

FOOTER_INFO

CUSTOM_TEMPLATES_DIR

REVERSE_PROXY_USER_HEADER

REVERSE_PROXY_WHITELIST

LOGOUT_REDIRECT_URL

LDAP Settings

LDAP_ENABLED

LDAP_SERVER_URI

LDAP_BIND_DN

LDAP_BIND_PASSWORD

LDAP_USER_BASE

LDAP_USER_FILTER

LDAP_USERNAME_ATTR

LDAP_FIRSTNAME_ATTR

LDAP_LASTNAME_ATTR

LDAP_EMAIL_ATTR

LDAP_CREATE_SUPERUSER

Storage Settings

OUTPUT_PERMISSIONS

PUID / PGID

ENFORCE_ATOMIC_WRITES

TMP_DIR

LIB_DIR

LIB_BIN_DIR

DATA_DIR

ARCHIVE_DIR

USERS_DIR

PERSONAS_DIR

CRAWL_DIR

SNAP_DIR

ALLOW_NO_UNIX_SOCKETS

Database Settings

DATABASE_NAME

SQLITE_JOURNAL_MODE

SQLITE_MMAP_SIZE

SQLITE_TIMEOUT

SQLITE_BUSY_TIMEOUT

SQLITE_LOCK_RETRY_TIMEOUT

SQLITE_LOCK_RETRY_INTERVAL

Search Settings

SEARCH_BACKEND_ENGINE

Shell Options

DEBUG

USE_COLOR

SHOW_PROGRESS

Plugin Configuration

➡️ https://archivebox.github.io/abx-plugins/

Shared core options that plugins fall back to

Listing & setting plugin options

71 KiB

Raw Blame History

`ONLY_NEW`

`TIMEOUT`

`RESOLUTION`

`CHECK_SSL_VALIDITY`

`USER_AGENT`

`COOKIES_FILE`

`DEFAULT_PERSONA`

`ACTIVE_PERSONA`

`URL_DENYLIST`

`URL_ALLOWLIST`

`TAG_SEPARATOR_PATTERN`

`CRAWL_MAX_URLS`

`CRAWL_MAX_SIZE`

`CRAWL_TIMEOUT`

`CRAWL_MAX_CONCURRENT_SNAPSHOTS`

`SNAPSHOT_MAX_SIZE`

`DELETE_AFTER`

`PERMISSIONS`

`PLUGINS`

`ENABLED_PLUGINS`

`ADMIN_USERNAME` / `ADMIN_PASSWORD`

`PUBLIC_INDEX` / `PUBLIC_ADD_VIEW`

`SECRET_KEY`

`BIND_ADDR`

`BASE_URL`

`SERVER_SECURITY_MODE`

`SNAPSHOTS_PER_PAGE`

`FOOTER_INFO`

`CUSTOM_TEMPLATES_DIR`

`REVERSE_PROXY_USER_HEADER`

`REVERSE_PROXY_WHITELIST`

`LOGOUT_REDIRECT_URL`

`LDAP_ENABLED`

`LDAP_SERVER_URI`

`LDAP_BIND_DN`

`LDAP_BIND_PASSWORD`

`LDAP_USER_BASE`

`LDAP_USER_FILTER`

`LDAP_USERNAME_ATTR`

`LDAP_FIRSTNAME_ATTR`

`LDAP_LASTNAME_ATTR`

`LDAP_EMAIL_ATTR`

`LDAP_CREATE_SUPERUSER`

`OUTPUT_PERMISSIONS`

`PUID` / `PGID`

`ENFORCE_ATOMIC_WRITES`

`TMP_DIR`

`LIB_DIR`

`LIB_BIN_DIR`

`DATA_DIR`

`ARCHIVE_DIR`

`USERS_DIR`

`PERSONAS_DIR`

`CRAWL_DIR`

`SNAP_DIR`

`ALLOW_NO_UNIX_SOCKETS`

`DATABASE_NAME`

`SQLITE_JOURNAL_MODE`

`SQLITE_MMAP_SIZE`

`SQLITE_TIMEOUT`

`SQLITE_BUSY_TIMEOUT`

`SQLITE_LOCK_RETRY_TIMEOUT`

`SQLITE_LOCK_RETRY_INTERVAL`

`SEARCH_BACKEND_ENGINE`

`DEBUG`

`USE_COLOR`

`SHOW_PROGRESS`