Files
ArchiveBox/docs/Scheduled-Archiving.md
Nick Sweeting 46b547b88d docs: fix stale config refs across wiki pages + minor typo fixes
Sweep of all prose doc pages to fix references that were stale, wrong,
or pointed at anchors/options that no longer exist in 0.9.x.

Critical (non-functional examples + factual errors):
- All `PUBLIC_SNAPSHOTS=...` examples (Security-Overview, Publishing-
  Your-Archive, Usage) replaced with `PERMISSIONS=public|private`.
- Setting-up-Authentication: drop the "edit CSRF_TRUSTED_ORIGINS in
  archivebox/core/settings.py source" advice (no longer user-settable);
  update auth-permissions list to use PERMISSIONS instead of
  PUBLIC_SNAPSHOTS.
- Security-Overview: SAVE_ARCHIVE_DOT_ORG (with extra underscores)
  was never real; use ARCHIVEDOTORG_ENABLED.
- Docker/Install/Usage: FETCH_TITLE/FETCH_SCREENSHOT/FETCH_PDF/FETCH_DOM
  were never aliases (only FETCH_MEDIA is); replace with real
  <PLUGIN>_ENABLED.
- Troubleshooting: CHROME_BINARY default is `chromium`, not
  `chromium-browser`. Also fixed deprecated `brew cask upgrade
  chromium-browser` -> `brew upgrade --cask chromium`.
- Docker: typo MAX_MEDIA_SIZE -> MEDIA_MAX_SIZE.

Broken Configuration anchors (must be lowercase on GitHub wiki):
- Security-Overview: #FOOTER_INFO / #OUTPUT_PERMISSIONS / #COOKIES_FILE
  -> lowercase.
- Setting-up-Authentication: combined #public_index--public_snapshots--public_add_view
  -> individual #public_index / #public_add_view / #permissions.

Plugin option references now link to abx-plugins:
- CHROME_USER_DATA_DIR / CHROME_BINARY / CHROME_SANDBOX -> /#chrome
- RIPGREP_BINARY -> /#search_backend_ripgrep
- WGET_ENABLED / DOM_ENABLED / SAVE_WGET / SAVE_DOM -> respective anchors
- ARCHIVEDOTORG_ENABLED -> /#archivedotorg
- FAVICON_PROVIDER / FAVICON_ENABLED -> /#favicon
- MEDIA_ENABLED -> /#media

Legacy aliases:
- Scheduled-Archiving: URL_WHITELIST/URL_BLACKLIST -> URL_ALLOWLIST/
  URL_DENYLIST; dropped non-existent `--overwrite` schedule flag.

Dead source links removed:
- Usage: archivebox/main.py + archivebox/config.py (split to cli/ and
  config/common.py).
- Security-Overview: archivebox/extractors/*.py -> plugin anchors.
- Install: dead Configuration#dependency-options and
  Configuration#archive-method-toggles anchors -> abx-plugins reference.

Typo fixes (codespell):
- preferrably -> preferably, necesary -> necessary, Rasberry ->
  Raspberry, sytem -> system, Dissallow -> Disallow, whats -> what's,
  filesytem -> filesystem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 03:29:27 -07:00

2.9 KiB

Scheduled Archiving

ArchiveBox now stores schedules in the database and lets the orchestrator materialize them into queued Crawl records at the right time. You no longer need host cron, user crontabs, or a separate archivebox_scheduler container when archivebox server is running.

How It Works

  1. archivebox schedule ... creates a CrawlSchedule record plus a sealed template Crawl.
  2. The long-running global orchestrator inside archivebox server watches enabled schedules.
  3. When a schedule becomes due, the orchestrator creates a new queued Crawl.
  4. That queued crawl is processed the same way as UI/API-submitted work.

One-shot foreground flows such as archivebox add ... continue to process only the crawl they were asked to run. They do not also sweep and execute unrelated scheduled crawls.

CLI Usage

cd ~/archivebox/data

archivebox schedule --every=daily --depth=1 https://example.com/feed.xml
archivebox schedule --every='0 */6 * * *' https://example.com/feed.xml
archivebox schedule --show
archivebox schedule --clear
archivebox schedule --run-all
archivebox schedule --foreground

Accepted schedule formats:

  • Aliases: minute, hour, day, week, month, year, daily, weekly, monthly, yearly
  • Cron expressions: e.g. 0 */6 * * *

archivebox schedule --run-all enqueues every enabled schedule immediately.

archivebox schedule --foreground runs the global orchestrator in the foreground, which is useful outside archivebox server if you want a dedicated long-running scheduler/worker process without the web UI.

Running archivebox schedule --every=day with no import_path creates a recurring maintenance schedule that queues archivebox://update crawls.

Docker Compose

With the new orchestrator flow, you only need the main archivebox service:

services:
  archivebox:
    image: archivebox/archivebox:dev
    command: server --quick-init 0.0.0.0:8000
    volumes:
      - ./data:/data

Create schedules with:

docker compose run --rm archivebox schedule --every=weekly --depth=1 https://example.com/feed.xml
docker compose run --rm archivebox schedule --show

If the main archivebox server container is already running, its orchestrator will pick up future scheduled runs automatically. There is no scheduler sidecar to restart.

Examples

Archive a Twitter mirror once a week:

archivebox schedule --every=weekly --depth=1 'https://nitter.net/ArchiveBoxApp'

Archive a subreddit and linked discussions once a week:

archivebox config --set URL_ALLOWLIST='^http(s)?:\/\/(.+)?teddit\.net\/?.*$'
archivebox schedule --every=weekly --depth=1 'https://teddit.net/r/DataHoarder/'

Archive Hacker News every day:

archivebox config --set URL_DENYLIST='^http(s)?:\/\/(.+\.)?(youtube\.com)|(amazon\.com)\/.*$'
archivebox schedule --every=daily --depth=1 'https://news.ycombinator.com'

Queue a daily maintenance update:

archivebox schedule --every=day