Clean rebuild of docs/apidocs/archivebox/ via autodoc2:
- Removes 19 stale module pages whose source files no longer exist
(cli_utils, host_utils, schedule_utils — renamed to *_util; actors,
ideas, debugging, folders, legacy, progress_layout, tests_piping,
config_tags, personas.runtime/views, orchestrator*, worker, tasks).
- Adds 29 new module pages for code that was added since the previous
generation but not yet documented.
- Updates 100 existing pages to reflect API surface changes (e.g.
ENABLED_PLUGINS Field removed, SNAPSHOTS_PER_PAGE clamp removed,
default bumped to 50, etc.).
156 source modules <-> 156 apidoc files (zero drift). Build clean
under sphinx-build -W --keep-going.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
admin_snapshots.py:571 had min(max(50, SNAPSHOTS_PER_PAGE), 500), and
admin_archiveresults.py:501 had min(max(5, SNAPSHOTS_PER_PAGE), 5000).
Both clamps silently overrode the configured value — a documented
default of 40 was inaccessible in the Snapshot admin, and the
ArchiveResult admin also reused the same setting without being mentioned
in the docs.
- Drop both clamps; admin changelists now use SNAPSHOTS_PER_PAGE as-is.
- Bump the default in common.py from 40 to 50 (matches what users were
actually seeing in the admin under the old floor).
- Add ge=1 validation so non-positive values are rejected at config
parse time instead of producing broken pagination.
- Update Configuration.md: new default 50, clarify the option drives
both Snapshot and ArchiveResult admin changelists plus the public
index, and that it must be >= 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several inaccuracies + over-documentation cleaned up in one pass:
- ONLY_NEW: completely rewrite. The old prose ("ArchiveBox will never
re-download sites that have already succeeded previously") was carried
over from 0.7.x and is wrong in 0.9.x — setting ONLY_NEW=False (or
--no-only-new) explicitly creates a new Snapshot and re-runs every
extractor. Now describes the actual behavior: skip URL entirely vs.
create a new Snapshot for it.
- CRAWL_MAX_CONCURRENT_SNAPSHOTS: fix the "each concurrent Snapshot
launches its own Chrome instance" claim. Chrome is crawl-scoped by
default (CHROME_ISOLATION="crawl") — concurrent Snapshots share the
crawl's Chrome via tabs, not separate browser processes.
- BASE_URL: drop the "admin.admin.admin.<host> compounding bug"
reference. Config docs shouldn't explain legacy bugs.
- Remove derived/runtime-only options that are NOT user-settable:
ACTIVE_PERSONA (set by persona resolver), CRAWL_DIR/SNAP_DIR (injected
by orchestrator per-call), DATA_DIR (derived from cwd), ARCHIVE_DIR
(derived from DATA_DIR/archive), USERS_DIR (derived from ARCHIVE_DIR),
PERSONAS_DIR (derived from DATA_DIR), LIB_BIN_DIR (tracks LIB_DIR),
DATABASE_NAME (derived from DATA_DIR/index.sqlite3). Backward-compat
<a id="..."></a> anchors preserved for all of them above the nearest
surviving heading so external links still resolve.
- LIB_DIR: fix default path. The doc claimed "<DATA_DIR>/lib/<arch>-<os>"
but constants.py:117 uses platformdirs.user_config_path("abx") / "lib"
— the XDG user-config dir, not inside the data folder. Updated to the
actual default.
- ENABLED_PLUGINS section dropped (option removed in a separate commit);
anchor redirected to PLUGINS.
- Drop the "Pydantic config" implementation-detail mention in PUID/PGID.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This rewrite (now reapplied on top of the wiki subtree) covers the full
session's work on Configuration.md:
- Add crawl/snapshot limits (CRAWL_MAX_URLS/SIZE/TIMEOUT,
CRAWL_MAX_CONCURRENT_SNAPSHOTS, SNAPSHOT_MAX_SIZE), DELETE_AFTER,
PERMISSIONS, PLUGINS/ENABLED_PLUGINS/ACTIVE_PERSONA.
- Add new Database Settings section (SQLITE_* tuning + DATABASE_NAME).
- Add SERVER_SECURITY_MODE deep-dive (4 modes, host-layout table).
- Add Storage path overrides (DATA_DIR, ARCHIVE_DIR, USERS_DIR,
PERSONAS_DIR, CRAWL_DIR, SNAP_DIR, ALLOW_NO_UNIX_SOCKETS).
- Remove ALLOWED_HOSTS + CSRF_TRUSTED_ORIGINS as user-settable; both
auto-derived from BASE_URL + SERVER_SECURITY_MODE. Backward-compat
anchors preserved on BASE_URL with the 0.7.3 -> 0.9 legacy upgrade note.
- Remove the entire Plugin Settings tree (~200 options, 41 subsections);
replace with prominent redirect to https://archivebox.github.io/abx-plugins/
and a "shared core options that plugins fall back to" table.
- Add 231 backward-compat <a id="..."></a> anchors so old URLs to plugin
sections / removed options / multi-option headers all still resolve
(e.g. #wget_args -> Plugin Configuration section, #public_snapshots ->
PERMISSIONS, #ssl_enabled -> Plugin Configuration, #admin_username ->
ADMIN_USERNAME/PASSWORD heading, #dir_output_permissions ->
OUTPUT_PERMISSIONS, #url_blacklist -> URL_DENYLIST).
- Fix wrong default: PUBLIC_ADD_VIEW is False, not True.
- Drop the 7 TRAFILATURA_OUTPUT_* per-format flags (replaced by single
TRAFILATURA_OUTPUT_FORMATS in plugin); SSL_ENABLED/SSL_TIMEOUT (wrong
plugin namespace) — anchors redirected to Plugin Configuration.
- Reframe COOKIES_FILE as low-level escape hatch; personas are the
preferred auth path.
- Link every named plugin to its specific anchor on the abx-plugins page
(e.g. WGET_TIMEOUT -> #wget, SONIC_HOST -> #search_backend_sonic).
- Strip implementation-detail mentions (Pydantic, etc.).
- Slim Shell Options to only user-settable (DEBUG, USE_COLOR,
SHOW_PROGRESS); drop IS_TTY/IN_DOCKER/IN_QEMU.
- Restructure: General -> Server (+LDAP) -> Storage -> Database (new) ->
Search -> Shell -> Plugin Configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove ALLOWED_HOSTS + CSRF_TRUSTED_ORIGINS as documented user options;
both are now auto-derived from BASE_URL + SERVER_SECURITY_MODE. Backward-compat
anchors preserved above BASE_URL with a brief 0.7.3->0.9 legacy upgrade note.
- Drop the lone pydantic reference in PUID/PGID; users don't care about
implementation details, only names/defaults/behavior/why.
- Reframe COOKIES_FILE as the low-level escape hatch and surface personas as
the preferred auth path (TIP admonition + persona-first example).
- Link every named plugin override to its specific anchor on the abx-plugins
page (e.g. WGET_TIMEOUT -> #wget, SONIC_HOST -> #search_backend_sonic);
retain the generic top-of-page link only where no single plugin is named.
- Drop spurious CURL_* override examples (no curl plugin exists in abx-plugins).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Snapshot detail page: embed scoped live-progress monitor (same-origin
/progress.json on whichever host the page is served from); hide admin
action buttons when scoped; per-snapshot perms via can_view_snapshot.
- crawl_file API: respect crawl-level permissions; PUBLIC/UNLISTED served
to guests, PRIVATE returns 404 for non-admin/non-owner.
- CrawlRunner: replace allow_paused_snapshot_maintenance with
allow_maintenance_on_inactive_crawl so SEALED crawls don't short-circuit
the cancellation guard for legitimate maintenance hooks (search backend
backfill, fs migration, etc.). Fixes infinite STARTED loop on snapshots
with queued search_backend results.
- Universal `--init` flag: works on any subcommand (server, update, add,
shell, install, ...). Detected at module load, stripped from argv, and
consumed in the dispatcher so subprocesses inherit a clean env.
- supervisord_util.run_runner_worker: route Ctrl+C through
supervisor.signalProcess(name, "SIGINT") instead of raw os.kill on a
cached pid, gated on statename=RUNNING. Prevents killing unrelated
processes when the worker's pid has been reused by the OS.
- Login page: remove non-functional password-reset links; add
has_real_admin_users template tag to gate the bootstrap hint.
- Add page: hide underline on the "Get the extension" link.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>