17 Commits

Author SHA1 Message Date
Nick Sweeting 8b57085827 chore: commit local archivebox changes 2026-06-14 11:44:38 -07:00
Nick Sweeting 303e99258c Simplify Docker data ownership detection 2026-06-10 09:13:02 -07:00
Nick Sweeting 4fa90e484a release: archivebox 0.9.34rc71 2026-06-07 20:51:28 -07:00
Nick Sweeting c075d654d8 Consolidate runtime config handling 2026-06-01 15:03:40 -07:00
Nick Sweeting 46b547b88d docs: fix stale config refs across wiki pages + minor typo fixes
Sweep of all prose doc pages to fix references that were stale, wrong,
or pointed at anchors/options that no longer exist in 0.9.x.

Critical (non-functional examples + factual errors):
- All `PUBLIC_SNAPSHOTS=...` examples (Security-Overview, Publishing-
  Your-Archive, Usage) replaced with `PERMISSIONS=public|private`.
- Setting-up-Authentication: drop the "edit CSRF_TRUSTED_ORIGINS in
  archivebox/core/settings.py source" advice (no longer user-settable);
  update auth-permissions list to use PERMISSIONS instead of
  PUBLIC_SNAPSHOTS.
- Security-Overview: SAVE_ARCHIVE_DOT_ORG (with extra underscores)
  was never real; use ARCHIVEDOTORG_ENABLED.
- Docker/Install/Usage: FETCH_TITLE/FETCH_SCREENSHOT/FETCH_PDF/FETCH_DOM
  were never aliases (only FETCH_MEDIA is); replace with real
  <PLUGIN>_ENABLED.
- Troubleshooting: CHROME_BINARY default is `chromium`, not
  `chromium-browser`. Also fixed deprecated `brew cask upgrade
  chromium-browser` -> `brew upgrade --cask chromium`.
- Docker: typo MAX_MEDIA_SIZE -> MEDIA_MAX_SIZE.

Broken Configuration anchors (must be lowercase on GitHub wiki):
- Security-Overview: #FOOTER_INFO / #OUTPUT_PERMISSIONS / #COOKIES_FILE
  -> lowercase.
- Setting-up-Authentication: combined #public_index--public_snapshots--public_add_view
  -> individual #public_index / #public_add_view / #permissions.

Plugin option references now link to abx-plugins:
- CHROME_USER_DATA_DIR / CHROME_BINARY / CHROME_SANDBOX -> /#chrome
- RIPGREP_BINARY -> /#search_backend_ripgrep
- WGET_ENABLED / DOM_ENABLED / SAVE_WGET / SAVE_DOM -> respective anchors
- ARCHIVEDOTORG_ENABLED -> /#archivedotorg
- FAVICON_PROVIDER / FAVICON_ENABLED -> /#favicon
- MEDIA_ENABLED -> /#media

Legacy aliases:
- Scheduled-Archiving: URL_WHITELIST/URL_BLACKLIST -> URL_ALLOWLIST/
  URL_DENYLIST; dropped non-existent `--overwrite` schedule flag.

Dead source links removed:
- Usage: archivebox/main.py + archivebox/config.py (split to cli/ and
  config/common.py).
- Security-Overview: archivebox/extractors/*.py -> plugin anchors.
- Install: dead Configuration#dependency-options and
  Configuration#archive-method-toggles anchors -> abx-plugins reference.

Typo fixes (codespell):
- preferrably -> preferably, necesary -> necessary, Rasberry ->
  Raspberry, sytem -> system, Dissallow -> Disallow, whats -> what's,
  filesytem -> filesystem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 03:29:27 -07:00
Nick Sweeting deb0940f5f fix(snapshots): honor SNAPSHOTS_PER_PAGE without silent clamping; bump default to 50
admin_snapshots.py:571 had min(max(50, SNAPSHOTS_PER_PAGE), 500), and
admin_archiveresults.py:501 had min(max(5, SNAPSHOTS_PER_PAGE), 5000).
Both clamps silently overrode the configured value — a documented
default of 40 was inaccessible in the Snapshot admin, and the
ArchiveResult admin also reused the same setting without being mentioned
in the docs.

- Drop both clamps; admin changelists now use SNAPSHOTS_PER_PAGE as-is.
- Bump the default in common.py from 40 to 50 (matches what users were
  actually seeing in the admin under the old floor).
- Add ge=1 validation so non-positive values are rejected at config
  parse time instead of producing broken pagination.
- Update Configuration.md: new default 50, clarify the option drives
  both Snapshot and ArchiveResult admin changelists plus the public
  index, and that it must be >= 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 03:02:34 -07:00
Nick Sweeting 01d6ae2082 docs: Configuration.md accuracy pass + remove internal-only options
Several inaccuracies + over-documentation cleaned up in one pass:

- ONLY_NEW: completely rewrite. The old prose ("ArchiveBox will never
  re-download sites that have already succeeded previously") was carried
  over from 0.7.x and is wrong in 0.9.x — setting ONLY_NEW=False (or
  --no-only-new) explicitly creates a new Snapshot and re-runs every
  extractor. Now describes the actual behavior: skip URL entirely vs.
  create a new Snapshot for it.
- CRAWL_MAX_CONCURRENT_SNAPSHOTS: fix the "each concurrent Snapshot
  launches its own Chrome instance" claim. Chrome is crawl-scoped by
  default (CHROME_ISOLATION="crawl") — concurrent Snapshots share the
  crawl's Chrome via tabs, not separate browser processes.
- BASE_URL: drop the "admin.admin.admin.<host> compounding bug"
  reference. Config docs shouldn't explain legacy bugs.
- Remove derived/runtime-only options that are NOT user-settable:
  ACTIVE_PERSONA (set by persona resolver), CRAWL_DIR/SNAP_DIR (injected
  by orchestrator per-call), DATA_DIR (derived from cwd), ARCHIVE_DIR
  (derived from DATA_DIR/archive), USERS_DIR (derived from ARCHIVE_DIR),
  PERSONAS_DIR (derived from DATA_DIR), LIB_BIN_DIR (tracks LIB_DIR),
  DATABASE_NAME (derived from DATA_DIR/index.sqlite3). Backward-compat
  <a id="..."></a> anchors preserved for all of them above the nearest
  surviving heading so external links still resolve.
- LIB_DIR: fix default path. The doc claimed "<DATA_DIR>/lib/<arch>-<os>"
  but constants.py:117 uses platformdirs.user_config_path("abx") / "lib"
  — the XDG user-config dir, not inside the data folder. Updated to the
  actual default.
- ENABLED_PLUGINS section dropped (option removed in a separate commit);
  anchor redirected to PLUGINS.
- Drop the "Pydantic config" implementation-detail mention in PUID/PGID.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 02:56:49 -07:00
Nick Sweeting 249cbcdc82 fix: run crawl delete test through orchestrator 2026-05-31 02:39:51 -07:00
Nick Sweeting 113fae8cd1 docs: rewrite Configuration.md (remove plugin tree, add Database/crawl limits, redirect to abx-plugins)
This rewrite (now reapplied on top of the wiki subtree) covers the full
session's work on Configuration.md:

- Add crawl/snapshot limits (CRAWL_MAX_URLS/SIZE/TIMEOUT,
  CRAWL_MAX_CONCURRENT_SNAPSHOTS, SNAPSHOT_MAX_SIZE), DELETE_AFTER,
  PERMISSIONS, PLUGINS/ENABLED_PLUGINS/ACTIVE_PERSONA.
- Add new Database Settings section (SQLITE_* tuning + DATABASE_NAME).
- Add SERVER_SECURITY_MODE deep-dive (4 modes, host-layout table).
- Add Storage path overrides (DATA_DIR, ARCHIVE_DIR, USERS_DIR,
  PERSONAS_DIR, CRAWL_DIR, SNAP_DIR, ALLOW_NO_UNIX_SOCKETS).
- Remove ALLOWED_HOSTS + CSRF_TRUSTED_ORIGINS as user-settable; both
  auto-derived from BASE_URL + SERVER_SECURITY_MODE. Backward-compat
  anchors preserved on BASE_URL with the 0.7.3 -> 0.9 legacy upgrade note.
- Remove the entire Plugin Settings tree (~200 options, 41 subsections);
  replace with prominent redirect to https://archivebox.github.io/abx-plugins/
  and a "shared core options that plugins fall back to" table.
- Add 231 backward-compat <a id="..."></a> anchors so old URLs to plugin
  sections / removed options / multi-option headers all still resolve
  (e.g. #wget_args -> Plugin Configuration section, #public_snapshots ->
  PERMISSIONS, #ssl_enabled -> Plugin Configuration, #admin_username ->
  ADMIN_USERNAME/PASSWORD heading, #dir_output_permissions ->
  OUTPUT_PERMISSIONS, #url_blacklist -> URL_DENYLIST).
- Fix wrong default: PUBLIC_ADD_VIEW is False, not True.
- Drop the 7 TRAFILATURA_OUTPUT_* per-format flags (replaced by single
  TRAFILATURA_OUTPUT_FORMATS in plugin); SSL_ENABLED/SSL_TIMEOUT (wrong
  plugin namespace) — anchors redirected to Plugin Configuration.
- Reframe COOKIES_FILE as low-level escape hatch; personas are the
  preferred auth path.
- Link every named plugin to its specific anchor on the abx-plugins page
  (e.g. WGET_TIMEOUT -> #wget, SONIC_HOST -> #search_backend_sonic).
- Strip implementation-detail mentions (Pydantic, etc.).
- Slim Shell Options to only user-settable (DEBUG, USE_COLOR,
  SHOW_PROGRESS); drop IS_TTY/IN_DOCKER/IN_QEMU.
- Restructure: General -> Server (+LDAP) -> Storage -> Database (new) ->
  Search -> Shell -> Plugin Configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 02:24:36 -07:00
Nick Sweeting a70cb46032 Merge commit 'caa680a26f4a7ccaac13465e8ab7d407e524dfb8' as 'docs' 2026-05-31 02:24:01 -07:00
Nick Sweeting da5d6d6bdd chore: remove docs/ in prep for git subtree bootstrap from wiki repo
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 02:23:13 -07:00
Nick Sweeting 7ea387887c docs: tighten Configuration.md plugin-link targets and remove non-user-settable options
- Remove ALLOWED_HOSTS + CSRF_TRUSTED_ORIGINS as documented user options;
  both are now auto-derived from BASE_URL + SERVER_SECURITY_MODE. Backward-compat
  anchors preserved above BASE_URL with a brief 0.7.3->0.9 legacy upgrade note.
- Drop the lone pydantic reference in PUID/PGID; users don't care about
  implementation details, only names/defaults/behavior/why.
- Reframe COOKIES_FILE as the low-level escape hatch and surface personas as
  the preferred auth path (TIP admonition + persona-first example).
- Link every named plugin override to its specific anchor on the abx-plugins
  page (e.g. WGET_TIMEOUT -> #wget, SONIC_HOST -> #search_backend_sonic);
  retain the generic top-of-page link only where no single plugin is named.
- Drop spurious CURL_* override examples (no curl plugin exists in abx-plugins).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-31 02:10:46 -07:00
Nick Sweeting fe421fdb79 release: v0.9.33rc53 2026-05-31 01:41:43 -07:00
Nick Sweeting b0a47e8bf5 wip: snapshot live progress, universal --init, runner perms, supervisord SIGINT
- Snapshot detail page: embed scoped live-progress monitor (same-origin
  /progress.json on whichever host the page is served from); hide admin
  action buttons when scoped; per-snapshot perms via can_view_snapshot.
- crawl_file API: respect crawl-level permissions; PUBLIC/UNLISTED served
  to guests, PRIVATE returns 404 for non-admin/non-owner.
- CrawlRunner: replace allow_paused_snapshot_maintenance with
  allow_maintenance_on_inactive_crawl so SEALED crawls don't short-circuit
  the cancellation guard for legitimate maintenance hooks (search backend
  backfill, fs migration, etc.). Fixes infinite STARTED loop on snapshots
  with queued search_backend results.
- Universal `--init` flag: works on any subcommand (server, update, add,
  shell, install, ...). Detected at module load, stripped from argv, and
  consumed in the dispatcher so subprocesses inherit a clean env.
- supervisord_util.run_runner_worker: route Ctrl+C through
  supervisor.signalProcess(name, "SIGINT") instead of raw os.kill on a
  cached pid, gated on statename=RUNNING. Prevents killing unrelated
  processes when the worker's pid has been reused by the OS.
- Login page: remove non-functional password-reset links; add
  has_real_admin_users template tag to gate the bootstrap hint.
- Add page: hide underline on the "Get the extension" link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 04:45:15 -07:00
Nick Sweeting 66e5e06c7f chore: checkpoint deploy loop changes 2026-05-28 07:12:56 -07:00
Nick Sweeting 08de08c64c release: archivebox 0.9.33rc17 2026-05-28 07:02:30 -07:00
Nick Sweeting 30c841d477 Update dev docs and MHTML preview handling 2026-05-24 22:41:16 -07:00