archivebox add and other entry points now seed Crawl.urls as
CrawlSeed JSONL at depth=0 (the input layer) with max_depth=depth
for direct URLs and depth+1 only for stdin/import text where the
synthetic archivebox://internal root lives at depth=0. The runner
also accepts one plain URL per line for ORM/crawl-create/schedule
callers so every Crawl row goes through the same expansion path
without scattering CrawlSeed knowledge across the codebase.
Tests updated to match restored convention.
Direct URL inputs from CLI/UI/API now seed Crawl.urls as explicit
{type:CrawlSeed,url,depth} JSONL rows; raw stdin/UI/API import text
stays verbatim. The runner's create_initial_snapshots() is now the
single place that either expands seed rows or creates the synthetic
archivebox://internal root + staticfile/stdin.txt, so add paths no
longer perform DB/FS side effects and the parser hooks run through
the same Snapshot lifecycle as every other extractor.
Renames (no functional change, just consistency with the rest of the codebase):
- cli/cli_utils.py → cli/cli_util.py
- core/host_utils.py → core/host_util.py
- core/tag_utils.py → core/tag_util.py
- crawls/schedule_utils.py → crawls/schedule_util.py
- machine/env_utils.py → machine/env_util.py
Functional fixes:
- archivebox add --index-only now materializes Snapshot rows synchronously
via crawl.create_snapshots_from_urls() instead of just queueing the Crawl
and leaving the index empty. The previous behavior broke every test that
expected --index-only to populate the index, since the runner is never
started in index-only mode.
- config/collection.py: add _coerce_from_str_dict as the inverse of
_coerce_to_str_dict so JSON-encoded INI values are decoded back to native
dict/list types when mirrored into Machine.config (a JSONField). Without
this, downstream consumers like MachineEvent / abx-dl get raw JSON
strings where they expect dicts.
Plus matching admin / middleware / model touch-ups, the registration
password_change_form template, and assorted small cleanups the user
worked through while validating the deploy path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Snapshot detail page: embed scoped live-progress monitor (same-origin
/progress.json on whichever host the page is served from); hide admin
action buttons when scoped; per-snapshot perms via can_view_snapshot.
- crawl_file API: respect crawl-level permissions; PUBLIC/UNLISTED served
to guests, PRIVATE returns 404 for non-admin/non-owner.
- CrawlRunner: replace allow_paused_snapshot_maintenance with
allow_maintenance_on_inactive_crawl so SEALED crawls don't short-circuit
the cancellation guard for legitimate maintenance hooks (search backend
backfill, fs migration, etc.). Fixes infinite STARTED loop on snapshots
with queued search_backend results.
- Universal `--init` flag: works on any subcommand (server, update, add,
shell, install, ...). Detected at module load, stripped from argv, and
consumed in the dispatcher so subprocesses inherit a clean env.
- supervisord_util.run_runner_worker: route Ctrl+C through
supervisor.signalProcess(name, "SIGINT") instead of raw os.kill on a
cached pid, gated on statename=RUNNING. Prevents killing unrelated
processes when the worker's pid has been reused by the OS.
- Login page: remove non-functional password-reset links; add
has_real_admin_users template tag to gate the bootstrap hint.
- Add page: hide underline on the "Get the extension" link.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>