mirror of
https://github.com/clockworklabs/SpacetimeDB.git
synced 2026-06-27 16:30:35 -04:00
fb0a458d4f
# Description of Changes AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same prompts, same chat app, two backends. Upgraded through 12 feature levels, manually graded at each level, bugs fixed, all costs measured via OpenTelemetry. Results viewable at: https://spacetimedb.com/llms-benchmark-sequential-upgrade ## Benchmark harness (`tools/llm-sequential-upgrade/`) - `run.sh`: orchestrates headless Claude Code sessions for code generation, sequential upgrades, and bug fixes. Tracks all API costs via OTel. Supports `--upgrade`, `--fix`, `--composed-prompt`, `--resume-session` modes. - `grade.sh` / `grade-agents.sh`: grading harnesses for manual testing of generated apps. - `docker-compose.otel.yaml`: OTel collector + PostgreSQL services. - `generate-report.mjs` / `parse-telemetry.mjs`: aggregate per-session telemetry into cost reports. - Backend guidelines in `backends/`: SpacetimeDB SDK reference, config templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io guidance. **After https://github.com/clockworklabs/SpacetimeDB/pull/4740 merges, we will likely want to update this so that it reads backend and SDK guidance from SKILLS** ## Two complete benchmark runs **Run 1 (20260403):** Original methodology. **Run 2 (20260406):** Refined methodology with domain bias removed from SpacetimeDB SDK docs and PostgreSQL instructions made feature-spec-neutral. **Note: no meaningful changes in results were observed with these changes. Domain familiarity biases were very small and almost certainly not the cause of STDB's major gains over PG stack.** Each run contains full L1-L12 app source for both backends, level snapshots preserving state before each upgrade, and per-session OTel cost summaries. ## 12 feature levels | Level | Feature | |---|---| | L1 | Basic Chat + Typing + Read Receipts + Unread Counts | | L2 | Scheduled Messages | | L3 | Ephemeral Messages | | L4 | Message Reactions | | L5 | Message Editing with History | | L6 | Real-Time Permissions (kick, ban, promote) | | L7 | Rich User Presence | | L8 | Message Threading | | L9 | Private Rooms + Direct Messages | | L10 | Room Activity Indicators | | L11 | Draft Sync | | L12 | Anonymous to Registered Migration | ## Results | | Run 1 (20260403) | Run 2 (20260406) | |---|---|---| | **SpacetimeDB total cost** | $13.33 | $12.62 | | **PostgreSQL total cost** | $17.80 | $19.68 | | **SpacetimeDB bugs** | 5 | 2 | | **PostgreSQL bugs** | 19 | 8 | | **SpacetimeDB fix sessions** | 4 | 1 | | **PostgreSQL fix sessions** | 17 | 10 | Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs, and require fewer fix iterations. The refined methodology (Run 2) widened the cost gap and **confirmed the advantage is structural, not an artifact of domain-biased SDK docs.** ## Performance benchmark (`perf-benchmark/`) Stress throughput tool that fires concurrent writers at peak saturation against the AI-generated `send_message` handlers. | Tier | SpacetimeDB (avg) | PostgreSQL (avg) | Ratio | |---|---|---|---| | AI-generated (as-shipped) | 5,267 msgs/sec | 694 msgs/sec | 7.6x | | PG rate limit removed | 5,267 msgs/sec | 1,070 msgs/sec | 4.9x | | Optimized (same features kept) | 25,278 msgs/sec | 1,139 msgs/sec | 22x | The gap widens with optimization because SpacetimeDB's bottleneck is fixable code patterns in the reducer while PostgreSQL's bottleneck is architectural (sequential network round-trips to an external database). Optimized reference code with all features preserved is in `perf-benchmark/results/optimized-reference/`. ## Data handling Per-session cost summaries (`cost-summary.json`, `COST_REPORT.md`, `metadata.json`) are committed. Raw OTel telemetry (`raw-telemetry.jsonl`) containing PII is excluded via `.gitignore` and stored privately. # API and ABI breaking changes None. All changes are in `tools/llm-sequential-upgrade/`. No production code, library, or SDK changes. # Expected complexity level and risk **1 - Trivial.** Self-contained benchmarking tooling and data. No interaction with production code. # Testing - [x] L1-L12 upgrades completed on all 4 apps (2 backends x 2 runs) with OTel cost capture - [x] All levels manually graded after each upgrade; bugs filed and fixed via the harness - [x] Methodology refinement between runs validated (domain bias removal, feature-neutral instructions) - [x] Stress benchmarks run across both runs x 3 tiers (as-shipped, rate-limit-removed, optimized) - [x] Optimized benchmarks verified to preserve all original features - [x] Sensitive data (PII in raw telemetry) removed from repo and gitignored - [ ] Reviewer: spot-check that METRICS_DATA.json / METRICS_REPORT.json numbers match the telemetry cost-summary.json files --------- Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com> Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>