Files
Matt Rossman 65fab30935 feat(ai): judge tool inputs, add storage guidance and permissive RLS evals (#46168)
Adding broad RLS policies to public buckets can cause users to expose
more than they expected, like the ability to list all profile pictures
on an app. This patches Assistant with knowledge to follow our latest
guidance on restrictive RLS policies for storage buckets
https://github.com/supabase/supabase/pull/46172

**Changes**
- Adds Storage bucket evals for public website assets and avatar access
patterns to distinguish public vs private bucket use cases
- Adds eval for overly permissive table policies
- Adds `storage` knowledge so Assistant distinguishes public buckets,
private buckets, object reads, and object listing.
- Adds `includeToolCallInputs` option for scorer transcripts so LLM
judges can evaluate proposed SQL/tool actions.
- Bumps max step count to 10 since storage knowledge may incur another
tool call (also 10 is recommended
[here](https://vercel.com/academy/ai-sdk/multi-step-and-generative-ui#why-multi-step-is-required)
for complex multi-tool scenarios)

**References**
-
https://supabase.com/docs/guides/storage/buckets/fundamentals#public-buckets
- https://supabase.com/docs/guides/storage/security/access-control
- https://github.com/supabase/supabase/pull/46172

**Notes:**
- These prompt tweaks are not meant to be exhaustive fixes, they are
mainly hotfixes intended to hold us out until these cases can be
addressed more deeply in skills/docs and tracked in a central evals

Closes AI-676
Closes AI-756

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added Storage knowledge resource for the assistant covering Supabase
Storage access patterns and RLS guidance.
* Added three evaluation cases: two for Storage (marketing assets,
avatars) and one for RLS policy generation for user profiles.

* **Improvements**
  * Evaluators now include tool call inputs when judging conversations.
* Assistant prompts and generation enhanced with richer Storage/RLS
guidance and extended streaming limits.

* **Tests**
* Added test ensuring tool call inputs are included in serialized thread
context.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/supabase/supabase/pull/46168?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-29 09:55:23 -04:00
..

Studio Assistant Evals

We use Braintrust to evaluate Assistant behaviors against a tracked dataset (offline evals) and against live traces (online evals).

Offline Evals

Add offline eval test cases to dataset.ts. If needed, add new scorers (see below) for the specific dimension you wish to test. Expect to update and run offline evals when adding new Assistant behaviors

You may wish to run offline evals when:

  • You updated the eval suite with a new test case or scorer
  • You changed Assistant's behavior and want to check for improvements/regressions

Running Offline Evals in CI

Add the run-evals label on a PR to the repo and Braintrust's GitHub Action will run evals and post a summary comment (example).

You can find detailed results in the "Experiments" tab of the "Assistant" project on Braintrust.

Running Offline Evals in Local Dev

Within apps/studio

# To set up WASM files
pnpm evals:setup

# Run all evals and upload results to Braintrust
pnpm evals:upload

# Run all evals without uploading results
pnpm evals:run

# Run an upload single test case
pnpm braintrust eval evals/assistant.eval.ts --filter "input.prompt=How many projects"

Upload results when you want to inspect Experiments or Logs in the Braintrust dashboard or API. You can use developer tools like Braintrust MCP or bt CLI to analyze results with an agent.

Scorers

Scorers look at a thread or task output and assign a score deterministically or via LLM-as-a-judge. Optionally they can consider expected values.

Define scorers in scorer.ts and include them in assistant.eval.ts to run them in offline evals.

Updating Online Scorers

Online scorers run as serverless functions on Braintrust infrastructure. They're deployed from the scorer-online.ts script. Since these scoring against production traces, they can't rely on ground truth expected values. Structure scoring logic and LLM prompts accordingly. Not every scorer needs to be an online scorer.

To opt-in to online scoring, add the scorer to scorer-online-manifest.json and add a corresponding handler in scorer-online.ts

Testing & Deploying Online Scorers

Add the preview-scorers label to a PR to deploy branch-prefixed scorers to the "Assistant (Staging Scorers)" Braintrust project (example). From that project dashboard, you can manually test the scorer against a trace from any project.

After merge to master, preview scorers automatically clean up and deploy to the production in the "Assistant" Braintrust project. Update the "Online Scoring" automation in the Logs page to include the new scorer function.