Adding broad RLS policies to public buckets can cause users to expose more than they expected, like the ability to list all profile pictures on an app. This patches Assistant with knowledge to follow our latest guidance on restrictive RLS policies for storage buckets https://github.com/supabase/supabase/pull/46172 **Changes** - Adds Storage bucket evals for public website assets and avatar access patterns to distinguish public vs private bucket use cases - Adds eval for overly permissive table policies - Adds `storage` knowledge so Assistant distinguishes public buckets, private buckets, object reads, and object listing. - Adds `includeToolCallInputs` option for scorer transcripts so LLM judges can evaluate proposed SQL/tool actions. - Bumps max step count to 10 since storage knowledge may incur another tool call (also 10 is recommended [here](https://vercel.com/academy/ai-sdk/multi-step-and-generative-ui#why-multi-step-is-required) for complex multi-tool scenarios) **References** - https://supabase.com/docs/guides/storage/buckets/fundamentals#public-buckets - https://supabase.com/docs/guides/storage/security/access-control - https://github.com/supabase/supabase/pull/46172 **Notes:** - These prompt tweaks are not meant to be exhaustive fixes, they are mainly hotfixes intended to hold us out until these cases can be addressed more deeply in skills/docs and tracked in a central evals Closes AI-676 Closes AI-756 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Storage knowledge resource for the assistant covering Supabase Storage access patterns and RLS guidance. * Added three evaluation cases: two for Storage (marketing assets, avatars) and one for RLS policy generation for user profiles. * **Improvements** * Evaluators now include tool call inputs when judging conversations. * Assistant prompts and generation enhanced with richer Storage/RLS guidance and extended streaming limits. * **Tests** * Added test ensuring tool call inputs are included in serialized thread context. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/supabase/supabase/pull/46168?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Studio Assistant Evals
We use Braintrust to evaluate Assistant behaviors against a tracked dataset (offline evals) and against live traces (online evals).
Offline Evals
Add offline eval test cases to dataset.ts. If needed, add new scorers (see below) for the specific dimension you wish to test. Expect to update and run offline evals when adding new Assistant behaviors
You may wish to run offline evals when:
- You updated the eval suite with a new test case or scorer
- You changed Assistant's behavior and want to check for improvements/regressions
Running Offline Evals in CI
Add the run-evals label on a PR to the repo and Braintrust's GitHub Action will run evals and post a summary comment (example).
You can find detailed results in the "Experiments" tab of the "Assistant" project on Braintrust.
Running Offline Evals in Local Dev
Within apps/studio
# To set up WASM files
pnpm evals:setup
# Run all evals and upload results to Braintrust
pnpm evals:upload
# Run all evals without uploading results
pnpm evals:run
# Run an upload single test case
pnpm braintrust eval evals/assistant.eval.ts --filter "input.prompt=How many projects"
Upload results when you want to inspect Experiments or Logs in the Braintrust dashboard or API. You can use developer tools like Braintrust MCP or bt CLI to analyze results with an agent.
Scorers
Scorers look at a thread or task output and assign a score deterministically or via LLM-as-a-judge. Optionally they can consider expected values.
Define scorers in scorer.ts and include them in assistant.eval.ts to run them in offline evals.
Updating Online Scorers
Online scorers run as serverless functions on Braintrust infrastructure. They're deployed from the scorer-online.ts script. Since these scoring against production traces, they can't rely on ground truth expected values. Structure scoring logic and LLM prompts accordingly. Not every scorer needs to be an online scorer.
To opt-in to online scoring, add the scorer to scorer-online-manifest.json and add a corresponding handler in scorer-online.ts
Testing & Deploying Online Scorers
Add the preview-scorers label to a PR to deploy branch-prefixed scorers to the "Assistant (Staging Scorers)" Braintrust project (example). From that project dashboard, you can manually test the scorer against a trace from any project.
After merge to master, preview scorers automatically clean up and deploy to the production in the "Assistant" Braintrust project. Update the "Online Scoring" automation in the Logs page to include the new scorer function.