Files
supabase/apps/docs/internals/markdown-sources.ts
Pamela Chia 20290c71bd fix(docs): stop named-bot markdown 404s on guides (#47337)
## Summary

Since the guides UA-redirect shipped (GROWTH-811), named LLM bots
requesting `/docs/guides/*` get rewritten to the markdown handler, which
returns a 404 when no `.md` file exists. About 90K of those 404s per day
land on real pages that serve HTML 200 fine: the bot gets nothing on a
page that works.

The root cause is that the docs middleware hardcoded
`hasMarkdownVariant: true` for every guide path, so it never checked
whether a `.md` actually existed. I fixed it in two layers:

1. A build-time slug manifest makes `hasMarkdownVariant` truthful. Guide
pages with no `.md` now fall through to HTML 200 instead of a 404. This
is content-source-agnostic and future-proof: a new content source can
never silently regress to a 404.
2. A second generator pass emits real markdown for the troubleshooting
collection (the largest source, ~70% of the 404 volume), so those bots
get clean markdown rather than just HTML.

## Changes

- Add a shared `markdown-sources` module: a single source of truth for
which slugs get a `.md` (guides + troubleshooting), so the generator
output and the manifest cannot drift.
- Generate markdown for the troubleshooting collection (196 pages, TOML
frontmatter parsed via `smol-toml`), written under
`public/markdown/guides/troubleshooting/`.
- Emit a build-time slug manifest (a gitignored generated `.ts` module,
regenerated in `prebuild`, `predev`, and `pretypecheck`, mirroring the
existing `__generated__/graphql.ts` lifecycle).
- Gate the middleware's `hasMarkdownVariant` on the manifest: serve HTML
200 instead of a 404 for guide paths with no markdown variant.

This PR intentionally does not generate markdown for the ai-prompts,
YAML config, and externally-fetched (splinter) sources. The HTML
fallback covers them now; generating their markdown is follow-up work.

## Testing

Local verification (deterministic, against the real manifest and the
real negotiation function):
- Manifest invariant holds: 744 manifest slugs equal 744 generated `.md`
files.
- Generator emits 196 troubleshooting files with zero warnings,
frontmatter stripped, no leaked delimiters.
- Negotiation decision matrix, 6/6: covered slug + bot UA to markdown;
uncovered real page + bot UA to pass (HTML 200); nonexistent + bot UA to
pass; browser to HTML; covered + `.md` suffix to markdown; uncovered +
`.md` suffix to pass.

Verified on the Vercel preview deploy:
- [x] `User-Agent: ChatGPT-User` on a troubleshooting page returns `200
text/markdown` (real markdown body, frontmatter stripped).
- [x] `User-Agent: ChatGPT-User` on an uncovered real page
(`ai-tools/ai-prompts/code-format-sql`) returns `200 text/html` (was
404).
- [x] Browser request to the same uncovered page returns `200 text/html`
(unchanged for humans).
- [x] `User-Agent: ChatGPT-User` on a covered standard guide returns
`200 text/markdown` (no regression).
- [x] `User-Agent: ChatGPT-User` on a nonexistent guide URL returns
`404` (correct).

Known limitation: an explicit `.md`-suffix request on an uncovered page
still 404s by design (an explicit markdown request for a page that has
no markdown). The ~90K/day volume is plain-URL UA-based, so it is
unaffected.

Post-deploy, I will re-run the request-grain 404 reclassification in the
GROWTH-915 BQ workspace to confirm fixable guide markdown 404s drop to
near zero.

## Linear
- fixes GROWTH-946


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added generated markdown slug tracking for docs guides, improving
markdown availability detection.
* Added automated manifest generation and validation during docs build
and CI workflows.

* **Bug Fixes**
* Improved guide markdown negotiation so only supported guide slugs are
treated as having a markdown variant.
* Standardized markdown source handling for guides and troubleshooting
pages.

* **Tests**
  * Added coverage for guide and troubleshooting slug generation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Alaister Young <10985857+alaister@users.noreply.github.com>
2026-06-27 22:00:14 -07:00

43 lines
1.3 KiB
TypeScript

import path from 'node:path'
import { globby } from 'globby'
export type FrontmatterFormat = 'yaml' | 'toml'
export interface MarkdownSource {
sourceFile: string
slug: string
outPath: string
frontmatter: FrontmatterFormat
}
const OUTPUT_ROOT = 'public/markdown/guides'
const GUIDES_GLOB = 'content/guides/**/!(_)*.mdx'
const TROUBLESHOOTING_GLOB = 'content/troubleshooting/!(_)*.mdx'
export function guideSlug(sourceFile: string): string {
return sourceFile.replace(/^content\/guides\//, '').replace(/\.mdx$/, '')
}
export function troubleshootingSlug(sourceFile: string): string {
return `troubleshooting/${path.basename(sourceFile, '.mdx')}`
}
export async function collectMarkdownSources(): Promise<MarkdownSource[]> {
const [guideFiles, troubleshootingFiles] = await Promise.all([
globby([GUIDES_GLOB]),
globby([TROUBLESHOOTING_GLOB]),
])
const guides: MarkdownSource[] = guideFiles.map((sourceFile) => {
const slug = guideSlug(sourceFile)
return { sourceFile, slug, outPath: `${OUTPUT_ROOT}/${slug}.md`, frontmatter: 'yaml' }
})
const troubleshooting: MarkdownSource[] = troubleshootingFiles.map((sourceFile) => {
const slug = troubleshootingSlug(sourceFile)
return { sourceFile, slug, outPath: `${OUTPUT_ROOT}/${slug}.md`, frontmatter: 'toml' }
})
return [...guides, ...troubleshooting]
}