Most retrieval-augmented features are easy to demo and hard to operate. The first version works because the document corpus is small and the prompts are loose. The hundredth version fails because the corpus drifted, the prompts got tuned to a moment in time, and nobody owns the eval pipeline.
We use a 12-point checklist on every RAG feature before it leaves staging. It's intentionally boring: ground-truth dataset, eval harness, retrieval recall floor, generation quality floor, cost ceiling per request, latency budget, refusal rate, fallback path, audit log shape, redaction boundary, escalation path, and rollback plan.
The point of the list isn't to slow things down. It's to force the team to write down the conditions under which the feature should be turned off — before anyone has the political reason not to.