Every product experimentation team doing causal inference on LLM-based features eventually hits the same wall: when the provider ships a new model version, there's no holdout.
Your infrastructure team upgrades every workspace from Claude 4.5 to Claude 4.6 overnight. All 50 production workspaces get the new model at the same time. A week later, task completion climbs ac...