The familiar starting point is a vague alert that lands at the worst possible time. A spike in 500s appears, Grafana shows three dashboards in red, Prometheus is scraping happily but not telling a coherent story, and the on-call engineer starts hopping between logs, traces, node metrics, and cloud consoles trying to answer one basic question: what broke?
That situation...