A war story about thundering herds, cascading failures, and the 15 seconds that broke everything.
It was a Wednesday afternoon when the alerts started firing. Our production API — a Ruby on Rails application serving thousands of clients — had gone completely unresponsive. Every single endpoint was returning...