It’s a fun problem to solve and one I’ve come across before when trying to alert on your monitoring tool being down, but slightly different when it’s your product.
Hopefully interesting if you’ve hit similar puzzles before.
We recently released our On-call product, and as part of that, had to think a lot about redundancy and 'failing safety'.
Here's how we achieve it - and how we're thinking about it. Interested if any other examples of this exist in the wild - I'd love to know more about how eg: Datadog achieve this.
It’s a fun problem to solve and one I’ve come across before when trying to alert on your monitoring tool being down, but slightly different when it’s your product.
Hopefully interesting if you’ve hit similar puzzles before.
reply