More monitoring is not necessarily better monitoring What gets measured, gets fixed. Are we measuring the meaningful stuff, or incidentals Meaningful metrics: quality, quantity and cost Incidentals - swap, disk space, anything targetted at a specific “problem” condition There should be (at least) 3 levels of “severity” with appropriate corresponding actions Major site problem - VIP not working, broken images, error messages, can’t place orders, etc. Minor site problem - Slow response, error in rare cases, monitoring broken Not customer-facing - One server or a small number have a problem but VIP is OK Minor - Node is not down but some resources are out-of-bounds. (swap, disk, cpu etc) Forensic, correlating - Should not notify anyone, but can be used for additional information “Critical” is not necessarily important (e.g. NTP can be “critical” but not worth notifying anyone) Too many minor alarms can mask important problems and dull our response

Problem: monitoring environment too complex for one person to fully understand Recent monitoring changes put us at risk Best subject matter experts don’t understand the monitors well Monitoring specialist doesn’t know the specifics of each service Monitoring specialist is not the best person to have the primary responsibility for monitoring Someone should review the service and tune the monitors, using monitoring specialist as a resource Every monitor should be well described, including impact, probable causes, troubleshooting info

Lots of monitors in red/yellow/orange state for a long time means it’s tough to find interesting/relevant problems. Chronic, frequent problems that don’t need a response are good candidates for nuking. Sheer number of monitors makes it hard to evaluate whether monitoring is working right Periodic review of disabled monitors (automatic report would be better). Disabled notifications should be tracked by a ticket; when closing, reopen if the monitor is still disabled. Don’t disable “checking”, only disable “notifications”, system should keep checking/recording.