More monitoring is not necessarily better monitoring
What gets measured, gets fixed.
Are we measuring the meaningful stuff, or incidentals
Meaningful metrics: quality, quantity and cost
Incidentals – swap, disk space, anything targetted at a specific “problem” condition
There should be (at least) 3 levels of “severity” with appropriate corresponding actions
Major site problem – VIP not working, broken images, error messages, can’t place orders, etc.
Minor site problem – Slow response, error in rare cases, monitoring broken
Not customer-facing – One server or a small number have a problem but VIP is OK
Minor – Node is not down but some resources are out-of-bounds. (swap, disk, cpu etc)
Forensic, correlating – Should not notify anyone, but can be used for additional information
“Critical” is not necessarily important (e.g. NTP can be “critical” but not worth notifying anyone)
Too many minor alarms can mask important problems and dull our response
Problem: monitoring environment too complex for one person to fully understand
Recent monitoring changes put us at risk
Best subject matter experts don’t understand the monitors well
Monitoring specialist doesn’t know the specifics of each service
Monitoring specialist is not the best person to have the primary responsibility for monitoring
Someone should review the service and tune the monitors, using monitoring specialist as a resource
Every monitor should be well described, including impact, probable causes, troubleshooting info
Lots of monitors in red/yellow/orange state for a long time means it’s tough to find interesting/relevant problems.
Chronic, frequent problems that don’t need a response are good candidates for nuking.
Sheer number of monitors makes it hard to evaluate whether monitoring is working right
Periodic review of disabled monitors (automatic report would be better).
Disabled notifications should be tracked by a ticket; when closing, reopen if the monitor is still disabled.
Don’t disable “checking”, only disable “notifications”, system should keep checking/recording.
Without knowing specifically what you’re monitoring (though I can guess readily enough), it sounds like you’ve got most of the bases covered with your thoughts there. Sounds like there’s a lot going on.
Like you wouldn’t believe.
Fuck yeah, dude. This whole thing should be sent to jw and C, stat. ;)