More monitoring is not necessarily better monitoring What gets measured, gets fixed. Are we measuring the meaningful stuff, or incidentals Meaningful metrics: quality, quantity and cost Incidentals - swap, disk space, anything targetted at a specific “problem” condition There should be (at least) 3 levels of “severity” with appropriate corresponding actions Major site problem - VIP not working, broken images, error messages, can’t place orders, etc. Minor site problem - Slow response, error in rare cases, monitoring broken Not customer-facing - One server or a small number have a problem but VIP is OK Minor - Node is not down but some resources are out-of-bounds. (swap, disk, cpu etc) Forensic, correlating - Should not notify anyone, but can be used for additional information “Critical” is not necessarily important (e.g. NTP can be “critical” but not worth notifying anyone) Too many minor alarms can mask important problems and dull our response
...