Monthly Archives: May 2007

Hawaii, here we come

Friends- We have made reservations for Hawaii! We are going to the Big Island, from October 14 to October 20. That’s six nights, checking in Sunday and checking out Saturday.

I know I have asked a few of you if you’d like to go with us, so now would be a good time to confirm your availability and request time off from work, etc.

We have a short list (family mostly) that we want with us if possible, but not all of them will be available and so far there’s plenty of room. If you’re interested in joining us, let me know that you’re interested and I’ll get back to you to let you know if there’s still space available. I have reserved 4 bedrooms total (2 with king, 2 with double-queen) but there’s a chance to get more rooms if we act quickly.

The rooms are pre-paid using our timeshare, so you would only need to cover your travel, meals and entertainment (there’s a kitchen unit in the room).

Here’s a link to the resort info: http://www.hiltongrandvacations.com/hilton-waikoloa-timeshare.php

thoughts on monitoring

More monitoring is not necessarily better monitoring
What gets measured, gets fixed.
Are we measuring the meaningful stuff, or incidentals
Meaningful metrics: quality, quantity and cost
Incidentals – swap, disk space, anything targetted at a specific “problem” condition
There should be (at least) 3 levels of “severity” with appropriate corresponding actions
Major site problem – VIP not working, broken images, error messages, can’t place orders, etc.
Minor site problem – Slow response, error in rare cases, monitoring broken
Not customer-facing – One server or a small number have a problem but VIP is OK
Minor – Node is not down but some resources are out-of-bounds. (swap, disk, cpu etc)
Forensic, correlating – Should not notify anyone, but can be used for additional information
“Critical” is not necessarily important (e.g. NTP can be “critical” but not worth notifying anyone)
Too many minor alarms can mask important problems and dull our response

Problem: monitoring environment too complex for one person to fully understand
Recent monitoring changes put us at risk
Best subject matter experts don’t understand the monitors well
Monitoring specialist doesn’t know the specifics of each service
Monitoring specialist is not the best person to have the primary responsibility for monitoring
Someone should review the service and tune the monitors, using monitoring specialist as a resource
Every monitor should be well described, including impact, probable causes, troubleshooting info

Lots of monitors in red/yellow/orange state for a long time means it’s tough to find interesting/relevant problems.
Chronic, frequent problems that don’t need a response are good candidates for nuking.
Sheer number of monitors makes it hard to evaluate whether monitoring is working right
Periodic review of disabled monitors (automatic report would be better).
Disabled notifications should be tracked by a ticket; when closing, reopen if the monitor is still disabled.
Don’t disable “checking”, only disable “notifications”, system should keep checking/recording.