[Note: as of this writing, Greg is still unable to kill selected Wisconsin residents with the power of his mind.]
More information on this outage. According to the log, it looks like the dhcp service stopped running about 21:37. I think I figured out why the service stopped.
The logs show a login as “root” from “maintain” at 21:30, a few minutes before the outage. This root session is still open as of this email. Additionally, there is a log file in /home/[username redacted] showing whether the daemon is running and who is logged in once per minute, and that file starts at the same time as the outage.
So, given that we have a limited track record in this configuration, it’s a bit early to make generalizations, but our first and only official outage of dhcp appears to be self-inflicted. Here’s a quick breakdown of root cause and solution.
Problem: dhcpd dying for any reason
Solution: Monitoring for this condition is still our first and best defense. We will be adding a cron-based script to check for this condition and automatically restart, but this is not a replacement for monitoring, since there are cases where a restart script won’t work (machine out of resources, config file fatal error). We will have the restart cron in place before pushing all ranges to production.
Problem: root access on maintain affords root access on dhcp1-2
Solution: If there are people who need root access on maintain but not on dhcp1-2, then we need to lock down the config-push process so that trusted users on maintain can’t gain privileged access on dhcp1-2. Our current implementation isn’t locked down to this level because we assume the access lists can easily be made consistent. If the access lists need to be different, we will assess the risk and act accordingly.
Problem: staff members taking advantage of security holes
Solution: This is a behavior problem and should be addressed by management. If someone has been granted user access and not root access, taking advantage of a loophole to gain root access should be pretty clearly off-limits as well. Please clarify the access policy as appropriate.
Problem: staff members with root access killing production services
Solution: Any changes to a running production server that might affect the service should be announced ahead of time and done during an appropriate maintenance window. Please clarify the policy and take any appropriate action.
Two more root shells were open on dhcp1. I have killed them.
These are the files that were accessed at the time the root shells were started (showing access time, not mod time):
-rwsr-xr-x 1 root root 29788 May 24 07:39 /sbin/dnsqa
-rwsr-xr-x 1 root root 7836 May 24 07:39 /usr/bin/rsh
Of these, “dnsqa” is probably the suspicious one, so I have chowned it back to [username redacted] so that file won’t have root privileges.
There are probably more exploits we will find over time that allow non-root users to gain root, but I don’t know how much effort we should expend trying to find and close all of them. Our policy so far has been that we trust non-root users not to abuse the system to try and gain more access than they have been given, so this type of local-user exploit is normally not much of a concern. This is also typical of other non-SGI locations, so we are not alone. We should make a decision as to what direction to go:
1. we assume that local users are trustworthy
2. we expend more resources trying to make the system more bulletproof (similar to ISPs who give shell accounts)
3. we decide not to have any local accounts at all for non-root users
Let me know what you think…