[Note: as of this writing, Greg is still unable to kill selected Wisconsin residents with the power of his mind.]
Monday
More information on this outage. According to the log, it looks like the dhcp service stopped running about 21:37. I think I figured out why the service stopped.
The logs show a login as “root” from “maintain” at 21:30, a few minutes before the outage. This root session is still open as of this email. Additionally, there is a log file in /home/[username redacted] showing whether the daemon is running and who is logged in once per minute, and that file starts at the same time as the outage.
So, given that we have a limited track record in this configuration, it’s a bit early to make generalizations, but our first and only official outage of dhcp appears to be self-inflicted. Here’s a quick breakdown of root cause and solution.
Problem: dhcpd dying for any reason
Solution: Monitoring for this condition is still our first and best defense. We will be adding a cron-based script to check for this condition and automatically restart, but this is not a replacement for monitoring, since there are cases where a restart script won’t work (machine out of resources, config file fatal error). We will have the restart cron in place before pushing all ranges to production.
Problem: root access on maintain affords root access on dhcp1-2
Solution: If there are people who need root access on maintain but not on dhcp1-2, then we need to lock down the config-push process so that trusted users on maintain can’t gain privileged access on dhcp1-2. Our current implementation isn’t locked down to this level because we assume the access lists can easily be made consistent. If the access lists need to be different, we will assess the risk and act accordingly.
Problem: staff members taking advantage of security holes
Solution: This is a behavior problem and should be addressed by management. If someone has been granted user access and not root access, taking advantage of a loophole to gain root access should be pretty clearly off-limits as well. Please clarify the access policy as appropriate.
Problem: staff members with root access killing production services
Solution: Any changes to a running production server that might affect the service should be announced ahead of time and done during an appropriate maintenance window. Please clarify the policy and take any appropriate action.
Tuesday
Two more root shells were open on dhcp1. I have killed them.
These are the files that were accessed at the time the root shells were started (showing access time, not mod time):
-rwsr-xr-x 1 root root 29788 May 24 07:39 /sbin/dnsqa
-rwsr-xr-x 1 root root 7836 May 24 07:39 /usr/bin/rsh
Of these, “dnsqa” is probably the suspicious one, so I have chowned it back to [username redacted] so that file won’t have root privileges.
There are probably more exploits we will find over time that allow non-root users to gain root, but I don’t know how much effort we should expend trying to find and close all of them. Our policy so far has been that we trust non-root users not to abuse the system to try and gain more access than they have been given, so this type of local-user exploit is normally not much of a concern. This is also typical of other non-SGI locations, so we are not alone. We should make a decision as to what direction to go:
1. we assume that local users are trustworthy
2. we expend more resources trying to make the system more bulletproof (similar to ISPs who give shell accounts)
3. we decide not to have any local accounts at all for non-root users
Let me know what you think…
gregc
Personally I’d say that for network critical machines like DHCP or DNS servers no non-root users should have a local account on the machine. Especially when the machine in question has had a history of people putting attacks to get root privs from a normal account (in quite a few organisations this alone would be sufficent for the machine in question to be taken out of service and have a re-build of the OS done on it).
After all, given root level acccess to DHCP or DNS servers it would be very easy to perform some nasty attacks to end users – for example poisoning the cache of the name servers or having clients pointed at name servers an attacker controls via DHCP supplying the addresses of evil DNS servers.
I’d also say that a policy of trusting non-root users to not abuse the system is not a great one as it relies on your users being friendly. Personally I’d rather assume that all of your users are out to get you; this can often end up being true if an outside attacker breaks into the account of a user or a user decides to cause some damage to your systems.
Completely in agreement. For mostly historical reasons, we have a policy of giving root access to the “IT Architecture” team while the servers are being built and tested, and then removing root access before going to production. In the past they have been “the brains of the operation” but we’re trying to assert more independence and move them to an advisory role rather than an active role.
Said users are supposed to be trusted staff members, but when they pull jackass stunts like killing the server because they disagree with our decision not to use a daemontools-style wrapper/sentinel thingy… well, hopefully management will start to question the usefulness of the policy (and if we’re lucky, to question the usefulness of the staff member in question as well).
The jackass in question is supposed to be a trusted member of the IT organization, but his recent actions have eroded almost all trust. We are taking action to protect the servers, but I also want to make it painfully obvious to management that this guy is being a jackass and incuring extra costs to the company because of it.
Some might say, if he’s fooled me twice, shame on me. I’m leaving the system open to additional exploits, but if they are actually exploited, it’s his ass on the line (or in the queue) and not likely mine. I’m currently in the “give him enough rope” stage of the game.
The logs show a login as “root” from “maintain” at 21:30
Well, there’s problem #1. Why, oh why, do you even allow logins as root? Makes it absolutely impossible to audit who’s accessing and making changes to your systems. Quit it. There’s almost never a good excuse for it.
It really sounds like you guys need some help from above… for someone to sit down and say “these are the policies” and enforce them. It sounds like you have a real “too many cooks” problem that won’t be solved unless someone of authority is willing to take a stand on it.
Ah, turf wars. Fun.
*snuggles*
p.s. You AIMable?
aim is “nekodojo” (all IM info is in my lj userinfo)
Yes, you are right. The root ssh key is intended so that another server can push the config file and restart the daemon. It is restricted to only the pushing host. Trouble is, at the time, ${dickhead} had sudo ability on one and not the other.
We could further limit the key so that it is only able to execute rsync and dhcpd restart… but in our estimation, if someone is able to push a (possibly empty) config file and restart the service, that’s as much damage as could probably be done. Plus I want to make the point to management that we should be able to trust these jokers, not continually have to defend machines against our own people for fsck sake. *sigh*
1. we assume that local users are trustworthy
Fallacious argument. You are to assume all local users are untrustworthy and configure appropriately. Trust in this sense doesn’t mean “I trust you” it means, “I am giving you the ability to completely destroy my machine”. This assumption leads to the dark side, no good will come of it :).
2. we expend more resources trying to make the system more bulletproof (similar to ISPs who give shell accounts)
This is more sensible, given that you determine who needs to do what and configure appropriately.
3. we decide not to have any local accounts at all for non-root users
Even if you have users a group that can su root, you’re not really preventing anything, you’re still granting trust.
Personally, my policy would be “You abuse root, you’re gone.” Period, end of story. Mind you, that also entails having the tools in place to capture such abuse.
I don’t envy you. While it is a nice assumption to make that your staff is trustworthy, it’s still a risky one.
Good luck :)