Tag Archives: work


This week at work we’ve been experimenting with MogileFS (link) which is a “filing system” capable of storing/retrieving many files across many hosts. It’s not a true file system, meaning that you can’t mount it and look at it with “ls” or “cat” files, etc. but using a custom perl module you can put files in and take them back out.

Some interesting facts about it: It’s based on HTTP GET/PUT so I believe it would be a good complement to other systems we have at work (without going into too much detail, it’s no surprise to anyone that Shutterfly receives lots of picture files from users and stores them for later printing :) Also, the MogileFS tracker module takes care of ensuring that files are replicated like you want; for example, if you want 2 copies of each file living on different nodes, after the initial upload the file will be duplicated appropriately. In the event of failure of one of the nodes, other nodes that hold the same data as the lost one will duplicate the items again to ensure replication is maintained.
Rest of this is probably boring to many


[Note: as of this writing, Greg is still unable to kill selected Wisconsin residents with the power of his mind.]


More information on this outage. According to the log, it looks like the dhcp service stopped running about 21:37. I think I figured out why the service stopped.

The logs show a login as “root” from “maintain” at 21:30, a few minutes before the outage. This root session is still open as of this email. Additionally, there is a log file in /home/[username redacted] showing whether the daemon is running and who is logged in once per minute, and that file starts at the same time as the outage.

So, given that we have a limited track record in this configuration, it’s a bit early to make generalizations, but our first and only official outage of dhcp appears to be self-inflicted. Here’s a quick breakdown of root cause and solution.

Problem: dhcpd dying for any reason
Solution: Monitoring for this condition is still our first and best defense. We will be adding a cron-based script to check for this condition and automatically restart, but this is not a replacement for monitoring, since there are cases where a restart script won’t work (machine out of resources, config file fatal error). We will have the restart cron in place before pushing all ranges to production.

Problem: root access on maintain affords root access on dhcp1-2
Solution: If there are people who need root access on maintain but not on dhcp1-2, then we need to lock down the config-push process so that trusted users on maintain can’t gain privileged access on dhcp1-2. Our current implementation isn’t locked down to this level because we assume the access lists can easily be made consistent. If the access lists need to be different, we will assess the risk and act accordingly.

Problem: staff members taking advantage of security holes
Solution: This is a behavior problem and should be addressed by management. If someone has been granted user access and not root access, taking advantage of a loophole to gain root access should be pretty clearly off-limits as well. Please clarify the access policy as appropriate.

Problem: staff members with root access killing production services
Solution: Any changes to a running production server that might affect the service should be announced ahead of time and done during an appropriate maintenance window. Please clarify the policy and take any appropriate action.


Two more root shells were open on dhcp1. I have killed them.

These are the files that were accessed at the time the root shells were started (showing access time, not mod time):
-rwsr-xr-x 1 root root 29788 May 24 07:39 /sbin/dnsqa
-rwsr-xr-x 1 root root 7836 May 24 07:39 /usr/bin/rsh

Of these, “dnsqa” is probably the suspicious one, so I have chowned it back to [username redacted] so that file won’t have root privileges.

There are probably more exploits we will find over time that allow non-root users to gain root, but I don’t know how much effort we should expend trying to find and close all of them. Our policy so far has been that we trust non-root users not to abuse the system to try and gain more access than they have been given, so this type of local-user exploit is normally not much of a concern. This is also typical of other non-SGI locations, so we are not alone. We should make a decision as to what direction to go:
1. we assume that local users are trustworthy
2. we expend more resources trying to make the system more bulletproof (similar to ISPs who give shell accounts)
3. we decide not to have any local accounts at all for non-root users

Let me know what you think…

seeking dns consultants

My company is seeking some DNS advice. This will probably be of the “consulting service” type of arrangement, where the consultants prepare a quote, work on a specific task, and bill by invoice. (If we can’t find a consulting service we like, we may also consider hiring an hourly contractor for a short period of time, probably 3-6 weeks).

Approximate prerequisites are:
Knows more about DNS than I do
Exposure to multiple (at least 5) corporate IT environments with a role in managing or advising DNS infrastructure, preferably multiple “mixed” (unix/windows) shops
Familiar with recent trends in DNS (such as dynamic updates, Active Directory, etc) and familiar with DNS best practices.
Able to intelligently discuss choices that most IT organizations make with regard to DNS and articulate the pros and cons (i.e. don’t say “this is the only way”, rather say “here are the tradeoffs”)

Tell me if you know of a consulting service that might meet our demanding requirements :) For now we are only considering established consulting services with references, but if we get to the point of looking for hourly contracts I’ll probably post again.

I didn’t come here to tell you how this is going to end.

I didn’t come here to tell you how this is going to end. I came here, to tell you how it’s going to begin. I’m going to hang up this phone and then I’m going to show these people what you don’t want them to see. I’m going to show them a world, without you.

Some years from now, I may look back and wish that I had written more about this period, this week, this month. Continue reading

The beginning of the end

I have accepted a contract position with SGI, to work on sendmail and spam filtering. I am extremely happy about this.

I will be terminating at AV on 4/15 (or maybe 4/16?) I am happy about this as well.

Unfortunately I had already signed up to have my transition extended to 5/31 so I will most likely lose my severance package due to voluntarily (wry grin) leaving earlier than my planned term date. Which means I forfeit the 3 mo. of pay I would have got by sticking it out another 6 weeks. But, I don’t want to risk losing the new opportunity. I will still get my “please stay until 4/15 bonus” on 4/15 though (which is about 6 weeks pay) and I will get vacation cashed out (3+ weeks).

It would be really cool if my boss could sort of lose the 5/31 extension paperwork and let me go out with full severance on 4/15. I would even be willing to help them out after hours if he did that. If not, if he just says here is your 6 weeks bonus, you forfeit the rest, well, that’s the risk I took by accepting the extension. But if they need anything after that time I will just laugh.

So. The beginning of the end. I need to celebrate next weekend.

Thanks to everyone who has supported me, especially by sending me leads :)