Weekend Report

Last week started off OK, but by the time we got to Wednesday things were on the way downhill. I had a flat tire just as I was pulling into the office driveway. I waited until my afternoon meetings to go out and change it. Surprise! There was one funny lug bolt that needed a special key piece that I dont have.

Public service announcement: If you have a special anti-theft lug bolt on each wheel, take some time to check and make sure you have the key adapter in your car. Do this ahead of time, before you end up needing it. If someone else put the tires on for you, check that they put the key back in the trunk with the wrench.

As it turns out the AAA guy was not able to get it off either, and the tire would not inflate at all (due to riding on the rims for the last quarter-mile or so). The dealership has the part but there are like 10 different patterns so you can’t just get a ride and go pick it up, they have to actually see the car to match it. (The manual has helpful advice for this circumstance – be sure to write down the code number (visible on the key itself) sometime before losing the key.)

After getting towed to the dealer, I got the $20 part and paid the $70 to the tow guy for the mileage beyond 5 miles. The towing guy was really nice. Also my friend JT from work drove along with me in case I needed a ride home, that was extra cool.

This week was index deployment week, which involves a lot of meetings. There were also a couple of meetings having to do with last week’s network maintenance. I don’t like having a lot of meetings… I prefer to be available to people as a resource, and take some of the extra overflow type of work and in general keep people happy and focused. Lots of meetings with people outside my group make me feel sort of cut off. Perhaps I should take a laptop to the meetings and stay connected with irc.

Most of the time that was not meetings on Thursday and Friday was taken up by helping to deploy a new product – it was delivered to us a week and a half ago but we didn’t have a chance to work on it until now, and Friday night was the deadline. So Trip and I stayed quite late on Thursday night to try and figure out why it wasn’t working (Trip was doing most of the heavy lifting and I was mainly moral support and research assistant). After getting the data rate up from being stuck at 4 megabits we were able to get it up to like 8 or 9 (the theoretical max is 50 because we are running two copies at once to the same machine). This was not enough to meet our deadline but it would probably finish Saturday, and our other deployment was likely to run over a bit too.) So we called it “ok for now” but decided we should regroup with the R&D folks Friday morning when they got back in the office.

Friday was a lot of stuff relating to the index deployment, which went pretty smoothly other than the new product - for the sake of discussion let’s refer to the new product as Opique (not its real name). The engineers told us mostly what we had figured out on our own the night before (about allocating more memory for TCP buffers). There was one engineer who seems to always want to blame shit on other people (we’ll call him Firzin, though that is not his real name). He kept going on about why we waited a week and a half to start – I accepted as to how that was our mistake, but also mentioned that this product has never seen any QA and the type of problems we were finding should have been found by testing before. The engineer who created the thing (we’ll call him Prishanth, again not his real name) seemed strangely silent through the whole affair.

Later in the afternoon, we discovered part of the reason our copies were slow was due to some bad network hardware (one piece was replaced post-haste, the other was not flaky enough to be called an “emergency” so it will be replaced later.) After replacing network hardware, our copies continued without a hitch, but with only modest improvement. Apparently the problem with the network had started the night before, and (shocker!) there don’t seem to be any monitors for packet loss around, just monitors that tell you if a piece of equipment stops responding completely. Our network infrastructure is mostly clean and healthy, but when something goes wrong it is not easy to discover. Le sigh.

Anyway, we stayed a bit late Friday, dealing with a couple pieces of equipment that had gone down completely and their copying needed to be restarted. The deployment looked as if it would edge into Saturday morning, so we all went home. However, I kept working on the Opique product from home, monitoring the transfers, sending status reports, creating multiple scripts to monitor the data rate, creating spreadsheets, etc. Finally I went to bed at 1 am and set the alarm for 6:00.

Saturday morning, we deployed the index at 7:00 am. Everything went swimmingly. After flipping on, I went back to working on Opique, which ate up much of my day Saturday, except for a brief trip to Fry’s. The copying was still going on all day Saturday and looked like it would not be done until Sunday (and New York would probably be Monday). One server crashed Saturday night and I took great pains to restart the copy partway through – something that the product doesn’t support as-is but I was able to modify the script to support partial copying.

Sunday, more of the same, checking the copies. Most of them were done though there was one server that had crashed (same one again) so I summoned Dale to take a look in the afternoon. I spent most of the morning and afternoon puzzling over the flip-on script, which is supposed to switch on the new data just copied. After spending some time reading the directions and playing with it, I took a look at the source to see why it wasn’t flipping on, and it turns out there is no flip-on feature (like, at all). So flipping on consisted of logging on to each server, renaming a file, and creating a symlink to it, then stopping and starting the program. Sounds like a pretty easy thing to script, but it just wasn’t. After staring at the scripts we were given for an hour or two, I bypassed the script entirely and managed to get it flipped on manually. Finally, I found that two of the machines in the second stage copy wouldn’t start copying – this turned out to be another bug. It turns out that if your config file contains “opique.1” and “opique.11” then doing a grep for “opique.1” will get duplicates, imagine that. Someone should probably check for that type of thing in QA - but wait, there was no QA this time, we skipped it in honor of Lent. (OK, that was cynical, but finding problems with this product is like shooting fish in a barrel).

OK, so here I am, most of my weekend shot, laundry not done, and it’s about time to go to bed. Thanks for listening to the ramble. :)