Tuesday, November 29, 2005
A Five-Hour Phone Call
At around 3:15 PM, I got a call from a project manager asking if I could join a conference call. "Sure." I called in and started talking to the project manager and a couple of other people. They had deployed a new version of our software yesterday, and today problems were being reported. I was one of three people who have detailed technical knowledge of the system, and I was the only one available at the time, so they hoped I could diagnose and correct the problem.
For the next half hour or so, I asked and answered questions. The problem was strange, and we couldn't come up with any theories to explain it. The worst part was that the new software version worked fine on about half of the systems where it was deployed, but didn't work on the other half. If none of them worked, it would be easy to point at the new version as being the problem, but the fact that some of them worked led us to think it might have something to do with misconfiguration, or the communications infrastructure, or user error, or something else.
After half an hour, one of the other two guys who understands the system joined the call. So we went over everything with him. Log files were e-mailed and files were transferred over the Internet so that we could all see the same information. As we added his expertise to the mix, we didn't get any closer to an explanation; we just developed a longer list of questions.
One thing we really needed was for a trusted onsite person to try a few things with one of the machines and tell us exactly what was happening. A technician was on his way to one of the problem sites. He said he would be there in about twenty minutes. So we waited. And waited. We continued discussing things and examining log files, but after an hour and a half had passed, we decided to find out where that technician was. It turned out that other (unrelated) customer emergencies had arisen, and the tech had been diverted. But he was now on his way to the original site. He said he would be there in about twenty minutes.
It took longer than twenty minutes, but the tech did reach the site and describe the situation to us. After a little investigation, we determined that this particular system had been misconfigured. The tech changed a setting, and the system started working. Hooray! Unfortunately, this only fixed one particular system. It didn't explain the problems we were seeing with the other systems.
So, we kept at it. At around the three-hour mark, the third guy-who-knows-something joined the call. So we went over everything again. More logs were e-mailed and more files were transferred. Guy-who-knows-something #2 dropped out at the four-hour mark (around 7:15 PM). The next 45 minutes were pretty quiet, as Guy #3 examined log files and software code. The only conversation was an occasional "Is everyone still there?" to break the silence. Finally, Guy #3 said the magic words:
I think I've found the problem.
He described his theory, and it made sense. On some of the systems, an operation was taking 15.2 seconds to complete. Another part of the system was only willing to wait for 15 seconds for the operation to complete, so it was giving up and attempting a retry of the operation. Thus, nothing ever got completed, just because of a 0.2-second overrun.
A quick change was made to a configuration file to increase the amount of time the system would wait, the system was restarted, and everything worked. As we all said our thank-yous and goodbyes, I looked at the timer on my phone, which indicated the call had lasted four hours and fifty-seven minutes.
So that's how I spent my afternoon and early evening: holding a phone to my ear, feeling my butt getting numb, staring at a computer screen. But I didn't have to travel anywhere, and the problem turned out to be something that wasn't blamed on me, so it could have been worse.
Your call sounded grueling. I spend a bit of time on the phone myself, nothing close to what you just endured though,usually just hour long teleconferences.
I eventually got myself a headset for my phone, it really makes things easier. I use to only use it for teleconferences, but I find that I use it all the time now, as it keeps my hands free and I can do other things while I'm on the phone. (It's a whole lot better than a speaker phone).
You've probably already have one on order by now though....
I complain "death by conference call" when I'm on a call that lasts over one hour.
Something I didn't mention are that Guy #2 and Guy #3 are contractors, not employees of my company. They aren't always around when we need them, so as long as we had them connected, I think it was a good decision to keep everybody on the line (even though it was painful).
Honestly, we solved more problems in that five-hour call then we have in the past five months. That of course doesn't mean five-hour calls are a good thing; it just shows how bad things are in general.