Monday, July 18, 2005

 

It Still Must Be a Software Problem

At the end of our last episode ("It Must Be a Software Problem"), we had machines with coin acceptors with blown fuses, and a promise from the hardware people to get to the bottom of that problem. When I arrived at work this morning, I found that all three test machines had coin acceptors with lit LEDs. Nobody told me what happened, so I assumed that magical elves had fixed everything while I was away. I wanted to test the machines to verify that everything worked, but I got dragged into a few meetings.

Around 4:30, the project manager stopped by and let me know that the machine up in the QA lab was not accepting coins. He first talked to the hardware guy, who assured him the hardware was in working order. Would I take a look?

I trudge upstairs to the QA lab and talk to the QA guy. He says that coins aren't being accepted. So I put a coin in, and guess what? The machine swallows it up. The coin has been "accepted" and has fallen into the coin bin inside the machine, just like it's supposed to do. So what's the problem? Well, the machine is "accepting" the coins, but it is not crediting the customer with the money. The coin just disappears, and the display still shows "$ 0.00" (actually, it shows "€ 0.00" instead of dollars, but some browsers still can't display the Euro symbol, and I don't want to answer questions about "that funny-looking 'e' thing").

I know that this machine has been through a lot over the past couple of days, so I decide to reboot it. After rebooting, coins still get swallowed without credit. Grumbling, I walk toward the stairs to see if I can get my development machine to exhibit the same behavior, when the QA guy stops me. The machine is now giving credit! We run a few coins through, and they are all working now.

It's nice that it's working, but I need to figure out why it takes a while before the coin acceptor starts working. I go back to my cubicle and fire up my development machine. It is not accepting any coins; they just keep falling out the coin return. I reboot a few times, but still no good.

I go to the third machine. It accepts coins and provides credit the first time I try. It works perfectly.

So now, before I can work on the main problem (figuring out what the QA machine is doing what it's doing), I first have to fix my development machine. I immediately expect a hardware problem, as this machine worked fine before the magical hardware elves "fixed" it, but I have to put on my prosecutor-gathering-evidence hat. Besides, the hardware guys all go home at 4 o'clock, so they won't be around to help me anyway.

The first thing I do is a casual check to verify that all the wires are hooked up. It looks like they are, although some of them go deep into the machine where I can't really see them without disassembly. So I modify my software so that it reports in detail all the communications between the main processor and the coin acceptor. I find that the processor is sending messages out, but is getting no replies back. I look at the coin acceptor, and its little LEDs are blinking when messages go back and forth, so it looks like the coin acceptor thinks it is sending replies back to the processor.

So, either the signal isn't getting back to the processor, or I've done something dumb with the software that is hiding the replies. I rebuild the software, install it on the third machine, and test there. On that machine, I see two-way communications. Conclusion: the software is good; the problem is with the return communications link.

I wanted to avoid disassembling the machine, but now that's the only choice. I take things apart, occasionally running back and forth between my machine and the working machine for comparison. After several minutes, I find the problem. The I/O cable consists of three wires, and two of those wires have come loose from the connector on one end. And unfortunately, there are seven possible holes where these two loose wires could go, so I have to run back and forth a couple of times to see how the other machine is hooked up.

I put everything back together, and now my machine works. So that problem is solved, but I still have the original problem (why doesn't the QA machine work?). I decide to call it a day and look at the remaining problem tomorrow. My final act is to send the hardware guys a friendly e-mail requesting that they provide a more solid cable so that these wires won't get loose again. (I don't expect them to provide one, but I want the issue documented.)

Today I spent about two and a half hours tracking down some loose wires.

People ask me why I never get anything done.

I still don't think it's a software problem.


Comments:
It seems to me that you're doing the hardware guy's job.

If they know they can dump it on you and say 'it's a software problem' and you'll end up troubleshooting the hardware problems for them, then they win. After all, they get to go home at 4PM.

In the perfect world the hardware team should be ashamed of their lack of skill, but I'm sure they're high-fiving themselves at the watering-hole saying things like "I love this company, the software guys fix all our f-ups" and "gee, I love it that we get to leave on time" and "What's that funny looking E on the display anyway??"

(Yes, we have our share of dismal work dodgers in my office too, but I can't really vent at them, so let me vent at yours by proxy.)
 
It can't all be blamed on the hardware people, some of whom are very competent and hard-working. The head of the hardware department has always wanted his own team of programmers to take care of developing device drivers, diagnostic utilities, and related software before passing everything over to the application developers, but he isn't allowed to do that. All he can do is procure and assemble the hardware, and then wait for us to tell him what's wrong with it.

But I can put full blame on them for the shoddy workmanship and cheap, unreliable components they are always giving us, and for their attitude. Every time we find something that doesn't work, it's somehow our fault.
 
I'm sure you keep software change logs... do they not do the same for hardware? It seems like this is a left-hand/right-hand problem... you have no idea what they had just done.. so you had to start over trying to troubleshoot. Crossing those gaps between the real world and software is always a messy process.. it does not sound like you are getting much help.
 
I'm sure you keep software change logs... do they not do the same for hardware? It seems like this is a left-hand/right-hand problem... you have no idea what they had just done.. so you had to start over trying to troubleshoot. Crossing those gaps between the real world and software is always a messy process.. it does not sound like you are getting much help.
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?