Thursday, July 14, 2005
It Must Be a Software Problem
"Kris, your coin acceptors are not working. Did you bother to test this stuff at all before releasing it?"
That was, in essence, the first e-mail that greeted me at work today. The background: I am working on the software for a vending machine. Earlier versions of the vending machine only took paper bills, but we are going to be rolling this product out to Europe, so the machine has to be able to take Euro coins. A "coin acceptor" is the device in a vending machine that has the slot where the customer inserts money. The coin acceptor either "accepts" the coin, meaning it drops into a bin inside the machine and the machine bumps up the customer's credit, or the coin acceptor "rejects" the coin, meaning it falls back out the return slot. The CPU in the machine communicates with the coin acceptor via a serial line to tell the acceptor which coins to accept, and to find out when coins have been accepted. We recently released a version of the vending-machine software that makes the coin acceptor work, but the testers couldn't get the coin acceptor to accept any coins.
The coin acceptor does work on the machine I used while developing the software, and I tested the coin-acceptor operation extensively and successfully on that machine. However, the coin acceptors were not working on two recently delivered machines, one of which went to the testers and the other went to another software developer. We discovered this problem on Tuesday, took a quick look at the hardware, and noticed that on my (working) machine, there was an LED on the coin acceptor that lighted up when the power was turned on, and other LEDs that flickered when communications were taking place, whereas on the other (non-working) machines, the LEDs never lit up at all. I concluded that the coin acceptors probably weren't getting power, and asked the hardware people to verify that everything was hooked up correctly.
Problems like this are common when we have prototypes of new hardware, so it doesn't bother me much when the hardware doesn't work. But yesterday, without even looking at the machines, the hardware people replied "It must be a software problem." So that's why the ball was back in my court this morning.
"It must be a software problem" is something we hear a lot from the hardware group. Similarly, we often hear "it must be a client-side software problem" from the server-side software development group. Both groups are usually wrong when they make those claims. That's not because the client-side software developers never make mistakes; it's because all problem reports come to us first, and we check things very thoroughly before pointing the finger in another direction. So, when we get this kind of response, it really pisses us off.
I knew I'd have one of those days were I spend several hours proving to somebody else that it is their stuff that's wrong, not mine. I've gotten very good at gathering evidence and presenting a case; I feel like a criminal prosecutor at times like this. I don't like participating in this adversarial system, but it's the only mechanism that seems to work.
To me, it seems obvious that if the power light doesn't turn on, then the power isn't connected. I don't know of any software I can write that will spontaneously produce a flow of electricity through an external electronic device, so there wasn't much I could do as a software developer. I decided to start by firing off an e-mail describing the problem in simple terms and asking for a reasonable amount of help: "Dear hardware guys, I obviously a clueless ninny, so could you please provide me any technical contact information you have for the coin acceptor vendor? Also, because the power light does not turn on when we turn on the power, could you send somebody over to check the wiring?"
From past experience with the hardware people, I didn't expect any help for a while (they work in another building), so I started taking apart the machines. I figured I would swap the coin acceptor from the "good" machine with the one in the "bad" machine. That should provide a clue as to whether the problem was in the coin acceptor or somewhere else in the machine. So I grabbed my Phillips-head screwdriver, my narrow slot-head screwdriver, my needle-nose pliers, and started unscrewing and unplugging things.
I was pleasantly surprised when a couple of the hardware guys walked in as I was almost finished removing one of the coin acceptors. They hooked up a multimeter to the "bad" machine, and found 24 volts of power across the pins, as expected. They also checked the serial data lines, and found them to be wired correctly as well. They asked me a couple of times whether I was sure I had the right software loaded, and I assured them that the software in the "bad" machine was identical to that in the "good" machine. I pointed out the unlit power light a couple of times, but they didn't seem interested. I sympathized with them—it really did look like everything was hooked up correctly.
So we walked over to the "good" machine, and they did the same checks that they had with the other machine. This was an important step, because after turning the machine on and off a few times, they noticed that the LED turned on and off along with the power. They started accepting the possibility that it might be a hardware problem. They decided that swapping the coin acceptors might be a good idea.
So we did some swapping and testing of various things, and the result was a unanimous conclusion that the "bad" machine had a "bad" coin acceptor. The power light just wouldn't turn on, no matter what they tried. But this raised a new mystery: there was another "bad" machine in the QA lab. One bad coin acceptor might be just bad luck, but two bad coin acceptors out of three seemed unlikely. So we took a long walk upstairs to the QA lab.
The third machine had the same dark power light that the other bad machine did. While they were checking things, I noticed something marked "FUSE" on the coin acceptor's circuit board. I didn't expect this to be the problem, but I asked whether it made sense to check the fuses to see whether they were good.
They checked, and whaddya know, the two bad machines' coin acceptors had blown fuses! So, OK, we just need to replace the fuses. Yeah, it's weird that two fuses would blow, but we'll just have to try and see what happens, right?
A few hours later, they came back with two replacement fuses, taken from some other coin acceptors that were in the warehouse. We replace the blown fuse in a coin acceptor, and turn on the power. The power light goes on. Hooray! Then, the light gets dim, and then goes dark. We check the fuse, and it's blown.
OK, there must be something wrong with this coin acceptor, causing it to blow fuses, right? Well, the thing is that this coin acceptor is the one that was originally the "good" coin acceptor. We had put its fuse in the "bad" coin acceptor, and that coin acceptor suddenly became a "good" coin acceptor. We've also tried two different machine cabinets, so we're pretty sure there's nothing wrong with the wiring in the cabinets. So we seem to have one magic fuse that makes any coin acceptor work, and three blown fuses.
The hardware guy puzzled over this for a while, said "Oh, what the hell?" and then put the second replacement fuse into a coin acceptor and turned it on. Again, we see light for a couple of seconds, and then darkness. So that's one good fuse and four blown fuses.
All of the fuses are rated for the same current. Did the vendor get a batch of bad fuses? Is the one "good" fuse too tolerant? Are we going to get these things fixed before our shipment deadline?
Nobody knows the answer yet, but I'm pretty sure it's not a software problem.
And for the record, whenever something mechanical goes wrong on any computerized equipment in my home or at work, I automatically blame the hardware. Unless it's on a public computer at the library -- then I blame the patron!