Wednesday, July 27, 2005
OK, Now We Really Do Have Software Problems
I'm sure readers are on the edges of their seats waiting for more information on my coin acceptors, so here's the latest.
The hardware people did some diagnosis on the failing units, trying to determine why the fuses were blowing on the coin acceptors. It turns out that when the device is powered up, there is a capacitor that initially draws a lot of current. Our one "magic fuse" that doesn't blow is a slow-blow fuse, meaning that the current needs to exceed the fuse's rating for a "long time" (several milliseconds) before it will blow. For some reason, the vendor installed quick-blow fuses in this batch of units, so that initial inrush of high current caused them to blow immediately.
The hardware guys sent their findings to the vendor. The vendor sent us a replacement circuit board that is fuseless. I was asked to test this new board, but unfortunately, the new board has a power connector which is different from the old one, so I can't just plug it in to my machine in place of the old board. So I'm waiting for the hardware guys to give me new wiring.
With the hardware issues off my plate, I finally had time to look into the issues that might be software-related. There were three problems found during testing:
- The coin acceptor won't accept any coins until the machine has been on for "a while" (about a minute).
- The first coin that is accepted after power-up will not be credited to the customer. (That is, the machine takes the money but doesn't let the customer buy anything.) The second coin and subsequent coins are credited properly.
- Even when it is accepting coins, it is not very reliable. A customer often has to insert a valid coin over and over again before the machine will finally take it.
So, it's my job to figure out what's wrong. The code that deals with the coin acceptor was written by another programmer, who is no longer with the company. A few weeks ago, I was assigned to figure out how the stuff works and to fix whatever is wrong with it. I got it basically working, but now I'd need to delve deeper to figure out these uncovered problems.
Reading code written by someone else can be challenging. Programmers write software for two "audiences:" the computer and other programmers. Obviously, the code must make sense to the computer, or it won't work and is therefore useless. However, it is also important to write the code such that other programmers can read it, so that those other programmers can fix bugs or add new features. What makes sense to one programmer can be complete gibberish to another. With complex code, it's necessary to get into the other guy's head to really understand it. It is often easier for a programmer to completely rewrite a program from scratch than it is to figure out how someone else's existing program works.
Fortunately, the guy who initially wrote this stuff wasn't a bad programmer, and it wasn't too hard to figure out how everything worked. I found a solution to issue #1 (no coins accepted for a while) pretty quickly: at startup, the machine is doing a lot of things, and the coin-acceptor initialization code didn't run until a lot of those other things finished. I changed it so that it wouldn't wait for those other things.
Issue #2 (first accepted coin is ignored) took a litle more time. It turns out that after the coin acceptor is powered up, the first query for its status from the machine will result in a different kind of response than the typical response. The machine's software also had a special way of dealing with the first response, and the interaction of these two types of special startup handling caused the first coin-acceptance event to be ignored. The original programmer never noticed this during development because we usually don't power-down the machines very often, and so the special coin-acceptor-power-up behavior wasn't seen very often.
The solution to issue #3 (valid coins not accepted reliably) took some experimentation. My initial guesses were that the problem was due to (a) the machine was not instructing the coin acceptor to accept coins properly, (b) the machine was not querying status fast enough, or (c) the coin acceptor's default security level was too high. I'll explain each of these.
When a coin acceptor is initially turned on, it will reject all coins. The machine must send it a series of commands to enable acceptance of coins. I was able to eliminate possibility (a) by observing the commands being sent to the coin acceptor at startup and verifying that they were correct.
After the coin acceptor is commanded to accept coins, the machine must query the coin acceptor at least once per second to check whether coins have been inserted. If this query does not happen at least once per second, the coin acceptor will assume that the machine is no longer working properly, and will stop accepting any coins. I was able to eliminate possibility (b) by observing the queries, which were reliably happening three or four times per second.So that left possibility (c). Coin acceptors provide a set of security options that determine how strict they are in determining whether an inserted coin is valid. They can be set at lower security levels, which accept coins readily, but make it more likely that counterfeit coins can slip through, or they can be set at higher security levels, which filter out the counterfeits but also reject valid coins that aren't quite perfect. High security levels can be frustrating for customers, but low security levels get us in trouble when the retailers find counterfeit coins in the machine at the end of the day.
I wrote code to send the "query security level" and "modify security level" messages to the coin acceptor, and experimented with some of the settings. Setting a low security level did indeed make the coin acceptor accept coins more readily, which was good because if that wasn't true, possibility (c) wouldn't have been the problem and I'd have to stay late trying to imagine more possibilities.
But now what should I do? I could have the machine set a particular security level that worked well for me, but that security level might not be appropriate for other machines. I could read the desired security level out of a configuration file, which makes it possible for onsite technicians to tweak things as needed, but techs don't like futzing with configuration files. I could also add some nifty user-interface screens allowing a technician to adjust the security settings by clicking some buttons, but that would take a day or two of work, and this project is already behind schedule.
I decided to not decide. I sent an e-mail to the higher-ups presenting the options, asking them what they wanted. On Thursday, I'll see what the answer is.