Debugging software by eliminating the impossible
How many times have you seen software exhibiting completely impossible results? In theory software is completely deterministic, but in practice it often seems capriciously demonic. But all is not lost: the detection methods of Sherlock Holmes can help you discover the hidden order beneath the chaos.
Sherlock Holmes famously stated that “once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.” And what is true for (fictional) crimes is also true for software. The basic process you follow to find a problem is:
- Come up with a list of potential causes.
- Test each potential cause in isolation, ruling them out one by one.
- Whatever cause you can’t rule out is the likely cause, even if it seems improbable.
To see how this might work in practice, here’s a bug my colleague Phil and I encountered over at my day job, back when I was working on building a microservices architecture.
The Case of the Missing Stats
I was working on a client library, Phil was working on the server. Phil was testing out a new feature where the client would send messages to the server, containing certain statistics. When he ran the client the server did get messages, but the messages only ever had empty lists of stats.
Someone had kidnapped the stats, and we had to find them.
Phil was using the following components, each of which was a potential suspect:
- A local server with in-development code.
- Python 3.4.
- The latest version of the client.
- The latest version of the test script.
Eliminating the impossible
Our next step was to isolate each possible cause and falsify it.
Theory #1: the client was broken
The client code had never been used with a real server; perhaps it was buggy? I checked to see if there were unit tests, and there were some checking for existence of stats. Maybe the unit tests were broken though.
We ran the client with Python 3.5 on my computer using the same test script Phil had used and recorded traffic to the production server. Python 3.5 and 3.4 are similar enough that it seemed OK to change that variable at the same time.
The messages sent to the server did include the expected stats. The client was apparently not the problem, nor was the test script.
Theory #2: Python version
We tried Python 2.7, just for kicks; stats were still there.
Theory #3: Phil’s computer
Maybe Phil’s computer was cursed? Phil gave me an SSH login to his computer, I set up a new environment and ran the client against the production server using the Python 3.4 on his computer.
Once again we saw stats being sent.
Theory #4: the server was broken
You may have noticed that so far we were testing against the production server, and Phil had been testing against his in-development server. The server seemed an unlikely cause, however: the client unilaterally sent messages to the server, so the server version shouldn’t have mattered.
However, having eliminated all other causes, that was the next thing to check. We ran the client against Phil’s in-development server… and suddenly the stats were missing from the client transmission logs.
We had found the kidnapper. Now we needed to figure out how the crime had been committed.
Recreating the crime
So far we’d assumed that when the client talked to the dev server the messages did not include stats. Now that we could reproduce the problem we noticed that it wasn’t that the messages didn’t include stats; rather, we were sending fewer messages.
Messages with stats were failing to be sent. A quick check of the logs indicated an encoding error: we were failing to encode messages that had stats, so they were never sent. (We should have checked the logs much much earlier in the process, as it turns out.)
Reading the code suggested the problem: the in-development server was feeding the client bogus data earlier on. When the client tried to send a message to the server that included stats it needed to use some of that bogus data, and it failed to encode and the message got dropped. If the client sent a message to the server with an empty list of stats the bogus data was not needed, so encoding and sending succeeded.
The server turned out to be the culprit after all, even though it seemed to be the most improbable cause at the outset. Or at least, the first order culprit; a root-cause analysis suggested that some problems in our protocol design were the real cause.
You too can be a scientific software detective
Our debugging process could have been better: we didn’t really check only one change at a time, and we neglected the obvious step of checking the logs. But the basic process worked:
- Isolate a possible cause.
- Falsify it, demonstrating it can’t be the real cause.
- Repeat until only one cause is left.
Got an impossible bug? Put on your imaginary detective hat, stick an imaginary detective pipe in your mouth, and catch that culprit.