Why and how you should test your software

This is a major rewrite second draft of a talk I'll be giving at PyCon 2017. Thanks to all those who commented on the first draft. Once again I would appreciate feedback, comments and contrasting points of view.

Why should you test your software? How should you test your software? Some people have easy answers to these questions.

Yolo programmers don't bother testing at all, happy to live in the moment. More serious programmers will tell you that you test your software in order to produce a Quality Product. To produce a Quality Product you must always write unit tests and integration tests and do QA. Neglect any of these and your code will fall into a bug-infested abyss.

While I'm much more sympathetic to this second view, I don't think it's a sufficient answer. Given how different software projects can be from each other it seems unlikely that one set of answers will fit everyone:

  • A mobile game is not the same as a medical device is not the same as an online store.
  • Every organization is different: a bootstrapped startup has very different resources than a massive multinational, and neither are very much like NASA.
  • Every project goes through different stages and phases. A product with no users is in a very different position than a product used by thousands of businesses.

What you need is not a single answer, but a way to choose the answers that match your situation and your needs. We'll start by considering the means you have available to you for testing. Then we'll consider why you would to test your software. Finally, we'll combine means and goals to see how you can choose to test your software.

What means of testing can you use?

As a starting point, let's consider the different means of testing available to you. Is the following code a test?

def test_add():
    assert add(2, 2) == 5

I would say that, yes, that's clearly a test. Says so right in the function name, even. The test proves that add() does what it ought to do: add two numbers and give us the result.

You've noticed, of course, that this test is wrong. Luckily our development process has another step: code review. You, dear reader, can act as a code reviewer and tell me that my code is wrong, that 2 + 2 is 4, not 5.

Is code review a form of testing? If you're trying to verify your code matches a specification, well, this is a test. You have a specification for arithmetic in your head ("2 + 2 = 4") and you are checking that the code follows it.

Let's consider code review as a form of testing, alongside automated unit tests. Even if they're both tests, they're also quite different. What is the core difference between them?

One form of testing is automated, the other is done by a human.

An automated test is consistent and repeatable. You can write this:

def test_add_twice():
    for i in range(10000000):
        assert add(i, i) == 2 * i

And the computer will run the exact same code every time. The code will make sure add() consistently returns that particular result for those particular inputs. A human would face some difficulties in manually verifying ten million different computations: boredom, distraction, errors, slowness.

On the other hand, a human can read this code and tell you it's buggy:

def add(a, b):
    return a + b + 1

Where the computer does what it's told, for good or for bad, a human can provide meaning. Only a human can tell what the software is for.

Now we can categorize tests by the means used: humans test for meaning, whereas automated tests ensure consistency.

Why should you test your software?

Next, let's consider goals.

The first possible goal of testing your software is to make sure it meets a specification. This goal is what most programmers think of when testing is discussed: it covers things like unit testing and manual testing by QA. And as we saw it also covers code reviews. Your software has certain required functionality, the specification, and you want to make sure it actually does so now and in the future.

Some of the requirements are high-level: an online store wants customers to be able to order a product they've added to their shopping cart. Other requirements are low-level implementation details, mostly of interest only to the programmers. You might want the verify_creditcard() to accept a credit card number as a string and throw a InvalidCreditCard exception if the credit card is invalid.

The sum of all these requirements is the specification. It might be written down in great detail, or it might be a notion in your head (e.g. "2 + 2 is 4"). Regardless, you test your software to make sure it does what it's supposed to.

Sometimes, however, testing can have a different goal. In his book "Lean Startup" Eric Ries talks about building software only to discover that no one actually wanted to use it. Spending much time testing to ensure your software meets the specification is a waste of time if no one will ever use your software.

Ries argues that you first need to figure out if a product will succeed, by testing out what he calls a "Minimum Viable Product" with potential users and customers. This is a very different form of testing: it's not about verifying whether your software meets the specification, it's about learning something you hadn't known before.

The second possible goal of testing your software is in order to gain knowledge. Let's look at another form of testing with this goal. "A/B testing" is a form of testing where you try out two variations and see which produces a better result. Perhaps you are testing a redesign of your website: you show the current design to 90% of your visitors, the new design to 10% of your visitors, and see which results in more signups for your product.

Notice that you have two specifications, and you've already implemented them both. The point of the test is to figure out which works better, to learn something new, not to verify if the implementation meets the specification.

Now we have an answer for why you should test: either to verify you meet the specification or to gain new knowledge.

How should you test your software?

Combine the two goals we've come up with (gaining knowledge and matching the specification) and the two means of testing (human and software) and you get four different forms of testing, each providing a more specific testing goal:

Goal: Knowledge
Means: Humans Understanding Users Understanding Runtime Behavior Means: Software
Correct Functionality Stable Functionality
Goal: Implement the specification

You must choose the appropriate ones to use for your particular needs and situation. Let's go through these four types of testing one by one and see when you should each.

Understanding Users

  • Will people buy your product?
  • Will a design change result in more signups?
  • Will users understand how your software works?

These are all questions that cannot be answered by comparing your software to a specification. Instead you need empirical knowledge: you need to observe what actual human beings do when presented with your software.

Relevant testing techniques include:

  • Usability tests.
  • Minimum viable product testing ("Lean Startup").
  • A/B testing.

Understanding Runtime Behavior

  • How does your software behave under load?
  • Does your software have race conditions?
  • Does your software break with unexpected inputs?

These questions can't always be answered by comparing your software to a specification. Once your software is complex enough you can't fully understand or predict how it will behave. You need to observe it actually running to understand its behavior.

Relevant testing techniques include:

  • Stress tests and soak tests.
  • Gathering tracebacks and exceptions from your production logs.

Correct Functionality

  • Does your software actually match the specification?
  • Does it do what it's supposed to?

It's tempting to say that automated tests can prove this, but remember the unit test that checked that 2 + 2 is 5. On a more fundamental level, software can technically match a specification and completely fail to achieve the goal of the specification. Only a human can understand the meaning of the specification and decide if the software matches it.

Relevant testing techniques include:

  • Manual user interface tests (e.g. a QA person using your website).
  • Code review.

Stable Functionality

  • Does your public API return the same results for the same inputs?
  • Is your code providing the guarantees it ought to provide?

Humans are not a good way to test this. Humans are pretty good at ignoring small changes: if a button changes from "Send Now" to "Send now" you might not even notice at all. In contrast, software will break if your API changes from sendNow() to send_now(), or if the return type changes subtly.

This means a public API, an API that other software relies on, needs stability in order to be correct. Writing automated tests for private APIs, or code that is changing rapidly, will result in high maintenance costs as you continuously update your tests.

Relevant testing techniques include:

  • Unit tests, integration tests, and other similar automated API tests.
  • Automated user interface tests (e.g. Selenium tests for websites).
  • Compiler checks and linters.

Applying the model

So what is all this good for?

Choosing how to test

First, our initial purpose: the model can help you choose what form of testing to do based on your goals.

Consider a startup building a product that it's not sure anyone wants. Writing automated tests may prove a waste of time, since it focuses on implementing a specification before figuring out what users actually want.

One possible alternative the Lean Startup methodology, which focuses on experiments or tests whose goal is finding what product will meet customers' needs. That means focusing on the Understanding Users quadrant. Once a product is chosen you spending the time to have more than the most minimal of specifications, at which point much more resources are applied to Correct Functionality and Stable functionality.

Recognizing when you've chosen the wrong type of testing

Second, the model can help you change course if you're using the wrong type of testing. For example, consider a hypothetical startup that writes tax preparation software (the details are inspired by a real company.) They wrote Selenium tests for their web UI at the same time were rapidly making significant changes to their UI.

Even with the tests their application was still buggy, and every time they changed the UI the tests would break. The tests didn't seem to improve quality, and they wasted developer time maintaining them. What were they doing wrong?

Their problem was that their system really has two parts:

  1. The tax engine, which is fairly stable: the tax code changes only once per year. Errors in the tax engine are a significant problem for users, and incompatible API changes are a problem for developers. This suggests the need for Stable Functionality tests, e.g. unit tests talking directly to the tax calculation engine. Correct Functionality could be ensured by code review and feedback from a tax accountant.
  2. The web-based user interface. The UI was changing constantly, which suggests Stable Functionality wasn't a goal yet. Correct Functionality was still a goal, so the UI should have been tested manually by humans (e.g. the programmers as they wrote the code.)

A basis for discussing testing

Finally, the model provides a shared terminology that can help you discuss testing in its broadest sense and its many differing goals.

  • Instead of getting into fruitless arguments about whether manual testing is a good idea or unit testing is a good idea, you can start with a model that shows very clearly the differences between them.
  • You can also discuss testing with other parts of the company (e.g. marketing) that are starting from a very different point of view.

Summary

Why should you test your software? Either to gain knowledge or to implement a specification.

How can you test your software? With humans or with software.

How should you test your software? Depending on your particular situation, choose the relevant forms of testing for understanding users, understanding runtime behavior, stable functionality or consistent functionality.

Questions or suggestions? Add a comment to the Hacker News or Reddit discussions linked below.


Broken software, bad job offers: you can learn from two decades of my mistakes. Join 1500 other programmers and learn how to avoid a new mistake every week.