Make Tests Independent

Thursday, April 30, 2020

5 minute read

I changed a piece of code and 75 tests broke…

It was a simple feature. I just added a GDPR approval checkbox when creating a new account. And suddenly half our tests broke. Tests for reporting, tests for core features, all kinds of tests for all kinds of things. Because every one of them needed to create an account.

So I added a default to the test mocks. I updated lots of expecteds. I added a startup step to some base classes. And the tests all passed again. And then there was a bug in production.

It turns out that 3 of those tests were failing because of an actual bug. My GDPR change interacted poorly with reporting, but I didn’t realize that. My tests told me, but I couldn’t see that.

I need my tests to only fail when they really should fail. But I still want everything in the system to be tested. How do I do that?

Squeeze your tests…

Our goal is to make our tests entirely independent. Any bug or change would cause exactly one test to fail. The problem is that lots of our tests are over-ripe: they execute a bunch of code that they don’t verify as a way to get to the verification they want to do. We call this extra testing the juice.

Our goal is to sqeeze the extra juice out of each test. We want to make it narrower so that it has only one reason to fail - the one that was intended when it was written. We still want to verify the juice, so we will create tests for just the extra juice. The ultimate goal is that each test has only one reason to fail, meaning future changes will break only one test.

Access the recipe for squeezing juice out of over-ripe tests as well as other recipes coming in the future!

Problem sound familiar but not happening right now?

You can also identify the code and tests that are likely to become over-ripe on future stories. The juicing recipe also contains instructions for detecting the problem and fixing it before it bites you.

Tests have fewer false positives…

It’s better:
You have narrowed your over-ripe test by extracting one extra piece of common code, or in other words, one extra juicy bit. Now you are executing those juicy bits far fewer times (you usually find a lot of exact duplicates), so your tests run faster. And each one will fail for fewer reasons.

It’s also no worse:
You have created a new test for what you extracted that verifies exactly what had been verified before. This means you don’t lose any coverage.

Benefits:

Test suite runs faster.
Stories get done faster because you fix fewer false failures.
Ship fewer bugs because you listen to your test failures.

Downsides:

The setup in ripe tests can get out of sync with the assertions in your new juice test. This can result in integration bugs that make it past your test suite.

Solving the downside (optional)

First, make sure it’s an actual problem in this case. Most code used in setting up many tests changes rarely, so it may be cheaper to keep the risk than to fix it.

If it is a problem, then 1) use extract method on the setup statements from the ripe test, 2) extract method the assertions from the juice test, and 3) create a new test that calls the setup method and the assertion method. This verifies that your ripe setup will stay in sync with your juice tests.

Going too far

This can get out of hand. You might see partial duplication between setup methods, so you desire to merge them or to create a smart builder. Don’t do it! You are seeing a ripple of one of your god classes. Fix the god class’ interaction with this test instead.

Demo the value to team and management…

Show three things at your sprint demo:

Example: One specific change (overripe test, then juice test + less-ripe test).
Progress: A chart of the number of overly-tested code chunks in your system over time (starting before your change, ending after your efforts) & the number of work-hours it took you
Impact: A quantitative measure showing the impact of your change.

The example is obvious; the other two require some explanation.

Progress Measure - Calculate

Do the hotspot analysis (see the recipe for details). Size each hotspot by the number of tests executing it. Ignore all hotspots with 3 tests or fewer, and add the rest to get a number. This number is your exposure: the number of extra-juicy spots. Track it over time.

Demonstrating a progress measure and showing you can improve it will cause those at the demo to ask about impact. After that discussion they’ll start talking about funding if they see a high value to cost ratio.

Impact Measure - Track

Which of these matters most to your management?

Reduced cost overruns on stories.
Reduced regression bugs that were caught by tests but shipped anyway.

Whichever you you choose is your impact measure. Now you can calculate a baseline. How much has this happened over the last quarter, and what did it cost the organization?

Because stories done near each other in time tend to be near each other in code, future work is likely to hit this same over-ripe test cluster. Therefore, if you fix half this cluster, you can expect to reduce your impact measure by half for near term stories.

Over the long term you will hit all the juicy spots. Thus the long-term impact is approximately how much you improve your progress measure. That will return the conversation to the bigger picture of the progress measure that drives your funding.