This One Subtle Bug You Might Not See Coming

Sometimes bugs are obvious. You used a ++ where you should have used a --. Other times, bugs are much less obvious. A regular expression was missing an escape of a .. Then there are those bugs that are so subtle and insidious that they deserve their own blog post just to document and excoriate them. This post is about one of those bugs.

Fortunately for me, this wasn't my bug. It was a bug hidden in a Programmers.SE question. The questioner was blaming himself for not seeing or understanding a change in an external library, and he was desperately looking for a way to write better code to account for such changes in the external library. What did that library do? There was a method that returned a long, where the long was milliseconds since epoch. In the update, the method was changed so that it still returned a long, but the long now represented nanoseconds since epoch.

Putting aside the discussion about whether a long has enough precision, this change is a special type of awful. The semantic type of the return value has changed, but the programmatic type did not. The result is code using the library is now broken, since it expects milliseconds, but still compiles and runs, because it is getting a long value. Unit tests will all still pass. Integration or higher level tests might fail, or they might not. Even if those tests fail, tracking down the reason would not be straightforward.

But, but, but… it's a documented change! Sorry, that argument doesn't work in the real world. If my code compiles and runs against the update, that's all the documentation I should need (and most often will use). Only if it breaks will I then go to the docs to see what changes were made and what changes I have to make. Change logs and release notes are nice, helpful, and important, but they should be thought of as secondary resources to the code itself.

How do we fix this problem then? We use abstractions. There are a few different ways to approach this. The most important aspect to these solutions is that the code should continue to work seamlessly, or it should break as spectacularly and as soon as possible.

First, a seamless change. The method we want to change is as follows:

// returns milliseconds since epoch
public long getTimestamp() {
    // do some stuff
}


We now want to provide the timestamp in nanoseconds. Instead of changing the return value of the original method, we make whatever adjustments are necessary to have it keep returning milliseconds. We then create a new method that returns the nanosecond value:

// returns nanoseconds since epoch
public long getTimestampInNanoseconds() {
    // do some stuff
}


Old code can keep using the old method, and new code can use the new method. The downside is that we now have two methods cluttering up the interface, and we have to maintain them both. If the code has enough common components, they can be written such that they reference each other or a separate private helper, but this can get out of hand if many methods need to be added or changed.

Ok, now let's break the code.

The easiest way to break the code is to change the return type of the method. This will make all the dependent code fail to compile, breaking in a spectacular stream of compilation errors. By forcing a compilation failure, we force the developer to fall back to the secondary documentation mentioned earlier. We also can direct the developer to (nearly) every place that a change needs to be made, since the compiler will spew out the file and line where the compilation error is at.

What do we change the return type to? Well, returning a long is a form of primitive obsession, and we should be instead returning a value with a commensurate semantic meaning of the units of the return value. We can start by simply converting the return value from a long to a new MillisecondsSinceEpoch value (this step is optional, but is useful for the example):

// returns milliseconds since epoch
public MillisecondsSinceEpoch getTimestamp() {
    // do some stuff
}


When we then want to change it to return a NanosecondsSinceEpoch value, most of the code that uses it would fail to compile, pointing directly at the problem:

// returns nanoseconds since epoch
public NanosecondsSinceEpoch getTimestamp() {
    // do some stuff
}


We've now failed fast and spectacularly, sending our developers to the more informative secondary documents. There are downsides to this approach too though. The first is that the change in return value may not catch all the usages. If MillisecondsSinceEpoch and NanosecondsSinceEpoch both have a method toLong() that returns the value as a long, I could write a line like:

long milliseconds = timestamper.getTimestamp().toLong();


This would not report an error of the return type of getTimestamp() changes because the code is equally valid for both return types. By following the Law of Demeter, we can discourage constructs like that, but they still happen and we need to be robust to that if we can be.

The second downside is that, if the return type needs to change again (say, to PicosecondsSinceEpoch), we need to go through the same process again. All the code will fail to compile, and the developer will need to make another set of very similar changes. We can see that above in the optional step of changing from long to MillisecondsSinceEpoch to NanosecondsSinceEpoch, which is why I included that optional step. This could get tiring for developers.

There is a better solution, though. We still need to make the first spectacular failure, but we can then be robust to future changes. Rather than MillisecondsSinceEpoch and NanosecondsSinceEpoch classes, we can create a single new class, TimeSinceEpoch that encapsulates the representation away completely:

public TimeSinceEpoch getTimestamp() {
    // do some stuff
}


Once again, the code will fail to compile in a spectacular stream of compilation errors, and the developer will have to make the changes to fix everything, but we offer an API that is more useful and more robust through the TimeSinceEpoch interface. For example, rather than worrying whether the method returns milliseconds or nanoseconds, the TimeSinceEpoch class can provide methods for both representations:

public class TimeSinceEpoch {
    // …

    public long getMilliseconds() {
        // do some stuff
    }


    public long getNanoseconds() {
        // do some stuff
    }

    // …


Now, the compiler can catch all the places the code needs to change, since the methods make the units unambiguous. And now, if we need to add a getPicoseconds() method to the class, the existing code can be completely oblivious. This provides a robust, lasting solution with the assurance of fast failure and a one-time conversion pain.

It would have been nice if the code used a robust abstraction all along. We could have gone blithely into new versions without ever feeling that conversion pain. So should we always use an abstraction to wrap every value? In this case, it would have helped, but the truth is that we are looking at this retrospectively. At the outset, would we have thought that the timestamp precision would change? Probably not. In that case, returning a long value is reasonable, and wrapping that in a class would be considered over-engineering. Lots of people would have written that code. I would have written that code. When the change becomes painfully apparent, though, we should think about reaching for that abstraction. When a thing changes once, there's a good chance it will change again, and we should find solutions that prepare us for those changes.

Comments

Popular posts from this blog

The Timeline of Errors

Assuming the Bugs

Assuming Debugging