Assuming Debugging

Bugs are bad assumptions. Debugging is the process of testing our assumptions until we discover the invalid one(s). Once found, we can then correct those bad assumptions. The following code generates blog post author statistics for a popular blog site. The statistics show which days of the week authors are making posts. Most of the authors post every day, and extra on Fridays and Saturdays to capture the weekend audience. It has at least one bug.


@GET
@Path("/authors/{email}/weekly-stats")
public UserWeeklyStats calculateWeeklyStats(@PathParam("email") final String email) {
    final User user = this.userService.byEmail(email);
    final Map<LocalDate, List<Post>> postsByDate =
            this.postService.thisWeekGroupedByDateFor(user.id());

    if (postsByDate.isEmpty()) {
        return null;
    }

    final Map<DayOfWeek, List<Post>> postsByDayOfWeek = new HashMap<>();
    postsByDate.forEach((date, post) ->
        postsByDayOfWeek
                .computeIfAbsent(date.getDayOfWeek(), (key) -> new ArrayList<>())
                .add(post));

    final Map<DayOfWeek, Integer> postCountByDayOfWeek = new HashMap<>();
    DayOfWeek favoriteDayOfWeek = DayOfWeek.SUNDAY;
    int maxPosts = 0;
    for (final DayOfWeek dayOfWeek : DayOfWeek.values()) {
        final int postCountForDayOfWeek = postsByDayOfWeek.get(dayOfWeek).size();

        if (postCountForDayOfWeek > maxPosts) {
            favoriteDayOfWeek = dayOfWeek;
            maxPosts = postCountForDayOfWeek;
        }

        postCountByDayOfWeek.put(dayOfWeek, Integer.valueOf(maxPosts));

        if (dayOfWeek == LocalDate.now().getDayOfWeek()) {
            break;
        }
    }

    return UserWeeklyStats.build()
        .withPostCountsByDayOfWeek(postCountByDayOfWeek)
        .withFavoriteWeekDay(favoriteDayOfWeek, maxPosts)
        .done();
}

We've received several tickets relating to this block of code. Some users report incorrect statistics, while others encounter 500 errors trying to see the numbers. Below are some of the bugs we've seen submitted.

  • Author statistics incorrect for Saturday
  • Author page does not load stats on Sunday mornings
  • Some West-Coast author pages do not load in the mornings
  • Bookmarked author page shows error

The first inclination might be to start reading through the code to see if we can spot the error. Or we might try to reproduce all the issues, probably with some but not all success. Instead, let's look at the assumptions we are making about this method, and see if we can glean knowledge that way.

We discover our assumptions by looking over the code and identifying them. We can write them down if necessary, or hold them in our head if they are few enough. Strong assumptions, like our compiler working, we can ignore for a first pass. It is far more likely that the bugs are in our code, and it is much better if they are because we can fix them.

Looking at the first few lines, we see that this is an HTTP GET endpoint. That enpoint has a user's email as a path parameter. That email address is extracted by our framework and passed in as a parameter to the method. Like the compiler, we can assume that the framework we are using is not the source of the bugs. Though it may have bugs, it is unlikely as this framework is widely used and bugs like this have mostly been ironed out. We see that our code passes this parameter to a service that will find the user. Here, the code assumes that the email string must actually be an email address, and that the email address corresponds to an author. It does no checking to ensure that an email is provided at all, and that it is actually an email address. This doesn't appear to correspond to any of our above bugs, but we should note it as a potential bug. If a user calls /authors//weekly-stats or /authors/asdf/weekly-stats, we probably want to fail fast and return a relevant error message to them.

We also see that our code assumes that we get back a valid user for the given email address. We do not check if the returned user is valid, or if the service method throws an error. We look through our list of bugs to see if any of them apply. The only one that seems tangentially related is the one about bookmarked author pages. Authors are free to change their email addresses in the system, and the bug report notes that it seems to only happen to authors that have changed their emails. Perhaps their bookmarks use their old email addresses. There is probably a larger fix to track old email addresses in the system, but at least we can return a clearer message to the user.

As we keep reading, we see that we get the author's posts for this week. If there are no author posts, we return a null and skip any further processing. In some code bases, this is fine, but we try to never return null values to the front-end in this code base. Because of that contract, the UI code is often written to assume some sort of value is always returned. In this case, we have an assumption we actually want to enforce, and we are breaking that assumption. There are a few tactics to fix this. We could return an error result saying there are no posts. We could wrap the result in an optional-type wrapper that will indicate if there is a value. We could return a default empty result that the UI could blissfully process unawares. These are all valid options, and which one is right depends on the exact situation. Looking through our ticket list, we see that failing to load on Sunday mornings would very likely be caused by this bad assumption.

Next we see that the code reads through this week's posts from the author, and groups them by the day of the week they were posted. This code safely handles each new day of the week it encounters by creating a new list for posts if it isn't already in the map. We then see values being set up for use in a loop. We notice that the code assumes that the day of the week for the most posts is Sunday by default. This default value assumption is likely wrong if the user has no posts. This is probably not a critical assumption, and we may note it and ignore it for now.

We come to a loop over the days of the week. It gets the number of posts for each day of the week and stores it in a map. It also calculates the author's most popular day of the week to post. This code makes several assumptions about the frequency of author posts. We loop over every day of the week and get the size of the list. The assumption here is that the author had a post every day of the week. If the author did not post one day, getting the list from the map will return null, and the code will throw an error. Most authors post at least once every day on the site, but they post throughout the day. If a user tries to see the authors statistics before an author has posted on a given day, they could encounter this error. It would affect authors on the West Coast more often, as they post later in the day due to time zone differences. The original developer was clever enough to notice this for later days in the week with the final conditional in the loop, but did not apply that same thought to passed days. We can strongly link this bad assumption to that bug.

Looking further at the code, we see the assumptions about picking the author's favorite day to post. That code assumes that the sizes will always make sense (will not be negative), will be reasonably comparable, and can be tracked to find the maximum to find the author's favorite. These are all safe assumptions, and this code is correct.

Finally we come to the line to set the number of posts for each day of the week. There is a pretty obvious bug in this line. We are passing in the maximum number of posts so far instead of the number of posts for this day. The line should be postCountByDayOfWeek.put(dayOfWeek, Integer.valueOf(postCountForDayOfWeek));. If the number of posts ascends for every day of the week, this code would work like it should. If the number of posts does not increase every day, then days after the previous maximum will be wrong. Authors post most articles on Fridays and Saturdays, but sometimes they post more on Friday than Saturday. Because of this, the number for Saturday will be wrong, instead showing Friday's number again. We see that we have a ticket for this.

Some might contend that this bug is not a bad assumption, but more likely a simple typo. It probably is a simple oversight by the developer intially, but there are many assumptions from there that are invalid. We assume that developers diligently test their code changes, manually or automatically. This developer did not check for a series of posts that is not strictly increasing, or else they would have caught this. Additionally, we assume that reviewers will also diligently read and test the code. That also did not happen in this case. Developers are people, and they sometimes make mistakes. They make mistakes and cannot catch every bug. To assume they can is wrong. Instead, we can make efforts to embed our assumptions into the code itself.

This is the second in a series of posts on assumptions, bugs, debuggability, and maintainabilty.

Comments

Popular posts from this blog

The Timeline of Errors

Assuming the Bugs