Break Down Code to Make It DRYer

US Beef cuts

We are always looking to make our code DRYer. Code that is not repeated is easier to maintain. We can make a change in just one place rather than hunting for different incarnations in many different places. Making code DRYer is not always easy, and sometimes how to remove the duplication is not apparent.

Let's say we have the following two functions:


public List<Stats> processLines(File file) {
    List<Stats> stats = new ArrayList<>();

    for (String line : lines(file)) {
        Data data = extractDataFrom(line);
        Stats dataStats = calculateStatsFor(data);
        stats.add(dataStats);
    }

    return stats;
}

public List<Stats> processLines(List<File> files) {
    List<Stats> stats = new ArrayList<>();

    int fileNumber = 0;
    for (File file : files) {
        for (String line : lines(file)) {
            Data data = extractDataFrom(line, fileNumber);
            Stats dataStats = calculateStatsFor(data);
            stats.add(dataStats);
        }
        fileNumber++;
    }

    return stats;
}

We can immediately see that these two functions are very similar. The main difference between them is that one handles a single file while the other handles multiple files. It almost is the same, except for that call to extractDataFrom. In the first function, it takes one parameter: the line read from the file. In the second function, it takes two: the line read and the index of the file in the list of files. This means that the two methods are just different enough that they cannot be combined simply. We have to work a little harder. We can use two possible strategies for removing the duplication.

One is a special case of many

We don't see above the different implementations of extractDataFrom. We do know that the two overloads look like they do almost the same thing, much like the two functions we do see. Perhaps, if we scroll around in this class, we might see the following code:


private Data extractDataFrom(String line) {
    extractDataFrom(line, 0);
}

private Data extractDataFrom(String line, int fileNumber) {
    // Do some data extraction
}

This is an instance where one is a special case of many. The single file call is the same as the multi-file call with a list containing one file. Because of this, we can easily remove the duplication:


public List<Stats> processLines(File file) {
    return processLines(Arrays.asList(file));
}

public List<Stats> processLines(List<File> files) {
    List<Stats> stats = new ArrayList<>();

    int fileNumber = 0;
    for (File file : files) {
        for (String line : lines(file)) {
            Data data = extractDataFrom(line, fileNumber);
            Stats dataStats = calculateStatsFor(data);
            stats.add(dataStats);
        }
        fileNumber++;
    }

    return stats;
}

We will often see cases where one is a special case of many, and we can remove redundancy by making the one case into a many case. This often happens because we don't always see that we have these two cases that are the same. It just takes a careful eye to spot the duplication and remove it.

Sometimes things are not that simple.

Encapsulate What Varies


Instead, we scroll through the file and find that the methods are defined as:


private Data extractDataFrom(String line) {
    // Do some data extraction one way
}

private Data extractDataFrom(String line, int fileNumber) {
    // Do some data extraction another, completely different way
}

The single case is not a special instance of the many case now. They are handled in different ways that must be maintained. We can't use the trick we used above. We have to look more closely at the duplication we want to get rid of.

We first list out what our two functions are doing that overlaps:
  1. With each file (this is done differently between the functions!)
  2. Get each line in that file
  3. Extract the Data from that line (this is done differently between the functions!)
  4. Calculate the Stats for each Data
  5. Add the Stats to the list of Stats
  6. Return the list of Stats

We see that, essentially, the steps are all the same except for the loop over the files and the extract step. Those common steps are the duplication. What we need to do now is to encapsulate what varies. Once we've encapsulated so that both functions become essentially the same, we can remove the duplication. To do this, we will need to rearrange and refactor a bit. We can make the code look like:


public List<Stats> processLines(File file) {
    List<Data> datas = new ArrayList<>();

    for (String line : lines(file)) {
        datas.add(extractDataFrom(line));
    }

    List<Stats> stats = new ArrayList<>();

    for (Data data : datas) {
        Stats dataStats = calculateStatsFor(data);
        stats.add(dataStats);
    }

    return stats;
}

public List<Stats> processLines(List<File> files) {
    List<Data> datas = new ArrayList<>();

    int fileNumber = 0;
    for (File file : files) {
        for (String line : lines(file)) {
            datas.add(extractDataFrom(line, fileNumber));
        }
        fileNumber++;
    }

    List<Stats> stats = new ArrayList<>();

    for (Data data : datas) {
        Stats dataStats = calculateStatsFor(data);
        stats.add(dataStats);
    }

    return stats;
}

We now have two chunks of identical code at the bottom of each function, and all of the code that varies at the top. We have started to encapsulate away the extraction of the data from lines into a simple list of datas to process the stats for. I acknowledge that there is a change in the memory performance of this code from the original, but I am going to hand-wave that for the moment and return to it later.

Since we now have duplicate code, we can pull it out and pass in the datas as a parameter, fully encapsulating how they are derived:


public List<Stats> processLines(File file) {
    List<Data> datas = new ArrayList<>();

    for (String line : lines(file)) {
        datas.add(extractDataFrom(line));
    }

    return processDatas(datas);
}

public List<Stats> processLines(List<File> files) {
    List<Data> datas = new ArrayList<>();

    int fileNumber = 0;
    for (File file : files) {
        for (String line : lines(file)) {
            datas.add(extractDataFrom(line, fileNumber));
        }
        fileNumber++;
    }

    return processDatas(datas);
}

private List<Stats> processDatas(Iterable<Data> datas) {
    List<Stats> stats = new ArrayList<>();

    for (Data data : datas) {
        Stats dataStats = calculateStatsFor(data);
        stats.add(dataStats);
    }

    return stats;
}

This is looking pretty good. The processDatas function takes an Iterable of Datas to be more flexible, in case we want to ever change the data structure we are using. This is an improvement, and might be good enough. However, there are still some smells with this code. First, we hand-waved the memory performance, but if the files are very large we could run out of memory where we would not have before. Second, we originally rearranged these methods by separating a unique part and a common part. This allowed us to do some encapsulation, but it also indicates that these functions may be doing to much. They may not be single responsibility.

We've started to encapsulate what varies by putting the Datas into a list, but a list doesn't fully encapsulate. It is just a general data structure for a collection of similar things. Is there a way to fully encapsulate the difference using polymorphism?

We originally had processDatas take an Iterable to provide future flexibility. We can take advantage of that and hide the complexity of reading the files behind the simple Iterable interface. First, we can extract out a simple Iterable/Iterator for the single file case:


public class SingleFileLineToDataIterable implements Iterable<Data> {
    private File file;

    public SingleFileLineToDataIterable(File file) {
        this.file = file;
    }

    public Iterator<Data> iterator() {
        return new SingleFileLineToDataIterator(lines(file).iterator());
    }

    private static class SingleFileLineToDataIterator implements Iterator<Data> {
        private Iterator<String> lines;

        public SingleFileLineToDataIterator(Iterator<String> lines) {
            this.lines = lines;
        }

        public boolean hasNext() {
            return lines.hasNext();
        }

        public Data next() {
            return extractDataFrom(lines.next());
        }

        // Note: This method moved here from our other class
        private Data extractDataFrom(String line) {
            // Do some data extraction one way
        }
    }
}

We can create a SingleFileLineToDataIterable with our file and pass it to the processDatas function:


public List<Stats> processLines(File file) {
    return processDatas(new SingleFileLineToDataIterable(file));
}

We can also extract out an Iterable/Iterator for the multi-file case:


public class MultiFileLineToDataIterable implements Iterable<Data> {
    private Iterable<File> files;

    public MiltiFileLineToDataIterable(Iterable<File> files) {
        this.files = files;
    }

    public Iterator<Data> iterator() {
        return new MultiFileLineToDataIterator(files.iterator());
    }

    private static class MultiFileLineToDataIterator implements Iterator<Data> {
        private Iterator<File> files;
        private int fileNumber = 0;
        private Iterator<String> currentLines = Collections.emptyIterator();

        public MultiFileLineToDataIterator(Iterator<File> files) {
            this.files = files;
        }

        public boolean hasNext() {
            if (currentLines.hasNext()) {
                return true;
            }

            if (!files.hasNext()) {
                return false;
            }

            currentLines = lines(files.next()).iterator();
            fileNumber += 1;

            return hasNext();
        }

        public Data next() {
            if (!hasNext()) {
                throw new NoSuchElementException();
            }

            return extractDataFrom(currentLines.next(), fileNumber);
        }

        // Note: This method moved here from our other class
        private Data extractDataFrom(String line, int fileNumber) {
            // Do some data extraction another, completely different way
        }
    }
}

We can create a MultiFileLineToDataIterable with our files and pass it to the processDatas function:


public List<Stats> processLines(List<File> files) {
    processDatas(new MultiFileLineToDataIterable(files));
}

At this point, we should inline calls to both overloads of processLines where we can. If we can remove all the calls, we can delete them; if not, we inline what we can and mark them deprecated. In doing so, we've pushed the responsibility of reading the files out of this class completely, improving the maintainability. We've removed the duplication while encapsulating what varies. We also maintain the original memory performance reading data in the same way as we did originally.


This blog post is based on this StackExchange question.

Comments

Popular posts from this blog

The Timeline of Errors

Magic Numbers, Semantics, and Compiler Errors

Assuming Debugging