Where the Bugs are¶

Every time a bug is fixed, developers leave a trace – in the version database when they commit the fix, or in the bug database when they close the bug. In this chapter, we learn how to mine these repositories for past changes and bugs, and how to map them to individual modules and functions, highlighting those project components that have seen most changes and fixes over time.

In [1]:

from bookutils import YouTubeVideo
# YouTubeVideo("w4u5gCgPlmg")

Prerequisites

You should have read the chapter on tracking bugs.

In [2]:

import bookutils

In [3]:

import Tracking

Synopsis¶

To use the code provided in this chapter, write

>>> from debuggingbook.ChangeExplorer import <identifier>

and then make use of the following features.

This chapter provides two classes ChangeCounter and FineChangeCounter that allow to mine and visualize the distribution of changes in a given git repository.

ChangeCounter is initialized as

change_counter = ChangeCounter(repository)

where repository is either

a directory containing a git clone (i.e., it contains a .git directory)
the URL of a git repository.

Additional arguments are being passed to the underlying RepositoryMining class from the PyDriller Python package. A filter keyword argument, if given, is a predicate that takes a modification (from PyDriller) and returns True if it should be included.

In a change counter, all elements in the repository are represented as nodes – tuples $(f_1, f_2, ..., f_n)$ that denote a hierarchy: Each $f_i$ is a directory holding $f_{i+1}$, with $f_n$ being the actual file.

A change_counter provides a number of attributes. changes is a mapping of nodes to the number of changes in that node:

>>> change_counter.changes[('README.md',)]

7

The messages attribute holds all commit messages related to that node:

>>> change_counter.messages[('README.md',)]

['first commit',
 'Adjusted to debuggingbook',
 'New Twitter handle: @Debugging_Book',
 'Doc update',
 'Doc update',
 'Doc update',
 'Doc update']

The sizes attribute holds the (last) size of the respective element:

>>> change_counter.sizes[('README.md',)]

10701

FineChangeCounter acts like ChangeCounter, but also retrieves statistics for elements within the respective files; it has been tested for C, Python, and Jupyter Notebooks and should provide sufficient results for programming languages with similar syntax.

The map() method of ChangeCounter and FineChangeCounter produces an interactive tree map that allows to explore the elements of a repository. The redder (darker) a rectangle, the more changes it has seen; the larger a rectangle, the larger its size in bytes.

>>> fine_change_counter.map()

We can easily identify the most frequently changed files:

In [54]:

all_nodes = list(change_counter.changes.keys())
all_nodes.sort(key=lambda node: change_counter.changes[node], reverse=True)
[(node, change_counter.changes[node]) for node in all_nodes[:4]]

Out[54]:

[(('notebooks', 'PICS', 'Slicer-synopsis-1.png'), 183),
 (('notebooks', 'PICS', 'StatisticalDebugger-synopsis-1.png'), 176),
 (('notebooks', 'Slicer.ipynb'), 109),
 (('notebooks', 'PICS', 'Repairer-synopsis-1.png'), 73)]

In [55]:

# ignore
all_notebooks = [node for node in change_counter.changes.keys()
                 if len(node) == 2 and node[1].endswith('.ipynb')]
all_notebooks.sort(key=lambda node: change_counter.changes[node], reverse=True)

In [56]:

from bookutils import quiz

In [57]:

quiz("Which two notebooks have seen the most changes over time?",
    [
        f"`{all_notebooks[3][1].split('.')[0]}`",
        f"`{all_notebooks[1][1].split('.')[0]}`",
        f"`{all_notebooks[2][1].split('.')[0]}`",
        f"`{all_notebooks[0][1].split('.')[0]}`",
    ], [1234 % 3, 3702 / 1234])

Out[57]:

Quiz

Which two notebooks have seen the most changes over time?

StatisticalDebugger
Repairer
DeltaDebugger
Slicer

Indeed, these two are the two most frequently changed notebooks:

In [58]:

all_notebooks[0][1].split('.')[0], all_notebooks[1][1].split('.')[0]

Out[58]:

('Slicer', 'Repairer')

Past Fixes¶

Knowing which files have been changed most is useful in debugging, because any change increases the chance to introduce a new bug. Even more important, however, is the question of how frequently a file was fixed in the past, as this is an important indicator for its bug-proneness.

(One may think that fixing several bugs reduces the number of bugs, but unfortunately, a file which has seen several fixes in the past is likely to see fixes in the future, too. This is because the bug-proneness of a software component very much depends on the requirements it has to fulfill, and if these requirements are unclear, complex, or frequently change, this translates into many fixes.)

How can we tell changes from fixes?

One indicator is commit messages: If they refer to "bugs" or "fixes", then the change is a fix.
Another indicator is bug numbers: If a commit message contains an issue number from an associated issue database, then we can make use of the issue referred to.
- The issue database may provide us with additional information about the bug, such as its severity, how many people it was assigned to, how long it took to fix it, and more.
A final indicator is time: If a developer first committed a change and in the same time frame marked an issue as "resolved", then it is likely that the two refer to each other.

The way these two are linked very much depends on the project – and the discipline of developers as it comes to change messages. Branches and merges bring additional challenges.

For the debuggingbook project, identifying fixes is easy. The discipline is that if a change fixes a bug, it is prefixed with Fix:. We can use this to introduce a FixCounter class specific to our debuggingbook project.

In [59]:

class FixCounter(ChangeCounter):
    def include(self, m):
        """Include all modifications whose commit messages start with 'Fix:'"""
        return super().include(m) and m and m.msg.startswith("Fix:")

As a twist to our default ChangeCounter class, we include the "fix" messages in the tree map rectangles.

In [60]:

class FixCounter(FixCounter):
    def map_node_text(self, node):
        if node and node in self.messages:
            return "<br>".join(self.messages[node])
        return ""

    def map_hoverinfo(self):
        return 'label'

This is the tree map showing fixes. We see that

only those components that actually have seen a fix are shown; and
the fix distribution differs from the change distribution.

In [61]:

fix_counter = debuggingbook_change_counter(FixCounter)

In [62]:

fix_counter.map()

Fine-Grained Changes¶

In programming projects, individual files typically consist of smaller units such as functions, classes, and methods. We want to determine which of these units are frequently changed (and fixed). For this, we need to break down individual files into smaller parts, and then determine which of these parts would be affected by a change.

Mapping Elements to Locations¶

Our first task is a simple means to split a (programming) file into smaller parts, each with their own locations. First, we need to know what kind of content a file contains. To this end, we use the Python magic package. (The "magic" in the name does not refer to some "magic" functionality, but to the practice of having files start with "magic" bytes that indicate their type.)

In [63]:

import magic

The magic package easily guesses that a file contains C code:

In [64]:

magic.from_buffer('''
#include <stdio.h>

int main(int argc, char *argv[]) {
    printf("Hello, world!\n")
}
''')

Out[64]:

'C source, ASCII text'

It also works well for Python code:

In [65]:

magic.from_buffer('''
def foo():
    print("Hello, world!")
''')

Out[65]:

'Python script, ASCII text executable'

Jupyter Notebooks, however, are identified as SGML documents:

In [66]:

magic.from_buffer(open(os.path.join(current_repo(), 'notebooks', 'ChangeExplorer.ipynb')).read())

Out[66]:

'exported SGML document, UTF-8 Unicode text, with very long lines'

We define a set of delimiters for these languages which use regular expressions to identify

the language (matching the magic output)
the beginning of a unit, and
the end of a unit,

For Python, for instance, any line starting with def or class denotes the start of some unit; any line starting with something else denotes the end of a unit. For Jupyter, the delimiters do the same, yet encoded into JSON. The definitions for C are likely to work for a wide range of languages that all use { and } to delimit units.

In [67]:

import re

In [68]:

DELIMITERS = [
    (
        # Python
        re.compile(r'^python.*'),

        # Beginning of element
        re.compile(r'^(async\s+)?(def|class)\s+(?P<name>\w+)\W.*'),

        # End of element
        re.compile(r'^[^#\s]')
    ),
    (
        # Jupyter Notebooks
        re.compile(r'^(json|exported sgml|jupyter).*'),
        re.compile(r'^\s+"(async\s+)?(def|class)\s+(?P<name>\w+)\W'),
        re.compile(r'^(\s+"[^#\s\\]|\s+\])')
    ),
    (
        # C source code (actually, any { }-delimited language)
        re.compile(r'^(c |c\+\+|c#|java|perl|php).*'),
        re.compile(r'^[^\s].*\s+(?P<name>\w+)\s*[({].*'),
        re.compile(r'^[}]')
    )
]

The function rxdelim() returns suitable delimiters for a given content, using DELIMITERS.

In [69]:

def rxdelim(content):
    """Return suitable begin and end delimiters for the content `content`.
If no matching delimiters are found, return `None, None`."""
    tp = magic.from_buffer(content).lower()
    for rxtp, rxbegin, rxend in DELIMITERS:
        if rxtp.match(tp):
            return rxbegin, rxend

    return None, None

The function elem_mapping() returns a list of the individual elements as found in the file, indexed by line numbers (starting with 1).

In [70]:

def elem_mapping(content, log=False):
    """Return a list of the elements in `content`, indexed by line number."""
    rxbegin, rxend = rxdelim(content)
    if rxbegin is None:
        return []

    mapping = [None]
    current_elem = None
    lineno = 0

    for line in content.split('\n'):
        lineno += 1

        match = rxbegin.match(line)
        if match:
            current_elem = match.group('name')
        elif rxend.match(line):
            current_elem = None

        mapping.append(current_elem)

        if log:
            print(f"{lineno:3} {str(current_elem):15} {line}")

    return mapping

Here is an example of how elem_mapping() works. During execution (with log set to True), we already see the elements associated with individual line numbers.

In [71]:

some_c_source = """
#include <stdio.h>

int foo(int x) {
    return x;
}

struct bar {
    int x, y;
}

int main(int argc, char *argv[]) {
    return foo(argc);
}

"""
some_c_mapping = elem_mapping(some_c_source, log=True)

  1 None            
  2 None            #include <stdio.h>
  3 None            
  4 foo             int foo(int x) {
  5 foo                 return x;
  6 None            }
  7 None            
  8 bar             struct bar {
  9 bar                 int x, y;
 10 None            }
 11 None            
 12 main            int main(int argc, char *argv[]) {
 13 main                return foo(argc);
 14 None            }
 15 None            
 16 None

In the actual mapping, we can access the individual units for any line number:

In [72]:

some_c_mapping[1], some_c_mapping[8]

Out[72]:

(None, 'bar')

Here's how this works for Python:

In [73]:

some_python_source = """
def foo(x):
    return x

class bar(blue):
    x = 25
    def f(x):
        return 26

def main(argc):
    return foo(argc)

"""
some_python_mapping = elem_mapping(some_python_source, log=True)

  1 None            
  2 foo             def foo(x):
  3 foo                 return x
  4 foo             
  5 bar             class bar(blue):
  6 bar                 x = 25
  7 bar                 def f(x):
  8 bar                     return 26
  9 bar             
 10 main            def main(argc):
 11 main                return foo(argc)
 12 main            
 13 main

In [74]:

# some_jupyter_source = open("Slicer.ipynb").read()
# some_jupyter_mapping = elem_mapping(some_jupyter_source, log=False)

Determining Changed Elements¶

Using a mapping from elem_mapping(), we can determine which elements are affected by a change. The changed_elems_by_mapping() function returns the set of affected elements.

In [75]:

def changed_elems_by_mapping(mapping, start, length=0):
    """Within `mapping`, return the set of elements affected by a change
starting in line `start` and extending over `length` additional lines"""
    elems = set()
    for line in range(start, start + length + 1):
        if line < len(mapping) and mapping[line]:
            elems.add(mapping[line])

    return elems

Here's an example of changed_elems_by_mapping(), applied to the Python content, above:

In [76]:

changed_elems_by_mapping(some_python_mapping, start=2, length=4)

Out[76]:

{'bar', 'foo'}

The function elem_size() returns the size of an element (say, a function).

In [77]:

def elem_size(elem, source):
    """Within `source`, return the size of `elem`"""
    source_lines = [''] + source.split('\n')
    size = 0
    mapping = elem_mapping(source)

    for line_no in range(len(mapping)):
        if mapping[line_no] == elem or mapping[line_no] is elem:
            size += len(source_lines[line_no] + '\n')

    return size

In [78]:

elem_size('foo', some_python_source)

Out[78]:

In [79]:

assert sum(elem_size(name, some_python_source) 
           for name in ['foo', 'bar', 'main']) == len(some_python_source)

Given an old version and a new version of a (text) file, we can use the diff_match_patch module to determine differences, and from these the affected lines:

In [80]:

from ChangeDebugger import diff  # minor dependency

In [81]:

from diff_match_patch import diff_match_patch

In [82]:

def changed_elems(old_source, new_source):
    """Determine the elements affected by the change from `old_source` to `new_source`"""
    patches = diff(old_source, new_source)

    old_mapping = elem_mapping(old_source)
    new_mapping = elem_mapping(new_source)

    elems = set()

    for patch in patches:
        old_start_line = patch.start1 + 1
        new_start_line = patch.start2 + 1

        for (op, data) in patch.diffs:
            length = data.count('\n')

            if op == diff_match_patch.DIFF_INSERT:
                elems |= changed_elems_by_mapping(old_mapping, old_start_line)
                elems |= changed_elems_by_mapping(new_mapping, new_start_line, length)
            elif op == diff_match_patch.DIFF_DELETE:
                elems |= changed_elems_by_mapping(old_mapping, old_start_line, length)
                elems |= changed_elems_by_mapping(new_mapping, new_start_line)

            old_start_line += length
            new_start_line += length
            
    return elems

Here is how changed_elems() works. We define a "new" version of some_python_source:

In [83]:

some_new_python_source = """
def foo(y):
    return y

class qux(blue):
    x = 25
    def f(x):
        return 26

def main(argc):
    return foo(argc)

"""

In [84]:

changed_elems(some_python_source, some_new_python_source)

Out[84]:

{'bar', 'foo', 'qux'}

Note that the list of changed elements includes added as well as deleted elements.

Putting it all Together¶

We introduce a class FineChangeCounter that, like ChangeCounter, counts changes for individual files; however, FineChangeCounter adds additional nodes for all elements affected by the change. For a file consisting of multiple elements, this has the same effect as if the file were a directory, and the elements were all contained as individual files in this directory.

In [85]:

class FineChangeCounter(ChangeCounter):
    def update_elems(self, node, m):
        old_source = m.source_code_before if m.source_code_before else ""
        new_source = m.source_code if m.source_code else ""
        
        for elem in changed_elems(old_source, new_source):
            elem_node = node + (elem,)

            self.update_size(elem_node, elem_size(elem, new_source))
            self.update_changes(elem_node, m.msg)

Retrieving fine-grained changes takes a bit more time, since all files have to be parsed...

In [86]:

with Timer() as t:
    fine_change_counter = debuggingbook_change_counter(FineChangeCounter)

t.elapsed_time()

Out[86]:

164.12128805500106

... but the result is very much worth it. We can now zoom into individual files and compare the change counts for the individual functions.

In [87]:

fine_change_counter.map()

Like before, we can access the most frequently changed elements, This is the most frequently changed item in the book:

In [88]:

elem_nodes = [node for node in fine_change_counter.changes.keys()
              if len(node) == 3 and node[1].endswith('.ipynb')]
elem_nodes.sort(key=lambda node: fine_change_counter.changes[node], reverse=True)
[(node, fine_change_counter.changes[node]) for node in elem_nodes[:1]]

Out[88]:

[(('notebooks', 'Slicer.ipynb', 'Slicer'), 7)]

In [89]:

from bookutils import quiz

In [90]:

quiz("Which is the _second_ most changed element?",
    [
        f"`{elem_nodes[3][2]}` in `{elem_nodes[3][1].split('.ipynb')[0]}`",
        f"`{elem_nodes[1][2]}` in `{elem_nodes[1][1].split('.ipynb')[0]}`",
        f"`{elem_nodes[2][2]}` in `{elem_nodes[2][1].split('.ipynb')[0]}`",
        f"`{elem_nodes[0][2]}` in `{elem_nodes[0][1].split('.ipynb')[0]}`",
    ], 1975308642 / 987654321)

Out[90]:

Quiz

Which is the second most changed element?

debug in Debugger
DependencyTracker in Slicer
int_fuzzer in Template
Slicer in Slicer

Indeed, here comes the list of the top five most frequently changed elements:

In [91]:

[(node, fine_change_counter.changes[node]) for node in elem_nodes[:5]]

Out[91]:

[(('notebooks', 'Slicer.ipynb', 'Slicer'), 7),
 (('notebooks', 'Slicer.ipynb', 'DependencyTracker'), 4),
 (('notebooks', 'Template.ipynb', 'int_fuzzer'), 3),
 (('notebooks', 'Debugger.ipynb', 'debug'), 3),
 (('notebooks', 'Tracer.ipynb', 'ConditionalTracer'), 3)]

Now it is time to apply these tools on your own projects. Which are the most frequently changed (and fixed) elements? Why is that so? What can you do to improve things? All these are consequences of debugging – to help having fewer bugs in the future!

Synopsis¶

This chapter provides two classes ChangeCounter and FineChangeCounter that allow to mine and visualize the distribution of changes in a given git repository.

ChangeCounter is initialized as

change_counter = ChangeCounter(repository)

where repository is either

a directory containing a git clone (i.e., it contains a .git directory)
the URL of a git repository.

Additional arguments are being passed to the underlying RepositoryMining class from the PyDriller Python package. A filter keyword argument, if given, is a predicate that takes a modification (from PyDriller) and returns True if it should be included.

In a change counter, all elements in the repository are represented as nodes – tuples $(f_1, f_2, ..., f_n)$ that denote a hierarchy: Each $f_i$ is a directory holding $f_{i+1}$, with $f_n$ being the actual file.

A change_counter provides a number of attributes. changes is a mapping of nodes to the number of changes in that node:

In [92]:

change_counter.changes[('README.md',)]

Out[92]:

The messages attribute holds all commit messages related to that node:

In [93]:

change_counter.messages[('README.md',)]

Out[93]:

['first commit',
 'Adjusted to debuggingbook',
 'New Twitter handle: @Debugging_Book',
 'Doc update',
 'Doc update',
 'Doc update',
 'Doc update']

The sizes attribute holds the (last) size of the respective element:

In [94]:

change_counter.sizes[('README.md',)]

Out[94]:

FineChangeCounter acts like ChangeCounter, but also retrieves statistics for elements within the respective files; it has been tested for C, Python, and Jupyter Notebooks and should provide sufficient results for programming languages with similar syntax.

The map() method of ChangeCounter and FineChangeCounter produces an interactive tree map that allows to explore the elements of a repository. The redder (darker) a rectangle, the more changes it has seen; the larger a rectangle, the larger its size in bytes.

In [95]:

fine_change_counter.map()

Subclassing offers several ways to customize what to mine and how to visualize it. See the chapter for details.

Here are all the classes defined in this chapter:

In [96]:

# ignore
from ClassDiagram import display_class_hierarchy

In [97]:

# ignore
display_class_hierarchy([FineChangeCounter, FixCounter],
                        public_methods=[
                            ChangeCounter.__init__,
                            ChangeCounter.map  # FIXME: Why is `map()` not highlighted?
                        ],
                        project='debuggingbook')

Out[97]:

Lessons Learned¶

We can easily mine past changes and map these to individual files and elements
This information can be helpful in guiding the debugging and development process
Counting fixes needs to be customized to the conventions used in the project at hand

Background¶

To be added

Exercises¶

To be added

Exercise 1: Title¶

Text of the exercise

Solution. Solution for the exercise