# Where the Bugs are¶

Every time a bug is fixed, developers leave a trace – in the version database when they commit the fix, or in the bug database when they close the bug. In this chapter, we learn how to mine these repositories for past changes and bugs, and how to map them to individual modules and functions, highlighting those project components that have seen most changes and fixes over time.

In [1]:
from bookutils import YouTubeVideo

Out[1]:

Prerequisites

In [2]:
import bookutils

In [3]:
import Tracking


## Synopsis¶

>>> from debuggingbook.ChangeCounter import <identifier>


and then make use of the following features.

This chapter provides two classes ChangeCounter and FineChangeCounter that allow to mine and visualize the distribution of changes in a given git repository.

ChangeCounter is initialized as

change_counter = ChangeCounter(repository)


where repository is either

• a directory containing a git clone (i.e., it contains a .git directory)
• the URL of a git repository.

Additional arguments are being passed to the underlying Repository class from the PyDriller Python package. A filter keyword argument, if given, is a predicate that takes a modification (from PyDriller) and returns True if it should be included.

In a change counter, all elements in the repository are represented as nodes – tuples $(f_1, f_2, ..., f_n)$ that denote a hierarchy: Each $f_i$ is a directory holding $f_{i+1}$, with $f_n$ being the actual file.

A change_counter provides a number of attributes. changes is a mapping of nodes to the number of changes in that node:

>>> change_counter.changes.get(('README.md',), None)
13


The messages attribute holds all commit messages related to that node:

>>> change_counter.messages.get(('README.md',), None)
['Fix: corrected rule for rendered notebooks (#24)\nNew: strip out any <iframe> tags\nNew: when rendering .md files, replace videos by proper image',
'Doc update',
'Doc update',
'New: show badges at top of GitHub project page',
'New: prefer Unicode arrows over LaTeX ones',
'Update',
'Doc update',
'Doc update',
'Doc update',
'Doc update']


The sizes attribute holds the (last) size of the respective element:

>>> change_counter.sizes.get(('README.md',), None)
13025


FineChangeCounter acts like ChangeCounter, but also retrieves statistics for elements within the respective files; it has been tested for C, Python, and Jupyter Notebooks and should provide sufficient results for programming languages with similar syntax.

The map() method of ChangeCounter and FineChangeCounter produces an interactive tree map that allows to explore the elements of a repository. The redder (darker) a rectangle, the more changes it has seen; the larger a rectangle, the larger its size in bytes.

>>> fine_change_counter.map()


The included classes offer several methods that can be overridden in subclasses to customize what to mine and how to visualize it. See the chapter for details.

Here are all the classes defined in this chapter:

## Mining Change Histories¶

The history of any software project is a history of change. Any nontrivial project thus comes with a version database to organize and track changes; and possibly also with an issue database to organize and track issues.

Over time, these databases hold plenty of information about the project: Who changed what, when, and why? This information can be mined from existing databases and analyzed to answer questions such as

• Which parts in my project were most frequently or recently changed?
• How many files does the average change touch?
• Where in my project were the most bugs fixed?

To answer such questions, we can mine change and bug histories for past changes and fixes. This involves digging through version databases such as git and issue trackers such as RedMine or Bugzilla and extracting all their information. Fortunately for us, there is ready-made infrastructure for some of this.

## Mining with PyDriller¶

PyDriller is a Python package for mining change histories. Its Repository class takes a git version repository and allows to access all the individual changes ("modifications"), together with committers, affected files, commit messages, and more.

In [4]:
from pydriller import Repository  # https://pydriller.readthedocs.io/

In [5]:
from pydriller.domain.commit import Commit

In [6]:
from pydriller.domain.commit import ModifiedFile


To use Repository, we need to pass it

• the URL of a git repository; or
• the directory name where a cloned git repository can be found.

In general, cloning a git repository locally (with git clone URL) and then analyzing it locally will be faster and require less network resources.

Let us apply Repository on the repository of this book. The function current_repo() returns the directory in which a .git subdirectory is stored – that is, the root of a cloned git repository.

In [7]:
import os

In [8]:
# ignore
from typing import Callable, Optional, Type, Tuple, Any
from typing import Dict, Union, Set, List

In [9]:
def current_repo() -> Optional[str]:
path = os.getcwd()
while True:
if os.path.exists(os.path.join(path, '.git')):
return os.path.normpath(path)

# Go one level up
new_path = os.path.normpath(os.path.join(path, '..'))
if new_path != path:
path = new_path
else:
return None

return None

In [10]:
current_repo()

Out[10]:
'/Users/zeller/Projects/debuggingbook'

This gives us a repository miner for the book:

In [11]:
from datetime import datetime

In [12]:
book_miner = Repository(current_repo(), to=datetime(2020, 10, 1))


The to argument limits the range of time we want to look at.

You can also specify a URL instead, but this will access the repository via the network and generally be much slower.

In [13]:
DEBUGGINGBOOK_REMOTE_REPO = 'https://github.com/uds-se/debuggingbook.git'
# book_miner = Repository(DEBUGGINGBOOK_REMOTE_REPO)

In [14]:
# ignore
if 'CI' in os.environ:
# The CI git clone is shallow, so access full repo remotely
book_miner = Repository(DEBUGGINGBOOK_REMOTE_REPO,
to=datetime(2020, 10, 1))


traverse_commits() is a generator that returns one commit after another. Let us fetch the very first commit made to the book:

In [15]:
book_commits = book_miner.traverse_commits()
book_first_commit = next(book_commits)


Each commit has a number of attributes telling us more about the commit.

In [16]:
[attr for attr in dir(book_first_commit) if not attr.startswith('_')]

Out[16]:
['author',
'author_date',
'author_timezone',
'branches',
'committer',
'committer_date',
'committer_timezone',
'deletions',
'dmm_unit_complexity',
'dmm_unit_interfacing',
'dmm_unit_size',
'files',
'hash',
'in_main_branch',
'insertions',
'lines',
'merge',
'modified_files',
'msg',
'parents',
'project_name',
'project_path']

For instance, the msg attribute lets us know about the commit message:

In [17]:
book_first_commit.msg

Out[17]:
'first commit'

whereas the author attribute gets us the name and email of the person who made the commit:

In [18]:
[attr for attr in dir(book_first_commit.author) if not attr.startswith('_')]

Out[18]:
['email', 'name']
In [19]:
book_first_commit.author.name, book_first_commit.author.email

Out[19]:
('Andreas Zeller', '[email protected]')

A commit consists of multiple modifications to possibly multiple files. The commit modified_files attribute returns a list of modifications.

In [20]:
book_first_commit.modified_files

Out[20]:
[<pydriller.domain.commit.ModifiedFile at 0x7faf177c1240>]

For each modification, we can retrieve the files involved as well as several statistics:

In [21]:
[attr for attr in dir(book_first_commit.modified_files[0]) if not attr.startswith('_')]

Out[21]:
['added_lines',
'change_type',
'changed_methods',
'complexity',
'deleted_lines',
'diff',
'diff_parsed',
'filename',
'language_supported',
'methods',
'methods_before',
'new_path',
'nloc',
'old_path',
'source_code',
'source_code_before',
'token_count']

Let us see which file was created with this modification:

In [22]:
book_first_commit.modified_files[0].new_path

Out[22]:
'README.md'

The source_code attribute holds the entire file contents after the modification.

In [23]:
print(book_first_commit.modified_files[0].source_code)

# debuggingbook



We see that the debuggingbook project started with a very simple commit, namely the addition of an (almost empty) README.md file.

The attribute source_code_before holds the previous source code. We see that it is None – the file was just created.

In [24]:
print(book_first_commit.modified_files[0].source_code_before)

None


Let us have a look at the second commit. We see that it is much more substantial already.

In [25]:
book_second_commit = next(book_commits)

In [26]:
[m.new_path for m in book_second_commit.modified_files]

Out[26]:
['Chapters.makefile',
'Makefile',
'debuggingbook.bib',
'ipypublish',
'ipypublish_plugins',
'notebooks/.ipynb_checkpoints/index-checkpoint.ipynb',
'notebooks/index.ipynb',
'utils']

We fetch the modification for the README.md file:

In [27]:
readme_modification = [m for m in book_second_commit.modified_files if m.new_path == 'README.md'][0]


The source_code_before attribute holds the previous version (which we already have seen):

In [28]:
print(readme_modification.source_code_before)

# debuggingbook



The source_code attribute holds the new version – now a complete "README" file. (Compare this first version to the current README text.)

In [29]:
print(readme_modification.source_code[:400])

# About this Book

__Welcome to "The Debugging Book"!__

Software has bugs, and finding bugs can involve lots of effort.  This book addresses this problem by _automating_ software debugging, specifically by _locating errors and their causes automatically_.  Recent years have seen the development of novel techniques that lead to dramatic improvements in test generation and software testing.  They


The diff attribute holds the differences between the old and the new version.

In [30]:
print(readme_modification.diff[:100])

@@ -1 +1,157 @@
-# debuggingbook
+
+
+__Welcome to "The Debugging Book"!__
+
+So


The diff_parsed attribute even lists added and deleted lines:

In [31]:
readme_modification.diff_parsed['added'][:10]

Out[31]:
[(1, ''),
(3, ''),
(4, '__Welcome to "The Debugging Book"!__'),
(5, ''),
(6,
'Software has bugs, and finding bugs can involve lots of effort.  This book addresses this problem by _automating_ software debugging, specifically by _locating errors and their causes automatically_.  Recent years have seen the development of novel techniques that lead to dramatic improvements in test generation and software testing.  They now are mature enough to be assembled in a book – even with executable code.'),
(7, ''),
(8, ''),
(9, ''),
(10, '## A Textbook for Paper, Screen, and Keyboard')]

With all this information, we can track all commits and modifications and establish statistics over which files were changed (and possibly even fixed) most. This is what we will do in the next section.

In [32]:
# ignore
del book_miner  # Save a bit of memory


## Counting Changes¶

We start with a simple ChangeCounter class that, given a repository, counts for each file how frequently it was changed.

We represent file names as nodes – a tuple $(f_1, f_2, ..., f_n)$ that denotes a hierarchy: Each $f_i$ is a directory holding $f_{i+1}$, with $f_n$ being the actual file. Here is what this notebook looks as a node:

In [33]:
tuple('debuggingbook/notebooks/ChangeCounter.ipynb'.split('/'))

Out[33]:
('debuggingbook', 'notebooks', 'ChangeCounter.ipynb')
In [34]:
Node = Tuple


The constructor takes the repository to be analyzed and sets the internal counters.

In [35]:
from collections import defaultdict

In [36]:
import warnings

In [37]:
from git.exc import GitCommandError  # type: ignore

In [38]:
class ChangeCounter:
"""Count the number of changes for a repository."""

def __init__(self, repo: str, *,
filter: Optional[Callable[[Commit], bool]] = None,
log: bool = False,
**kwargs: Any) -> None:
"""
Constructor.
repo is a git repository (as URL or directory).
filter is a predicate that takes a modification and returns True
if it should be considered (default: consider all).
log turns on logging if set.
kwargs are passed to the Repository() constructor.
"""
self.repo = repo
self.log = log

if filter is None:
def filter(m: ModifiedFile) -> bool:
return True
assert filter is not None

self.filter = filter

# A node is an tuple (f_1, f_2, f_3, ..., f_n) denoting
# a folder f_1 holding a folder f_2 ... holding a file f_n.

# Mapping node -> #of changes
self.changes: Dict[Node, int] = defaultdict(int)

# Mapping node -> list of commit messages
self.messages: Dict[Node, List[str]] = defaultdict(list)

# Mapping node -> last size seen
self.sizes: Dict[Node, Union[int, float]] = {}

self.mine(**kwargs)


The method mine() does all the heavy lifting of mining. It retrieves all commits and all modifications from the repository, passing the modifications through the update_stats() method.

In [39]:
class ChangeCounter(ChangeCounter):
def mine(self, **kwargs: Any) -> None:
"""Gather data from repository. To be extended in subclasses."""
miner = Repository(self.repo, **kwargs)

for commit in miner.traverse_commits():
try:
self.mine_commit(commit)
except GitCommandError as err:
# Warn about failing git commands, but continue
warnings.warn(str(err))

def mine_commit(self, commit: Commit) -> None:
for m in commit.modified_files:
m.committer = commit.committer
m.committer_date = commit.committer_date
m.msg = commit.msg

if self.include(m):
self.update_stats(m)


The include() method allows to filter modifications. For simplicity, we copy the most relevant attributes of the commit over to the modification, such that the filter can access them, too.

In [40]:
class ChangeCounter(ChangeCounter):
def include(self, m: ModifiedFile) -> bool:
"""
Return True if the modification m should be included
(default: the filter predicate given to the constructor).
"""
return self.filter(m)


For each such node, update_stats() then invokes update_size(), update_changes(), and update_elems().

In [41]:
class ChangeCounter(ChangeCounter):
def update_stats(self, m: ModifiedFile) -> None:
"""
Update counters with modification m.
Can be extended in subclasses.
"""
if not m.new_path:
return

node = tuple(m.new_path.split('/'))

self.update_size(node, len(m.source_code) if m.source_code else 0)
self.update_changes(node, m.msg)

self.update_elems(node, m)


update_size() simply saves the last size of the item being modified. Since we progress from first to last commit, this reflects the size of the newest version.

In [42]:
class ChangeCounter(ChangeCounter):
def update_size(self, node: Tuple, size: int) -> None:
"""
Update counters for node with size.
Can be extended in subclasses.
"""
self.sizes[node] = size


update_changes() increases the counter changes for the given node node, and adds the current commit message commit_msg to its list. This makes

• size a mapping of nodes to their size
• changes a mapping of nodes to the number of changes they have seen
• commit_msg a mapping of nodes to the list of commit messages that have affected them.
In [43]:
class ChangeCounter(ChangeCounter):
def update_changes(self, node: Tuple, commit_msg: str) -> None:
"""
Update stats for node changed with commit_msg.
Can be extended in subclasses.
"""
self.changes[node] += 1

self.messages[node].append(commit_msg)


The update_elems() method is reserved for later use, when we go and count fine-grained changes.

In [44]:
class ChangeCounter(ChangeCounter):
def update_elems(self, node: Tuple, m: ModifiedFile) -> None:
"""
Update counters for subelements of node with modification m.
To be defined in subclasses.
"""
pass


Let us put ChangeCounter to action – on the current (debuggingbook) repository.

In [45]:
DEBUGGINGBOOK_REPO = current_repo()

In [46]:
DEBUGGINGBOOK_REPO

Out[46]:
'/Users/zeller/Projects/debuggingbook'

The function debuggingbook_change_counter instantiates a ChangeCounter class (or any subclass) with the debuggingbook repository, mining all the counters as listed above. Since mining all history takes quite some time, its parameter start_date allows to set a starting date (default: March 1, 2021); changes before that date will be ignored.

In [47]:
DEBUGGINGBOOK_START_DATE: datetime = datetime(2021, 3, 1)

In [48]:
NUM_WORKERS = 4  # Number of threads to be run in parallel

In [49]:
def debuggingbook_change_counter(
cls: Type,
start_date: datetime = DEBUGGINGBOOK_START_DATE) -> Any:
"""
Instantiate a ChangeCounter (sub)class cls with the debuggingbook repo.
Only mines changes after start_date (default: DEBUGGINGBOOK_START_DATE)
"""

def filter(m: ModifiedFile) -> bool:
"""
Do not include
* the docs/ directory; it only holds generated Web pages
* the notebooks/shared/ package; this is infrastructure
* the synopsis pictures; these are all generated
"""
return (m.new_path and
not m.new_path.startswith('docs/') and
not m.new_path.startswith('notebooks/shared/') and
'-synopsis-' not in m.new_path)

return cls(DEBUGGINGBOOK_REPO,
filter=filter,
since=start_date,
num_workers=NUM_WORKERS)


Let us set change_counter to this ChangeCounter instance. This can take a few minutes.

In [50]:
from Timer import Timer

In [51]:
with Timer() as t:
change_counter = debuggingbook_change_counter(ChangeCounter)

t.elapsed_time()

Out[51]:
145.40182915900368

The attribute changes of our ChangeCounter now is a mapping of nodes to the respective number of changes. Here are the first 10 entries:

In [52]:
list(change_counter.changes.keys())[:10]

Out[52]:
[('notebooks', '__init__.py'),
('.github', 'workflows', 'check.yml'),
('requirements.txt',),
('.github', 'workflows', 'check-code.yml'),
('.github', 'workflows', 'check-imports.yml'),
('.github', 'workflows', 'check-types.yml'),
('beta', 'code', 'debuggingbook'),
('code', 'debuggingbook'),
('notebooks', 'Assertions.ipynb'),
('notebooks', 'ChangeCounter.ipynb')]

This is the number of changes to the Chapters.makefile file which lists the book chapters:

In [53]:
change_counter.changes.get(('Chapters.makefile',), None)

Out[53]:
20

The messages attribute holds all the messages:

In [54]:
change_counter.messages.get(('Chapters.makefile',), None)

Out[54]:
['New: can now run notebooks and check HTML as part of CI',
"Fix: Mark 'Assertions' as new, too",
'New: publish Slicer',
'New: (Incomplete) chapters on performance and concurrency debugging',
'New: moved StackInspector in its own module',
"New: mark (pretty much) all chapters as 'ready'",
'Doc fix',
'New: release first chapters',
"New: have a 'shared' directory for material shared between fuzzingbook and debuggingbook; avoid cross-project links",
"New: 'make shared' syncs the 'shared' folder",
'Fix: DEPENDENCIES_PART was missing in PUBLIC_CHAPTERS',
'New release: ChangeDebugger',
'New: publish StatisticalDebugger',
'New: public chapter on dynamic invariants',
'New: publish DDSetDebugger',
'New: PerformanceDebugger goes live',
'New: Publish repair chapter']
In [55]:
for node in change_counter.changes:
assert len(change_counter.messages[node]) == change_counter.changes[node]


The sizes attribute holds the final size:

In [56]:
change_counter.sizes.get(('Chapters.makefile',), None)

Out[56]:
3591

## Visualizing Past Changes¶

To explore the number of changes across all project files, we visualize them as a tree map. A tree map visualizes hierarchical data using nested rectangles. In our visualization, each directory is shown as a rectangle containing smaller rectangles. The size of a rectangle is relative to its size (in bytes); and the color of a rectangle is relative to the number of changes it has seen.

We use the easyplotly package to easily create a treemap.

In [57]:
import easyplotly as ep
import plotly.graph_objects as go

In [58]:
import math


The method map_node_sizes() returns a size for the node – any number will do. By default, we use a logarithmic scale, such that smaller files are not totally visually eclipsed by larger files.

In [59]:
class ChangeCounter(ChangeCounter):
def map_node_sizes(self,scale: str = 'log') -> \
Dict[Node, Union[int, float]]:
"""
Return a mapping of nodes to sizes.
"""

if scale == 'log':
# Default: use log scale
return {node: math.log(size+1)
for node, size in self.sizes.items()}

elif scale == 'sqrt':
# Alternative: use sqrt size
return {node: math.sqrt(size)
for node, size in self.sizes.items()}

elif scale == 'abs':
# Alternative: use absolute size
return self.sizes

else:
raise ValueError(f"Unknown scale: {scale}; "
f"use one of [log, sqrt, abs]")


The method map_node_color() returns a color for the node – again, as a number. The smallest and largest numbers returned indicate beginning and end in the given color scale, respectively.

In [60]:
class ChangeCounter(ChangeCounter):
def map_node_color(self, node: Node) -> Optional[int]:
"""
Return a color of the node, as a number.
"""
return self.changes.get(node)


The method map_node_text() shows a text to be displayed in the rectangle; we set this to the number of changes.

In [61]:
class ChangeCounter(ChangeCounter):
def map_node_text(self, node: Node) -> Optional[str]:
"""
Return the text to be shown for the node (default: #changes).
"""
change = self.changes.get(node)
return str(change) if change is not None else None


The methods map_hoverinfo() and map_colorscale() set additional map parameters. For details, see the easyplotly documentation.

In [62]:
class ChangeCounter(ChangeCounter):
def map_hoverinfo(self) -> str:
"""
Return the text to be shown when hovering over a node.
"""
return 'label+text'

def map_colorscale(self) -> str:
"""
Return the colorscale for the map. To be overloaded in subclasses.
"""
return 'YlOrRd'


With all this, the map() function creates a tree map of the repository, using the easyplotly Treemap constructor.

In [63]:
class ChangeCounter(ChangeCounter):
def map(self) -> go.Figure:
"""Produce an interactive tree map of the repository."""
treemap = ep.Treemap(
self.map_node_sizes(),
text=self.map_node_text,
hoverinfo=self.map_hoverinfo(),
marker_colors=self.map_node_color,
marker_colorscale=self.map_colorscale(),
root_label=self.repo,
branchvalues='total'
)

fig = go.Figure(treemap)
fig.update_layout(margin=dict(l=0, r=0, t=30, b=0))

return fig


This is what the tree map for debuggingbook looks like.

• Click on any rectangle to enlarge it.
• Click outside of the rectangle to return to a wider view.
• Hover over a rectangle to get further information.
In [64]:
change_counter = debuggingbook_change_counter(ChangeCounter)

In [65]:
change_counter.map()