!git clone https://github.com/gotec/git2net-tutorials
import os
os.chdir('git2net-tutorials')
!pip install -r requirements.txt
os.chdir('..')
!git clone https://github.com/gotec/git2net git2net4analysis
import git2net
import os
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In this final tutorial, we will take a look at how you can use git2net
to compute the complexity of files in a repository, and proxy the productivity of developers contributing to it.
As before, we start by mining a new database for which we also disambiguate the authors.
# We assume a clone of git2net's repository exists in the folder below following the first tutorial.
git_repo_dir = 'git2net4analysis'
# Here, we specify the database in which we will store the results of the mining process.
sqlite_db_file = 'git2net4analysis.db'
# Remove database if exists
if os.path.exists(sqlite_db_file):
os.remove(sqlite_db_file)
git2net.mine_git_repo(git_repo_dir, sqlite_db_file)
git2net.disambiguate_aliases_db(sqlite_db_file)
By calling the function git2net.compute_complexity()
we can now create a new table complexity
in our database which contains the complexity of all modified files for all commits.
Let's create this table and have a look at what it contains.
git2net.compute_complexity(git_repo_dir, sqlite_db_file)
with sqlite3.connect(sqlite_db_file) as con:
complexity = pd.read_sql("SELECT * FROM complexity", con)
complexity.head()
As we can see, the table contains information about the commit/file pair (i.e., commit_hash
, old_path
, new_path
).
It further contains the number of edit events (events
)—i.e. additions, deletions, and replacements—that the commit made in a given file, and the total Levenshtein edit distance (levenshtein_distance
) of these edits.
Finally, the table contains the Halstead effort (HE
), the cyclomatic complexity (CCN
), the number of lines of code (NLOC
), the number of tokens (TOK
), and the number of functions (FUN
) in all modified files before (*_pre
) and after (*_post
) each commit.
As we show in this publication, we can use the absolute value of the change in complexity (*_delta
) as a proxy for the productivity of developers in Open Source software projects.
Let's look at the changes in complexity for the git2net
project.
To do this, we first create a dataframe listing the absolute changes in complexity for all contributors to git2net
over time.
with sqlite3.connect(sqlite_db_file) as con:
complexity = pd.read_sql("""SELECT commit_hash,
events,
levenshtein_distance,
HE_delta,
CCN_delta,
NLOC_delta,
TOK_delta,
FUN_delta
FROM complexity""", con)
# We compute the absolute differences.
complexity['HE_absdelta'] = np.abs(complexity.HE_delta)
complexity['CCN_absdelta'] = np.abs(complexity.CCN_delta)
complexity['NLOC_absdelta'] = np.abs(complexity.NLOC_delta)
complexity['TOK_absdelta'] = np.abs(complexity.TOK_delta)
complexity['FUN_absdelta'] = np.abs(complexity.FUN_delta)
complexity.groupby(['commit_hash']) \
.agg({'events': 'sum',
'levenshtein_distance': 'sum',
'HE_absdelta': 'sum',
'CCN_absdelta': 'sum',
'NLOC_absdelta': 'sum',
'TOK_absdelta': 'sum',
'FUN_absdelta': 'sum'}).reset_index()
complexity.drop(columns=['HE_delta', 'CCN_delta', 'NLOC_delta', 'TOK_delta',
'FUN_delta'], inplace=True)
# We add a counter for the commits.
complexity['commits'] = 1
with sqlite3.connect(sqlite_db_file) as con:
commits = pd.read_sql("SELECT hash, author_date, author_id FROM commits", con)
complexity = pd.merge(complexity, commits, left_on='commit_hash', right_on='hash', how='left')
complexity = complexity.set_index(pd.DatetimeIndex(complexity['author_date']))
complexity = complexity.sort_index()
complexity.drop(columns=['hash', 'commit_hash', 'author_date'], inplace=True)
complexity.head()
We can then plot these changes, e.g., as cumulative sums as shown below.
with sqlite3.connect(sqlite_db_file) as con:
commits = pd.read_sql("SELECT author_name, author_id FROM commits", con)
commits.drop_duplicates(inplace=True)
author_id_name_map = {}
for idx, group in commits.groupby('author_id'):
author_id_name_map[idx] = list(group.author_name)
fig, axs = plt.subplots(4,2, figsize=(14,10))
axs[0,0].set_title('Commits')
axs[0,1].set_title('Events')
axs[1,0].set_title('Levenshtein Distance')
axs[1,1].set_title('Halstead Effort')
axs[2,0].set_title('Cyclomatic Complexity')
axs[2,1].set_title('Lines of Code')
axs[3,0].set_title('Tokens')
axs[3,1].set_title('Functions')
for author_id, group in complexity.groupby(['author_id']):
group_cs = group.cumsum()
axs[0,0].plot(group_cs.index, group_cs.commits, label=author_id_name_map[author_id][0])
axs[0,1].plot(group_cs.index, group_cs.events)
axs[1,0].plot(group_cs.index, group_cs.levenshtein_distance)
axs[1,1].plot(group_cs.index, group_cs.HE_absdelta)
axs[2,0].plot(group_cs.index, group_cs.CCN_absdelta)
axs[2,1].plot(group_cs.index, group_cs.NLOC_absdelta)
axs[3,0].plot(group_cs.index, group_cs.TOK_absdelta)
axs[3,1].plot(group_cs.index, group_cs.FUN_absdelta)
plt.tight_layout()
axs[0,0].legend(loc='upper center', bbox_to_anchor=(1, 1.4), ncol=8)
plt.show()
Can you create a similar plot with a rolling window instead of the cumulative sum?
With this, we conclude both this tutorial and the series of tutorials for git2net
.
We hope you found them helpful.
Enjoy using git2net
, and best of luck with your research!
If you have any feedback or find bugs within the code, please let us know at gotec/git2net
.
git2net
is developed as an Open Source project, which means your ideas and inputs are highly welcome.
Feel free to share the project and contribute yourself.
You can immediately get started on the repository you just downloaded!