!git clone https://github.com/gotec/git2net-tutorials
import os
os.chdir('git2net-tutorials')
!pip install -r requirements.txt
os.chdir('..')
!git clone https://github.com/gotec/git2net git2net4analysis
import sqlite3
import pandas as pd
import os
import git2net
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
from collections import defaultdict
In this tutorial, we will take a closer look at the database mined by git2net
.
Before doing so, we again mine a new database from scratch to start with a clean base.
# We assume a clone of git2net's repository exists in the folder below following the first tutorial.
git_repo_dir = 'git2net4analysis'
# Here, we specify the database in which we will store the results of the mining process.
sqlite_db_file = 'git2net4analysis.db'
# Remove database if exists
if os.path.exists(sqlite_db_file):
os.remove(sqlite_db_file)
git2net.mine_git_repo(git_repo_dir, sqlite_db_file)
git2net.disambiguate_aliases_db(sqlite_db_file)
con = sqlite3.connect(sqlite_db_file)
curr = con.cursor()
curr.execute("SELECT name FROM sqlite_master WHERE type='table';")
curr.fetchall()
Similarly, we can use SQL to query for the names of the contributors that we saw in the third tutorial.
curr.execute("SELECT DISTINCT author_name FROM commits")
curr.fetchall()
Next, let's look at an alternative approach using the Python package pandas
.
With this approach, we first load the database in a pandas.DataFrame()
.
Then, the evaluation is much more comfortable as the pandas
API includes a vast range of tools for statistical analysis and visualisation.
with sqlite3.connect(sqlite_db_file) as con:
commits = pd.read_sql_query("SELECT * FROM commits", con)
commits.head(5)
With head(5)
, we list all extracted features contained in commits
for the first five commits.
How many records/commits does the history contain?
commits.shape
The first number reflects the number of records. The second shows how many columns (i.e., features) are involved.
With the following operation, we can then obtain the same list of contributors that we already saw from the SQL statement above.
commits.author_name.unique()
How many commits did the individual authors contribute to the overall project?
commits.groupby('author_name')['hash'].count()
That's easy!
Let's look at a final example that applies the features of pandas
.
At which time are git2net
commits usually submitted?
commits['timestamp'] = pd.to_datetime(commits['author_date'], format="%Y-%m-%d %H:%M:%S")
commits['hours'] = commits.timestamp.dt.hour
commits['days'] = commits.timestamp.dt.dayofweek
fig, ax = plt.subplots(figsize=(18,6))
sns.heatmap(commits.groupby(['days', 'hours'])['timestamp'].count().unstack(), annot=True, ax=ax)
plt.show()
Of course, GitHub and GitLab offer similar visualisations, but based on git2net
and pandas
, you can filter your data set yourself.
The diagram shows the week from Monday (0) to Sunday (6).
Someone seems to work at the weekend too :)
Find out who it is!
Finally, let's put everything together and analyse how the editing of own and foreign code evolves in the repository of git2net
, similarly to the analyses we performed here.
To do so, we first get the required data from the SQLite database.
Specifically, we need to obtain all instances where code is edited.
These are recorded as edits of type replacement
in the database.
We require the original (now deleted) line's author and the current commit's author for all these edits.
This data is not recorded in the edits
table.
Instead, you can find it in the commits
table.
Hence we query for the corresponding commits' hashes and obtain information on the author and the commit time for all commits from the commits
table.
with sqlite3.connect(sqlite_db_file) as con:
edits = pd.read_sql("""SELECT
commit_hash,
original_commit_deletion,
levenshtein_dist
FROM edits
WHERE edit_type=='replacement'""", con).drop_duplicates()
commits = pd.read_sql("""SELECT
hash,
author_id,
author_date
FROM commits""", con)
print('Edits')
display(edits.head())
print('--------------------------------------')
print('Commits')
display(commits.head())
Now we join the two resulting data frames to get a single one containing information about how much code of whom and when.
edit_info = pd.merge(edits, commits, how='left', left_on='commit_hash', right_on='hash') \
.drop(columns=['commit_hash', 'hash'])
edit_info = pd.merge(edit_info, commits, how='left', left_on='original_commit_deletion',
right_on='hash', suffixes=('', '_before')) \
.drop(columns=['original_commit_deletion', 'hash', 'author_date_before'])
edit_info = edit_info[['author_id_before', 'author_id', 'author_date', 'levenshtein_dist']]
edit_info.index = pd.DatetimeIndex(edit_info.author_date)
edit_info = edit_info.drop(columns=['author_date'])
edit_info
Now, we can compute the Levenshtein distances for changes in both own and foreign code for a rolling time window. In a final step, we normalise them to allow a better comparison and plot them over time.
windowsize = timedelta(days=365)
increment = timedelta(days=30)
plot_data = defaultdict(list)
time = min(edit_info.index) + windowsize
while time < max(edit_info.index):
mask = (edit_info.index > time - windowsize) & (edit_info.index <= time)
wdata = edit_info.loc[mask]
self_changes_dist = 0
foreign_changes_dist = 0
for idx, row in wdata.iterrows():
self_changes_dist += row['levenshtein_dist'] * (row['author_id_before'] == row['author_id'])
foreign_changes_dist += row['levenshtein_dist'] * (row['author_id_before'] != row['author_id'])
plot_data['time'].append(time)
plot_data['self_changes_dist'].append(self_changes_dist)
plot_data['foreign_changes_dist'].append(foreign_changes_dist)
time += increment
plot_data['self_changes_dist_norm'] = [s / (s + f) for s, f in zip(plot_data['self_changes_dist'],
plot_data['foreign_changes_dist'])]
plot_data['foreign_changes_dist_norm'] = [f / (s + f) for s, f in zip(plot_data['self_changes_dist'],
plot_data['foreign_changes_dist'])]
plot_data = pd.DataFrame(plot_data)
plt.figure(figsize=(14,5))
ax = plt.subplot(1,2,1)
plot_data.plot(x='time', y=['self_changes_dist', 'foreign_changes_dist'], ax=ax, ylabel="Levenshtein distance")
ax = plt.subplot(1,2,2)
plot_data.plot(x='time', y=['self_changes_dist_norm', 'foreign_changes_dist_norm'], kind='area', ax=ax,
ylabel="relative Levenshtein distance")
plt.show()
As you can see, code ownership in the repository of git2net
is very high.
However, this should be expected as, as we saw earlier, most of the code is written by a single person.
With minor modifications to the code above, you can also figure out how much code each contributor owns at every point in time.
git2net
also works with the git repositories behind Overleaf projects.
So go ahead and try it with your most recent paper :)