Notebook

In [ ]:

!git clone https://github.com/gotec/git2net-tutorials
import os
os.chdir('git2net-tutorials')
!pip install -r requirements.txt
os.chdir('..')
!git clone https://github.com/gotec/git2net git2net4analysis

In [ ]:

import sqlite3
import pandas as pd
import os
import git2net
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
from collections import defaultdict

Database analysis¶

In this tutorial, we will take a closer look at the database mined by git2net. Before doing so, we again mine a new database from scratch to start with a clean base.

In [ ]:

# We assume a clone of git2net's repository exists in the folder below following the first tutorial.
git_repo_dir = 'git2net4analysis'

# Here, we specify the database in which we will store the results of the mining process.
sqlite_db_file = 'git2net4analysis.db'

# Remove database if exists
if os.path.exists(sqlite_db_file):
    os.remove(sqlite_db_file)
    
git2net.mine_git_repo(git_repo_dir, sqlite_db_file)
git2net.disambiguate_aliases_db(sqlite_db_file)

Working with the database generated by `git2net`¶

Using SQL¶

The resulting database stored in sqlite_db_file is an SQLite database. In the second tutorial, we claimed that this database contains three tables: _metadata, commits, and edits. Let's see how we can use a simple SQL query to confirm this.

In [ ]:

con = sqlite3.connect(sqlite_db_file)
curr = con.cursor()
curr.execute("SELECT name FROM sqlite_master WHERE type='table';")
curr.fetchall()

Similarly, we can use SQL to query for the names of the contributors that we saw in the third tutorial.

In [ ]:

curr.execute("SELECT DISTINCT author_name FROM commits")
curr.fetchall()

Using pandas¶

Next, let's look at an alternative approach using the Python package pandas. With this approach, we first load the database in a pandas.DataFrame(). Then, the evaluation is much more comfortable as the pandas API includes a vast range of tools for statistical analysis and visualisation.

In [ ]:

with sqlite3.connect(sqlite_db_file) as con:
    commits = pd.read_sql_query("SELECT * FROM commits", con)

commits.head(5)

With head(5), we list all extracted features contained in commits for the first five commits. How many records/commits does the history contain?

In [ ]:

commits.shape

The first number reflects the number of records. The second shows how many columns (i.e., features) are involved.

With the following operation, we can then obtain the same list of contributors that we already saw from the SQL statement above.

In [ ]:

commits.author_name.unique()

How many commits did the individual authors contribute to the overall project?

In [ ]:

commits.groupby('author_name')['hash'].count()

That's easy! Let's look at a final example that applies the features of pandas. At which time are git2net commits usually submitted?

In [ ]:

commits['timestamp'] = pd.to_datetime(commits['author_date'], format="%Y-%m-%d %H:%M:%S")
commits['hours'] = commits.timestamp.dt.hour
commits['days'] = commits.timestamp.dt.dayofweek
fig, ax = plt.subplots(figsize=(18,6))
sns.heatmap(commits.groupby(['days', 'hours'])['timestamp'].count().unstack(), annot=True, ax=ax)
plt.show()

Of course, GitHub and GitLab offer similar visualisations, but based on git2net and pandas, you can filter your data set yourself. The diagram shows the week from Monday (0) to Sunday (6). Someone seems to work at the weekend too :) Find out who it is!

Exemplary analysis: Code ownership¶

Finally, let's put everything together and analyse how the editing of own and foreign code evolves in the repository of git2net, similarly to the analyses we performed here.

To do so, we first get the required data from the SQLite database. Specifically, we need to obtain all instances where code is edited. These are recorded as edits of type replacement in the database. We require the original (now deleted) line's author and the current commit's author for all these edits. This data is not recorded in the edits table. Instead, you can find it in the commits table. Hence we query for the corresponding commits' hashes and obtain information on the author and the commit time for all commits from the commits table.

In [ ]:

with sqlite3.connect(sqlite_db_file) as con:
    edits = pd.read_sql("""SELECT
                               commit_hash,
                               original_commit_deletion,
                               levenshtein_dist
                           FROM edits
                           WHERE edit_type=='replacement'""", con).drop_duplicates()
    commits = pd.read_sql("""SELECT
                                 hash,
                                 author_id,
                                 author_date
                             FROM commits""", con)
print('Edits')
display(edits.head())

print('--------------------------------------')

print('Commits')
display(commits.head())

Now we join the two resulting data frames to get a single one containing information about how much code of whom and when.

In [ ]:

edit_info = pd.merge(edits, commits, how='left', left_on='commit_hash', right_on='hash') \
              .drop(columns=['commit_hash', 'hash'])

edit_info = pd.merge(edit_info, commits, how='left', left_on='original_commit_deletion',
                     right_on='hash', suffixes=('', '_before')) \
              .drop(columns=['original_commit_deletion', 'hash', 'author_date_before'])

edit_info = edit_info[['author_id_before', 'author_id', 'author_date', 'levenshtein_dist']]

edit_info.index = pd.DatetimeIndex(edit_info.author_date)
edit_info = edit_info.drop(columns=['author_date'])

edit_info

Now, we can compute the Levenshtein distances for changes in both own and foreign code for a rolling time window. In a final step, we normalise them to allow a better comparison and plot them over time.

In [ ]:

windowsize = timedelta(days=365)
increment = timedelta(days=30)

plot_data = defaultdict(list)

time = min(edit_info.index) + windowsize
while time < max(edit_info.index):
    mask = (edit_info.index > time - windowsize) & (edit_info.index <= time)
    wdata = edit_info.loc[mask]
    self_changes_dist = 0
    foreign_changes_dist = 0
    for idx, row in wdata.iterrows():
        self_changes_dist += row['levenshtein_dist'] * (row['author_id_before'] == row['author_id'])
        foreign_changes_dist += row['levenshtein_dist'] * (row['author_id_before'] != row['author_id'])
    plot_data['time'].append(time)
    plot_data['self_changes_dist'].append(self_changes_dist)
    plot_data['foreign_changes_dist'].append(foreign_changes_dist)
    time += increment


plot_data['self_changes_dist_norm'] = [s / (s + f) for s, f in zip(plot_data['self_changes_dist'],
                                                                   plot_data['foreign_changes_dist'])]
plot_data['foreign_changes_dist_norm'] = [f / (s + f) for s, f in zip(plot_data['self_changes_dist'],
                                                                      plot_data['foreign_changes_dist'])]

plot_data = pd.DataFrame(plot_data)

plt.figure(figsize=(14,5))
ax = plt.subplot(1,2,1)
plot_data.plot(x='time', y=['self_changes_dist', 'foreign_changes_dist'], ax=ax, ylabel="Levenshtein distance")

ax = plt.subplot(1,2,2)
plot_data.plot(x='time', y=['self_changes_dist_norm', 'foreign_changes_dist_norm'], kind='area', ax=ax,
               ylabel="relative Levenshtein distance")

plt.show()

As you can see, code ownership in the repository of git2net is very high. However, this should be expected as, as we saw earlier, most of the code is written by a single person. With minor modifications to the code above, you can also figure out how much code each contributor owns at every point in time.

git2net also works with the git repositories behind Overleaf projects. So go ahead and try it with your most recent paper :)

Database analysis¶

Working with the database generated by git2net¶

Using SQL¶

Using pandas¶

Exemplary analysis: Code ownership¶

Working with the database generated by `git2net`¶