Note: this tutorial was particularly modeled, and therefore owes a lot, to the excellent materials offered in:
In particular I've reused the excellent images from the Pro Git book that John had already selected and downloaded, as well as some of his outline. But this version of the tutorial aims to be 100% reproducible by being executed directly as an IPython notebook and is hosted itself on github so that others can more easily make improvements to it by collaborating on Github. Many thanks to John and Emanuele for making their materials available online.
After writing this document, I discovered J.R. Johansson's tutorial on version control that is also written as a fully reproducible notebook and is also aimed at a scientific audience. It has a similar spirit to this one, and is part of his excellent series Lectures on Scientific Computing with Python that is entirely available as Jupyter Notebooks.
“Revision control, also known as version control, source control or software configuration management (SCM), is the management of changes to documents, programs, and other information stored as computer files.”
Reproducibility?
What do (good) version control tools give you?
paper_v5_john_jane_final_oct22_really_final.tex
by email again!)The commit: a snapshot of work at a point in time
Credit: ProGit book, by Scott Chacon, CC License.
A repository: a group of linked commits
Note: these form a Directed Acyclic Graph (DAG), with nodes identified by their hash.
A hash: a fingerprint of the content of each commit and its parent
from hashlib import sha1
# Our first commit
data1 = b'This is the start of my paper.'
meta1 = b'date: 1/1/17'
hash1 = sha1(data1 + meta1).hexdigest( )
print('Hash:', hash1)
# Our second commit, linked to the first
data2 = b'Some more text in my paper...'
meta2 = b'date: 1/2/1'
# Note we add the parent hash here!
hash2 = sha1(data2 + meta2 + hash1.encode()).hexdigest()
print('Hash:', hash2)
And this is pretty much the essence of Git!
The minimal amount of configuration for git to work without pestering you is to tell it who you are. You should run a version of these commands in your shell:
git config --global user.name "Your Name"
git config --global user.email "your.email@yourplace.org"
And while we're at it, we also turn on the use of color, which is very useful
git config --global color.ui "auto"
Set git to use the credential memory cache so we don't have to retype passwords too frequently.
Github offers in its help pages instructions on how to configure the credentials helper for Mac OSX, Windows and Linux.
The rests of this tutorial is structured in the following way: after the above brief overview of key concepts you need to understand in order for git to really make sense, we will then dive into hands-on work. We will discuss 5 "stages of git" with scenarios of increasing sophistication and complexity, introducing the necessary commands for each stage:
In reality, this tutorial only covers stages 1-4, since for #5 there are many software develoment-oriented tutorials and documents of very high quality online. But most scientists start working alone with a few files or with a small team, so I feel it's important to build first the key concepts and practices based on problems scientists encounter in their everyday life and without the jargon of the software world. Once you've become familiar with 1-4, the excellent tutorials that exist about collaborating on github on open-source projects should make sense.