After completing this week's lecture and tutorial work, you will be able to:
This worksheet covers parts of the Version Control chapter of the online textbook. You should read this chapter before attempting the worksheet.
### Run this cell before continuing.
source("tests.R")
source("cleanup.R")
Question 1.1 Multiple Choice:
{points: 1}
Which reason listed below is not a good reason to use version control:
A. Version control tools provide transparency on how a project evolved by tracking the history of documents, and who made what changes to those documents.
B. Version control tools usually include a remote/cloud repository hosting service that can act as a backup of your local files (i.e., the files on your computer).
C. In practice, most data science projects involve collaboration on documents that contain code (e.g., Jupyter notebooks), and version control tools facilitate collaboration on such documents.
D. Version control tools check the accuracy of your code.
Assign your answer to an object called answer1.1
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_1.1()
Question 1.2 True or false:
{points: 1}
Git is a remote/cloud repository hosting service where you can backup and share your files with collaborators.
Assign your answer to an object called answer1.2
. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. "true"
or "false"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_1.2()
For the rest of this worksheet, you will create a toy data science project on GitHub to practice using Git and GitHub. We will ask you questions about what you are doing along the way to test your understanding.
If you do not already have a free GitHub.com account, visit GitHub.com and signup for one. Store your username and password in a secure place (we recomend using a password manager for things like this, examples of these are LastPass, 1Password, etc).
On GitHub.com create a new repository and name it toy_ds_project
. You can decide whether to make it private or public. Ensure that you select “Add a README file.” This task corresponds to this step in the textbook.
Question 2.1 Multiple Choice:
Which statement below is not true about GitHub repositories:
{points: 1}
A. Immediately after a repository is created on GitHub.com using the website, the repository exists only on GitHub.com and does not exist on your computer (i.e., you need to do something to get a copy of it on your computer).
B. Only the creator of GitHub repository, and people the creator specify, can edit the files in the repository. This is true even when the repository is public.
C. If the repository is public, anyone on the web can view it.
D. If the repository is public, anyone on the web can edit it.
E. A GitHub repository is like a folder on Dropbox or Google Drive, but it is different in that it has special properties for version control.
Assign your answer to an object called answer2.1
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_2.1()
README.md
file in your toy_ds_project
repository on GitHub.com using the pen tool. Write "project creation date:" and list today's date.README.md
file. Write "author" and list your name as the author. Commit this change and use the commit message "added project author".Note: you can visit the version of your repository at any stage in its history by click on the
<>
buttons! Give it a try!
Question 3.1 True or false:
{points: 1}
Even though commit messages are required to edit a file using the pen tool on GitHub.com, what you write in the message is not important.
Assign your answer to an object called answer3.1
. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. "true"
or "false"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_3.1()
For our data science project, we need to put a copy of our repository somewhere we can run and test the code we write (otherwise, we won't know that our code works!). We can use the course JupyterHub for this!
Clone a copy of this GitHub repository to the course JupyterHub using the Jupyter Git extension. This task corresponds to this step in the textbook.
Question 4.1 True or false:
{points: 1}
The definition of cloning a repository is to copy/download the entire contents (files, project history, and location of the remote repository) of a remote GitHub.com repository to a computer (e.g., your workspace on a JupyterHub, or your laptop).
Assign your answer to an object called answer4.1
. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. "true"
or "false"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_4.1()
Now that your repository exists in your workspace on the course JupyterHub, you can create a new Jupyter notebook with an R kernel and write some code! To help this project move along, we show you below how to create a new Jupyter notebook and save it and some code to put in it.
To create a new Jupyter notebook with an R kernel in your toy_ds_project
repository, use the file navigation menu of Jupyter so that you are inside the toy_ds_project
:
Once there, click on new R notebook:
Next, right-click on the filename and click on "Rename", to rename the file marg_vs_divorce_viz.ipynb
.
Add the code below to the notebook and run it to display the data visualization. Feel free to add a narrative to the notebook if you like, commenting on the question being asked, the data visualization results, and whether correlation means causation. When you are done, save the notebook.
library(tidyverse)
library(cowplot)
library(scales)
# Data sourced from the Spurious Correlations website (http://www.tylervigen.com/spurious-correlations)
should_have_bought_butter <- tibble(margarine_consumption = c(8.2, 7, 6.5, 5.3, 5.2,
4, 4.6, 4.5, 4.2, 3.7),
maine_divorce_rate = c(5, 4.7, 4.6, 4.4, 4.3,
4.1, 4.2, 4.2, 4.2, 4.1),
year = c(2000, 2001, 2002, 2003, 2004,
2005, 2006, 2007, 2008, 2009))
marg_vs_time <- should_have_bought_butter |>
ggplot(aes(x = year, y = margarine_consumption)) +
geom_line(colour = "Blue") +
labs(x = "", y = "Margarine consumption \n(lbs per capita)",
title = "Divorce rate in Maine correlates with margarine consumption") +
theme_bw() +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank()) +
theme(text = element_text(size = 20)) +
scale_y_continuous(labels = number_format(accuracy = 0.01))
divorce_rate_vs_time <- should_have_bought_butter |>
ggplot(aes(x = year, y = maine_divorce_rate)) +
geom_line(colour = "Red") +
labs(x = "Year", y = "Divorce rate in Maine \n(per 1000)") +
scale_x_continuous(breaks = 0:2100) +
theme_bw() +
theme(text = element_text(size = 20))
options(repr.plot.width = 11, repr.plot.height = 8)
plot_grid(marg_vs_time, divorce_rate_vs_time, ncol = 1)
Now we would like to start the process of putting marg_vs_divorce_viz.ipynb
under version control and eventually push this file to our remote repository on GitHub.com. The first step to doing this is to add the changes to this file (creating it and the code) to the Git staging area. Go ahead and use the Jupyter Git extension to do this now. This task corresponds to this step in the textbook.
Question 6.1 Multiple Choice:
{points: 1}
Git has a distinct step of adding files to the staging area because:
A. Not all changes we make (i.e., files we create or edit) are ones that we want to push to our remote GitHub repository.
B. It allows us to edit multiple files at once, but associate particular commit messages with particular files (so that the commit messages can more specifically reflect the changes that were made).
C. This is technically required of all version control software.
D. A and C.
E. A and B.
Assign your answer to an object called answer6.1
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_6.1()
Question 7.1 True or false:
{points: 1}
When we commit our changes to Git, the snapshot of changes, the commit message, the time and date stamp and the user who committed the changes are all saved to the Git history on GitHub.
Assign your answer to an object called answer7.1
. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. "true"
or "false"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_7.1()
Finally, we are ready to send our changes (creating and adding code to marg_vs_divorce_viz.ipynb
) to our remote repository through a process we call "pushing". Go ahead and do this now. This task corresponds to this step in the textbook.
After completing pushing your work to the remote repository on GitHub, visit your repository on GitHub.com and check out what your awesome toy project looks like!
Question 8.1 Multiple Choice:
Which statement below is not true?
{points: 1}
A. Cloning and pulling a GitHub repository are the exact same thing.
B. Pushing with Git is the act of sending changes that were committed to Git to a remote repository, for example, on GitHub.com.
C. Pulling with Git is the act of collecting changes that exists in a remote repository, for example, on GitHub.com, that do not yet exist on the local computer you are working on (i.e., your workspace on the JupyterHub or your laptop).
D. You should push your work to GitHub anytime you want to share your work with others, or when you are done a work session and want to back up your work.
Assign your answer to an object called answer8.1
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_8.1()
One of the advantages of using version control tools, such as Git and GitHub, is how it lets you collaborate. Let's get some practice starting down this path. Add one or more of your group members to your GitHub repository as a collaborator. This task corresponds to this step in the textbook.
Question 9.1 True or false:
{points: 1}
You can clone or pull from any public remote repository on GitHub.com, however you can only push to public remote repositories on GitHub.com that you own are a collaborator on.
Assign your answer to an object called answer9.1
. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. "true"
or "false"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_9.1()
If you want to practice more Git & GitHub skills for collaboration, ask someone in your room if you can collaborate and send an edit to their project. To do this, they will need to add you as a collaborator, and then you will need to clone their repository to your JupyterHub. After that, you can edit some files (or create a whole new one), save your work, and then use the Jupyter Git extension to add, commit, and push your changes to their remote GitHub repository.
It's easy for project communications to get lost in email or whatever messaging platform you use to communicate with your team. GitHub issues are an excellent tool explicitly designed for project collaboration as they are "attached" to the project's remote GitHub repository. Your task here is to go to the issue tab for your project and create an issue about something you might want to improve about your project. This task corresponds to this step in the textbook.
Question 10.1 Multiple Choice:
{points: 1}
Which statement below is not a reason why GitHub issues are an ideal medium for project-specific communications?
A. Issues are part of each GitHub repository, and thus "attached" to the project.
B. Issues only persist while they are open, and immediately deleted when they are closed.
C. Issues are easily searchable using GitHub’s search tools.
D. All issues are accessible to all project collaborators, so no one is left out of the conversation.
E. Issues can be set up so that team members get email notifications when a new issue is created or a new post is made in an issue thread.
Assign your answer to an object called answer10.1
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer.
# your code here
fail() # No Answer - remove if you provide an answer
test_10.1()
Visit a group member's GitHub repository and leave a polite but constructive message on how they could improve their project.