Many users manage files, run programs, and otherwise interact with their computers only through a graphical user interface (GUI), clicking on applications or other files to open them. Some activities, though, including some activities that are important for this workshop, are performed more easily by typing instructions out in words and hitting the Enter key within an interface known as the the command line (sometimes also called the shell, the terminal, or the command prompt; we will use these terms interchangeably). In this workshop we use the command line primarily for three things: to install software, to change the directory within which we’re working, and to run Python programs. This tutorial describes each of these briefly. For more information about using the command line, see the Command line crash course (link at the bottom of the page) or The Unix Shell from Software Carpentry.
Each operating system makes a terminal available by default, without requiring special installation:
A window will open that displays a command line, a place where you can type instructions to be executed on the computer, with a prompt that might look something like this on a Mac OS Terminal:
or this in the Windows Powershell:
PS C:\Users\Tara L Andrews>
or this in a Linux terminal:
The prompt (which ends in the examples above with either a dollar sign [“$”] or a right angle bracket [“>”]) tells you that you can type instructions to be executed. When you do that, the terminal will sometimes print output or messages in response to what you’ve told it to do (not all terminal commands produce output or messages in the terminal window), and it will then usually present another command prompt so that you can enter a new command.
Now you are ready to type the commands that come next. To run a command, type it into the command prompt window and hit the Enter key. You can practice by typing
echo 'hello, world' and then hitting the Enter key, and the system should echo back to you the string “hello, world”. (You can use single or double straight quotation marks around the phrase you want to echo, but whichever one you use at the beginning must match the one you use at the end.)
Windows users: Some of you may have used cmd.exe in the past to work at the command line. We recommend Powershell (or, for Windows 10 users, bash) because it uses many of the same commands that have always been in use on Unix-like systems (including Linux and Mac OS), and so makes it easier for you to follow generic command-line instructions such as those we will be giving in the workshop. If you choose to stick to cmd.exe, you do so at your own risk, and the commands described below may not all be available.
To get started with CollateX you’ll need to install CollateX itself, along with a few supporting programs. Detailed installation instructions are available in our installation instructions, but the short version is that after you install Python (see the installation instructions), you should open a command prompt and run the following commands (but see the note below for Windows users concerning the Levenshtein installation):
pip install collatex pip install python-levenshtein pip install graphviz
pip install) is the command that you use for installing and updating software, and the command line is the simplest method for some—but not all—software installation. Our workshop also requires you to install Python (before you run the
pip commands above) and Graphviz, and those installations are normally performed with GUI installers that you download, and not at the command line. See the installation instructions for information about installing all software used in the workshop.
The files on your computer are organized into a hierarchical file system, and one special type of file is a directory, which, instead of containing text or other data, functions as a sort of folder that can contain other files (including other directories). At the top of the hierarchy is a root directory, which may contain files of many types, as well as subdirectories (which, in turn, may contain all of the same types of items as the root directory, all the way down). Your Desktop, the place where you download files, your Trash, and other items that look like folders in a graphical file manager (such as the Windows Explorer or the Mac OS Finder) are directories. A folder is a graphical and conceptual metaphor for a directory; thus we tend to talk about directories when working on the command line and to talk about folders when working in a graphical user interface (a GUI). But we are referring to the same thing: a directory in the file system, represented as a folder in the GUI.
User accounts on a computer are part of the hierarchical file system. On Mac OS, the root directory typically contains a subdirectory called “Users”, which contains a subdirectory for each user who has an account on the machine (the name of the subdirectory is typically the same as the user’s login name). The main directory for you, as a user on the system with your own unique login id, is called your home directory, and when you open a command prompt it typically opens inside your home directory. Don’t confuse the home directory with the root directory; the root directory is at the top of the file system for the entire machine, and your home directory is located somewhere under the root directory and is located above all of your individual files. Note also that your home directory isn’t the same as your Desktop; on most operating systems your Desktop (which looks like your main point of entry in the graphical file interface) is really a subdirectory of your home directory, and your home directory may contain other files and subdirectories in addition to your Desktop.
You can verify this by typing
pwd (which stands for “print working directory”) at the command line. When user “djb” opens a new command line on his Mac and types this command, it responds “/Users/djb”. This representation is called a path, and the way to read it is:
The hierarchy described above (with a few additional files) can be illustrated as follows (yellow backgrounds are directories; blue are non-directory files of various types):
Any file (including subdirectories) can be described with a path that begins in the root directory, and the
pwd command causes that path to be printed on the command line. When you use the command line, you are always located inside a particular directory (called your current working directory), which matters because some operations need to be performed in specific locations within the hierarchy.
In the terminology commonly used to describe file systems, where directory A contains directory B, we say that A is the parent directory of B and B is a child of A. We’ll use these terms below when we discuss file-system navigation.
Digression on file system hygiene: As is illustrated in the image above, you should keep the files related to a specific project inside their own subdirectory (perhaps subdivided across sub-subdirectories depending on file type, chapter, or something else). There are several reasons not to keep all of your files for all of your projects in the same place (common locations for this type of project-management mistake are on the Desktop or in your Documents folder), the most important of which is that it becomes difficult to distinguish one from another, and you run the risk of overwriting a file for one project with a file for another project that happens to have the same filename. There’s nothing wrong with creating a subdirectory for each of your projects on your Desktop or inside your Documents folder, although many users prefer to create those project-specific subdirectories inside their home directories because they don’t all need to be visible right on the Desktop (which can get crowded) and they can hold files that aren’t documents, and therefore don’t naturally belong under a Documents hierarchy. What’s most important is not to dump all of the files for all of your projects into the same directory in the hierarchy.
In order to have all our materials organized in the same way, we encourage you to create a directory for this workshop. You can create it wherever you want, but an appropriate place is your Home directory. Give it the name that you prefer, we will call it “Workshop”. So now you should have a directory called “Workshop” inside your Home directory (or similar). Inside “Workshop” there will be three sub-directories: “Notebooks”, “Scripts”, and “fixtures”. The sub-directory “Notebook” will contain the Jupyter notebooks we will create, as introduced in the Jupyter notebook tutorial; the sub-directory “Scripts” will contain our Python programs, as explained in the Collate outside the notebook tutorial; we will download the sub-directory “fixtures” from the workshop website and it contains the sample texts we will be working on (Darwin, Woolf). At the moment, don’t worry about the sub-directories. and just create the directory “Workshop”.
Some commands can be performed in any directory, and for those commands, where you’re located doesn’t matter. For example, when you install Python modules using the
pip commands above, they will be installed correctly no matter where you are. But when you type
pwd, what you see depends on where you’re located.
We’ll introduce commands that need to be performed in specific directories later in the workshop, and we’ll describe here just how to change your location at the command prompt. It’s easy to get lost when you’re first learning command-line navigation, and you can always use
pwd to ask the system to tell you where you are at the moment.
The command to change directories is
cd. By itself, it moves you to your home directory, but if you follow it with a path, it will move you to the directory specified by the path. For example, if you’re in your home directory on the Mac OS system described above,
cd /Users will take you to the “Users” subdirectory under the root directory. That is, if your home directory is “/Users/djb”, it will take you to your parent directory. And if your project about Shakespeare is located in “/Users/djb/projects/shakespeare”,
cd /Users/djb/projects/shakespeare will take you there.
Two dots (
..) are a special path step that represents the parent of the current location. This means that if you type
cd .. when you’re in “/Users/djb/projects/shakespeare”, you’ll move to “/Users/djb/projects”. And if you type
cd ../.. when you’re in “/Users/djb/projects/shakespeare”, you’ll move up two levels in the hierarchy, first to the parent of your current location and then to the parent of that intermediate one. Note that there may be more than one way to move around the hierarchy, so if you’re in “/Users/djb/projects/shakespeare”, you can get to “/Users/djb” in at least three ways:
cd ../.., and (if you are user “djb”)
cd (Windows users, see below).
The character “~” is a special path step that represents your home directory. There’s no point in typing
cd ~ to go to your home directory because
cd alone will do that, but to go to, say, your own “projects” subdirectory from anywhere in the hierarchy, you can use
You can toggle easily between two directories by using
cd -, which takes to you the directory you left most recently.
For Windows users: Powershell does not understand
cd alone, nor
cd -. For moving to your home directory, you can type
For the workshop you’ll develop your Python programs either in the PyCharm Integrated Development Environment (IDE) or in Jupyter Notebook, which is installed automatically with Anaconda Python, and we’ll show you how to use both environments. But once you’ve written a Python program that you want to use to collate your witnesses, you’ll run it from the command line, which avoids the inconvenience and processing overhead of the development platform.
To run a Python program, you type
python myfile.py, replacing “myfile.py” with the name of your Python program. But the command as written will only work if “myfile.py” is inside the directory in which you are running the command, and this will not always be the case. For example, you might want to use “myfile.py” to process several different projects, which means that you might want to run it inside the directories for each of those projects without having to put a separate copy of the identical Python script in each project subdirectory.
By default Python will look for “myfile.py” only in your current working directory, the one in which your command prompt is located. You can nonetheless access it anywhere by specifying a longer path. For example,
python /opt/bin/myfile.py will run a file called “myfile.py” that is located in the “bin” subdirectory of the “opt” subdirectory of the root directory. And
python ../myfile.py will run “myfile.py” if it is located in the parent directory of your current working directory.
Typing long names of files and directories is tedious, and your command line makes it easier for you by supporting filename completion. If you type
cd followed by the beginning of a path and hit the Tab key, if there is an unambiguous way to complete the next step in the path, the system will fill it in for you. For example, if you are inside “/Users/djb/”, you have a subdirectory inside that directory called “projects”, and you type
cd pr and hit the Tab key, the system will complete the word “projects” for you. Here are the details:
cd pr), it will extend the match through the “o”, but only that far. If you hit the tab key a second time, it will show you the possible matches, and if you then type enough letters to resolve the ambiguity and hit the Tab key, it will complete fully.
To avoid retyping long commands, you can reuse commands you’ve already typed by accessing them in your history. Here are some useful shortcuts:
history, and you can limit that to the most recent previous commands with
history 10(replacing the “10” with the number of commands you want to see). The commands are numbered, and you can reuse one by typing an exclamation mark followed by the command number. For example, if you want to rerun command #22, type
!22. Windows users, this won’t work in Powershell and the same applies for the two following points.
!followed by that string. For example,
!pywill rerun the most recent command that began with “py”. Caution: It’s possible to rerun a damaging command by accident if you misremember the command history. For example, you might type
rm temporary_file.txtto remove a specific file called “temporary_file.txt” (
rmmeans “remove”), and then
rm *in a different scratch directory to delete all files in that directory (the asterisk means “all files in the current directory”). If you forget about that second command and then type
!rmand Enter, intending to rerun the first one, you’ll actually rerun the second one if it’s more recent, and delete files you wanted to keep (and there is no reliable “undelete” operation to recover from this type of mistake). You can protect yourself partially by getting in the habit of always typing “rm” commands in full, or always following the “rm” with “ -i” (that is, something like
rm -i *), which will give you a chance to change your mind by showing you the filenames and asking you to confirm.
r, and then typing a few characters. This searches backward through the history for commands that contain whatever you’ve typed (which can occur anywhere in the command, and not only at the beginning). When you find the command you like, you can either rerun it with Enter or edit it as described above.
Some applications will let you use characters in filenames that are not easily processed by other applications, and because in Digital Humanities we often process the same files with multiple applications, it is safest and wisest to use the most restricted inventory of filename charaters. In our own work we observe the following conventions:
Close the terminal and create a directory in your home directory; don’t forget the file naming conventions. Now open a terminal and navigate to the new directory, using the command
cd, as described above. Keep browsing your files until you feel able to move around comfortably using
cd (remember that
cd .. will bring you to the parent directory and that you can use
pwd if get lost).