By Tyler Ransom, Duke University Social Science Research Institute

Julia's `DataFrames`

package is largely mirrored after the `data.frame()`

package in R. The key underlying principle is that data frames allow for storage of mixed data types (e.g. strings and numbers) in the same object. Data frames also allow for a missing data type, which is `NA`

in Julia. Other statistical software packages such as SAS, Stata, SPSS, and Matlab offer similar features in their data storage.

This tutorial serves to familiarize Julia users with the primary syntax and capaibilities of Julia's `DataFrames`

package. There will be emphasis on making connections with Stata's syntax and features, but anyone with experience in statistical programming will be able to make connections to their preferred language.

First, let's call the packages we'll need for this demonstration. We'll be using Julia version 0.4.1 with DataFrames version 0.6.10 and FreqTables version 0.0.1.

In [ ]:

```
using DataFrames
using FreqTables
```

Now let's read in some sample data --- the `auto`

dataset from Stata (in CSV form: https://github.com/jmxpearson/duke-julia-ssri-2016/auto.csv). In Julia, the `readtable()`

function converts delimited text files into data frames.

There are a number of options for configuring the read-in operation, but for now we'll use a simple comma-separated file with standard configurations.

Notice that whatever variable name you choose on the left-hand side of the equals sign will be the name of your data frame moving forward.

In [2]:

```
auto = readtable("auto.csv");
```

Next, let's look at the variables that are in our data frame. The `showcols`

function accomplishes this task. This is very similar to Stata's `describe`

command.

In [3]:

```
showcols(auto)
```

The output of `showcols()`

shows us that we have 74 observations, 12 variables, the name and format of each of our variables, and the number of missing observations for each.

We can also get the length and width of our data frame using the `size()`

function:

In [112]:

```
num_obs = size(auto,1)
```

Out[112]:

In [113]:

```
num_vars = size(auto,2)
```

Out[113]:

Next, let's look at some of our variables. We do this by either referencing the name with a `":"`

in front, or with the column number:

In [114]:

```
auto[:price]
```

Out[114]:

In [115]:

```
auto[2]
```

Out[115]:

In [116]:

```
auto[:,[:price,:mpg]]
```

Out[116]:

We can also use the `head()`

and `tail()`

functions to view the first `k`

and last `k`

observations for all variables in our data frame:

In [117]:

```
head(auto,4)
```

Out[117]:

In [118]:

```
tail(auto,4)
```

Out[118]:

We can also list observations of certain variables indexed by their row number:

In [119]:

```
auto[[1;2;4;15],[:headroom,:trunk]]
```

Out[119]:

We can also list observations that meet some condition. For example, suppose we want to look at the `headroom`

and `trunk`

space for all cars that achieve less than 20 miles per gallon:

In [120]:

```
auto[(auto[:,:mpg].<20),[:headroom,:trunk]]
```

Out[120]:

We can look at summary statistics in a few different ways. First, notice that the `showcols()`

function reported the number of `NA`

or missing values for each variable.

The `describe()`

function also displays missing value frequencies and percentages, in addition to reporting the min/max, mean, median, number of unique observations, and quartiles for each variable in the data frame:

In [4]:

```
describe(auto)
```

We can also display the mean for each variable using the `colwise()`

function. However, this function will return an error if we include string variables.

In [122]:

```
colwise(mean,auto)
```

In [171]:

```
round(colwise(mean,auto[:,2:end]),3)
```

Out[171]:

Notice that this returns `NA`

for variables with at least one missing observation.

We can also compute frequencies of categorical variables, using a couple of different functions:

`countmap()`

returns cell counts as a dictionary:

In [124]:

```
countmap(auto[:foreign])
```

Out[124]:

We can also use the `by`

structure, coupled with the `nrow`

function:

In [125]:

```
by(auto,:foreign,nrow)
```

Out[125]:

For cross-tabulations, we require the `FreqTables`

package, which was loaded earlier.

In [126]:

```
freqtable(auto, :rep78, :foreign, subset=!isna(auto[:rep78]))
```

Out[126]:

Notice that, at the moment, the Julia's `DataFrames`

is substantially lagging other languages in terms of computing cross-tabulations and contingency tables.

Suppose we want to delete observations in our dataset according to some rule. This amounts to *keeping* the complement of the rule. For example, if we want to drop all observations of the data frame where a variable is missing, we index the rows we want to keep with `!isna()`

and select all columns:

In [5]:

```
auto1 = auto[!isna(auto[:,:rep78]), :];
```

In [6]:

```
showcols(auto1)
```

Notice that we now have 5 fewer observations in the new data frame, and that there are no missing values.

We can drop variables in two different ways:

First, by using the complement of a `keep`

statement:

In [7]:

```
auto1 = auto1[setdiff(names(auto1), [:price,:mpg])];
```

In [8]:

```
showcols(autod1)
```

Second, we can drop in-place using the `delete!()`

function, which overwrites the data frame.

In [139]:

```
delete!(auto1,[:weight,:length]);
```

In [140]:

```
showcols(auto1)
```

We can keep variables simply by indexing the variable names or column numbers of interest:

In [141]:

```
auto2 = auto[:,[:make,:mpg,:displacement,:gear_ratio]];
```

In [142]:

```
showcols(auto2)
```

Suppose we want to rename the variables in the "kept" data frame from directly above. This is easily accomplished with the `rename!()`

function:

In [143]:

```
rename!(auto2,[:make,:displacement],[:make_name,:CCs]);
```

In [144]:

```
showcols(auto2)
```

Cloning a variable is easily done as follows:

In [146]:

```
auto2[:mpg_same] = auto2[:mpg];
```

In [147]:

```
showcols(auto2)
```

To generate a new variable using a function of one or more existing variables, the syntax is a bit more involved. For instance, suppose we want to create a new variable called `mpgSquared`

, which is equal to `mpg`

squared:

In [148]:

```
auto2[:mpgSquared] = map(temp -> temp.^2, auto[:mpg]);
```

We use `map()`

to accomplish the task, which takes as arguments a function (using arrow notation) and an input (`auto[:mpg]`

). Note that the argument of the function (`temp`

, here) can be any name. I used `temp`

to emphasize that it is a variable of local scope, and thus purley temporary.

Finally, note that there is a `"."`

before the caret symbol, indicating that this is a vectorized operation. Failure to include the `"."`

will result in an error.

We can verify that the function worked as expected:

In [149]:

```
head(auto2[:,[:mpg,:mpgSquared]])
```

Out[149]:

We can use this framework to generate new variables using any mathematical function. For example, a dummy variable that is equal to 1 if `:mpg`

is less than 20 and `:gear_ratio`

is less than 3, and 0 otherwise:

In [157]:

```
auto2[:dummy_var] = map((tempx,tempy) -> (tempx.<20) & (tempy.<3), auto2[:mpg], auto2[:gear_ratio]);
```

In [158]:

```
showcols(auto2)
```

Note that the type of the new dummy variable is `Bool`

instead of `Int64`

.

Suppose we want to change the ordering of the variables of our data frame. This is most easily done as follows:

In [159]:

```
auto2 = auto2[:,[2;3;4;1;5:end]];
```

In [160]:

```
showcols(auto2)
```

We can also sort the observations in our data frame by any number of columns and any number of methods (ascending or descending). Performance of the `sort!()`

function closely mirrors Stata's `gsort`

capabilities.

Below, we will sort ascending by `:mpg`

and descending by `:make_name`

:

In [169]:

```
sort!(auto2,cols=[:mpg,:make_name],rev=[false,true]);
```

In [170]:

```
head(auto2,4)
```

Out[170]:

Julia's `DataFrames`

allows for reshaping of longitudinal datasets in a similar fashion as other statistical software programs.

Let's start by hand-creating a "wide" panel dataset with 3 individuals and 3 time periods:

In [173]:

```
reshape1 = DataFrame(id = 1:3, sex = [0;1;0],
inc1980 = [5000;2000;3000],
inc1981 = [5500;2200;2000],
inc1982 = [6000;4400;1000])
```

Out[173]:

We can reshape this data frame to "long" format by using the `stack()`

command provided by the `DataFrames`

package:

In [174]:

```
longform1A = stack(reshape1, [:inc1980, :inc1981, :inc1982], [:id, :sex])
```

Out[174]:

Now we have three replications of each `:id`

and `:sex`

(the time-invariant columns), as well as two new columns, labeled `:variable`

and `:value`

.

We can sort the new dataframe so that it is in a more readable format:

In [175]:

```
sort!(longform1A, cols = [:id, :variable])
```

Out[175]:

And we can also reshape back to "wide" format using the `unstack()`

function:

In [176]:

```
wideform1A = unstack(longform1A, :variable, :value)
```

Out[176]:

It's worth noting that this method does not work very well when there are multiple time-varying variables per `:id`

. We'll discuss this in detail a bit later.

`DataFrames`

also has functions that allow the user to merge two data frames together. There are many different types of possible merges, all accessible via the `join()`

function.

The different types of merges depend on if the user wants to keep unmatched observations from either data frame, not on whether or not the identification is duplicated in the merging data frames (i.e. each type of merge can be used for both one-to-one merges and many-to-one merges).

The basic syntax is `c = join(a, b, on = [:id1, :id2], kind = symbol)`

, where `a`

and `b`

are data frames each with the identifiers `:id1`

and `id2`

, and `kind`

is a symbol that can take any of the following 7 values:

Let's show how to do each of these merges using a simple set of data frames.

`:inner`

: The output contains rows for values of the key that exist in both the first (left) and second (right) arguments to join*(this is the*`keep(match)`

option in Stata)`:left`

: The output contains rows for values of the key that exist in the first (left) argument to join, whether or not that value exists in the second (right) argument*(this is the*`keep(master)`

option in Stata)`:right`

: The output contains rows for values of the key that exist in the second (right) argument to join, whether or not that value exists in the first (left) argument*(this is the*`keep(using)`

option in Stata)`:outer`

: The output contains rows for values of the key that exist in the first (left) or second (right) argument to join*(this is the Stata default)*

`:semi`

: Like an inner join, but output is restricted to columns from the first (left) argument to join*(this is the*`keep(match master)`

option in Stata)`:anti`

: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument to join. As with semi joins, output is restricted to columns from the first (left) argument*(there is no natural stata equivalent for this)*`:cross`

: The output is the cartesian product of rows from the first (left) and second (right) arguments to join*(this is equivalent to Stata's*. Note also that`append`

command)`:cross`

is the only merge type that does not require an identifier in each data frame

Let's show how to do each of these merges using a simple set of data frames.

In [178]:

```
name = DataFrame(ID = [1, 2, 3, 4, 5, 6], Name = ["John", "Jane", "Mark", "Ann", "Vlad", "Maria"])
```

Out[178]:

In [179]:

```
jobs = DataFrame(ID = [1, 2, 3, 4, 5, 6], Job = ["Lawyer", "Doctor", "Mechanic", "Doctor", "Judge", "Pilot"])
```

Out[179]:

In [186]:

```
siblings = DataFrame(ID = [1, 1, 2, 3, 5, 5, 5, 6],
Sibling = ["Eric", "Ryan", "Jennifer", "Heather", "Carl", "Dmitri", "Andrei", "Pedro"])
```

Out[186]:

Let's do a simple `:inner`

merge on the first `name`

and `jobs`

data frames:

In [183]:

```
mergedNameJobs = join(name,jobs, on = :ID, kind = :inner)
```

Out[183]:

Now let's see what happens when we merge `name`

with `siblings`

, under a variety of `join`

types:

In [184]:

```
mergedNameSibsInner = join(name,siblings, on = :ID, kind = :inner)
```

Out[184]:

With the `:inner`

join, those who don't have siblings are removed from the merged data frame.

In [185]:

```
mergedNameSibsOuter = join(name,siblings, on = :ID, kind = :outer)
```

Out[185]:

When we instead do an `:outer`

join, we see that Ann, who doesn't have any siblings, shows as `NA`

under `:Sibling`

.

In [187]:

```
mergedNameSibsLeft = join(name,siblings, on = :ID, kind = :left)
```

Out[187]:

In [188]:

```
mergedNameSibsOuter = join(name,siblings, on = :ID, kind = :right)
```

Out[188]:

In [191]:

```
mergedNameSibsSemi = join(name,siblings, on = :ID, kind = :semi)
```

Out[191]:

In [192]:

```
mergedNameSibsAnti = join(name,siblings, on = :ID, kind = :anti)
```

Out[192]:

In [193]:

```
mergedNameSibsCross = join(name,siblings, kind = :cross)
```

Out[193]:

I mentioned previously that the reshaping method outlined previously does not work very well when there are multiple time-varying variables per :id. With the `join()`

functions in hand, this is possible, though not ideal compared to other software packages.

Let's revisit our previous example, except now with two time-varying variables (inc* and ue*):

In [194]:

```
reshape2 = DataFrame(id = 1:3, sex = [0;1;0], inc1980 = [5000;2000;3000],
inc1981 = [5500;2200;2000],inc1982 = [6000;4400;1000],
ue1980 = [0;1;0], ue1981 = [1;0;0], ue1982 = [0;0;1])
```

Out[194]:

If we try to reshape this using a similar `stack()`

call as before, we get:

In [195]:

```
longform2 = stack(reshape2, [:inc1980, :inc1981, :inc1982, :ue1980, :ue1981, :ue1982],
[:id, :sex])
```

Out[195]:

The inc* and ue* values are stacked, so that we have double the number of observations we would like to have.

The remedy for this is to do the reshaping separately for each type of variable, and then merge together.

Conversion from data frames to regular `Julia`

arrays may be required for use of libraries outside of the `DataFrames`

and `GLM`

world.

To convert, simply type

`arrayName = convert(Array,dataFrameName)`

But be aware that any `NA`

elements of the data frame will cause an error to be thrown (because `Julia`

's regular arrays do not know the `NA`

type).