An Introduction to Gadfly

Gadfly is an easy to use visualization package for Julia, the new high level high performance language for technical computing. It follows grammar of graphics principles to simplify translating your ideas to plots - mapping how y changes with x across levels of z.

The tutorial aims to make Gadfly approachable by using a series of examples. It complements Daniel Jone's Gadfly Manual and my reference sheets for Gadfly and Julia.

Translating your ideas to plots is more efficient using dataframes but we'll start with 1 and multiple dimensional arrays because your data may already be in that format. After that we'll look at combining data into dataframes and how that can make visualizing your data easier.

Starting Up

One of the easiest ways to use Julia is with an IPython Notebook. This allows you to edit the code, add annotations, and keep your plots just as I'm doing here (see Appendix One for installation instructions). This notebook is available on github so you can copy and paste from it or use it as you wish. To start IJulia, open a terminal, change to the directory in which you are saving your notebooks and perhaps your data and enter this command:

 ipython notebook --profile julia

That will open an IPython Dashboard and you can open an existing notebook from that directory or begin fresh with New Notebook. In Julia, when you want to use a package you start by entering "using packagename" and then wait a few seconds for it to load. Lets begin:

In [3]:
# we only need the package Gadfly today
using Gadfly
In [4]:
# First read in your files

# if the file's separator was a comma then you don't need to specify it
# similarly, its a default to read the first row as a header row
# if it was a tab separated file with a header row we would use:
#   mydat = readtable("filenameNoHeader.csv", separator='\t')

d_age = collect(readdlm("f_age.csv"))
d_sex = collect(readdlm("f_sex.csv"))
d_dbp = collect(readdlm("f_dBP.csv")) ;

# open 3 files and store them
# collect is used to create one dimensional arrays instead of 2d arrays because
#  each of these files has one column of data
#  If we'd had two columns in f_sex.csv then we'd skip the collect() and address
#    the columns as d_sex[1] and d_sex[2]
# note the semicolon on the last line to stop Julia printing the final output

# lets just check what we read into the arrays
print("sa ", size(d_age), " ss ", size(d_sex), " sd ", size(d_dbp),)
# and lets have a look at the first few rows of column 1 for each one
d_age[1:6]
sa (50,) ss (50,) sd (50,)
Out[4]:
6-element Array{Float64,1}:
 39.0
 46.0
 48.0
 61.0
 46.0
 43.0
In [6]:
# I can do that one at a time or use a trick instead
# [array1 array2 array3] with spaces between the output arrays 
# concatenates them into 3 columns and displays them
[d_age[1:6] d_sex[1:6] d_dbp[1:6]]
Out[6]:
6x3 Array{Any,2}:
 39.0  "F"   70.0
 46.0  "M"   81.0
 48.0  "F"   80.0
 61.0  "M"   95.0
 46.0  "M"   84.0
 43.0  "M"  110.0
In [7]:
# Im interested in the age distribution so lets plot a histogram
plot(x=d_age, Geom.histogram)
Out[7]:
In [8]:
# Its good practice to check coarse and fine histograms
plot(x=d_age, Geom.histogram(bincount=25))
Out[8]:
In [9]:
# And lets look at a box plot.  First narrow the plot:
set_default_plot_size(6cm, 10cm)
plot(y=d_age, Geom.boxplot, Theme(boxplot_spacing=10mm))
Out[9]:
In [10]:
# what if I want to compare men with women?
set_default_plot_size(8cm, 10cm)
plot(x=d_sex, y=d_age, Geom.boxplot, Theme(boxplot_spacing=15mm))
Out[10]:
In [11]:
# If I want a summary of the statistics for the sample its easy to get it
[mean(d_age), std(d_age), mode(d_age), "", quantile(d_age,[0.75,0.5,0.25])]
Out[11]:
7-element Array{Any,1}:
 47.86  
  8.1941
 43.0   
   ""   
 52.75  
 46.0   
 42.0   
In [12]:
# ok lets do a scatter plot
# resize the plot to something larger
set_default_plot_size(20cm, 12cm)
plot(x=1:50, y=d_age)

# note that rather than enter the number of rows I could have used the size function
# and entered  plot(x=1:size(d_age,1), y=d_age)
Out[12]:
In [14]:
# So having looked at the plot we decide to plot it with an estimated confidence interval
# and a loess smoothing of the data.  Plus lets make the labels more relevant.
# For the confidence intervals we add Geom.errorbar and calculate a min and max
# For the smoothing we add Geom.smooth but we have to add Geom.point because although
#   its the default it will be replaced by any other Geom.  If we wanted a line 
#   we could use Geom.line.
# Lets show which respondents are male and which are female with the color function
# If we do that then 2 smoothing lines are drawn (if we commented out line 4, then only one)
# Finally notice that the plot command isn't on one line anymore - the brackets contain it.

plot(x=1:size(d_age,1), y=d_age, 
  Guide.xlabel("Respondent"), Guide.ylabel("Age"),
  Geom.errorbar, ymin=d_age-1.96*std(d_age), ymax=d_age+1.96*std(d_age),
  color=collect(d_sex), Guide.colorkey("Sex"),
  Geom.smooth, Geom.point)
Out[14]:
In [15]:
# The other major chart type is the bar chart, comparing y with x so lets
# compare blood pressure with age
set_default_plot_size(20cm, 12cm)
plot(x=d_age, y=d_dbp, Geom.bar, Geom.smooth)
Out[15]:
In [16]:
# and using color to identify sex
plot(x=d_age, y=d_dbp, color=d_sex, Geom.bar(position=:dodge))
Out[16]:
In [17]:
# We don't have pie charts currently but you might prefer a normalized stacked bar chart anyway
#plot(x=d_age, y=d_dbp, Geom.normbar)

Finally for this section. What if we wanted to save one of our charts in a drawing or a document format? Gadfly makes that easy. All you need to do is wrap your plot in a function to draw it.

In [18]:
draw(PNG("myplot.png", 6inch, 3inch), plot(x=d_age, y=d_dbp, Geom.bar))

# or with a plot object:
p = plot(x=1:size(d_age,1), y=d_age, 
      Guide.xlabel("Respondent"), Guide.ylabel("Age"),
      Geom.errorbar, ymin=d_age-1.96*std(d_age), ymax=d_age+1.96*std(d_age),
      color=collect(d_sex), Guide.colorkey("Sex"),
      Geom.smooth, Geom.point)

draw(PDF("myplot.pdf", 6inch, 3inch), p)

Using DataFrames

A DataFrame is like a table or a spreadsheet: Columns of data in an array, usually with headings at the top.

If you use headings then you will be able to choose columns to plot with the heading which is a little easier than remembering all the column numbers. Also Gadfly will use them as element headings for the visualization unless you override them.

In [19]:
# we want to use Gadfly and Dataframes today
using Gadfly; using DataFrames
In [20]:
# First read in your file into a dataframe with readtable
df = readtable("filename.csv")

# if the file's separator was a comma then you don't need to specify it
# if it was a tab separated file with no header row we would use:
#   mydat = readtable("filenameNoHeader.csv", separator='\t', header=false)

# lets just check what we read into mydata
print("size is ", size(df))
# and lets have a look at the first few rows and columns
#  because its a single frame no tricks are needed to display it
df[1:3, 1:size(df,2)]
size is (50,7)
Out[20]:
3x7 DataFrame:
        IX Sex Age   sBP  dBP Drink   BMI
[1,]     0   1  39 106.0 70.0     0 26.97
[2,]     1   2  46 121.0 81.0     0 28.73
[3,]     2   1  48 127.5 80.0     1 25.34
In [21]:
# I'd prefer M and F to 2 and 1,  similarly Y and N for Drink
df["Sex"]=ifelse(df["Sex"].==1, "F", "M") 
df["Drink"]=ifelse(df["Drink"].==1, "Y", "N")
df[1:6, 1:size(df,2)]
Out[21]:
6x7 DataFrame:
        IX Sex Age   sBP   dBP Drink   BMI
[1,]     0 "F"  39 106.0  70.0   "N" 26.97
[2,]     1 "M"  46 121.0  81.0   "N" 28.73
[3,]     2 "F"  48 127.5  80.0   "Y" 25.34
[4,]     3 "M"  61 150.0  95.0   "Y" 28.58
[5,]     4 "M"  46 130.0  84.0   "Y"  23.1
[6,]     5 "M"  43 180.0 110.0   "N"  30.3
In [22]:
# Im interested in the age distribution so lets plot a histogram
plot(df, x="Age", Geom.histogram(bincount=6))
Out[22]:
In [23]:
# Its good practice to check coarse and fine histograms
plot(df, x=3, Geom.histogram(bincount=15))
# note that I just entered the column number instead of its heading
Out[23]:
In [24]:
# and lets look at box plots again but lets do the original two side by side
# but first lets convert the 1s and 2s in sex to "F" and "M"
#df = df[df["Sex"].==1 ? "F" : "M",:]
hstack( plot(df, y="Age", Geom.boxplot), 
        plot(df, x="Sex", y="Age", Geom.boxplot),
        plot(df, x="Drink", y="Age", Geom.boxplot) 
)
80 120 10 -10 20 50 30 100 40 0 70 -20 110 60 90 Age N Y Drink 80 120 10 -10 20 50 30 100 40 0 70 -20 110 60 90 Age F M Sex 80 120 10 -10 20 50 30 100 40 0 70 -20 110 60 90 Age
Out[24]:
In [25]:
# Its quite a bit easier with dataframes.  Similarly we can display all stats
describe(df)
IX
Min      0.0
1st Qu.  12.25
Median   24.5
Mean     24.5
3rd Qu.  36.75
Max      49.0
NAs      0
NA%      0.0%

Sex
Length  50
Type    ASCIIString
NAs     0
NA%     0.0%
Unique  2

Age
Min      35.0
1st Qu.  42.0
Median   46.0
Mean     47.86
3rd Qu.  52.75
Max      65.0
NAs      0
NA%      0.0%

sBP
Min      96.0
1st Qu.  119.5
Median   131.5
Mean     133.5
3rd Qu.  145.75
Max      206.0
NAs      0
NA%      0.0%

dBP
Min      63.0
1st Qu.  76.375
Median   84.25
Mean     84.86
3rd Qu.  90.75
Max      121.0
NAs      0
NA%      0.0%

Drink
Length  50
Type    ASCIIString
NAs     0
NA%     0.0%
Unique  2

BMI
Min      18.59
1st Qu.  23.25
Median   26.15
Mean     26.374199999999995
3rd Qu.  28.472499999999997
Max      40.11
NAs      0
NA%      0.0%

In [26]:
# And the standard deviations
["Age" std(df["Age"]) "sBP" std(df["sBP"]) "dBP" std(df["dBP"]) "BMI" std(df["BMI"])]
Out[26]:
1x8 Array{Any,2}:
 "Age"  8.1941  "sBP"  22.6628  "dBP"  12.5921  "BMI"  4.41216
In [27]:
# ok lets do a scatter plot
# resize the plot to something larger
set_default_plot_size(20cm, 12cm)
plot(df, x="IX", y="sBP")
Out[27]:
In [28]:
# So having looked at the plot we decide to plot it with an estimated confidence interval
# and a loess smoothing of the data.  Plus lets make the labels more relevant.
# For the confidence intervals we add Geom.errorbar and calculate a min and max
# For the smoothing we add Geom.smooth but we have to add Geom.point because although
#   its the default it will be replaced by any other Geom.  If we wanted a line 
#   we could use Geom.line.
# Lets show which respondents are male and which are female with the color function
# If we do that then 2 smoothing lines are drawn (if we commented out line 4, then only one)
# Finally notice that the plot command isn't on one line anymore - the brackets contain it.

plot(df, x="IX", y="sBP",
  Guide.xlabel("Respondent"), Guide.ylabel("Blood Pressure"),
  Geom.errorbar, ymin=df["sBP"]-1.96*std(df["sBP"]), ymax=df["sBP"]+1.96*std(df["sBP"]),
  color="Sex", # Guide.colorkey("Sex"),
  Geom.smooth, Geom.point)
Out[28]:
In [29]:
# The other major chart type is the bar chart, comparing y with x so lets
# compare blood pressure with age
set_default_plot_size(20cm, 12cm)
plot(df, x="Age", y="sBP", Geom.bar, Geom.smooth)
Out[29]:
In [30]:
# and using color to identify sex
plot(df, x="Age", y="sBP", color="Sex", Geom.bar(position=:dodge))
Out[30]:
In [31]:
# We don't have pie charts currently but you might prefer a normalized stacked bar chart anyway
#plot(x=d_age, y=d_dbp, Geom.normbar)
In [32]:
# Its easy to save your visualizations as png, pdf or ps files
#  draw(format("filename.formatsuffix", width, height),(plot object or command))
draw(PNG("myplot.png", 6inch, 3inch), plot(df, x="Age", y="dBP", Geom.bar))

# or with a plot object:
p = plot(df, x="IX", y="Age", 
      Guide.xlabel("Respondent"), # Guide.ylabel("Age"),
      Geom.errorbar, ymin=df["Age"]-1.96*std(df["Age"]), ymax=df["Age"]+1.96*std(df["Age"]),
      color="Sex", # Guide.colorkey("Sex"),
      Geom.smooth, Geom.point)

draw(PDF("myplot.pdf", 6inch, 3inch), p)

Appendix One - Installing and Updating

There are three components to install: Julia, IPython, and the Julia packages you need.

Instruction for installing Julia are here.

Instructions fo installing IPython to support Julia are here.

You can add packages simply. Heres a couple of lines to install the most likely packages (reinstallation does no harm if your not sure what you have).

Pkg.Add("IJulia") ; Pkg.Add("DataFrames") ; Pkg.Add("Gadfly")
Pkg.Add("Stats") ; Pkg.Add("GLM") ; Pkg.Add("Distributions")

From time to time it pays to check that your packages are up to date. Do that now with:

Pkg.Update()  

Thats it. But if you have any issues then see the Julia page above and if still confused then just ask at the Julia Users Group