In this example we'll read data about prevalance of baby names in the US from a bunch of CSV files and try to analyze to data to figure out interesting trends from it.
require 'daru'
require 'gnuplotrb'
true
The data is contained in multiple CSV files on a per year basis.
Here's what a raw CSV file looks like for the year 1951. The first column is the name, second the sex and the third the number of births that took place with that name.
Find the data here
CSV.read("data/yob1951.txt").first(10)
[["Linda", "F", "73933"], ["Mary", "F", "65689"], ["Patricia", "F", "56422"], ["Deborah", "F", "42043"], ["Barbara", "F", "40588"], ["Susan", "F", "40207"], ["Nancy", "F", "30335"], ["Karen", "F", "27986"], ["Sandra", "F", "27656"], ["Kathleen", "F", "26703"]]
All the CSV files are loaded into a DataFrame, a new column 'year' added to them for identification, and then concatenated to produce one large DataFrame.
data_frame = Daru::DataFrame.from_csv("data/yob1951.txt",
headers: ['name', 'sex', 'births'])
data_frame['year'] = [1951] * data_frame.size
(1952..2014).each do |year|
temp = Daru::DataFrame.from_csv("data/yob#{year}.txt",
headers: ['name', 'sex', 'births'])
temp['year'] = [year] * temp.size
data_frame = data_frame.concat(temp)
end
data_frame.vectors = Daru::Index.new(['births','name', 'sex','year'])
data_frame
Daru::DataFrame:110707160 rows: 1353205 cols: 4 | ||||
---|---|---|---|---|
births | name | sex | year | |
0 | 73933 | Linda | F | 1951 |
1 | 65689 | Mary | F | 1951 |
2 | 56422 | Patricia | F | 1951 |
3 | 42043 | Deborah | F | 1951 |
4 | 40588 | Barbara | F | 1951 |
5 | 40207 | Susan | F | 1951 |
6 | 30335 | Nancy | F | 1951 |
7 | 27986 | Karen | F | 1951 |
8 | 27656 | Sandra | F | 1951 |
9 | 26703 | Kathleen | F | 1951 |
10 | 24773 | Carol | F | 1951 |
11 | 24070 | Donna | F | 1951 |
12 | 23927 | Sharon | F | 1951 |
13 | 21700 | Brenda | F | 1951 |
14 | 19835 | Diane | F | 1951 |
15 | 18559 | Pamela | F | 1951 |
16 | 17683 | Margaret | F | 1951 |
17 | 17061 | Debra | F | 1951 |
18 | 16949 | Janet | F | 1951 |
19 | 16289 | Cynthia | F | 1951 |
20 | 15947 | Janice | F | 1951 |
21 | 15187 | Carolyn | F | 1951 |
22 | 15103 | Elizabeth | F | 1951 |
23 | 15016 | Christine | F | 1951 |
24 | 14929 | Judith | F | 1951 |
25 | 14005 | Judy | F | 1951 |
26 | 13909 | Shirley | F | 1951 |
27 | 13468 | Joyce | F | 1951 |
28 | 12818 | Betty | F | 1951 |
29 | 11987 | Cheryl | F | 1951 |
30 | 10947 | Gloria | F | 1951 |
31 | 10930 | Rebecca | F | 1951 |
... | ... | ... | ... | ... |
1353204 | 5 | Zyrin | M | 2014 |
Pivot the DataFrame on the year as the row and sex as the column, using the 'births' column for aggregation.
This tells us the number of male and female births per year.
pivoted = data_frame.pivot_table(
index: ['year'], vectors: ['sex'], agg: :sum, values: 'births')
Daru::DataFrame:132293920 rows: 64 cols: 2 | ||
---|---|---|
["births", "F"] | ["births", "M"] | |
[1951] | 1800042 | 1881080 |
[1952] | 1854698 | 1944277 |
[1953] | 1880326 | 1969777 |
[1954] | 1941682 | 2037374 |
[1955] | 1954664 | 2057918 |
[1956] | 2007512 | 2113694 |
[1957] | 2044160 | 2155866 |
[1958] | 2010884 | 2120712 |
[1959] | 2023044 | 2133509 |
[1960] | 2022093 | 2132717 |
[1961] | 2017316 | 2122502 |
[1962] | 1966548 | 2068945 |
[1963] | 1927217 | 2031755 |
[1964] | 1894594 | 1993270 |
[1965] | 1765001 | 1861378 |
[1966] | 1691868 | 1783964 |
[1967] | 1650764 | 1744527 |
[1968] | 1640103 | 1738928 |
[1969] | 1686947 | 1789732 |
[1970] | 1748147 | 1859594 |
[1971] | 1663475 | 1769201 |
[1972] | 1521185 | 1622666 |
[1973] | 1458139 | 1559338 |
[1974] | 1467413 | 1573105 |
[1975] | 1457699 | 1562207 |
[1976] | 1465096 | 1569904 |
[1977] | 1532997 | 1643684 |
[1978] | 1531658 | 1642250 |
[1979] | 1605051 | 1721947 |
[1980] | 1659933 | 1783876 |
[1981] | 1667465 | 1790907 |
[1982] | 1692678 | 1813970 |
... | ... | ... |
[2014] | 1768775 | 1901376 |
The row and column names are rather inconvienient so we rename them into something better.
pivoted.index = Daru::DateTimeIndex.date_range(:start => '1951', :periods => pivoted.size, freq: 'YEAR')
pivoted.vectors = Daru::Index.new(['F', 'M'])
pivoted
Daru::DataFrame:132293920 rows: 64 cols: 2 | ||
---|---|---|
F | M | |
1951-01-01T00:00:00+00:00 | 1800042 | 1881080 |
1952-01-01T00:00:00+00:00 | 1854698 | 1944277 |
1953-01-01T00:00:00+00:00 | 1880326 | 1969777 |
1954-01-01T00:00:00+00:00 | 1941682 | 2037374 |
1955-01-01T00:00:00+00:00 | 1954664 | 2057918 |
1956-01-01T00:00:00+00:00 | 2007512 | 2113694 |
1957-01-01T00:00:00+00:00 | 2044160 | 2155866 |
1958-01-01T00:00:00+00:00 | 2010884 | 2120712 |
1959-01-01T00:00:00+00:00 | 2023044 | 2133509 |
1960-01-01T00:00:00+00:00 | 2022093 | 2132717 |
1961-01-01T00:00:00+00:00 | 2017316 | 2122502 |
1962-01-01T00:00:00+00:00 | 1966548 | 2068945 |
1963-01-01T00:00:00+00:00 | 1927217 | 2031755 |
1964-01-01T00:00:00+00:00 | 1894594 | 1993270 |
1965-01-01T00:00:00+00:00 | 1765001 | 1861378 |
1966-01-01T00:00:00+00:00 | 1691868 | 1783964 |
1967-01-01T00:00:00+00:00 | 1650764 | 1744527 |
1968-01-01T00:00:00+00:00 | 1640103 | 1738928 |
1969-01-01T00:00:00+00:00 | 1686947 | 1789732 |
1970-01-01T00:00:00+00:00 | 1748147 | 1859594 |
1971-01-01T00:00:00+00:00 | 1663475 | 1769201 |
1972-01-01T00:00:00+00:00 | 1521185 | 1622666 |
1973-01-01T00:00:00+00:00 | 1458139 | 1559338 |
1974-01-01T00:00:00+00:00 | 1467413 | 1573105 |
1975-01-01T00:00:00+00:00 | 1457699 | 1562207 |
1976-01-01T00:00:00+00:00 | 1465096 | 1569904 |
1977-01-01T00:00:00+00:00 | 1532997 | 1643684 |
1978-01-01T00:00:00+00:00 | 1531658 | 1642250 |
1979-01-01T00:00:00+00:00 | 1605051 | 1721947 |
1980-01-01T00:00:00+00:00 | 1659933 | 1783876 |
1981-01-01T00:00:00+00:00 | 1667465 | 1790907 |
1982-01-01T00:00:00+00:00 | 1692678 | 1813970 |
... | ... | ... |
2014-01-01T00:00:00+00:00 | 1768775 | 1901376 |
The number of male births vs. female births can then be plotted against each other using the GnuplotRB gem.
GnuplotRB::Plot.new(
[pivoted['F'], with: 'lines', title: 'F'],
[pivoted['M'], with: 'lines', title: 'M'], title: 'Total births by sex and year')
To further prod into the data, lets perform an SQL style GROUP BY operation on the DataFrame on the 'year' column so that the DataFrame is divided into groups according to year.
groups_by_year = data_frame.group_by(['year'])
nil
This code interates over all the groups created by year and selects the rows that contain a particular name in them from.
In this manner we collect rows from every group on a per year basis and create a DataFrame for each name which tells the number of births of a name during a particular year.
For this example we'll choose the names James, Robert, Jessica and Sophia.
pieces = []
['James', 'Robert', 'Jessica', 'Sophia'].each do |name|
rows = []
groups_by_year.each_group do |group|
rows << group.row[group['name'].index_of(name)]
end
pieces << Daru::DataFrame.rows(rows)
end
["James", "Robert", "Jessica", "Sophia"]
Perform some basic preprocessing/cleaning on the DataFrame.
pieces.each do |df|
df['year'].map! {|e| DateTime.new(e) }
df.set_index('year')
df.rename df['name'][0]
df.delete_vector 'name'
end
[ #<Daru::DataFrame:257429020 @name = James @size = 64> births sex 1951-01-01 259 F 1952-01-01 261 F 1953-01-01 237 F 1954-01-01 226 F 1955-01-01 246 F 1956-01-01 249 F 1957-01-01 281 F 1958-01-01 252 F 1959-01-01 296 F 1960-01-01 288 F 1961-01-01 286 F 1962-01-01 262 F 1963-01-01 311 F 1964-01-01 317 F 1965-01-01 303 F ... ... ... , #<Daru::DataFrame:83461800 @name = Robert @size = 64> births sex 1951-01-01 208 F 1952-01-01 195 F 1953-01-01 228 F 1954-01-01 191 F 1955-01-01 235 F 1956-01-01 214 F 1957-01-01 255 F 1958-01-01 235 F 1959-01-01 237 F 1960-01-01 271 F 1961-01-01 289 F 1962-01-01 229 F 1963-01-01 256 F 1964-01-01 278 F 1965-01-01 255 F ... ... ... , #<Daru::DataFrame:350000980 @name = Jessica @size = 64> births sex 1951-01-01 466 F 1952-01-01 451 F 1953-01-01 495 F 1954-01-01 423 F 1955-01-01 386 F 1956-01-01 406 F 1957-01-01 476 F 1958-01-01 529 F 1959-01-01 523 F 1960-01-01 559 F 1961-01-01 669 F 1962-01-01 867 F 1963-01-01 1120 F 1964-01-01 1172 F 1965-01-01 1530 F ... ... ... , #<Daru::DataFrame:128568600 @name = Sophia @size = 64> births sex 1951-01-01 153 F 1952-01-01 111 F 1953-01-01 131 F 1954-01-01 112 F 1955-01-01 152 F 1956-01-01 121 F 1957-01-01 187 F 1958-01-01 227 F 1959-01-01 275 F 1960-01-01 262 F 1961-01-01 324 F 1962-01-01 485 F 1963-01-01 523 F 1964-01-01 470 F 1965-01-01 507 F ... ... ... ]
plots = []
pieces.each do |df|
plot = GnuplotRB::Plot.new([
df['births'], with: 'lines', title: df.name])
plot.format_x = '%Y'
plots << plot
end
GnuplotRB::Multiplot.new(*plots).tap do |mp|
mp.layout = [2,2]
mp.format_x = '%Y'
mp.xtics = 'nomirror rotate by -45'
mp.title = 'Prevalence of certain names according to year'
mp.xlabel = 'Year'
mp.ylabel = 'Occurences'
mp.xrange = '"1945-01-01":"2016-01-01"'
end