Notebook

Daru Use Case: Analyzing baby names over a period of time.¶

In this example we'll read data about prevalance of baby names in the US from a bunch of CSV files and try to analyze to data to figure out interesting trends from it.

In [1]:

require 'daru'
require 'gnuplotrb'

Out[1]:

true

The data is contained in multiple CSV files on a per year basis.

Here's what a raw CSV file looks like for the year 1951. The first column is the name, second the sex and the third the number of births that took place with that name.

Find the data here

In [2]:

CSV.read("data/yob1951.txt").first(10)

Out[2]:

[["Linda", "F", "73933"], ["Mary", "F", "65689"], ["Patricia", "F", "56422"], ["Deborah", "F", "42043"], ["Barbara", "F", "40588"], ["Susan", "F", "40207"], ["Nancy", "F", "30335"], ["Karen", "F", "27986"], ["Sandra", "F", "27656"], ["Kathleen", "F", "26703"]]

All the CSV files are loaded into a DataFrame, a new column 'year' added to them for identification, and then concatenated to produce one large DataFrame.

In [3]:

data_frame = Daru::DataFrame.from_csv("data/yob1951.txt", 
  headers: ['name', 'sex', 'births'])
data_frame['year'] = [1951] * data_frame.size

(1952..2014).each do |year|
  temp = Daru::DataFrame.from_csv("data/yob#{year}.txt", 
    headers: ['name', 'sex', 'births'])
  temp['year'] = [year] * temp.size
  data_frame = data_frame.concat(temp)
end
data_frame.vectors = Daru::Index.new(['births','name', 'sex','year'])

data_frame

Out[3]:

Daru::DataFrame:110707160 rows: 1353205 cols: 4
	births	name	sex	year
0	73933	Linda	F	1951
1	65689	Mary	F	1951
2	56422	Patricia	F	1951
3	42043	Deborah	F	1951
4	40588	Barbara	F	1951
5	40207	Susan	F	1951
6	30335	Nancy	F	1951
7	27986	Karen	F	1951
8	27656	Sandra	F	1951
9	26703	Kathleen	F	1951
10	24773	Carol	F	1951
11	24070	Donna	F	1951
12	23927	Sharon	F	1951
13	21700	Brenda	F	1951
14	19835	Diane	F	1951
15	18559	Pamela	F	1951
16	17683	Margaret	F	1951
17	17061	Debra	F	1951
18	16949	Janet	F	1951
19	16289	Cynthia	F	1951
20	15947	Janice	F	1951
21	15187	Carolyn	F	1951
22	15103	Elizabeth	F	1951
23	15016	Christine	F	1951
24	14929	Judith	F	1951
25	14005	Judy	F	1951
26	13909	Shirley	F	1951
27	13468	Joyce	F	1951
28	12818	Betty	F	1951
29	11987	Cheryl	F	1951
30	10947	Gloria	F	1951
31	10930	Rebecca	F	1951
...	...	...	...	...
1353204	5	Zyrin	M	2014

Pivot the DataFrame on the year as the row and sex as the column, using the 'births' column for aggregation.

This tells us the number of male and female births per year.

In [4]:

pivoted = data_frame.pivot_table(
  index: ['year'], vectors: ['sex'], agg: :sum, values: 'births')

Out[4]:

Daru::DataFrame:132293920 rows: 64 cols: 2
	["births", "F"]	["births", "M"]
[1951]	1800042	1881080
[1952]	1854698	1944277
[1953]	1880326	1969777
[1954]	1941682	2037374
[1955]	1954664	2057918
[1956]	2007512	2113694
[1957]	2044160	2155866
[1958]	2010884	2120712
[1959]	2023044	2133509
[1960]	2022093	2132717
[1961]	2017316	2122502
[1962]	1966548	2068945
[1963]	1927217	2031755
[1964]	1894594	1993270
[1965]	1765001	1861378
[1966]	1691868	1783964
[1967]	1650764	1744527
[1968]	1640103	1738928
[1969]	1686947	1789732
[1970]	1748147	1859594
[1971]	1663475	1769201
[1972]	1521185	1622666
[1973]	1458139	1559338
[1974]	1467413	1573105
[1975]	1457699	1562207
[1976]	1465096	1569904
[1977]	1532997	1643684
[1978]	1531658	1642250
[1979]	1605051	1721947
[1980]	1659933	1783876
[1981]	1667465	1790907
[1982]	1692678	1813970
...	...	...
[2014]	1768775	1901376

The row and column names are rather inconvienient so we rename them into something better.

In [5]:

pivoted.index = Daru::DateTimeIndex.date_range(:start => '1951', :periods => pivoted.size, freq: 'YEAR')
pivoted.vectors = Daru::Index.new(['F', 'M'])
pivoted

Out[5]:

Daru::DataFrame:132293920 rows: 64 cols: 2
	F	M
1951-01-01T00:00:00+00:00	1800042	1881080
1952-01-01T00:00:00+00:00	1854698	1944277
1953-01-01T00:00:00+00:00	1880326	1969777
1954-01-01T00:00:00+00:00	1941682	2037374
1955-01-01T00:00:00+00:00	1954664	2057918
1956-01-01T00:00:00+00:00	2007512	2113694
1957-01-01T00:00:00+00:00	2044160	2155866
1958-01-01T00:00:00+00:00	2010884	2120712
1959-01-01T00:00:00+00:00	2023044	2133509
1960-01-01T00:00:00+00:00	2022093	2132717
1961-01-01T00:00:00+00:00	2017316	2122502
1962-01-01T00:00:00+00:00	1966548	2068945
1963-01-01T00:00:00+00:00	1927217	2031755
1964-01-01T00:00:00+00:00	1894594	1993270
1965-01-01T00:00:00+00:00	1765001	1861378
1966-01-01T00:00:00+00:00	1691868	1783964
1967-01-01T00:00:00+00:00	1650764	1744527
1968-01-01T00:00:00+00:00	1640103	1738928
1969-01-01T00:00:00+00:00	1686947	1789732
1970-01-01T00:00:00+00:00	1748147	1859594
1971-01-01T00:00:00+00:00	1663475	1769201
1972-01-01T00:00:00+00:00	1521185	1622666
1973-01-01T00:00:00+00:00	1458139	1559338
1974-01-01T00:00:00+00:00	1467413	1573105
1975-01-01T00:00:00+00:00	1457699	1562207
1976-01-01T00:00:00+00:00	1465096	1569904
1977-01-01T00:00:00+00:00	1532997	1643684
1978-01-01T00:00:00+00:00	1531658	1642250
1979-01-01T00:00:00+00:00	1605051	1721947
1980-01-01T00:00:00+00:00	1659933	1783876
1981-01-01T00:00:00+00:00	1667465	1790907
1982-01-01T00:00:00+00:00	1692678	1813970
...	...	...
2014-01-01T00:00:00+00:00	1768775	1901376

The number of male births vs. female births can then be plotted against each other using the GnuplotRB gem.

In [6]:

GnuplotRB::Plot.new(
  [pivoted['F'], with: 'lines', title: 'F'],
  [pivoted['M'], with: 'lines', title: 'M'], title: 'Total births by sex and year')

Out[6]:

To further prod into the data, lets perform an SQL style GROUP BY operation on the DataFrame on the 'year' column so that the DataFrame is divided into groups according to year.

In [7]:

groups_by_year = data_frame.group_by(['year'])
nil

This code interates over all the groups created by year and selects the rows that contain a particular name in them from.

In this manner we collect rows from every group on a per year basis and create a DataFrame for each name which tells the number of births of a name during a particular year.

For this example we'll choose the names James, Robert, Jessica and Sophia.

In [8]:

pieces = []
['James', 'Robert', 'Jessica', 'Sophia'].each do |name|
  rows = []
  groups_by_year.each_group do |group|
    rows << group.row[group['name'].index_of(name)]
  end
  pieces << Daru::DataFrame.rows(rows)
end

Out[8]:

["James", "Robert", "Jessica", "Sophia"]

Perform some basic preprocessing/cleaning on the DataFrame.

In [9]:

pieces.each do |df|
  df['year'].map! {|e| DateTime.new(e) }
  df.set_index('year')
  df.rename df['name'][0]
  df.delete_vector 'name'
end

Out[9]:

[
#<Daru::DataFrame:257429020 @name = James @size = 64>
               births        sex 
1951-01-01        259          F 
1952-01-01        261          F 
1953-01-01        237          F 
1954-01-01        226          F 
1955-01-01        246          F 
1956-01-01        249          F 
1957-01-01        281          F 
1958-01-01        252          F 
1959-01-01        296          F 
1960-01-01        288          F 
1961-01-01        286          F 
1962-01-01        262          F 
1963-01-01        311          F 
1964-01-01        317          F 
1965-01-01        303          F 
       ...        ...        ... 
, 
#<Daru::DataFrame:83461800 @name = Robert @size = 64>
               births        sex 
1951-01-01        208          F 
1952-01-01        195          F 
1953-01-01        228          F 
1954-01-01        191          F 
1955-01-01        235          F 
1956-01-01        214          F 
1957-01-01        255          F 
1958-01-01        235          F 
1959-01-01        237          F 
1960-01-01        271          F 
1961-01-01        289          F 
1962-01-01        229          F 
1963-01-01        256          F 
1964-01-01        278          F 
1965-01-01        255          F 
       ...        ...        ... 
, 
#<Daru::DataFrame:350000980 @name = Jessica @size = 64>
               births        sex 
1951-01-01        466          F 
1952-01-01        451          F 
1953-01-01        495          F 
1954-01-01        423          F 
1955-01-01        386          F 
1956-01-01        406          F 
1957-01-01        476          F 
1958-01-01        529          F 
1959-01-01        523          F 
1960-01-01        559          F 
1961-01-01        669          F 
1962-01-01        867          F 
1963-01-01       1120          F 
1964-01-01       1172          F 
1965-01-01       1530          F 
       ...        ...        ... 
, 
#<Daru::DataFrame:128568600 @name = Sophia @size = 64>
               births        sex 
1951-01-01        153          F 
1952-01-01        111          F 
1953-01-01        131          F 
1954-01-01        112          F 
1955-01-01        152          F 
1956-01-01        121          F 
1957-01-01        187          F 
1958-01-01        227          F 
1959-01-01        275          F 
1960-01-01        262          F 
1961-01-01        324          F 
1962-01-01        485          F 
1963-01-01        523          F 
1964-01-01        470          F 
1965-01-01        507          F 
       ...        ...        ... 
]

In [10]:

plots = []
pieces.each do |df|
  plot = GnuplotRB::Plot.new([
    df['births'], with: 'lines', title: df.name])
  plot.format_x = '%Y'
  plots << plot
end

GnuplotRB::Multiplot.new(*plots).tap do |mp|
  mp.layout = [2,2]
  mp.format_x = '%Y'
  mp.xtics = 'nomirror rotate by -45'
  mp.title = 'Prevalence of certain names according to year'
  mp.xlabel = 'Year'
  mp.ylabel = 'Occurences'
  mp.xrange = '"1945-01-01":"2016-01-01"'
end

Out[10]: