Data science intro with panthera¶

Clojure + Pandas + Numpy = 💖¶

I'll show how it is possible to get the most out of the Pandas & the Clojure ecosystem at the same time.

This intro is based on this Kaggle notebook you can follow along with that if you come from the Python world.

Env setup¶

The easiest way to go is the provided Docker image, but if you want to setup your machine just follow along.

System install¶

If you want to install everything at the system level you should do something equivalent to what we do below:

sudo apt-get update
sudo apt-get install libpython3.6-dev
pip3 install numpy pandas

conda¶

To work within a conda environment just create a new one with:

conda create -n panthera python=3.6 numpy pandas
conda activate panthera

Than start your REPL from the activated conda environment. This is the best way to install requirements for panthera because in the process you get MKL as well with Numpy.

Here¶

Let's just add panthera to our classpath and we're good to go!

In [1]:

(require '[clojupyter.misc.helper :as helper])
(helper/add-dependencies '[panthera "0.1-alpha.11"])
:ok

Out[1]:

:ok

Now require panthera main API namespace and define a little helper to better inspect data-frames

In [ ]:

(require '[panthera.panthera :as pt])

In [3]:

(require '[clojupyter.display :as display])
(require '[libpython-clj.python :as py])

(defn show
  [obj]
  (display/html
    (py/call-attr obj "to_html")))

Out[3]:

#'user/show

In [4]:

(helper/add-dependencies '[metasoarous/oz "1.5.4"])
(require '[oz.notebook.clojupyter :as oz])

Out[4]:

nil

A brief primer¶

We will work with Pokemons! Datasets are available here.

We can read data into panthera from various formats, one of the most used is read-csv. Most panthera functions accept either a data-frame and/or a series as a first argument, one or more required arguments and then a map of options.

To see which options are available you can check docs or even original Pandas docs, just remember that if you pass keywords they'll be converted to Python automatically (for example :index-col becomes index_col), while if you pass strings you have to use its original name.

Below as an example we read-csv our file, but we want to get only the first 10 rows, so we pass a map to the function like {:nrows 10}.

In [5]:

(show (pt/read-csv "../resources/pokemon.csv" {:nrows 10}))

Out[5]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
3	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
4	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False
5	6	Charmeleon	Fire	NaN	58	64	58	80	65	80	1	False
6	7	Charizard	Fire	Flying	78	84	78	109	85	100	1	False
7	8	Mega Charizard X	Fire	Dragon	78	130	111	130	85	100	1	False
8	9	Mega Charizard Y	Fire	Flying	78	104	78	159	115	100	1	False
9	10	Squirtle	Water	NaN	44	48	65	50	64	43	1	False

The cool thing is that we can chain operations, the threading first macro is our friend!

Below we read the whole csv, get the correlation matrix and then show it

In [6]:

(-> (pt/read-csv "../resources/pokemon.csv")
    pt/corr
    show)

Out[6]:

	#	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
#	1.000000	0.097712	0.102664	0.094691	0.089199	0.085596	0.012181	0.983428	0.154336
HP	0.097712	1.000000	0.422386	0.239622	0.362380	0.378718	0.175952	0.058683	0.273620
Attack	0.102664	0.422386	1.000000	0.438687	0.396362	0.263990	0.381240	0.051451	0.345408
Defense	0.094691	0.239622	0.438687	1.000000	0.223549	0.510747	0.015227	0.042419	0.246377
Sp. Atk	0.089199	0.362380	0.396362	0.223549	1.000000	0.506121	0.473018	0.036437	0.448907
Sp. Def	0.085596	0.378718	0.263990	0.510747	0.506121	1.000000	0.259133	0.028486	0.363937
Speed	0.012181	0.175952	0.381240	0.015227	0.473018	0.259133	1.000000	-0.023121	0.326715
Generation	0.983428	0.058683	0.051451	0.042419	0.036437	0.028486	-0.023121	1.000000	0.079794
Legendary	0.154336	0.273620	0.345408	0.246377	0.448907	0.363937	0.326715	0.079794	1.000000

Since we'll be using pokemon.csv a lot, let's give it a name, defonce is great here

In [7]:

(defonce pokemon (pt/read-csv "../resources/pokemon.csv"))

Out[7]:

#'user/pokemon

Let's see how plotting goes

In [8]:

(defn heatmap 
  [data x y z]
  {:data {:values data}
   :width 500
   :height 500
   :encoding {:x {:field x
                  :type "nominal"}
              :y {:field y
                  :type "nominal"}}
   :layer [{:mark "rect"
            :encoding {:color {:field z
                               :type "quantitative"}}}
           {:mark "text"
            :encoding {:text 
                       {:field z
                        :type "quantitative"
                        :format ".2f"}
                       :color {:value "white"}}}]})

Out[8]:

#'user/heatmap

In [9]:

(-> pokemon
    pt/corr
    pt/reset-index
    (pt/melt {:id-vars :index})
    pt/->clj
    (heatmap :index :variable :value)
    oz/view!)

Out[9]:

What we did is plotting the heatmap of the correlation matrix shown above. Don't worry too much to all the steps we took, we'll be seeing all of them one by one later on!

What if we already read our data but we want to see only some rows? We have the head function for that

In [10]:

(show (pt/head pokemon))

Out[10]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
3	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
4	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False

In [11]:

(show (pt/head pokemon 10))

Out[11]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
3	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
4	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False
5	6	Charmeleon	Fire	NaN	58	64	58	80	65	80	1	False
6	7	Charizard	Fire	Flying	78	84	78	109	85	100	1	False
7	8	Mega Charizard X	Fire	Dragon	78	130	111	130	85	100	1	False
8	9	Mega Charizard Y	Fire	Flying	78	104	78	159	115	100	1	False
9	10	Squirtle	Water	NaN	44	48	65	50	64	43	1	False

Another nice thing we can do is to get columns names

In [12]:

(pt/names pokemon)

Out[12]:

Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

Now when you see an output as the above one, that means that the data we have is still in Python. That's ok if you keep working within panthera, but what if you want to do something with column names using Clojure?

In [13]:

(vec (pt/names pokemon))

Out[13]:

["#" "Name" "Type 1" "Type 2" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary"]

That's it! Just call vecand now you have a nice Clojure vector that you can deal with.

N.B.: with many Python objects you can directly treat them as similar Clojure collections. For instance in this case we can do something like below

In [14]:

(doseq [a (pt/names pokemon)] (println a))

#
Name
Type 1
Type 2
HP
Attack
Defense
Sp. Atk
Sp. Def
Speed
Generation
Legendary

Out[14]:

nil

Some plotting¶

Plotting is nice to learn how to munge data: you get a fast visual feedback and usually results are nice to look at!

Let's plot Speed and Defense

In [15]:

(defn line-plot
  [data x y & [color]]
  (let [spec {:data {:values data}
              :mark "line"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :type "quantitative"}
                         :y {:field y
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "blue"}))))

Out[15]:

#'user/line-plot

In [16]:

(-> pokemon
    (pt/subset-cols :# :Speed :Defense)
    (pt/melt {:id-vars :#})
    pt/->clj
    (line-plot :# :value :variable)
    oz/view!)

Out[16]:

Let's look at the operation above:

subset-cols: we use this to, well, subset columns. We can choose N columns by label, we will get a 'new' data-frame with only the selected columns
melt: this transforms the data-frame from wide to long format (for more info about it see further below
->clj: this turns data-frames and serieses to a Clojure vector of maps

subset-cols is pretty straightforward:

In [17]:

(-> pokemon (pt/subset-cols :Speed :Attack) pt/head show)

Out[17]:

	Speed	Attack
0	45	49
1	60	62
2	80	82
3	80	100
4	65	52

In [18]:

(-> pokemon (pt/subset-cols :Speed :Attack :HP :#) pt/head show)

Out[18]:

	Speed	Attack	HP	#
0	45	49	45	1
1	60	62	60	2
2	80	82	80	3
3	80	100	80	4
4	65	52	39	5

In [19]:

(-> pokemon (pt/subset-cols :# :Attack) pt/head)

Out[19]:

   #  Attack
0  1      49
1  2      62
2  3      82
3  4     100
4  5      52

->clj tries to understand what's the better way to transform panthera data structures to Clojure ones

In [20]:

(-> pokemon (pt/subset-cols :Speed) pt/head pt/->clj)

Out[20]:

[{:speed 45} {:speed 60} {:speed 80} {:speed 80} {:speed 65}]

In [21]:

(-> pokemon (pt/subset-cols :Speed :HP) pt/head pt/->clj)

Out[21]:

[{:speed 45, :hp 45} {:speed 60, :hp 60} {:speed 80, :hp 80} {:speed 80, :hp 80} {:speed 65, :hp 39}]

Now we want to see what happens when we plot Attack vs Defense

In [22]:

(defn scatter
  [data x y & [color]]
  (let [spec {:data {:values data}
              :mark "point"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :type "quantitative"}
                         :y {:field y
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "dodgerblue"}))))

Out[22]:

#'user/scatter

In [23]:

(-> pokemon
    (pt/subset-cols :Attack :Defense)
    pt/->clj
    (scatter :attack :defense)
    oz/view!)

Out[23]:

And now the Speed histogram

In [24]:

(defn hist
  [data x & [color]]
  (let [spec {:data {:values data}
              :mark "bar"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :bin {:maxbins 50}
                             :type "quantitative"}
                         :y {:aggregate "count"
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "dodgerblue"}))))

Out[24]:

#'user/hist

In [25]:

(-> pokemon
    (pt/subset-cols :Speed)
    pt/->clj
    (hist :speed)
    oz/view!)

Out[25]:

Data-frames basics¶

Creation¶

How to create data-frames? Above we read a csv, but what if we already have some data in the runtime we want to deal with? Nothing easier than this:

In [26]:

(show (pt/data-frame [{:a 1 :b 2} {:a 3 :b 4}]))

Out[26]:

	a	b
0	1	2
1	3	4

What if we don't care about column names, or we'd prefer to add them to an already generated data-frame?

In [27]:

(show (pt/data-frame (to-array-2d [[1 2] [3 4]])))

Out[27]:

	0	1
0	1	2
1	3	4

Columns of data-frames are just serieses:

In [28]:

(-> pokemon (pt/subset-cols "Defense") pt/pytype)

Out[28]:

:series

In [29]:

(pt/series [1 2 3])

Out[29]:

0    1
1    2
2    3
dtype: int64

The column name is the name of the series:

In [30]:

(pt/series [1 2 3] {:name :my-series})

Out[30]:

0    1
1    2
2    3
Name: my-series, dtype: int64

Filtering¶

One of the most straightforward ways to filter data-frames is with booleans. We have filter-rows that takes either booleans or a function that generates booleans

In [31]:

(-> pokemon
    (pt/filter-rows #(-> % (pt/subset-cols "Defense") (pt/gt 200)))
    show)

Out[31]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
224	225	Mega Steelix	Steel	Ground	75	125	230	55	95	30	2	False
230	231	Shuckle	Bug	Rock	20	10	230	10	230	5	2	False
333	334	Mega Aggron	Steel	NaN	70	140	230	60	80	50	3	False

gt is exactly what you think it is: >. Check the Basic concepts notebook to better understand how math works in panthera.

Now we'll have to introduce Numpy in the equation. Let's say we want to filter the data-frame based on 2 conditions at the same time, we can do that using npy:

In [32]:

(require '[panthera.numpy :refer [npy]])

Out[32]:

nil

In [33]:

(defn my-filter
  [col1 col2]
  (npy :logical-and 
       {:args [(-> pokemon
                   (pt/subset-cols col1)
                   (pt/gt 200))
               (-> pokemon
                   (pt/subset-cols col2)
                   (pt/gt 100))]}))

Out[33]:

#'user/my-filter

In [34]:

(-> pokemon
    (pt/filter-rows (my-filter :Defense :Attack))
    show)

Out[34]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
224	225	Mega Steelix	Steel	Ground	75	125	230	55	95	30	2	False
333	334	Mega Aggron	Steel	NaN	70	140	230	60	80	50	3	False

panthera.numpy works a little differently than regular panthera, usually you need only npy to have access to all of numpy functions.

For instance:

In [35]:

(-> pokemon
    (pt/subset-cols :Defense)
    ((npy :log))
    pt/head)

Out[35]:

0    3.891820
1    4.143135
2    4.418841
3    4.812184
4    3.761200
Name: Defense, dtype: float64

Above we just calculated the log of the whole Defense column! Remember that npy operations are vectorized, so usually it is faster to use them (or equivalent panthera ones) than Clojure ones (unless you're doing more complicated operations, then Clojure would probably be faster).

Now let's try to do some more complicated things:

In [36]:

(/ (pt/sum (pt/subset-cols pokemon :Speed)) 
   (pt/n-rows pokemon))

Out[36]:

27311/400

Above we see how we can combine operations on serieses, but of course that's a mean, and we have a function for that!

In [37]:

(defn col-mean
  [col]
  (pt/mean (pt/subset-cols pokemon col)))

Out[37]:

#'user/col-mean

Now we would like to add a new column that says high when the value is above the mean, and low for the opposite.

npy is really helpful here:

In [38]:

(npy :where {:args [(pt/gt (pt/head (pt/subset-cols pokemon :Speed)) (col-mean :Speed))
                    "high"
                    "low"]})

Out[38]:

['low' 'low' 'high' 'high' 'low']

But this is pretty ugly and we can't chain it with other functions. It is pretty easy to wrap it into a chainable function:

In [39]:

(defn where
  [& args]
  (npy :where {:args args}))

Out[39]:

#'user/where

In [40]:

(-> pokemon
    (pt/subset-cols :Speed)
    pt/head
    (pt/gt (col-mean :Speed))
    (where "high" "low"))

Out[40]:

['low' 'low' 'high' 'high' 'low']

That seems to work! Let's add a new column to our data-frame:

In [41]:

(def speed-level
  (-> pokemon
    (pt/subset-cols :Speed)
    (pt/gt (col-mean :Speed))
    (where "high" "low")))

(-> pokemon
    (pt/assign {:speed-level speed-level})
    (pt/subset-cols :speed_level :Speed)
    (pt/head 10)
    show)

Out[41]:

	speed_level	Speed
0	low	45
1	low	60
2	high	80
3	high	80
4	low	65
5	high	80
6	high	100
7	high	100
8	high	100
9	low	43

Of course we didn't actually add speed_level to pokemon, we created a new data-frame. Everything here is as immutable as possible, let's check if this is really the case:

In [42]:

(vec (pt/names pokemon))

Out[42]:

["#" "Name" "Type 1" "Type 2" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary"]

Inspecting data¶

Other than head we have tail

In [43]:

(show (pt/tail pokemon))

Out[43]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
795	796	Diancie	Rock	Fairy	50	100	150	100	150	50	6	True
796	797	Mega Diancie	Rock	Fairy	50	160	110	160	110	110	6	True
797	798	Hoopa Confined	Psychic	Ghost	80	110	60	150	130	70	6	True
798	799	Hoopa Unbound	Psychic	Dark	80	160	60	170	130	80	6	True
799	800	Volcanion	Fire	Water	80	110	120	130	90	70	6	True

We can always check what's the shape of the data structure we're interested in. shape returns rows and columns count

In [44]:

(pt/shape pokemon)

Out[44]:

(800, 12)

If you want just one of the two you can either use one of n-rows or n-cols, or get the required value by index:

In [45]:

(pt/n-rows pokemon)

Out[45]:

In [46]:

((pt/shape pokemon) 0)

Out[46]:

Exploratory data analysis¶

Now we can move to something a little more interesting: some data analysis.

One of the first things we might want to do is to look at some frequencies. value-counts is our friend

In [47]:

(-> pokemon
    (pt/subset-cols "Type 1")
    (pt/value-counts {:dropna false}))

Out[47]:

Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Rock         44
Electric     44
Ground       32
Ghost        32
Dragon       32
Dark         31
Poison       28
Fighting     27
Steel        27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

As we can see we get counts by group automatically and this can come in handy!

There's also a nice way to see many stats at once for all the numeric columns: describe

In [48]:

(show (pt/describe pokemon))

Out[48]:

	#	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation
count	800.0000	800.000000	800.000000	800.000000	800.000000	800.000000	800.000000	800.00000
mean	400.5000	69.258750	79.001250	73.842500	72.820000	71.902500	68.277500	3.32375
std	231.0844	25.534669	32.457366	31.183501	32.722294	27.828916	29.060474	1.66129
min	1.0000	1.000000	5.000000	5.000000	10.000000	20.000000	5.000000	1.00000
25%	200.7500	50.000000	55.000000	50.000000	49.750000	50.000000	45.000000	2.00000
50%	400.5000	65.000000	75.000000	70.000000	65.000000	70.000000	65.000000	3.00000
75%	600.2500	80.000000	100.000000	90.000000	95.000000	90.000000	90.000000	5.00000
max	800.0000	255.000000	190.000000	230.000000	194.000000	230.000000	180.000000	6.00000

If you need some of these stats only for some columns, chances are that there's a function for that!

In [49]:

(-> (pt/subset-cols pokemon :HP)
    ((juxt pt/mean pt/std pt/minimum pt/maximum)))

Out[49]:

[69.25875 25.53466903233207 1 255]

Reshaping data¶

Some of the most common operations with rectangular data is to reshape them how we most please to make other operations easier.

The R people perfectly know what I mean when I talk about tidy data, if you have no idea about this check the link, but the main point is that while most are used to work with double entry matrices (like the one above built with describe), it is much easier to work with long data: one row per observation and one column per variable.

In panthera there's melt as a workhorse for this process

In [50]:

(-> pokemon pt/head show)

Out[50]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
3	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
4	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False

In [51]:

(-> pokemon pt/head (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) show)

Out[51]:

	Name	variable	value
0	Bulbasaur	Attack	49
1	Ivysaur	Attack	62
2	Venusaur	Attack	82
3	Mega Venusaur	Attack	100
4	Charmander	Attack	52
5	Bulbasaur	Defense	49
6	Ivysaur	Defense	63
7	Venusaur	Defense	83
8	Mega Venusaur	Defense	123
9	Charmander	Defense	43

Above we told panthera that we wanted to melt our data-frame and that we would like to have the column Name act as the main id, while we're interested in the value of Attack and Defense.

This makes much easier to group values by some variable:

In [52]:

(-> pokemon 
    pt/head 
    (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) 
    (pt/groupby :variable)
    pt/mean)

Out[52]:

          value
variable       
Attack     69.0
Defense    72.2

If you've ever used Excel you already know about pivot, which is the opposite of melt

In [53]:

(-> pokemon 
    pt/head 
    (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) 
    (pt/pivot {:index "Name" :columns "variable" :values "value"})
    show)

Out[53]:

variable	Attack	Defense
Name
Bulbasaur	49	49
Charmander	52	43
Ivysaur	62	63
Mega Venusaur	100	123
Venusaur	82	83

What if we have more than one data-frame? We can combine them however we want!

In [54]:

(show 
  (pt/concatenate
    [(pt/head pokemon)
     (pt/tail pokemon)]
    {:axis 0
     :ignore-index true}))

Out[54]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
3	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
4	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False
5	796	Diancie	Rock	Fairy	50	100	150	100	150	50	6	True
6	797	Mega Diancie	Rock	Fairy	50	160	110	160	110	110	6	True
7	798	Hoopa Confined	Psychic	Ghost	80	110	60	150	130	70	6	True
8	799	Hoopa Unbound	Psychic	Dark	80	160	60	170	130	80	6	True
9	800	Volcanion	Fire	Water	80	110	120	130	90	70	6	True

Just a second to discuss some options:

:axis: most of panthera operations can be applied either by rows or columns, we decide which with this keyword where 0 = rows and 1 = columns
:ignore-index: panthera works by index, to better understand what kind of indexes there are and most of their quirks check Basic concepts

To better understand :axis let's make another example

In [55]:

(show
  (pt/concatenate
    (repeat 2 (pt/head pokemon))
    {:axis 1}))

Out[55]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
3	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
4	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False

Types, types everywhere¶

There are many dedicated types, but no worries, there are nice ways to deal with them.

In [56]:

(pt/dtype pokemon)

Out[56]:

#              int64
Name          object
Type 1        object
Type 2        object
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

I guess there isn't much to say about :int64 and :bool, but surely :object looks more interesting. When panthera (numpy included) finds either strings or something it doesn't know how to deal with it goes to the less tight type possible which is an :object.

:objects are usually bloated, if we want to save some overhead and it makes sense to deal with categorical values we can convert them to :category

In [57]:

(-> pokemon
    (pt/subset-cols "Type 1")
    (pt/astype :category)
    pt/head)

Out[57]:

0    Grass
1    Grass
2    Grass
3    Grass
4     Fire
Name: Type 1, dtype: category
Categories (18, object): [Bug, Dark, Dragon, Electric, ..., Psychic, Rock, Steel, Water]

In [58]:

(-> pokemon
    (pt/subset-cols "Speed")
    (pt/astype :float)
    pt/head)

Out[58]:

0    45.0
1    60.0
2    80.0
3    80.0
4    65.0
Name: Speed, dtype: float64

Dealing with missing data¶

One of the most painful operations for data scientists and engineers is dealing with the unknown: NaN (or nil, Null, etc).

panthera tries to make this as painless as possible:

In [59]:

(-> pokemon
    (pt/subset-cols "Type 2")
    (pt/value-counts {:dropna false}))

Out[59]:

NaN         386
Flying       97
Ground       35
Poison       34
Psychic      33
Fighting     26
Grass        25
Fairy        23
Steel        22
Dark         20
Dragon       18
Ice          14
Ghost        14
Water        14
Rock         14
Fire         12
Electric      6
Normal        4
Bug           3
Name: Type 2, dtype: int64

We could check for NaN in other ways has well:

In [60]:

(-> pokemon (pt/subset-cols "Type 2") ((juxt pt/hasnans? (comp pt/all? pt/not-na?))))

Out[60]:

[true false]

One of the ways to deal with missing data is to just drop rows

In [61]:

(-> pokemon
    (pt/dropna {:subset ["Type 2"]})
    (pt/subset-cols "Type 2")
    (pt/value-counts {:dropna false}))

Out[61]:

Flying      97
Ground      35
Poison      34
Psychic     33
Fighting    26
Grass       25
Fairy       23
Steel       22
Dark        20
Dragon      18
Ice         14
Rock        14
Water       14
Ghost       14
Fire        12
Electric     6
Normal       4
Bug          3
Name: Type 2, dtype: int64

But let's say we want to replace missing observations with a flag or value of some kind, we can do that easily with fill-na

In [62]:

(-> pokemon
    (pt/subset-cols "Type 2")
    (pt/fill-na :empty)
    (pt/head 10))

Out[62]:

0    Poison
1    Poison
2    Poison
3    Poison
4     empty
5     empty
6    Flying
7    Dragon
8    Flying
9     empty
Name: Type 2, dtype: object

Time and dates¶

Programmers hate time, that's a fact. Panthera tries to make this experience as painless as possible

In [63]:

(def times
  ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"])

(pt/->datetime times)

Out[63]:

DatetimeIndex(['1992-01-10', '1992-02-10', '1992-03-10', '1993-03-15',
               '1993-03-16'],
              dtype='datetime64[ns]', freq=None)

In [64]:

(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    show)

Out[64]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
1992-01-10	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1992-02-10	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
1992-03-10	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
1993-03-15	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
1993-03-16	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False

In [65]:

(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    (pt/select-rows "1993-03-16" :loc))

Out[65]:

#                      5
Name          Charmander
Type 1              Fire
Type 2               NaN
HP                    39
Attack                52
Defense               43
Sp. Atk               60
Sp. Def               50
Speed                 65
Generation             1
Legendary          False
Name: 1993-03-16 00:00:00, dtype: object

In [66]:

(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    (pt/select-rows (pt/slice "1992-03-10" "1993-03-16") :loc)
    show)

Out[66]:

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
1992-03-10	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
1993-03-15	4	Mega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
1993-03-16	5	Charmander	Fire	NaN	39	52	43	60	50	65	1	False

In [ ]: