Data science intro with panthera

Clojure + Pandas + Numpy = 💖

I'll show how it is possible to get the most out of the Pandas & the Clojure ecosystem at the same time.

This intro is based on this Kaggle notebook you can follow along with that if you come from the Python world.

Env setup

The easiest way to go is the provided Docker image, but if you want to setup your machine just follow along.

System install

If you want to install everything at the system level you should do something equivalent to what we do below:

sudo apt-get update
sudo apt-get install libpython3.6-dev
pip3 install numpy pandas

conda

To work within a conda environment just create a new one with:

conda create -n panthera python=3.6 numpy pandas
conda activate panthera

Than start your REPL from the activated conda environment. This is the best way to install requirements for panthera because in the process you get MKL as well with Numpy.

Here

Let's just add panthera to our classpath and we're good to go!

In [1]:
(require '[clojupyter.misc.helper :as helper])
(helper/add-dependencies '[panthera "0.1-alpha.11"])
:ok
Out[1]:
:ok

Now require panthera main API namespace and define a little helper to better inspect data-frames

In [ ]:
(require '[panthera.panthera :as pt])
In [3]:
(require '[clojupyter.display :as display])
(require '[libpython-clj.python :as py])

(defn show
  [obj]
  (display/html
    (py/call-attr obj "to_html")))
Out[3]:
#'user/show
In [4]:
(helper/add-dependencies '[metasoarous/oz "1.5.4"])
(require '[oz.notebook.clojupyter :as oz])
Out[4]:
nil

A brief primer

We will work with Pokemons! Datasets are available here.

We can read data into panthera from various formats, one of the most used is read-csv. Most panthera functions accept either a data-frame and/or a series as a first argument, one or more required arguments and then a map of options.

To see which options are available you can check docs or even original Pandas docs, just remember that if you pass keywords they'll be converted to Python automatically (for example :index-col becomes index_col), while if you pass strings you have to use its original name.

Below as an example we read-csv our file, but we want to get only the first 10 rows, so we pass a map to the function like {:nrows 10}.

In [5]:
(show (pt/read-csv "../resources/pokemon.csv" {:nrows 10}))
Out[5]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
5 6 Charmeleon Fire NaN 58 64 58 80 65 80 1 False
6 7 Charizard Fire Flying 78 84 78 109 85 100 1 False
7 8 Mega Charizard X Fire Dragon 78 130 111 130 85 100 1 False
8 9 Mega Charizard Y Fire Flying 78 104 78 159 115 100 1 False
9 10 Squirtle Water NaN 44 48 65 50 64 43 1 False

The cool thing is that we can chain operations, the threading first macro is our friend!

Below we read the whole csv, get the correlation matrix and then show it

In [6]:
(-> (pt/read-csv "../resources/pokemon.csv")
    pt/corr
    show)
Out[6]:
# HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
# 1.000000 0.097712 0.102664 0.094691 0.089199 0.085596 0.012181 0.983428 0.154336
HP 0.097712 1.000000 0.422386 0.239622 0.362380 0.378718 0.175952 0.058683 0.273620
Attack 0.102664 0.422386 1.000000 0.438687 0.396362 0.263990 0.381240 0.051451 0.345408
Defense 0.094691 0.239622 0.438687 1.000000 0.223549 0.510747 0.015227 0.042419 0.246377
Sp. Atk 0.089199 0.362380 0.396362 0.223549 1.000000 0.506121 0.473018 0.036437 0.448907
Sp. Def 0.085596 0.378718 0.263990 0.510747 0.506121 1.000000 0.259133 0.028486 0.363937
Speed 0.012181 0.175952 0.381240 0.015227 0.473018 0.259133 1.000000 -0.023121 0.326715
Generation 0.983428 0.058683 0.051451 0.042419 0.036437 0.028486 -0.023121 1.000000 0.079794
Legendary 0.154336 0.273620 0.345408 0.246377 0.448907 0.363937 0.326715 0.079794 1.000000

Since we'll be using pokemon.csv a lot, let's give it a name, defonce is great here

In [7]:
(defonce pokemon (pt/read-csv "../resources/pokemon.csv"))
Out[7]:
#'user/pokemon

Let's see how plotting goes

In [8]:
(defn heatmap 
  [data x y z]
  {:data {:values data}
   :width 500
   :height 500
   :encoding {:x {:field x
                  :type "nominal"}
              :y {:field y
                  :type "nominal"}}
   :layer [{:mark "rect"
            :encoding {:color {:field z
                               :type "quantitative"}}}
           {:mark "text"
            :encoding {:text 
                       {:field z
                        :type "quantitative"
                        :format ".2f"}
                       :color {:value "white"}}}]})
Out[8]:
#'user/heatmap
In [9]:
(-> pokemon
    pt/corr
    pt/reset-index
    (pt/melt {:id-vars :index})
    pt/->clj
    (heatmap :index :variable :value)
    oz/view!)
Out[9]:

What we did is plotting the heatmap of the correlation matrix shown above. Don't worry too much to all the steps we took, we'll be seeing all of them one by one later on!

What if we already read our data but we want to see only some rows? We have the head function for that

In [10]:
(show (pt/head pokemon))
Out[10]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [11]:
(show (pt/head pokemon 10))
Out[11]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
5 6 Charmeleon Fire NaN 58 64 58 80 65 80 1 False
6 7 Charizard Fire Flying 78 84 78 109 85 100 1 False
7 8 Mega Charizard X Fire Dragon 78 130 111 130 85 100 1 False
8 9 Mega Charizard Y Fire Flying 78 104 78 159 115 100 1 False
9 10 Squirtle Water NaN 44 48 65 50 64 43 1 False

Another nice thing we can do is to get columns names

In [12]:
(pt/names pokemon)
Out[12]:
Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

Now when you see an output as the above one, that means that the data we have is still in Python. That's ok if you keep working within panthera, but what if you want to do something with column names using Clojure?

In [13]:
(vec (pt/names pokemon))
Out[13]:
["#" "Name" "Type 1" "Type 2" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary"]

That's it! Just call vecand now you have a nice Clojure vector that you can deal with.

N.B.: with many Python objects you can directly treat them as similar Clojure collections. For instance in this case we can do something like below

In [14]:
(doseq [a (pt/names pokemon)] (println a))
#
Name
Type 1
Type 2
HP
Attack
Defense
Sp. Atk
Sp. Def
Speed
Generation
Legendary
Out[14]:
nil

Some plotting

Plotting is nice to learn how to munge data: you get a fast visual feedback and usually results are nice to look at!

Let's plot Speed and Defense

In [15]:
(defn line-plot
  [data x y & [color]]
  (let [spec {:data {:values data}
              :mark "line"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :type "quantitative"}
                         :y {:field y
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "blue"}))))
Out[15]:
#'user/line-plot
In [16]:
(-> pokemon
    (pt/subset-cols :# :Speed :Defense)
    (pt/melt {:id-vars :#})
    pt/->clj
    (line-plot :# :value :variable)
    oz/view!)
Out[16]:

Let's look at the operation above:

  • subset-cols: we use this to, well, subset columns. We can choose N columns by label, we will get a 'new' data-frame with only the selected columns
  • melt: this transforms the data-frame from wide to long format (for more info about it see further below
  • ->clj: this turns data-frames and serieses to a Clojure vector of maps

subset-cols is pretty straightforward:

In [17]:
(-> pokemon (pt/subset-cols :Speed :Attack) pt/head show)
Out[17]:
Speed Attack
0 45 49
1 60 62
2 80 82
3 80 100
4 65 52
In [18]:
(-> pokemon (pt/subset-cols :Speed :Attack :HP :#) pt/head show)
Out[18]:
Speed Attack HP #
0 45 49 45 1
1 60 62 60 2
2 80 82 80 3
3 80 100 80 4
4 65 52 39 5
In [19]:
(-> pokemon (pt/subset-cols :# :Attack) pt/head)
Out[19]:
   #  Attack
0  1      49
1  2      62
2  3      82
3  4     100
4  5      52

->clj tries to understand what's the better way to transform panthera data structures to Clojure ones

In [20]:
(-> pokemon (pt/subset-cols :Speed) pt/head pt/->clj)
Out[20]:
[{:speed 45} {:speed 60} {:speed 80} {:speed 80} {:speed 65}]
In [21]:
(-> pokemon (pt/subset-cols :Speed :HP) pt/head pt/->clj)
Out[21]:
[{:speed 45, :hp 45} {:speed 60, :hp 60} {:speed 80, :hp 80} {:speed 80, :hp 80} {:speed 65, :hp 39}]

Now we want to see what happens when we plot Attack vs Defense

In [22]:
(defn scatter
  [data x y & [color]]
  (let [spec {:data {:values data}
              :mark "point"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :type "quantitative"}
                         :y {:field y
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "dodgerblue"}))))
Out[22]:
#'user/scatter
In [23]:
(-> pokemon
    (pt/subset-cols :Attack :Defense)
    pt/->clj
    (scatter :attack :defense)
    oz/view!)
Out[23]:

And now the Speed histogram

In [24]:
(defn hist
  [data x & [color]]
  (let [spec {:data {:values data}
              :mark "bar"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :bin {:maxbins 50}
                             :type "quantitative"}
                         :y {:aggregate "count"
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "dodgerblue"}))))
Out[24]:
#'user/hist
In [25]:
(-> pokemon
    (pt/subset-cols :Speed)
    pt/->clj
    (hist :speed)
    oz/view!)
Out[25]:

Data-frames basics

Creation

How to create data-frames? Above we read a csv, but what if we already have some data in the runtime we want to deal with? Nothing easier than this:

In [26]:
(show (pt/data-frame [{:a 1 :b 2} {:a 3 :b 4}]))
Out[26]:
a b
0 1 2
1 3 4

What if we don't care about column names, or we'd prefer to add them to an already generated data-frame?

In [27]:
(show (pt/data-frame (to-array-2d [[1 2] [3 4]])))
Out[27]:
0 1
0 1 2
1 3 4

Columns of data-frames are just serieses:

In [28]:
(-> pokemon (pt/subset-cols "Defense") pt/pytype)
Out[28]:
:series
In [29]:
(pt/series [1 2 3])
Out[29]:
0    1
1    2
2    3
dtype: int64

The column name is the name of the series:

In [30]:
(pt/series [1 2 3] {:name :my-series})
Out[30]:
0    1
1    2
2    3
Name: my-series, dtype: int64

Filtering

One of the most straightforward ways to filter data-frames is with booleans. We have filter-rows that takes either booleans or a function that generates booleans

In [31]:
(-> pokemon
    (pt/filter-rows #(-> % (pt/subset-cols "Defense") (pt/gt 200)))
    show)
Out[31]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
224 225 Mega Steelix Steel Ground 75 125 230 55 95 30 2 False
230 231 Shuckle Bug Rock 20 10 230 10 230 5 2 False
333 334 Mega Aggron Steel NaN 70 140 230 60 80 50 3 False

gt is exactly what you think it is: >. Check the Basic concepts notebook to better understand how math works in panthera.

Now we'll have to introduce Numpy in the equation. Let's say we want to filter the data-frame based on 2 conditions at the same time, we can do that using npy:

In [32]:
(require '[panthera.numpy :refer [npy]])
Out[32]:
nil
In [33]:
(defn my-filter
  [col1 col2]
  (npy :logical-and 
       {:args [(-> pokemon
                   (pt/subset-cols col1)
                   (pt/gt 200))
               (-> pokemon
                   (pt/subset-cols col2)
                   (pt/gt 100))]}))
Out[33]:
#'user/my-filter
In [34]:
(-> pokemon
    (pt/filter-rows (my-filter :Defense :Attack))
    show)
Out[34]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
224 225 Mega Steelix Steel Ground 75 125 230 55 95 30 2 False
333 334 Mega Aggron Steel NaN 70 140 230 60 80 50 3 False

panthera.numpy works a little differently than regular panthera, usually you need only npy to have access to all of numpy functions.

For instance:

In [35]:
(-> pokemon
    (pt/subset-cols :Defense)
    ((npy :log))
    pt/head)
Out[35]:
0    3.891820
1    4.143135
2    4.418841
3    4.812184
4    3.761200
Name: Defense, dtype: float64

Above we just calculated the log of the whole Defense column! Remember that npy operations are vectorized, so usually it is faster to use them (or equivalent panthera ones) than Clojure ones (unless you're doing more complicated operations, then Clojure would probably be faster).

Now let's try to do some more complicated things:

In [36]:
(/ (pt/sum (pt/subset-cols pokemon :Speed)) 
   (pt/n-rows pokemon))
Out[36]:
27311/400

Above we see how we can combine operations on serieses, but of course that's a mean, and we have a function for that!

In [37]:
(defn col-mean
  [col]
  (pt/mean (pt/subset-cols pokemon col)))
Out[37]:
#'user/col-mean

Now we would like to add a new column that says high when the value is above the mean, and low for the opposite.

npy is really helpful here:

In [38]:
(npy :where {:args [(pt/gt (pt/head (pt/subset-cols pokemon :Speed)) (col-mean :Speed))
                    "high"
                    "low"]})
Out[38]:
['low' 'low' 'high' 'high' 'low']

But this is pretty ugly and we can't chain it with other functions. It is pretty easy to wrap it into a chainable function:

In [39]:
(defn where
  [& args]
  (npy :where {:args args}))
Out[39]:
#'user/where
In [40]:
(-> pokemon
    (pt/subset-cols :Speed)
    pt/head
    (pt/gt (col-mean :Speed))
    (where "high" "low"))
Out[40]:
['low' 'low' 'high' 'high' 'low']

That seems to work! Let's add a new column to our data-frame:

In [41]:
(def speed-level
  (-> pokemon
    (pt/subset-cols :Speed)
    (pt/gt (col-mean :Speed))
    (where "high" "low")))

(-> pokemon
    (pt/assign {:speed-level speed-level})
    (pt/subset-cols :speed_level :Speed)
    (pt/head 10)
    show)
Out[41]:
speed_level Speed
0 low 45
1 low 60
2 high 80
3 high 80
4 low 65
5 high 80
6 high 100
7 high 100
8 high 100
9 low 43

Of course we didn't actually add speed_level to pokemon, we created a new data-frame. Everything here is as immutable as possible, let's check if this is really the case:

In [42]:
(vec (pt/names pokemon))
Out[42]:
["#" "Name" "Type 1" "Type 2" "HP" "Attack" "Defense" "Sp. Atk" "Sp. Def" "Speed" "Generation" "Legendary"]

Inspecting data

Other than head we have tail

In [43]:
(show (pt/tail pokemon))
Out[43]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
795 796 Diancie Rock Fairy 50 100 150 100 150 50 6 True
796 797 Mega Diancie Rock Fairy 50 160 110 160 110 110 6 True
797 798 Hoopa Confined Psychic Ghost 80 110 60 150 130 70 6 True
798 799 Hoopa Unbound Psychic Dark 80 160 60 170 130 80 6 True
799 800 Volcanion Fire Water 80 110 120 130 90 70 6 True

We can always check what's the shape of the data structure we're interested in. shape returns rows and columns count

In [44]:
(pt/shape pokemon)
Out[44]:
(800, 12)

If you want just one of the two you can either use one of n-rows or n-cols, or get the required value by index:

In [45]:
(pt/n-rows pokemon)
Out[45]:
800
In [46]:
((pt/shape pokemon) 0)
Out[46]:
800

Exploratory data analysis

Now we can move to something a little more interesting: some data analysis.

One of the first things we might want to do is to look at some frequencies. value-counts is our friend

In [47]:
(-> pokemon
    (pt/subset-cols "Type 1")
    (pt/value-counts {:dropna false}))
Out[47]:
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Rock         44
Electric     44
Ground       32
Ghost        32
Dragon       32
Dark         31
Poison       28
Fighting     27
Steel        27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

As we can see we get counts by group automatically and this can come in handy!

There's also a nice way to see many stats at once for all the numeric columns: describe

In [48]:
(show (pt/describe pokemon))
Out[48]:
# HP Attack Defense Sp. Atk Sp. Def Speed Generation
count 800.0000 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000 800.00000
mean 400.5000 69.258750 79.001250 73.842500 72.820000 71.902500 68.277500 3.32375
std 231.0844 25.534669 32.457366 31.183501 32.722294 27.828916 29.060474 1.66129
min 1.0000 1.000000 5.000000 5.000000 10.000000 20.000000 5.000000 1.00000
25% 200.7500 50.000000 55.000000 50.000000 49.750000 50.000000 45.000000 2.00000
50% 400.5000 65.000000 75.000000 70.000000 65.000000 70.000000 65.000000 3.00000
75% 600.2500 80.000000 100.000000 90.000000 95.000000 90.000000 90.000000 5.00000
max 800.0000 255.000000 190.000000 230.000000 194.000000 230.000000 180.000000 6.00000

If you need some of these stats only for some columns, chances are that there's a function for that!

In [49]:
(-> (pt/subset-cols pokemon :HP)
    ((juxt pt/mean pt/std pt/minimum pt/maximum)))
Out[49]:
[69.25875 25.53466903233207 1 255]

Reshaping data

Some of the most common operations with rectangular data is to reshape them how we most please to make other operations easier.

The R people perfectly know what I mean when I talk about tidy data, if you have no idea about this check the link, but the main point is that while most are used to work with double entry matrices (like the one above built with describe), it is much easier to work with long data: one row per observation and one column per variable.

In panthera there's melt as a workhorse for this process

In [50]:
(-> pokemon pt/head show)
Out[50]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [51]:
(-> pokemon pt/head (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) show)
Out[51]:
Name variable value
0 Bulbasaur Attack 49
1 Ivysaur Attack 62
2 Venusaur Attack 82
3 Mega Venusaur Attack 100
4 Charmander Attack 52
5 Bulbasaur Defense 49
6 Ivysaur Defense 63
7 Venusaur Defense 83
8 Mega Venusaur Defense 123
9 Charmander Defense 43

Above we told panthera that we wanted to melt our data-frame and that we would like to have the column Name act as the main id, while we're interested in the value of Attack and Defense.

This makes much easier to group values by some variable:

In [52]:
(-> pokemon 
    pt/head 
    (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) 
    (pt/groupby :variable)
    pt/mean)
Out[52]:
          value
variable       
Attack     69.0
Defense    72.2

If you've ever used Excel you already know about pivot, which is the opposite of melt

In [53]:
(-> pokemon 
    pt/head 
    (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) 
    (pt/pivot {:index "Name" :columns "variable" :values "value"})
    show)
Out[53]:
variable Attack Defense
Name
Bulbasaur 49 49
Charmander 52 43
Ivysaur 62 63
Mega Venusaur 100 123
Venusaur 82 83

What if we have more than one data-frame? We can combine them however we want!

In [54]:
(show 
  (pt/concatenate
    [(pt/head pokemon)
     (pt/tail pokemon)]
    {:axis 0
     :ignore-index true}))
Out[54]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
5 796 Diancie Rock Fairy 50 100 150 100 150 50 6 True
6 797 Mega Diancie Rock Fairy 50 160 110 160 110 110 6 True
7 798 Hoopa Confined Psychic Ghost 80 110 60 150 130 70 6 True
8 799 Hoopa Unbound Psychic Dark 80 160 60 170 130 80 6 True
9 800 Volcanion Fire Water 80 110 120 130 90 70 6 True

Just a second to discuss some options:

  • :axis: most of panthera operations can be applied either by rows or columns, we decide which with this keyword where 0 = rows and 1 = columns
  • :ignore-index: panthera works by index, to better understand what kind of indexes there are and most of their quirks check Basic concepts

To better understand :axis let's make another example

In [55]:
(show
  (pt/concatenate
    (repeat 2 (pt/head pokemon))
    {:axis 1}))
Out[55]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary # Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False 5 Charmander Fire NaN 39 52 43 60 50 65 1 False

Types, types everywhere

There are many dedicated types, but no worries, there are nice ways to deal with them.

In [56]:
(pt/dtype pokemon)
Out[56]:
#              int64
Name          object
Type 1        object
Type 2        object
HP             int64
Attack         int64
Defense        int64
Sp. Atk        int64
Sp. Def        int64
Speed          int64
Generation     int64
Legendary       bool
dtype: object

I guess there isn't much to say about :int64 and :bool, but surely :object looks more interesting. When panthera (numpy included) finds either strings or something it doesn't know how to deal with it goes to the less tight type possible which is an :object.

:objects are usually bloated, if we want to save some overhead and it makes sense to deal with categorical values we can convert them to :category

In [57]:
(-> pokemon
    (pt/subset-cols "Type 1")
    (pt/astype :category)
    pt/head)
Out[57]:
0    Grass
1    Grass
2    Grass
3    Grass
4     Fire
Name: Type 1, dtype: category
Categories (18, object): [Bug, Dark, Dragon, Electric, ..., Psychic, Rock, Steel, Water]
In [58]:
(-> pokemon
    (pt/subset-cols "Speed")
    (pt/astype :float)
    pt/head)
Out[58]:
0    45.0
1    60.0
2    80.0
3    80.0
4    65.0
Name: Speed, dtype: float64

Dealing with missing data

One of the most painful operations for data scientists and engineers is dealing with the unknown: NaN (or nil, Null, etc).

panthera tries to make this as painless as possible:

In [59]:
(-> pokemon
    (pt/subset-cols "Type 2")
    (pt/value-counts {:dropna false}))
Out[59]:
NaN         386
Flying       97
Ground       35
Poison       34
Psychic      33
Fighting     26
Grass        25
Fairy        23
Steel        22
Dark         20
Dragon       18
Ice          14
Ghost        14
Water        14
Rock         14
Fire         12
Electric      6
Normal        4
Bug           3
Name: Type 2, dtype: int64

We could check for NaN in other ways has well:

In [60]:
(-> pokemon (pt/subset-cols "Type 2") ((juxt pt/hasnans? (comp pt/all? pt/not-na?))))
Out[60]:
[true false]

One of the ways to deal with missing data is to just drop rows

In [61]:
(-> pokemon
    (pt/dropna {:subset ["Type 2"]})
    (pt/subset-cols "Type 2")
    (pt/value-counts {:dropna false}))
Out[61]:
Flying      97
Ground      35
Poison      34
Psychic     33
Fighting    26
Grass       25
Fairy       23
Steel       22
Dark        20
Dragon      18
Ice         14
Rock        14
Water       14
Ghost       14
Fire        12
Electric     6
Normal       4
Bug          3
Name: Type 2, dtype: int64

But let's say we want to replace missing observations with a flag or value of some kind, we can do that easily with fill-na

In [62]:
(-> pokemon
    (pt/subset-cols "Type 2")
    (pt/fill-na :empty)
    (pt/head 10))
Out[62]:
0    Poison
1    Poison
2    Poison
3    Poison
4     empty
5     empty
6    Flying
7    Dragon
8    Flying
9     empty
Name: Type 2, dtype: object

Time and dates

Programmers hate time, that's a fact. Panthera tries to make this experience as painless as possible

In [63]:
(def times
  ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"])

(pt/->datetime times)
Out[63]:
DatetimeIndex(['1992-01-10', '1992-02-10', '1992-03-10', '1993-03-15',
               '1993-03-16'],
              dtype='datetime64[ns]', freq=None)
In [64]:
(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    show)
Out[64]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
1992-01-10 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1992-02-10 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
1992-03-10 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
1993-03-15 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
1993-03-16 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [65]:
(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    (pt/select-rows "1993-03-16" :loc))
Out[65]:
#                      5
Name          Charmander
Type 1              Fire
Type 2               NaN
HP                    39
Attack                52
Defense               43
Sp. Atk               60
Sp. Def               50
Speed                 65
Generation             1
Legendary          False
Name: 1993-03-16 00:00:00, dtype: object
In [66]:
(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    (pt/select-rows (pt/slice "1992-03-10" "1993-03-16") :loc)
    show)
Out[66]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
1992-03-10 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
1993-03-15 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
1993-03-16 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [ ]: