Discretizers¶

This package supports discretization methods and mapping functions.

Installation¶

In [ ]:

Pkg.add("Discretizers")

Once the installation is complete you can use it anywhere by running

In [1]:

using Discretizers

Discretization¶

Categorical Labels¶

You can construct an object for mapping labels to integer indeces

In [2]:

data = [:cat, :dog, :dog, :cat, :cat, :elephant]
catdisc = CategoricalDiscretizer(data);

The resulting object can be used to encode your source labels to their categorical labels

In [3]:

println(":cat becomes: ", encode(catdisc, :cat))
println(":dog becomes: ", encode(catdisc, :dog))
println("data becomes: ", encode(catdisc, data))

:cat becomes: 1
:dog becomes: 2
data becomes: [1,2,2,1,1,3]

You can also transform back

In [4]:

println("1 becomes: ", decode(catdisc, 1))
println("2 becomes: ", decode(catdisc, 2))
println("[1,2,3] becomes: ", decode(catdisc, [1,2,3]))

1 becomes: cat
2 becomes: dog
[1,2,3] becomes: [:cat,:dog,:elephant]

The CategoricalDiscretizer works with any object type

In [5]:

CategoricalDiscretizer(["A", "B", "C"])
CategoricalDiscretizer([5000, 1200, 100])
CategoricalDiscretizer([:dog, "hello world", NaN]);

Linear Discretization¶

Linear discretization into a series of bins is supported as well

Here we construct a linear discretizer that maps $[0,0.5) \rightarrow 1$ and $[0.5,1] \rightarrow 2$

In [6]:

bin_edges = [0.0,0.5,1.0]
lindisc = LinearDiscretizer(bin_edges);

Encoding works the same way

In [7]:

println("0.2 becomes: ", encode(lindisc, 0.2))
println("0.7 becomes: ", encode(lindisc, 0.7))
println("0.5 becomes: ", encode(lindisc, 0.5))
println("it works on arrays: ", encode(lindisc, [0.0,0.8,0.2]))

0.2 becomes: 1
0.7 becomes: 2
0.5 becomes: 2
it works on arrays: [1,2,1]

Decoding is a bit different. Here we obtain the bin and sample from it uniformally

In [8]:

println("1 becomes: ", decode(lindisc, 1))
println("2 becomes: ", decode(lindisc, 2))
println("it works on arrays: ", decode(lindisc, [2,1,2]))

1 becomes: 0.16022506882123666
2 becomes: 0.9685583452997502
it works on arrays: [0.8603893534766647,0.07859144179702082,0.9584045710248809]

Some other functions are supported

In [9]:

println("number of labels: ", nlabels(catdisc), "  ", nlabels(lindisc))
println("bin centers:      ", bincenters(lindisc))
println("extrama of a bin: ", extrema(lindisc, 2))

number of labels: 3  2
bin centers:      [0.25,0.75]
extrama of a bin: (0.5,1.0)

Both discretizers can be constructed to map to other integer types

In [10]:

catdisc = CategoricalDiscretizer(data, Int32)
lindisc = LinearDiscretizer(bin_edges, UInt8)
encode(lindisc, 0.2)

Out[10]:

0x01

Discretization Algorithms¶

In many cases one would like to determine the bin edges for a Linear Discretizer automatically from data. This package supports several algorithms to do just that.

In [14]:

nbins = 3
data  = randn(1000)
edges = binedges(DiscretizeUniformWidth(nbins), data)

Out[14]:

4-element Array{Float64,1}:
 -3.59476
 -1.03547
  1.52381
  4.0831

DiscretizeUniformWidth takes the number of desired bins and breaks the range over the data into evenly spaced bins DiscretizeUniformCount takes the original data, sorts it, and breaks it into bins of even count

In [15]:

edges = binedges(DiscretizeUniformCount(nbins), data)

Out[15]:

4-element Array{Float64,1}:
 -3.59476 
 -0.431955
  0.409398
  4.0831

A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set.

In [13]:

data = [randn(100), randn(100)+1.0]
labels = [fill(:cat, 100), fill(:dog, 100)]
integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)
edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)

Out[13]:

3-element Array{AbstractFloat,1}:
 -2.29837 
  0.119088
  3.40968

More information on MODL can be found here.

In [ ]: