Pkg.add("Discretizers")
Once the installation is complete you can use it anywhere by running
using Discretizers
data = [:cat, :dog, :dog, :cat, :cat, :elephant]
catdisc = CategoricalDiscretizer(data);
The resulting object can be used to encode your source labels to their categorical labels
println(":cat becomes: ", encode(catdisc, :cat))
println(":dog becomes: ", encode(catdisc, :dog))
println("data becomes: ", encode(catdisc, data))
:cat becomes: 1 :dog becomes: 2 data becomes: [1,2,2,1,1,3]
You can also transform back
println("1 becomes: ", decode(catdisc, 1))
println("2 becomes: ", decode(catdisc, 2))
println("[1,2,3] becomes: ", decode(catdisc, [1,2,3]))
1 becomes: cat 2 becomes: dog [1,2,3] becomes: [:cat,:dog,:elephant]
The CategoricalDiscretizer works with any object type
CategoricalDiscretizer(["A", "B", "C"])
CategoricalDiscretizer([5000, 1200, 100])
CategoricalDiscretizer([:dog, "hello world", NaN]);
Linear discretization into a series of bins is supported as well
Here we construct a linear discretizer that maps $[0,0.5) \rightarrow 1$ and $[0.5,1] \rightarrow 2$
bin_edges = [0.0,0.5,1.0]
lindisc = LinearDiscretizer(bin_edges);
Encoding works the same way
println("0.2 becomes: ", encode(lindisc, 0.2))
println("0.7 becomes: ", encode(lindisc, 0.7))
println("0.5 becomes: ", encode(lindisc, 0.5))
println("it works on arrays: ", encode(lindisc, [0.0,0.8,0.2]))
0.2 becomes: 1 0.7 becomes: 2 0.5 becomes: 2 it works on arrays: [1,2,1]
Decoding is a bit different. Here we obtain the bin and sample from it uniformally
println("1 becomes: ", decode(lindisc, 1))
println("2 becomes: ", decode(lindisc, 2))
println("it works on arrays: ", decode(lindisc, [2,1,2]))
1 becomes: 0.16022506882123666 2 becomes: 0.9685583452997502 it works on arrays: [0.8603893534766647,0.07859144179702082,0.9584045710248809]
Some other functions are supported
println("number of labels: ", nlabels(catdisc), " ", nlabels(lindisc))
println("bin centers: ", bincenters(lindisc))
println("extrama of a bin: ", extrema(lindisc, 2))
number of labels: 3 2 bin centers: [0.25,0.75] extrama of a bin: (0.5,1.0)
Both discretizers can be constructed to map to other integer types
catdisc = CategoricalDiscretizer(data, Int32)
lindisc = LinearDiscretizer(bin_edges, UInt8)
encode(lindisc, 0.2)
0x01
In many cases one would like to determine the bin edges for a Linear Discretizer automatically from data. This package supports several algorithms to do just that.
nbins = 3
data = randn(1000)
edges = binedges(DiscretizeUniformWidth(nbins), data)
4-element Array{Float64,1}: -3.59476 -1.03547 1.52381 4.0831
DiscretizeUniformWidth takes the number of desired bins and breaks the range over the data into evenly spaced bins DiscretizeUniformCount takes the original data, sorts it, and breaks it into bins of even count
edges = binedges(DiscretizeUniformCount(nbins), data)
4-element Array{Float64,1}: -3.59476 -0.431955 0.409398 4.0831
A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set.
data = [randn(100), randn(100)+1.0]
labels = [fill(:cat, 100), fill(:dog, 100)]
integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)
edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)
3-element Array{AbstractFloat,1}: -2.29837 0.119088 3.40968
More information on MODL can be found here.