#!/usr/bin/env python # coding: utf-8 # Use Voting Classifiers # ====================== # # A [Voting classifier](http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) model combines multiple different models (i.e., sub-estimators) into a single model, which is (ideally) stronger than any of the individual models alone. # # [Dask](http://ml.dask.org/joblib.html) provides the software to train individual sub-estimators on different machines in a cluster. This enables users to train more models in parallel than would have been possible on a single machine. Note that users will only observe this benefit if they have a distributed cluster with more resources than their single machine (because sklearn already enables users to parallelize training across cores on a single machine). # # What follows is an example of how one would deploy a voting classifier model in dask (using a local cluster). # # Dask logo # In[ ]: from sklearn.ensemble import VotingClassifier from sklearn.linear_model import SGDClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC import sklearn.datasets # We create a synthetic dataset (with 1000 rows and 20 columns) that we can give to the voting classifier model. # In[ ]: X, y = sklearn.datasets.make_classification(n_samples=1_000, n_features=20) # We specify the VotingClassifier as a list of (name, sub-estimator) tuples. Fitting the VotingClassifier on the data fits each of the sub-estimators in turn. We set the ```n_jobs``` argument to be -1, which instructs sklearn to use all available cores (notice that we haven't used dask). # In[ ]: classifiers = [ ('sgd', SGDClassifier(max_iter=1000)), ('logisticregression', LogisticRegression()), ('svc', SVC(gamma='auto')), ] clf = VotingClassifier(classifiers, n_jobs=-1) # We call the classifier's fit method in order to train the classifier. # In[ ]: get_ipython().run_line_magic('time', 'clf.fit(X, y)') # Creating a Dask [client](https://distributed.readthedocs.io/en/latest/client.html) provides performance and progress metrics via the dashboard. Because ```Client``` is given no arugments, its output refers to a [local cluster](http://distributed.readthedocs.io/en/latest/local-cluster.html) (not a distributed cluster). # # We can view the dashboard by clicking the link after running the cell. # In[ ]: import joblib from distributed import Client client = Client() client # To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster. # In[ ]: get_ipython().run_cell_magic('time', '', 'with joblib.parallel_backend("dask"):\n clf.fit(X, y)\n\nprint(clf)\n') # Note, that we see no advantage of using dask because we are using a local cluster rather than a distributed cluster and sklearn is already using all my computer's cores. If we were using a distributed cluster, dask would enable us to take advantage of the multiple machines and train sub-estimators across them.