Using wrappers for Scikit learn API¶

This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at gensim.sklearn_integration

The wrappers available (as of now) are :

LdaModel (gensim.sklearn_api.ldamodel.LdaTransformer), which implements gensim's LDA Model in a scikit-learn interface
LsiModel (gensim.sklearn_api.lsimodel.LsiTransformer), which implements gensim's LSI Model in a scikit-learn interface
RpModel (gensim.sklearn_api.rpmodel.RpTransformer), which implements gensim's Random Projections Model in a scikit-learn interface
LDASeq Model (gensim.sklearn_api.ldaseqmodel.LdaSeqTransformer), which implements gensim's LdaSeqModel in a scikit-learn interface
Word2Vec Model (gensim.sklearn_api.w2vmodel.W2VTransformer), which implements gensim's Word2Vec in a scikit-learn interface
AuthorTopicModel Model (gensim.sklearn_api.atmodel.AuthorTopicTransformer), which implements gensim's AuthorTopicModel in a scikit-learn interface
Doc2Vec Model (gensim.sklearn_api.d2vmodel.D2VTransformer), which implements gensim's Doc2Vec in a scikit-learn interface
Text2Bow Model (gensim.sklearn_api.text2bow.Text2BowTransformer), which implements gensim's Dictionary in a scikit-learn interface
TfidfModel Model (gensim.sklearn_api.tfidf.TfIdfTransformer), which implements gensim's TfidfModel in a scikit-learn interface
HdpModel Model (gensim.sklearn_api.hdp.HdpTransformer), which implements gensim's HdpModel in a scikit-learn interface

LDA Model¶

To use LdaModel begin with importing LdaModel wrapper

In [1]:

from gensim.sklearn_api import LdaTransformer

Next we will create a dummy set of texts and convert it into a corpus

In [2]:

from gensim.corpora import Dictionary
texts = [
    ['complier', 'system', 'computer'],
    ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
    ['graph', 'flow', 'network', 'graph'],
    ['loading', 'computer', 'system'],
    ['user', 'server', 'system'],
    ['tree', 'hamiltonian'],
    ['graph', 'trees'],
    ['computer', 'kernel', 'malfunction', 'computer'],
    ['server', 'system', 'computer']
]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Then to run the LdaModel on it

In [3]:

model = LdaTransformer(num_topics=2, id2word=dictionary, iterations=20, random_state=1)
model.fit(corpus)
model.transform(corpus)

Out[3]:

array([[0.84165996, 0.15834005],
       [0.716593  , 0.28340697],
       [0.11434125, 0.88565874],
       [0.80545014, 0.19454984],
       [0.39609504, 0.603905  ],
       [0.80124027, 0.19875973],
       [0.19269218, 0.80730784],
       [0.8466452 , 0.15335481],
       [0.67057097, 0.32942903]], dtype=float32)

Integration with Sklearn¶

To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use 20 Newsgroups data set. We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.

In [4]:

import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from gensim.sklearn_api.ldamodel import LdaTransformer

In [5]:

rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)

Next, we use use the loaded data to create our dictionary and corpus.

In [6]:

data_texts = [_.split() for _ in data.data]
id2word = Dictionary(data_texts)
corpus = [id2word.doc2bow(i.split()) for i in data.data]

Next, we just need to fit corpus and id2word to our Lda wrapper.

In [7]:

obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=20)
lda = obj.fit(corpus)

Example for Using Grid Search¶

In [8]:

from sklearn.model_selection import GridSearchCV

The inbuilt score function of Lda wrapper class provides two modes : perplexity and u_mass for computing the scores of the candidate models. The preferred mode for the scoring function is specified using scorer parameter of the wrapper as follows :

In [9]:

obj = LdaTransformer(id2word=id2word, num_topics=2, iterations=5, scorer='u_mass') # here 'scorer' can be 'perplexity' or 'u_mass'
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}

# set `scoring` as `None` to use the inbuilt score function of `SklLdaModel` class
model = GridSearchCV(obj, parameters, cv=3, scoring=None)
model.fit(corpus)

model.best_params_

Out[9]:

{'iterations': 20, 'num_topics': 3}

You can also supply a custom scoring function of your choice using the scoring parameter of GridSearchCV function. The example shown below uses c_v mode of CoherenceModel class for computing the scores of the candidate models.

In [10]:

from gensim.models.coherencemodel import CoherenceModel

# supplying a custom scoring function
def scoring_function(estimator, X, y=None):
    goodcm = CoherenceModel(model=estimator.gensim_model, texts=data_texts, dictionary=estimator.gensim_model.id2word, coherence='c_v')
    return goodcm.get_coherence()

obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=5)
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}

# set `scoring` as your custom scoring function
model = GridSearchCV(obj, parameters, cv=2, scoring=scoring_function)
model.fit(corpus)

model.best_params_

Out[10]:

{'iterations': 50, 'num_topics': 2}

Example of Using Pipeline¶

In [11]:

from sklearn.pipeline import Pipeline
from sklearn import linear_model

def print_features_pipe(clf, vocab, n=10):
    ''' Better printing for sorted list '''
    # FIXME: this function is broken
    coef = clf.named_steps['classifier'].coef_[0]
    print(coef)
    print('Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0])))
    print('Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0])))

In [12]:

id2word = Dictionary([_.split() for _ in data.data])
corpus = [id2word.doc2bow(i.split()) for i in data.data]

In [14]:

model = LdaTransformer(num_topics=15, id2word=id2word, iterations=10, random_state=37)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(corpus, data.target)
# print_features_pipe(pipe, id2word.values())

print(pipe.score(corpus, data.target))

/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

0.6459731543624161

LSI Model¶

To use LsiModel begin with importing LsiModel wrapper

In [15]:

from gensim.sklearn_api import LsiTransformer

Example of Using Pipeline¶

In [18]:

model = LsiTransformer(num_topics=15, id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(corpus, data.target)
# print_features_pipe(pipe, id2word.values())

print(pipe.score(corpus, data.target))

/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

0.8657718120805369

Random Projections Model¶

To use RpModel begin with importing RpModel wrapper

In [19]:

from gensim.sklearn_api import RpTransformer

Example of Using Pipeline¶

In [20]:

model = RpTransformer(num_topics=2)
np.random.mtrand.RandomState(1)  # set seed for getting same result
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(corpus, data.target)
# print_features_pipe(pipe, id2word.values())

print(pipe.score(corpus, data.target))

/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

0.5461409395973155

LDASeq Model¶

To use LdaSeqModel begin with importing LdaSeqModel wrapper

In [21]:

from gensim.sklearn_api import LdaSeqTransformer

Example of Using Pipeline¶

In [22]:

test_data = data.data[0:2]
test_target = data.target[0:2]
id2word_ldaseq = Dictionary(map(lambda x: x.split(), test_data))
corpus_ldaseq = [id2word_ldaseq.doc2bow(i.split()) for i in test_data]

model = LdaSeqTransformer(id2word=id2word_ldaseq, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)  # l2 penalty used
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(corpus_ldaseq, test_target)
# print_features_pipe(pipe, id2word_ldaseq.values())

print(pipe.score(corpus_ldaseq, test_target))

/home/misha/git/gensim/gensim/models/ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars
  convergence = np.fabs((bound - old_bound) / old_bound)
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

1.0

Word2Vec Model¶

To use Word2Vec model begin with importing Word2Vec wrapper

In [23]:

from gensim.sklearn_api import W2VTransformer

Example of Using Pipeline¶

In [24]:

w2v_texts = [
    ['calculus', 'is', 'the', 'mathematical', 'study', 'of', 'continuous', 'change'],
    ['geometry', 'is', 'the', 'study', 'of', 'shape'],
    ['algebra', 'is', 'the', 'study', 'of', 'generalizations', 'of', 'arithmetic', 'operations'],
    ['differential', 'calculus', 'is', 'related', 'to', 'rates', 'of', 'change', 'and', 'slopes', 'of', 'curves'],
    ['integral', 'calculus', 'is', 'realted', 'to', 'accumulation', 'of', 'quantities', 'and', 'the', 'areas', 'under', 'and', 'between', 'curves'],
    ['physics', 'is', 'the', 'natural', 'science', 'that', 'involves', 'the', 'study', 'of', 'matter', 'and', 'its', 'motion', 'and', 'behavior', 'through', 'space', 'and', 'time'],
    ['the', 'main', 'goal', 'of', 'physics', 'is', 'to', 'understand', 'how', 'the', 'universe', 'behaves'],
    ['physics', 'also', 'makes', 'significant', 'contributions', 'through', 'advances', 'in', 'new', 'technologies', 'that', 'arise', 'from', 'theoretical', 'breakthroughs'],
    ['advances', 'in', 'the', 'understanding', 'of', 'electromagnetism', 'or', 'nuclear', 'physics', 'led', 'directly', 'to', 'the', 'development', 'of', 'new', 'products', 'that', 'have', 'dramatically', 'transformed', 'modern', 'day', 'society']
]

model = W2VTransformer(size=10, min_count=1)
model.fit(w2v_texts)

class_dict = {'mathematics': 1, 'physics': 0}
train_data = [
    ('calculus', 'mathematics'), ('mathematical', 'mathematics'), ('geometry', 'mathematics'), ('operations', 'mathematics'), ('curves', 'mathematics'),
    ('natural', 'physics'), ('nuclear', 'physics'), ('science', 'physics'), ('electromagnetism', 'physics'), ('natural', 'physics')
]

train_input = list(map(lambda x: x[0], train_data))
train_target = list(map(lambda x: class_dict[x[1]], train_data))

clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
clf.fit(model.transform(train_input), train_target)
text_w2v = Pipeline([('features', model,), ('classifier', clf)])
score = text_w2v.score(train_input, train_target)

print(score)

0.9

/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

AuthorTopic Model¶

To use AuthorTopic model begin with importing AuthorTopic wrapper

In [25]:

from gensim.sklearn_api import AuthorTopicTransformer

Example of Using Pipeline¶

In [26]:

from sklearn import cluster

atm_texts = [
    ['complier', 'system', 'computer'],
    ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
    ['graph', 'flow', 'network', 'graph'],
    ['loading', 'computer', 'system'],
    ['user', 'server', 'system'],
    ['tree', 'hamiltonian'],
    ['graph', 'trees'],
    ['computer', 'kernel', 'malfunction', 'computer'],
    ['server', 'system', 'computer'],
]
atm_dictionary = Dictionary(atm_texts)
atm_corpus = [atm_dictionary.doc2bow(text) for text in atm_texts]
author2doc = {'john': [0, 1, 2, 3, 4, 5, 6], 'jane': [2, 3, 4, 5, 6, 7, 8], 'jack': [0, 2, 4, 6, 8], 'jill': [1, 3, 5, 7]}

model = AuthorTopicTransformer(id2word=atm_dictionary, author2doc=author2doc, num_topics=10, passes=100)
model.fit(atm_corpus)

# create and train clustering model
clstr = cluster.MiniBatchKMeans(n_clusters=2)
authors_full = ['john', 'jane', 'jack', 'jill']
clstr.fit(model.transform(authors_full))

# stack together the two models in a pipeline
text_atm = Pipeline([('features', model,), ('cluster', clstr)])
author_list = ['jane', 'jack', 'jill']
ret_val = text_atm.predict(author_list)

print(ret_val)

[1 1 0]

Doc2Vec Model¶

To use Doc2Vec model begin with importing Doc2Vec wrapper

In [27]:

from gensim.sklearn_api import D2VTransformer

Example of Using Pipeline¶

In [28]:

from gensim.models import doc2vec
d2v_sentences = [doc2vec.TaggedDocument(words, [i]) for i, words in enumerate(w2v_texts)]

model = D2VTransformer(min_count=1)
model.fit(d2v_sentences)

class_dict = {'mathematics': 1, 'physics': 0}
train_data = [
    (['calculus', 'mathematical'], 'mathematics'), (['geometry', 'operations', 'curves'], 'mathematics'),
    (['natural', 'nuclear'], 'physics'), (['science', 'electromagnetism', 'natural'], 'physics')
]
train_input = list(map(lambda x: x[0], train_data))
train_target = list(map(lambda x: class_dict[x[1]], train_data))

clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
clf.fit(model.transform(train_input), train_target)
text_d2v = Pipeline([('features', model,), ('classifier', clf)])
score = text_d2v.score(train_input, train_target)

print(score)

1.0

/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Text2Bow Model¶

To use Text2Bow model begin with importing Text2Bow wrapper

In [29]:

from gensim.sklearn_api import Text2BowTransformer

Example of Using Pipeline¶

In [30]:

text2bow_model = Text2BowTransformer()
lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=np.random.seed(0))
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
text_t2b = Pipeline([('bow_model', text2bow_model), ('ldamodel', lda_model), ('classifier', clf)])
text_t2b.fit(data.data, data.target)
score = text_t2b.score(data.data, data.target)

print(score)

/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

0.9723154362416108

TfIdf Model¶

To use TfIdf model begin with importing TfIdf wrapper

In [31]:

from gensim.sklearn_api import TfIdfTransformer

Example of Using Pipeline¶

In [32]:

tfidf_model = TfIdfTransformer()
tfidf_model.fit(corpus)
lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=np.random.seed(0))
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
text_tfidf = Pipeline((('tfidf_model', tfidf_model), ('ldamodel', lda_model), ('classifier', clf)))
text_tfidf.fit(corpus, data.target)
score = text_tfidf.score(corpus, data.target)

print(score)

/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

0.735738255033557

HDP Model¶

To use HDP model begin with importing HDP wrapper

In [33]:

from gensim.sklearn_api import HdpTransformer

Example of Using Pipeline¶

In [34]:

model = HdpTransformer(id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
text_hdp = Pipeline([('features', model,), ('classifier', clf)])
text_hdp.fit(corpus, data.target)
score = text_hdp.score(corpus, data.target)

print(score)

/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

0.8271812080536913

In [ ]: