This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at gensim.sklearn_integration
The wrappers available (as of now) are :
LdaModel (gensim.sklearn_api.ldamodel.LdaTransformer
), which implements gensim's LDA Model
in a scikit-learn interface
LsiModel (gensim.sklearn_api.lsimodel.LsiTransformer
), which implements gensim's LSI Model
in a scikit-learn interface
RpModel (gensim.sklearn_api.rpmodel.RpTransformer
), which implements gensim's Random Projections Model
in a scikit-learn interface
LDASeq Model (gensim.sklearn_api.ldaseqmodel.LdaSeqTransformer
), which implements gensim's LdaSeqModel
in a scikit-learn interface
Word2Vec Model (gensim.sklearn_api.w2vmodel.W2VTransformer
), which implements gensim's Word2Vec
in a scikit-learn interface
AuthorTopicModel Model (gensim.sklearn_api.atmodel.AuthorTopicTransformer
), which implements gensim's AuthorTopicModel
in a scikit-learn interface
Doc2Vec Model (gensim.sklearn_api.d2vmodel.D2VTransformer
), which implements gensim's Doc2Vec
in a scikit-learn interface
Text2Bow Model (gensim.sklearn_api.text2bow.Text2BowTransformer
), which implements gensim's Dictionary
in a scikit-learn interface
TfidfModel Model (gensim.sklearn_api.tfidf.TfIdfTransformer
), which implements gensim's TfidfModel
in a scikit-learn interface
HdpModel Model (gensim.sklearn_api.hdp.HdpTransformer
), which implements gensim's HdpModel
in a scikit-learn interface
To use LdaModel begin with importing LdaModel wrapper
from gensim.sklearn_api import LdaTransformer
Next we will create a dummy set of texts and convert it into a corpus
from gensim.corpora import Dictionary
texts = [
['complier', 'system', 'computer'],
['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
['graph', 'flow', 'network', 'graph'],
['loading', 'computer', 'system'],
['user', 'server', 'system'],
['tree', 'hamiltonian'],
['graph', 'trees'],
['computer', 'kernel', 'malfunction', 'computer'],
['server', 'system', 'computer']
]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
Then to run the LdaModel on it
model = LdaTransformer(num_topics=2, id2word=dictionary, iterations=20, random_state=1)
model.fit(corpus)
model.transform(corpus)
array([[0.84165996, 0.15834005], [0.716593 , 0.28340697], [0.11434125, 0.88565874], [0.80545014, 0.19454984], [0.39609504, 0.603905 ], [0.80124027, 0.19875973], [0.19269218, 0.80730784], [0.8466452 , 0.15335481], [0.67057097, 0.32942903]], dtype=float32)
To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use 20 Newsgroups data set. We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics.
import numpy as np
from gensim import matutils
from gensim.models.ldamodel import LdaModel
from sklearn.datasets import fetch_20newsgroups
from gensim.sklearn_api.ldamodel import LdaTransformer
rand = np.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)
Next, we use use the loaded data to create our dictionary and corpus.
data_texts = [_.split() for _ in data.data]
id2word = Dictionary(data_texts)
corpus = [id2word.doc2bow(i.split()) for i in data.data]
Next, we just need to fit corpus and id2word to our Lda wrapper.
obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=20)
lda = obj.fit(corpus)
from sklearn.model_selection import GridSearchCV
The inbuilt score
function of Lda wrapper class provides two modes : perplexity
and u_mass
for computing the scores of the candidate models. The preferred mode for the scoring function is specified using scorer
parameter of the wrapper as follows :
obj = LdaTransformer(id2word=id2word, num_topics=2, iterations=5, scorer='u_mass') # here 'scorer' can be 'perplexity' or 'u_mass'
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}
# set `scoring` as `None` to use the inbuilt score function of `SklLdaModel` class
model = GridSearchCV(obj, parameters, cv=3, scoring=None)
model.fit(corpus)
model.best_params_
{'iterations': 20, 'num_topics': 3}
You can also supply a custom scoring function of your choice using the scoring
parameter of GridSearchCV
function. The example shown below uses c_v
mode of CoherenceModel
class for computing the scores of the candidate models.
from gensim.models.coherencemodel import CoherenceModel
# supplying a custom scoring function
def scoring_function(estimator, X, y=None):
goodcm = CoherenceModel(model=estimator.gensim_model, texts=data_texts, dictionary=estimator.gensim_model.id2word, coherence='c_v')
return goodcm.get_coherence()
obj = LdaTransformer(id2word=id2word, num_topics=5, iterations=5)
parameters = {'num_topics': (2, 3, 5, 10), 'iterations': (1, 20, 50)}
# set `scoring` as your custom scoring function
model = GridSearchCV(obj, parameters, cv=2, scoring=scoring_function)
model.fit(corpus)
model.best_params_
{'iterations': 50, 'num_topics': 2}
from sklearn.pipeline import Pipeline
from sklearn import linear_model
def print_features_pipe(clf, vocab, n=10):
''' Better printing for sorted list '''
# FIXME: this function is broken
coef = clf.named_steps['classifier'].coef_[0]
print(coef)
print('Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0])))
print('Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0])))
id2word = Dictionary([_.split() for _ in data.data])
corpus = [id2word.doc2bow(i.split()) for i in data.data]
model = LdaTransformer(num_topics=15, id2word=id2word, iterations=10, random_state=37)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(corpus, data.target)
# print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
0.6459731543624161
To use LsiModel begin with importing LsiModel wrapper
from gensim.sklearn_api import LsiTransformer
model = LsiTransformer(num_topics=15, id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(corpus, data.target)
# print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
0.8657718120805369
To use RpModel begin with importing RpModel wrapper
from gensim.sklearn_api import RpTransformer
model = RpTransformer(num_topics=2)
np.random.mtrand.RandomState(1) # set seed for getting same result
clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(corpus, data.target)
# print_features_pipe(pipe, id2word.values())
print(pipe.score(corpus, data.target))
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
0.5461409395973155
To use LdaSeqModel begin with importing LdaSeqModel wrapper
from gensim.sklearn_api import LdaSeqTransformer
test_data = data.data[0:2]
test_target = data.target[0:2]
id2word_ldaseq = Dictionary(map(lambda x: x.split(), test_data))
corpus_ldaseq = [id2word_ldaseq.doc2bow(i.split()) for i in test_data]
model = LdaSeqTransformer(id2word=id2word_ldaseq, num_topics=2, time_slice=[1, 1, 1], initialize='gensim')
clf = linear_model.LogisticRegression(penalty='l2', C=0.1) # l2 penalty used
pipe = Pipeline([('features', model,), ('classifier', clf)])
pipe.fit(corpus_ldaseq, test_target)
# print_features_pipe(pipe, id2word_ldaseq.values())
print(pipe.score(corpus_ldaseq, test_target))
/home/misha/git/gensim/gensim/models/ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars convergence = np.fabs((bound - old_bound) / old_bound) /home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
1.0
To use Word2Vec model begin with importing Word2Vec wrapper
from gensim.sklearn_api import W2VTransformer
w2v_texts = [
['calculus', 'is', 'the', 'mathematical', 'study', 'of', 'continuous', 'change'],
['geometry', 'is', 'the', 'study', 'of', 'shape'],
['algebra', 'is', 'the', 'study', 'of', 'generalizations', 'of', 'arithmetic', 'operations'],
['differential', 'calculus', 'is', 'related', 'to', 'rates', 'of', 'change', 'and', 'slopes', 'of', 'curves'],
['integral', 'calculus', 'is', 'realted', 'to', 'accumulation', 'of', 'quantities', 'and', 'the', 'areas', 'under', 'and', 'between', 'curves'],
['physics', 'is', 'the', 'natural', 'science', 'that', 'involves', 'the', 'study', 'of', 'matter', 'and', 'its', 'motion', 'and', 'behavior', 'through', 'space', 'and', 'time'],
['the', 'main', 'goal', 'of', 'physics', 'is', 'to', 'understand', 'how', 'the', 'universe', 'behaves'],
['physics', 'also', 'makes', 'significant', 'contributions', 'through', 'advances', 'in', 'new', 'technologies', 'that', 'arise', 'from', 'theoretical', 'breakthroughs'],
['advances', 'in', 'the', 'understanding', 'of', 'electromagnetism', 'or', 'nuclear', 'physics', 'led', 'directly', 'to', 'the', 'development', 'of', 'new', 'products', 'that', 'have', 'dramatically', 'transformed', 'modern', 'day', 'society']
]
model = W2VTransformer(size=10, min_count=1)
model.fit(w2v_texts)
class_dict = {'mathematics': 1, 'physics': 0}
train_data = [
('calculus', 'mathematics'), ('mathematical', 'mathematics'), ('geometry', 'mathematics'), ('operations', 'mathematics'), ('curves', 'mathematics'),
('natural', 'physics'), ('nuclear', 'physics'), ('science', 'physics'), ('electromagnetism', 'physics'), ('natural', 'physics')
]
train_input = list(map(lambda x: x[0], train_data))
train_target = list(map(lambda x: class_dict[x[1]], train_data))
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
clf.fit(model.transform(train_input), train_target)
text_w2v = Pipeline([('features', model,), ('classifier', clf)])
score = text_w2v.score(train_input, train_target)
print(score)
0.9
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
To use AuthorTopic model begin with importing AuthorTopic wrapper
from gensim.sklearn_api import AuthorTopicTransformer
from sklearn import cluster
atm_texts = [
['complier', 'system', 'computer'],
['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
['graph', 'flow', 'network', 'graph'],
['loading', 'computer', 'system'],
['user', 'server', 'system'],
['tree', 'hamiltonian'],
['graph', 'trees'],
['computer', 'kernel', 'malfunction', 'computer'],
['server', 'system', 'computer'],
]
atm_dictionary = Dictionary(atm_texts)
atm_corpus = [atm_dictionary.doc2bow(text) for text in atm_texts]
author2doc = {'john': [0, 1, 2, 3, 4, 5, 6], 'jane': [2, 3, 4, 5, 6, 7, 8], 'jack': [0, 2, 4, 6, 8], 'jill': [1, 3, 5, 7]}
model = AuthorTopicTransformer(id2word=atm_dictionary, author2doc=author2doc, num_topics=10, passes=100)
model.fit(atm_corpus)
# create and train clustering model
clstr = cluster.MiniBatchKMeans(n_clusters=2)
authors_full = ['john', 'jane', 'jack', 'jill']
clstr.fit(model.transform(authors_full))
# stack together the two models in a pipeline
text_atm = Pipeline([('features', model,), ('cluster', clstr)])
author_list = ['jane', 'jack', 'jill']
ret_val = text_atm.predict(author_list)
print(ret_val)
[1 1 0]
To use Doc2Vec model begin with importing Doc2Vec wrapper
from gensim.sklearn_api import D2VTransformer
from gensim.models import doc2vec
d2v_sentences = [doc2vec.TaggedDocument(words, [i]) for i, words in enumerate(w2v_texts)]
model = D2VTransformer(min_count=1)
model.fit(d2v_sentences)
class_dict = {'mathematics': 1, 'physics': 0}
train_data = [
(['calculus', 'mathematical'], 'mathematics'), (['geometry', 'operations', 'curves'], 'mathematics'),
(['natural', 'nuclear'], 'physics'), (['science', 'electromagnetism', 'natural'], 'physics')
]
train_input = list(map(lambda x: x[0], train_data))
train_target = list(map(lambda x: class_dict[x[1]], train_data))
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
clf.fit(model.transform(train_input), train_target)
text_d2v = Pipeline([('features', model,), ('classifier', clf)])
score = text_d2v.score(train_input, train_target)
print(score)
1.0
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
To use Text2Bow model begin with importing Text2Bow wrapper
from gensim.sklearn_api import Text2BowTransformer
text2bow_model = Text2BowTransformer()
lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=np.random.seed(0))
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
text_t2b = Pipeline([('bow_model', text2bow_model), ('ldamodel', lda_model), ('classifier', clf)])
text_t2b.fit(data.data, data.target)
score = text_t2b.score(data.data, data.target)
print(score)
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
0.9723154362416108
To use TfIdf model begin with importing TfIdf wrapper
from gensim.sklearn_api import TfIdfTransformer
tfidf_model = TfIdfTransformer()
tfidf_model.fit(corpus)
lda_model = LdaTransformer(num_topics=2, passes=10, minimum_probability=0, random_state=np.random.seed(0))
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
text_tfidf = Pipeline((('tfidf_model', tfidf_model), ('ldamodel', lda_model), ('classifier', clf)))
text_tfidf.fit(corpus, data.target)
score = text_tfidf.score(corpus, data.target)
print(score)
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
0.735738255033557
To use HDP model begin with importing HDP wrapper
from gensim.sklearn_api import HdpTransformer
model = HdpTransformer(id2word=id2word)
clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
text_hdp = Pipeline([('features', model,), ('classifier', clf)])
text_hdp.fit(corpus, data.target)
score = text_hdp.score(corpus, data.target)
print(score)
/home/misha/envs/gensim/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
0.8271812080536913