NLTK(Natural Language Toolkit) es un paquete de python para trabajar con datos de lenguaje humano. Cuenta con múltiples librerías y funciones para procesar texto, tokenizar, generar derivados, entre otras cosas.
Gensim Es un paquete de python usado para hacer modelamiento de tópicos, representación de texto como vectores y análisis de similitudes.
Vamos a utilizar estos dos paquetes para hacer un clasificador de sentimientos
# Instalamos los paquetes que no tengamos con pip o conda, por ejemplo:
#!pip install --quiet nltk gensim wordcloud
import pandas as pd
import numpy as np
Vamos a usar un conjunto de datos de tweets sobre Covid
cov_text_df = pd.read_csv('../Datos/Corona_NLP_test.csv')
display(cov_text_df)
UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
---|---|---|---|---|---|---|
0 | 1 | 44953 | NYC | 02-03-2020 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative |
1 | 2 | 44954 | Seattle, WA | 02-03-2020 | When I couldn't find hand sanitizer at Fred Me... | Positive |
2 | 3 | 44955 | NaN | 02-03-2020 | Find out how you can protect yourself and love... | Extremely Positive |
3 | 4 | 44956 | Chicagoland | 02-03-2020 | #Panic buying hits #NewYork City as anxious sh... | Negative |
4 | 5 | 44957 | Melbourne, Victoria | 03-03-2020 | #toiletpaper #dunnypaper #coronavirus #coronav... | Neutral |
... | ... | ... | ... | ... | ... | ... |
3793 | 3794 | 48746 | Israel ?? | 16-03-2020 | Meanwhile In A Supermarket in Israel -- People... | Positive |
3794 | 3795 | 48747 | Farmington, NM | 16-03-2020 | Did you panic buy a lot of non-perishable item... | Negative |
3795 | 3796 | 48748 | Haverford, PA | 16-03-2020 | Asst Prof of Economics @cconces was on @NBCPhi... | Neutral |
3796 | 3797 | 48749 | NaN | 16-03-2020 | Gov need to do somethings instead of biar je r... | Extremely Negative |
3797 | 3798 | 48750 | Arlington, Virginia | 16-03-2020 | I and @ForestandPaper members are committed to... | Extremely Positive |
3798 rows × 6 columns
cov_text_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3798 entries, 0 to 3797 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UserName 3798 non-null int64 1 ScreenName 3798 non-null int64 2 Location 2964 non-null object 3 TweetAt 3798 non-null object 4 OriginalTweet 3798 non-null object 5 Sentiment 3798 non-null object dtypes: int64(2), object(4) memory usage: 178.2+ KB
en este caso específico solo nos interesan los Tweets originales y su Sentimiento respectivo, así que eliminaremos el resto de columnas
cov_text_df = cov_text_df.iloc[:,4:]
cov_text_df.head(5)
OriginalTweet | Sentiment | |
---|---|---|
0 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative |
1 | When I couldn't find hand sanitizer at Fred Me... | Positive |
2 | Find out how you can protect yourself and love... | Extremely Positive |
3 | #Panic buying hits #NewYork City as anxious sh... | Negative |
4 | #toiletpaper #dunnypaper #coronavirus #coronav... | Neutral |
Observemos unos cuantos ejemplos de tweets
for text in cov_text_df.OriginalTweet.head():
print("tweet: ", text, '\n')
tweet: TRENDING: New Yorkers encounter empty supermarket shelves (pictured, Wegmans in Brooklyn), sold-out online grocers (FoodKick, MaxDelivery) as #coronavirus-fearing shoppers stock up https://t.co/Gr76pcrLWh https://t.co/ivMKMsqdT1 tweet: When I couldn't find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a 2 pack of Purell??!!Check out how #coronavirus concerns are driving up prices. https://t.co/ygbipBflMY tweet: Find out how you can protect yourself and loved ones from #coronavirus. ? tweet: #Panic buying hits #NewYork City as anxious shoppers stock up on food&medical supplies after #healthcare worker in her 30s becomes #BigApple 1st confirmed #coronavirus patient OR a #Bloomberg staged event? https://t.co/IASiReGPC4 #QAnon #QAnon2018 #QAnon2020 #Election2020 #CDC https://t.co/29isZOewxu tweet: #toiletpaper #dunnypaper #coronavirus #coronavirusaustralia #CoronaVirusUpdate #Covid_19 #9News #Corvid19 #7NewsMelb #dunnypapergate #Costco One week everyone buying baby milk powder the next everyone buying up toilet paper. https://t.co/ScZryVvsIh
Podemos ahora crear nuevas columnas y obtenemos algunas estadísticas interesantes
cov_text_df["num_words"] = cov_text_df['OriginalTweet'].apply(lambda x: len(str(x).split()))
print('número máximo de palabras en un tweet:',cov_text_df["num_words"].max())
print('número mínimo de palabras en un tweet:',cov_text_df["num_words"].min())
número máximo de palabras en un tweet: 62 número mínimo de palabras en un tweet: 2
cov_text_df["num_unique_words"] = cov_text_df['OriginalTweet'].apply(lambda x: len(set(str(x).split())))
print('máximo de palabras únicas en un tweet:',cov_text_df['num_unique_words'].max())
print('promedio de palabras únicas por tweet:',cov_text_df['num_unique_words'].mean())
máximo de palabras únicas en un tweet: 52 promedio de palabras únicas por tweet: 30.135071090047393
cov_text_df["num_chars"] = cov_text_df['OriginalTweet'].apply(lambda x: len(str(x)))
print('máximo de caracteres únicos en un tweet:',cov_text_df["num_chars"].max())
máximo de caracteres únicos en un tweet: 342
import string
cov_text_df["num_punctuations"] =cov_text_df['OriginalTweet'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
print('número máximo de puntuaciones en un tweet:',cov_text_df["num_punctuations"].max())
número máximo de puntuaciones en un tweet: 44
ahora vemos los diferentes Sentimientos que vamos a clasificar.
np.unique(cov_text_df['Sentiment'])
array(['Extremely Negative', 'Extremely Positive', 'Negative', 'Neutral', 'Positive'], dtype=object)
Necesitamos ahora darles valores numéricos para que se puedan procesar
labels_dict = {
'Extremely Negative': 0,
'Negative': 1,
'Neutral': 2,
'Positive': 3,
'Extremely Positive': 4
}
for index in range(len(cov_text_df)):
key = cov_text_df.iloc[index]['Sentiment']
cov_text_df.at[index, 'Sentiment_hot'] = labels_dict[key]
cov_text_df.head(3)
OriginalTweet | Sentiment | num_words | num_unique_words | num_chars | num_punctuations | Sentiment_hot | |
---|---|---|---|---|---|---|---|
0 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative | 23 | 23 | 228 | 21 | 0.0 |
1 | When I couldn't find hand sanitizer at Fred Me... | Positive | 30 | 29 | 193 | 17 | 3.0 |
2 | Find out how you can protect yourself and love... | Extremely Positive | 13 | 13 | 73 | 3 | 4.0 |
Las stopwords son palabras que tenemos que omitir de nuestro análisis ya que no brindan información relevante. usualmente son conectores o palabras que tienden a repetirse bastante.
NLTK tiene varias listas para diferentes casos, particularmente idiomas. Aunque nosotros mismos podríamos definir nuestra propia lista de stopwords.
Para usar estas librerías, hay que descargarlas primero
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to [nltk_data] /home/cjtorresj/nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
from nltk.corpus import stopwords
eng_stopwords = set(stopwords.words("english"))
Un buen primer acercamiento para observar palabras relevantes en el texto es usar los "Word Clouds". Usamos la lista de stopwords para filtrar las palabras y tener una idea general del texto
from wordcloud import WordCloud as wc
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
def generate_wordcloud(text):
wordcloud = wc(relative_scaling = 1.0,stopwords = eng_stopwords).generate(text)
fig,ax = plt.subplots(1,1,figsize=(10,10))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis("off")
ax.margins(x=0, y=0)
plt.show()
text =" ".join(cov_text_df.OriginalTweet)
generate_wordcloud(text)
También podemos revisar tendencias de los tweets con base en el sentimiento específico
import seaborn as sns
sns.violinplot(data=cov_text_df,x="Sentiment_hot", y="num_words")
<AxesSubplot:xlabel='Sentiment_hot', ylabel='num_words'>
La parte más importante en PLN. el separar o "deconstruir" el texto en partes. Estas pueden ser palabras, frases, símbolos, etc.
en este caso específico, por la naturaleza de lo que es un tweet, solo se hará tokenización por palabras.
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
[nltk_data] Downloading package punkt to /home/cjtorresj/nltk_data... [nltk_data] Package punkt is already up-to-date!
True
Definimos un caso de ejemplo
main_text = cov_text_df.OriginalTweet[35]
print(main_text)
Remember, with all the media deflection stories about bog rolls, panic-buying, food shortages, Covid-19, Irish bridges etc etc. We've still got Brexit, Austerity, poverty, a crashing economy & the worst UK Govt in living memory. All depressing stuff & it aint gonna end well.
Antes de tokenizar, es buena idea eliminar las puntuaciones.
import re
def remove_punct(text):
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
cov_text_df['TweetPunct'] = cov_text_df['OriginalTweet'].apply(lambda x: remove_punct(x))
cov_text_df.head()
OriginalTweet | Sentiment | num_words | num_unique_words | num_chars | num_punctuations | Sentiment_hot | TweetPunct | |
---|---|---|---|---|---|---|---|---|
0 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative | 23 | 23 | 228 | 21 | 0.0 | TRENDING New Yorkers encounter empty supermark... |
1 | When I couldn't find hand sanitizer at Fred Me... | Positive | 30 | 29 | 193 | 17 | 3.0 | When I couldnt find hand sanitizer at Fred Mey... |
2 | Find out how you can protect yourself and love... | Extremely Positive | 13 | 13 | 73 | 3 | 4.0 | Find out how you can protect yourself and love... |
3 | #Panic buying hits #NewYork City as anxious sh... | Negative | 37 | 37 | 318 | 24 | 1.0 | Panic buying hits NewYork City as anxious shop... |
4 | #toiletpaper #dunnypaper #coronavirus #coronav... | Neutral | 26 | 24 | 252 | 18 | 2.0 | toiletpaper dunnypaper coronavirus coronavirus... |
Aplicamos la tokenización por palabras sobre el caso de ejemplo y a nivel general en el dataset
main_text = cov_text_df.TweetPunct[35]
words = word_tokenize(main_text)
print(words)
['Remember', 'with', 'all', 'the', 'media', 'deflection', 'stories', 'about', 'bog', 'rolls', 'panicbuying', 'food', 'shortages', 'Covid', 'Irish', 'bridges', 'etc', 'etc', 'Weve', 'still', 'got', 'Brexit', 'Austerity', 'poverty', 'a', 'crashing', 'economy', 'amp', 'the', 'worst', 'UK', 'Govt', 'in', 'living', 'memory', 'All', 'depressing', 'stuff', 'amp', 'it', 'aint', 'gon', 'na', 'end', 'well']
cov_text_df['TweetTokenized'] = cov_text_df['TweetPunct'].apply(lambda x: word_tokenize(x.lower()))
cov_text_df.head()
OriginalTweet | Sentiment | num_words | num_unique_words | num_chars | num_punctuations | Sentiment_hot | TweetPunct | TweetTokenized | |
---|---|---|---|---|---|---|---|---|---|
0 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative | 23 | 23 | 228 | 21 | 0.0 | TRENDING New Yorkers encounter empty supermark... | [trending, new, yorkers, encounter, empty, sup... |
1 | When I couldn't find hand sanitizer at Fred Me... | Positive | 30 | 29 | 193 | 17 | 3.0 | When I couldnt find hand sanitizer at Fred Mey... | [when, i, couldnt, find, hand, sanitizer, at, ... |
2 | Find out how you can protect yourself and love... | Extremely Positive | 13 | 13 | 73 | 3 | 4.0 | Find out how you can protect yourself and love... | [find, out, how, you, can, protect, yourself, ... |
3 | #Panic buying hits #NewYork City as anxious sh... | Negative | 37 | 37 | 318 | 24 | 1.0 | Panic buying hits NewYork City as anxious shop... | [panic, buying, hits, newyork, city, as, anxio... |
4 | #toiletpaper #dunnypaper #coronavirus #coronav... | Neutral | 26 | 24 | 252 | 18 | 2.0 | toiletpaper dunnypaper coronavirus coronavirus... | [toiletpaper, dunnypaper, coronavirus, coronav... |
Volvemos a utilizar la lista de stopwords para filtrar las palabras más insignificantes.
stopWords = set(stopwords.words('english'))
wordsFiltered = []
for w in words:
if w not in stopWords:
wordsFiltered.append(w)
print(wordsFiltered)
['Remember', 'media', 'deflection', 'stories', 'bog', 'rolls', 'panicbuying', 'food', 'shortages', 'Covid', 'Irish', 'bridges', 'etc', 'etc', 'Weve', 'still', 'got', 'Brexit', 'Austerity', 'poverty', 'crashing', 'economy', 'amp', 'worst', 'UK', 'Govt', 'living', 'memory', 'All', 'depressing', 'stuff', 'amp', 'aint', 'gon', 'na', 'end', 'well']
Definimos una función que haga este mismo filtrado, para luego aplicarlo en todo el dataset
def remove_stopwords(words, stopwords):
wordsFiltered = []
for w in words:
if w not in stopwords:
wordsFiltered.append(w)
return wordsFiltered
cov_text_df['TweetNoStop'] = cov_text_df['TweetTokenized'].apply(lambda x: remove_stopwords(x, stopWords))
cov_text_df.head()
OriginalTweet | Sentiment | num_words | num_unique_words | num_chars | num_punctuations | Sentiment_hot | TweetPunct | TweetTokenized | TweetNoStop | |
---|---|---|---|---|---|---|---|---|---|---|
0 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative | 23 | 23 | 228 | 21 | 0.0 | TRENDING New Yorkers encounter empty supermark... | [trending, new, yorkers, encounter, empty, sup... | [trending, new, yorkers, encounter, empty, sup... |
1 | When I couldn't find hand sanitizer at Fred Me... | Positive | 30 | 29 | 193 | 17 | 3.0 | When I couldnt find hand sanitizer at Fred Mey... | [when, i, couldnt, find, hand, sanitizer, at, ... | [couldnt, find, hand, sanitizer, fred, meyer, ... |
2 | Find out how you can protect yourself and love... | Extremely Positive | 13 | 13 | 73 | 3 | 4.0 | Find out how you can protect yourself and love... | [find, out, how, you, can, protect, yourself, ... | [find, protect, loved, ones, coronavirus] |
3 | #Panic buying hits #NewYork City as anxious sh... | Negative | 37 | 37 | 318 | 24 | 1.0 | Panic buying hits NewYork City as anxious shop... | [panic, buying, hits, newyork, city, as, anxio... | [panic, buying, hits, newyork, city, anxious, ... |
4 | #toiletpaper #dunnypaper #coronavirus #coronav... | Neutral | 26 | 24 | 252 | 18 | 2.0 | toiletpaper dunnypaper coronavirus coronavirus... | [toiletpaper, dunnypaper, coronavirus, coronav... | [toiletpaper, dunnypaper, coronavirus, coronav... |
En muchos casos, palabras que tengan el mismo significado van a usar diferentes formas o conjugaciones. Nosotros como humanos reconocemos esto, pero para el computador, estas palabras son completamente distintas. Para mitigar este problema, hacemos un algoritmo de "Stemming" que convertirá las palabras en su forma más básica o elemental.
NLTK tiene estos algoritmos directamente. Vamos a usar "Porter Stemmer", el algoritmo más famoso que nos ayuda a eliminar sufijos.
from nltk.stem import PorterStemmer
main_text = cov_text_df.TweetNoStop[35]
ps = PorterStemmer()
for word in main_text:
print(ps.stem(word))
rememb media deflect stori bog roll panicbuy food shortag covid irish bridg etc etc weve still got brexit auster poverti crash economi amp worst uk govt live memori depress stuff amp aint gon na end well
Como podemos ver, algunas palabras quedan mal escritas o cortadas en la parte equivocada. Esto sucede ya que estos algoritmos no ven cada palabra con su contexto específico, simplemente Cadenas a las que se les aplica el algoritmo. Esta es la mayor desventaja a la hora de usar stemming.
def stemming(text):
text = [ps.stem(word) for word in text]
return text
cov_text_df['TweetStemmed'] = cov_text_df['TweetNoStop'].apply(lambda x: stemming(x))
cov_text_df.head()
OriginalTweet | Sentiment | num_words | num_unique_words | num_chars | num_punctuations | Sentiment_hot | TweetPunct | TweetTokenized | TweetNoStop | TweetStemmed | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative | 23 | 23 | 228 | 21 | 0.0 | TRENDING New Yorkers encounter empty supermark... | [trending, new, yorkers, encounter, empty, sup... | [trending, new, yorkers, encounter, empty, sup... | [trend, new, yorker, encount, empti, supermark... |
1 | When I couldn't find hand sanitizer at Fred Me... | Positive | 30 | 29 | 193 | 17 | 3.0 | When I couldnt find hand sanitizer at Fred Mey... | [when, i, couldnt, find, hand, sanitizer, at, ... | [couldnt, find, hand, sanitizer, fred, meyer, ... | [couldnt, find, hand, sanit, fred, meyer, turn... |
2 | Find out how you can protect yourself and love... | Extremely Positive | 13 | 13 | 73 | 3 | 4.0 | Find out how you can protect yourself and love... | [find, out, how, you, can, protect, yourself, ... | [find, protect, loved, ones, coronavirus] | [find, protect, love, one, coronaviru] |
3 | #Panic buying hits #NewYork City as anxious sh... | Negative | 37 | 37 | 318 | 24 | 1.0 | Panic buying hits NewYork City as anxious shop... | [panic, buying, hits, newyork, city, as, anxio... | [panic, buying, hits, newyork, city, anxious, ... | [panic, buy, hit, newyork, citi, anxiou, shopp... |
4 | #toiletpaper #dunnypaper #coronavirus #coronav... | Neutral | 26 | 24 | 252 | 18 | 2.0 | toiletpaper dunnypaper coronavirus coronavirus... | [toiletpaper, dunnypaper, coronavirus, coronav... | [toiletpaper, dunnypaper, coronavirus, coronav... | [toiletpap, dunnypap, coronaviru, coronavirusa... |
Otra forma de normalización de las palabras, pero con el extra de que usa una base de datos junto con su algoritmo para llegar a un "lema" o a la palabra que representa el significado de todos los derivados.
En este caso se usa la base de datos "WordNet", que contiene una extensa lista de palabras, sus conjugados y derivados.
nltk.download('wordnet')
nltk.download('omw-1.4')
[nltk_data] Downloading package wordnet to [nltk_data] /home/cjtorresj/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] /home/cjtorresj/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
True
from nltk.stem import WordNetLemmatizer
wn = WordNetLemmatizer()
for word in main_text:
print(wn.lemmatize(word))
remember medium deflection story bog roll panicbuying food shortage covid irish bridge etc etc weve still got brexit austerity poverty crashing economy amp worst uk govt living memory depressing stuff amp aint gon na end well
Vemos que ahora, todas las palabras son palabras de verdad. Además, están conjugadas de la misma manera.
def lemmatizer(text):
text = [wn.lemmatize(word) for word in text]
return text
cov_text_df['TweetLemmatized'] = cov_text_df['TweetNoStop'].apply(lambda x: lemmatizer(x))
cov_text_df.head()
OriginalTweet | Sentiment | num_words | num_unique_words | num_chars | num_punctuations | Sentiment_hot | TweetPunct | TweetTokenized | TweetNoStop | TweetStemmed | TweetLemmatized | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative | 23 | 23 | 228 | 21 | 0.0 | TRENDING New Yorkers encounter empty supermark... | [trending, new, yorkers, encounter, empty, sup... | [trending, new, yorkers, encounter, empty, sup... | [trend, new, yorker, encount, empti, supermark... | [trending, new, yorkers, encounter, empty, sup... |
1 | When I couldn't find hand sanitizer at Fred Me... | Positive | 30 | 29 | 193 | 17 | 3.0 | When I couldnt find hand sanitizer at Fred Mey... | [when, i, couldnt, find, hand, sanitizer, at, ... | [couldnt, find, hand, sanitizer, fred, meyer, ... | [couldnt, find, hand, sanit, fred, meyer, turn... | [couldnt, find, hand, sanitizer, fred, meyer, ... |
2 | Find out how you can protect yourself and love... | Extremely Positive | 13 | 13 | 73 | 3 | 4.0 | Find out how you can protect yourself and love... | [find, out, how, you, can, protect, yourself, ... | [find, protect, loved, ones, coronavirus] | [find, protect, love, one, coronaviru] | [find, protect, loved, one, coronavirus] |
3 | #Panic buying hits #NewYork City as anxious sh... | Negative | 37 | 37 | 318 | 24 | 1.0 | Panic buying hits NewYork City as anxious shop... | [panic, buying, hits, newyork, city, as, anxio... | [panic, buying, hits, newyork, city, anxious, ... | [panic, buy, hit, newyork, citi, anxiou, shopp... | [panic, buying, hit, newyork, city, anxious, s... |
4 | #toiletpaper #dunnypaper #coronavirus #coronav... | Neutral | 26 | 24 | 252 | 18 | 2.0 | toiletpaper dunnypaper coronavirus coronavirus... | [toiletpaper, dunnypaper, coronavirus, coronav... | [toiletpaper, dunnypaper, coronavirus, coronav... | [toiletpap, dunnypap, coronaviru, coronavirusa... | [toiletpaper, dunnypaper, coronavirus, coronav... |
Con el preprocesamiento hecho, podríamos aplicar el word cloud y ver las diferencias
text =" ".join(sum(cov_text_df.TweetLemmatized,[]))
generate_wordcloud(text)
Y también podemos ver el Word Cloud de cada sentimiento.
for key in labels_dict.keys():
print(key)
text =" ".join(sum(cov_text_df[cov_text_df['Sentiment'] == key].TweetLemmatized,[]))
generate_wordcloud(text)
Extremely Negative
Negative
Neutral
Positive
Extremely Positive
ya con los textos "limpios", ahora necesitamos transformarlos de tal forma que el computador pueda procesarlos. El primer paso de esto es crear un "diccionario" con todas las palabras únicas que tenemos en nuestros tweets.
Gensim tiene una clase especializada de diccionario para los casos de texto.
from gensim import corpora
mydict = corpora.Dictionary(cov_text_df['TweetLemmatized'])
print("Total unique words:")
print(len(mydict.token2id))
print("\nSample data from dictionary:")
i = 0
for key in mydict.token2id.keys():
print("Word: {} - ID: {} ".format(key, mydict.token2id[key]))
if i == 3:
break
i += 1
Total unique words: 13025 Sample data from dictionary: Word: brooklyn - ID: 0 Word: coronavirusfearing - ID: 1 Word: empty - ID: 2 Word: encounter - ID: 3
"term frequency–inverse document frequency" es una estadística con la que, a partir de la frecuencia de aparición de una palabra, podemos dar una idea de qué tan importante es esta palabra dentro de un documento o grupo de documentos.
Se observa la presencia de la palabra en cada documento específico, junto con el número de documentos en los que aparece. De esta forma podemos darle un peso o "importancia" a cada palabra.
Para esto, necesitamos entonces encontrar la frecuencia de las palabras en cada tweet, esto se hace creando una "Bag of Words" a partir de nuestro diccionario de palabras.
from gensim.models import TfidfModel
main_text = cov_text_df.TweetLemmatized[35]
mydict.doc2bow(main_text)
[(66, 1), (94, 1), (100, 2), (125, 1), (191, 1), (200, 1), (385, 1), (431, 1), (514, 1), (518, 1), (519, 1), (533, 1), (539, 1), (540, 1), (541, 1), (542, 1), (543, 1), (544, 1), (545, 1), (546, 2), (547, 1), (548, 1), (549, 1), (550, 1), (551, 1), (552, 1), (553, 1), (554, 1), (555, 1), (556, 1), (557, 1), (558, 1), (559, 1), (560, 1)]
cov_text_df['corpus'] = cov_text_df['TweetLemmatized'].apply(lambda x: mydict.doc2bow(x))
cov_text_df.head()
OriginalTweet | Sentiment | num_words | num_unique_words | num_chars | num_punctuations | Sentiment_hot | TweetPunct | TweetTokenized | TweetNoStop | TweetStemmed | TweetLemmatized | corpus | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TRENDING: New Yorkers encounter empty supermar... | Extremely Negative | 23 | 23 | 228 | 21 | 0.0 | TRENDING New Yorkers encounter empty supermark... | [trending, new, yorkers, encounter, empty, sup... | [trending, new, yorkers, encounter, empty, sup... | [trend, new, yorker, encount, empti, supermark... | [trending, new, yorkers, encounter, empty, sup... | [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1... |
1 | When I couldn't find hand sanitizer at Fred Me... | Positive | 30 | 29 | 193 | 17 | 3.0 | When I couldnt find hand sanitizer at Fred Mey... | [when, i, couldnt, find, hand, sanitizer, at, ... | [couldnt, find, hand, sanitizer, fred, meyer, ... | [couldnt, find, hand, sanit, fred, meyer, turn... | [couldnt, find, hand, sanitizer, fred, meyer, ... | [(20, 1), (21, 1), (22, 1), (23, 1), (24, 1), ... |
2 | Find out how you can protect yourself and love... | Extremely Positive | 13 | 13 | 73 | 3 | 4.0 | Find out how you can protect yourself and love... | [find, out, how, you, can, protect, yourself, ... | [find, protect, loved, ones, coronavirus] | [find, protect, love, one, coronaviru] | [find, protect, loved, one, coronavirus] | [(22, 1), (25, 1), (35, 1), (36, 1), (37, 1)] |
3 | #Panic buying hits #NewYork City as anxious sh... | Negative | 37 | 37 | 318 | 24 | 1.0 | Panic buying hits NewYork City as anxious shop... | [panic, buying, hits, newyork, city, as, anxio... | [panic, buying, hits, newyork, city, anxious, ... | [panic, buy, hit, newyork, citi, anxiou, shopp... | [panic, buying, hit, newyork, city, anxious, s... | [(13, 1), (15, 1), (22, 1), (38, 1), (39, 1), ... |
4 | #toiletpaper #dunnypaper #coronavirus #coronav... | Neutral | 26 | 24 | 252 | 18 | 2.0 | toiletpaper dunnypaper coronavirus coronavirus... | [toiletpaper, dunnypaper, coronavirus, coronav... | [toiletpaper, dunnypaper, coronavirus, coronav... | [toiletpap, dunnypap, coronaviru, coronavirusa... | [toiletpaper, dunnypaper, coronavirus, coronav... | [(22, 1), (36, 1), (42, 2), (61, 1), (62, 1), ... |
Ahora podemos crear el modelo tfidf, que le aplicará el peso o la importancia de la palabra sobre cada tweet.
tfidf_model = TfidfModel(cov_text_df['corpus'].to_list())
tfidf_model[cov_text_df.corpus[35]]
[(66, 0.012423809010707252), (94, 0.1338817964128143), (100, 0.13983597038555867), (125, 0.16603184971976692), (191, 0.0366615798460057), (200, 0.13215623253112269), (385, 0.13764709833941313), (431, 0.10504736264011318), (514, 0.1338817964128143), (518, 0.11763697092276443), (519, 0.16442432675660798), (533, 0.20787055195634596), (539, 0.19732557033072964), (540, 0.2583100574103834), (541, 0.17792498874417698), (542, 0.2365869448105144), (543, 0.19732557033072964), (544, 0.2583100574103834), (545, 0.2583100574103834), (546, 0.24435189549504926), (547, 0.14043469714728926), (548, 0.21486383221064542), (549, 0.14117213886112506), (550, 0.2365869448105144), (551, 0.13052074006425948), (552, 0.2021566259407788), (553, 0.11907813640808443), (554, 0.13160153085003132), (555, 0.10699154104827106), (556, 0.15871040074104084), (557, 0.14514451536505926), (558, 0.11336421039251729), (559, 0.16951764186212184), (560, 0.16772630707104322)]
Estos serán los datos que entrarán a nuestro modelo de clasificación. Pero tenemos un problema: Los tweets tienen tamaños variables. Así que es necesario convertir estos pesos de cada tweet en "vectores escasos", cuyo tamaño será el número de palabras en total.
from gensim.matutils import corpus2csc
tfidf = pd.DataFrame(corpus2csc(tfidf_model[cov_text_df['corpus'].to_list()], num_terms=len(mydict.token2id)).toarray()).T
tfidf
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 13015 | 13016 | 13017 | 13018 | 13019 | 13020 | 13021 | 13022 | 13023 | 13024 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.229625 | 0.293409 | 0.120685 | 0.2543 | 0.293409 | 0.208048 | 0.293409 | 0.293409 | 0.293409 | 0.132441 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
1 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
3 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
4 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3793 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
3794 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
3795 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
3796 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.261817 | 0.261817 | 0.261817 | 0.261817 | 0.261817 | 0.261817 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
3797 | 0.000000 | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.309055 | 0.309055 | 0.309055 | 0.309055 |
3798 rows × 13025 columns
Oficialmente nuestros datos ya estarán listos. Vamos a definir un modelo sencillo de clasificación.
from sklearn.tree import DecisionTreeClassifier
clf_decision_tfidf = DecisionTreeClassifier(random_state=2)
# Fit the model
clf_decision_tfidf.fit(tfidf, cov_text_df['Sentiment_hot'])
DecisionTreeClassifier(random_state=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=2)
Podemos ver ahora, con mayor detalle dentro del modelo, cuales fueron las palabras con mayor importancia a la hora de hacer la clasificación
importances = list(clf_decision_tfidf.feature_importances_)
feature_importances = [(feature, round(importance, 10)) for feature, importance in zip(mydict.token2id.keys(), importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
top_i = 0
for pair in feature_importances:
print('Variable: {:20} Importance: {}'.format(*pair))
if top_i == 20:
break
top_i += 1
Variable: covid Importance: 0.0525923825 Variable: coronavirus Importance: 0.0276501719 Variable: panic Importance: 0.0192730414 Variable: food Importance: 0.0185708846 Variable: store Importance: 0.0169241621 Variable: supermarket Importance: 0.0167774608 Variable: grocery Importance: 0.014298809 Variable: people Importance: 0.0139979201 Variable: stock Importance: 0.0138018322 Variable: like Importance: 0.0121993182 Variable: shopping Importance: 0.0103365326 Variable: please Importance: 0.0098437708 Variable: help Importance: 0.0095271886 Variable: online Importance: 0.0092700455 Variable: price Importance: 0.0091998153 Variable: go Importance: 0.0087511969 Variable: u Importance: 0.0078519802 Variable: buy Importance: 0.0072813202 Variable: safe Importance: 0.0072168044 Variable: stop Importance: 0.006921214 Variable: good Importance: 0.0066767934