Notebook

`TextBlob`: otro módulo para tareas de PLN (`NLTK` + `pattern`)¶

textblob es una librería de procesamiento del texto para Python que permite realizar tareas de Procesamiento del Lenguaje Natural como análisis morfológico, extracción de entidades, análisis de opinión, traducción automática, etc.

Está construída sobre otras dos librerías que ya conoces NLTK y pattern y su principal ventaja es que permite combinar el uso de las dos herramientas anteriores en un interfaz más simple.

Vamos a apoyarnos en este tutorial para aprender a utilizar algunas de sus funcionalidades. Lo primero es importar el objeto TextBlob que nos permite acceder a todas las herramentas que incluye.

In [1]:

from textblob import TextBlob

Vamos a crear nuestro primer ejemplo de textblob a través del objeto TextBlob. Piensa en estos textblobs como una especie de cadenas de texto de Python, analaizadas y enriquecidas con algunas características extra.

In [2]:

texto = '''In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles and San Francisco counties make an important point about the lightly regulated sharing economy. The consumers who participate deserve a  very clear picture of the risks they're taking'''
t = TextBlob(texto)

In [3]:

print t.sentences

[Sentence("In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles and San Francisco counties make an important point about the lightly regulated sharing economy."), Sentence("The consumers who participate deserve a  very clear picture of the risks they're taking")]

Procesando oraciones, palabras y entidades¶

Podemos segmentar en oraciones y en palabras nuestra texto de ejemplo simplemente accediendo a las propiedades .sentences y .words. Imprimimos por pantalla:

In [4]:

# imprimimos las oraciones
for sentence in t.sentences:
    print sentence
    print "--------------"
    
# y las palabras    
print t.words
print texto.split()

In new lawsuits brought against the ride-sharing companies Uber and Lyft, the top prosecutors in Los Angeles and San Francisco counties make an important point about the lightly regulated sharing economy.
--------------
The consumers who participate deserve a  very clear picture of the risks they're taking
--------------
['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft', 'the', 'top', 'prosecutors', 'in', 'Los', 'Angeles', 'and', 'San', 'Francisco', 'counties', 'make', 'an', 'important', 'point', 'about', 'the', 'lightly', 'regulated', 'sharing', 'economy', 'The', 'consumers', 'who', 'participate', 'deserve', 'a', 'very', 'clear', 'picture', 'of', 'the', 'risks', 'they', "'re", 'taking']
['In', 'new', 'lawsuits', 'brought', 'against', 'the', 'ride-sharing', 'companies', 'Uber', 'and', 'Lyft,', 'the', 'top', 'prosecutors', 'in', 'Los', 'Angeles', 'and', 'San', 'Francisco', 'counties', 'make', 'an', 'important', 'point', 'about', 'the', 'lightly', 'regulated', 'sharing', 'economy.', 'The', 'consumers', 'who', 'participate', 'deserve', 'a', 'very', 'clear', 'picture', 'of', 'the', 'risks', "they're", 'taking']

La propiedad .noun_phrases nos permite acceder a la lista de entidades (en realidad, son sintagmas nominales) incluídos en nuestro textblob. Así es como funciona.

In [5]:

print "el texto de ejemplo contiene", len(t.noun_phrases), "entidades"
for element in t.noun_phrases:
    print "-", element

el texto de ejemplo contiene 8 entidades
- new lawsuits
- uber
- lyft
- top prosecutors
- los angeles
- san francisco
- important point
- clear picture

In [6]:

# jugando con lemas, singulares y plurales 
for word in t.words:
    if word.endswith("s"):
        print word.lemmatize(), word, word.singularize()
    else:
        print word.lemmatize(), word, word.pluralize()

In In Ins
new new news
lawsuit lawsuits lawsuit
brought brought broughts
against against againsts
the the thes
ride-sharing ride-sharing ride-sharings
company companies company
Uber Uber Ubers
and and ands
Lyft Lyft Lyfts
the the thes
top top tops
prosecutor prosecutors prosecutor
in in ins
Los Los Lo
Angeles Angeles Angele
and and ands
San San Sans
Francisco Francisco Franciscoes
county counties county
make make makes
an an some
important important importants
point point points
about about abouts
the the thes
lightly lightly lightlies
regulated regulated regulateds
sharing sharing sharings
economy economy economies
The The Thes
consumer consumers consumer
who who whoes
participate participate participates
deserve deserve deserves
a a some
very very veries
clear clear clears
picture picture pictures
of of ofs
the the thes
risk risks risk
they they they
're 're 'res
taking taking takings

In [7]:

# ¿cómo podemos hacer la lematización más inteligente?
for element in t.tags:
    # solo lematizamos sustantivos
    if element[1] == "NN":
        print element[0], element[0].lemmatize(), element[0].pluralize() 
    elif element[1] == "NNS":
        print element[0], element[0].lemmatize(), element[0].singularize() 
    
    # y formas verbales
    if element[1].startswith("VB"):
        print element[0], element[0].lemmatize("v")

lawsuits lawsuit lawsuit
brought bring
companies company company
prosecutors prosecutor prosecutor
counties county county
make make
point point points
regulated regulate
sharing share
economy economy economies
consumers consumer consumer
participate participate
deserve deserve
picture picture pictures
risks risk risk
re re res
taking take

Análisis sintático¶

Aunque podemos utilizar otros analizadores, por defecto el método .parse() invoca al analizador morfosintáctico del módulo pattern.en que ya conoces.

In [8]:

# análisis sintáctico: ¿te suena de pattern?
print t.parse()

In/IN/B-PP/B-PNP new/JJ/B-NP/I-PNP lawsuits/NNS/I-NP/I-PNP brought/VBN/B-VP/I-PNP against/IN/B-PP/B-PNP the/DT/B-NP/I-PNP ride-sharing/JJ/I-NP/I-PNP companies/NNS/I-NP/I-PNP Uber/NNP/I-NP/I-PNP and/CC/O/O Lyft/NNP/B-NP/O ,/,/O/O the/DT/B-NP/O top/JJ/I-NP/O prosecutors/NNS/I-NP/O in/IN/B-PP/B-PNP Los/NNP/B-NP/I-PNP Angeles/NNP/I-NP/I-PNP and/CC/I-NP/I-PNP San/NNP/I-NP/I-PNP Francisco/NNP/I-NP/I-PNP counties/NNS/I-NP/I-PNP make/VB/B-VP/O an/DT/B-NP/O important/JJ/I-NP/O point/NN/I-NP/O about/IN/B-PP/O the/DT/O/O lightly/RB/B-VP/O regulated/VBN/I-VP/O sharing/VBG/I-VP/O economy/NN/B-NP/O ././O/O
The/DT/B-NP/O consumers/NNS/I-NP/O who/WP/O/O participate/VB/B-VP/O deserve/VBP/I-VP/O a/DT/B-NP/O very/RB/I-NP/O clear/JJ/I-NP/O picture/NN/I-NP/O of/IN/B-PP/B-PNP the/DT/B-NP/I-PNP risks/NNS/I-NP/I-PNP they/PRP/I-NP/I-PNP '/POS/O/O re/NN/B-NP/O taking/VBG/B-VP/O

Traducción automática¶

A partir de cualquier texto procesado con TextBlob, podemos acceder a un traductor automático de bastante calidad con el método .translate. Fíjate en cómo lo usamos. Es obligatorio indicar la lengua de destinto. La lengua de origen, se puede predecir a partir del texto de entrada.

In [9]:

# de chino a inglés y español 
oracion_zh = u"中国探月工程 亦稱嫦娥工程，是中国启动的第一个探月工程，于2003年3月1日正式启动"
t_zh = TextBlob(oracion_zh)
print t_zh.translate(from_lang="zh-CN", to="en")
print t_zh.translate(from_lang="zh-CN", to="es")

print "--------------"
t_es = TextBlob(u"La deuda pública ha marcado nuevos récords en España en el tercer trimestre")
print t_es.translate(to="el")
print t_es.translate(to="ru")
print t_es.translate(to="eu")
print t_es.translate(to="fi")
print t_es.translate(to="fr")
print t_es.translate(to="nl")
print t_es.translate(to="gl")
print t_es.translate(to="ca")
print t_es.translate(to="zh")
print t_es.translate(to="la")

# con el slang no funciona tan bien
print "--------------"
t_ita = TextBlob("Sono andato a Milano e mi sono divertito un bordello.")
print t_ita.translate(to="en")
print t_ita.translate(to="es")

Chinese Lunar Exploration Program , also known as Chang E project is the start of the first Chinese lunar exploration program , on March 1, 2003 officially launched
Programa de Exploración Lunar chino , también conocido como proyecto Chang E es el inicio del primer programa de exploración lunar chino, el 01 de marzo 2003 lanzó oficialmente
--------------
Το δημόσιο χρέος έχει θέσει νέα ρεκόρ στην Ισπανία κατά το τρίτο τρίμηνο
Государственный долг установить новые рекорды в Испании в третьем квартале
Zor publikoa Espainian erregistro berriak ezarri ditu , hirugarren hiruhilekoan
Julkinen velka on asettanut uusia ennätyksiä Espanjassa kolmannella neljänneksellä
La dette publique a établi de nouveaux records en Espagne au troisième trimestre
De overheidsschuld is nieuwe records in Spanje in het derde kwartaal
A débeda pública estableceu novos récords en España no terceiro trimestre
El deute públic ha marcat nous rècords a Espanya en el tercer trimestre
公共债务已在第三季度设定在西班牙的新纪录
Constituit novam in Hispania gestis publice in tertia quartam
--------------
I went to Milan and I enjoyed it a brothel .
Fui a Milán y me gustó mucho un burdel .

WordNet¶

textblob, más concretamente, cualquier objeto de la clase Word, nos permite acceder a la información de WordNet.

In [10]:

# WordNet
from textblob import Word
from textblob.wordnet import VERB

# ¿cuántos synsets tiene "car"
word = Word("car")
print word.synsets

# dame los synsets de la palabra "hack" como verbo
print Word("hack").get_synsets(pos=VERB)

# imprime la lista de definiciones de "car"
print Word("car").definitions

# recorre la jerarquía de hiperónimos
for s in word.synsets:
    print s.hypernym_paths()

[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]
[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]
[u'a motor vehicle with four wheels; usually propelled by an internal combustion engine', u'a wheeled vehicle adapted to the rails of railroad', u'the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant', u'where passengers ride up and down', u'a conveyance for passengers or freight on a cable railway']
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('container.n.01'), Synset('wheeled_vehicle.n.01'), Synset('self-propelled_vehicle.n.01'), Synset('motor_vehicle.n.01'), Synset('car.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('conveyance.n.03'), Synset('vehicle.n.01'), Synset('wheeled_vehicle.n.01'), Synset('self-propelled_vehicle.n.01'), Synset('motor_vehicle.n.01'), Synset('car.n.01')]]
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('container.n.01'), Synset('wheeled_vehicle.n.01'), Synset('car.n.02')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('instrumentality.n.03'), Synset('conveyance.n.03'), Synset('vehicle.n.01'), Synset('wheeled_vehicle.n.01'), Synset('car.n.02')]]
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('car.n.03')]]
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('car.n.04')]]
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('structure.n.01'), Synset('area.n.05'), Synset('room.n.01'), Synset('compartment.n.02'), Synset('cable_car.n.01')]]

Análisis de opinion¶

In [11]:

# análisis de opinión
opinion1 = TextBlob("This new restaurant is great. I had so much fun!! :-P")
print opinion1.sentiment

opinion2 = TextBlob("Google News to close in Spain.")
print opinion2.sentiment


print opinion1.sentiment.polarity

if opinion1.sentiment.subjectivity > 0.5:
    print "Hey, esto es una opinion"

Sentiment(polarity=0.5387784090909091, subjectivity=0.6011363636363636)
Sentiment(polarity=0.0, subjectivity=0.0)
0.538778409091
Hey, esto es una opinion

Otras curiosidades¶

In [12]:

#  corrección ortográfica
b1 = TextBlob("I havv goood speling!")
print b1.correct()

b2 = TextBlob("Mi naem iz Jonh!")
print b2.correct()

b3 = TextBlob("Boyz dont cri")
print b3.correct()

b4 = TextBlob("psychological posesion achivemen comitment")
print b4.correct()

I have good spelling!
I name in On!
Boy dont cry
psychological position achievement commitment

Hasta el infinito, y más allá¶

En este breve resumen solo consideramos las posibilidades que ofrece TextBlob por defecto. Pero si necesitas personalizar las herramientas, echa un vistazo a la documentación avanzada.

TextBlob: otro módulo para tareas de PLN (NLTK + pattern)¶