This demo will walk you though an example session of using the widget and related visualizers provided in the jupyter
sub-module of Text Extensions for Pandas.
import os
import regex
import sys
import numpy as np
import pandas as pd
# And of course we need the text_extensions_for_pandas library itself.
try:
import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
# If we're running from within the project source tree and the parent Python
# environment doesn't have the text_extensions_for_pandas package, use the
# version in the local source tree.
if not os.getcwd().endswith("notebooks"):
raise e
if ".." not in sys.path:
sys.path.insert(0, "..")
import text_extensions_for_pandas as tp
This demo will make use of the CoNLL-2003 dataset, a dataset concerning named entity recognition (Named Entity Extraction). We will be looking at a token classification problem - analyzing the building blocks of natural language present in this dataset that we can process and feed into a machine learning algorithm. The dataset contains categorical entity classifications of locations (LOC)
, persons (PER)
, organizations (ORG)
and miscellaneous (MISC)
.
Our goal is to load up some data from this dataset and do some basic processing and analysis, and make corrections if necessary.
We will use Text Extensions for Pandas to download and parse the CoNLL dataset into dataframes to work with.
# Download and cache the data set.
# NOTE: This data set is licensed for research use only. Be sure to adhere
# to the terms of the license when using this data set!
data_set_info = tp.io.conll.maybe_download_conll_data("outputs")
data_set_info
{'train': 'outputs/eng.train', 'dev': 'outputs/eng.testa', 'test': 'outputs/eng.testb'}
gold_standard = tp.io.conll.conll_2003_to_dataframes(
data_set_info["test"], ["pos", "phrase", "ent"], [False, True, True])
gold_standard = [
df.drop(columns=["pos", "phrase_iob", "phrase_type"])
for df in gold_standard
]
Once we have our dataset downloaded and parsed, we can prepare our dataframe for visualization.
tokens = gold_standard[0]
tokens
span | ent_iob | ent_type | sentence | line_num | |
---|---|---|---|---|---|
0 | [0, 10): '-DOCSTART-' | O | None | [0, 10): '-DOCSTART-' | 0 |
1 | [11, 17): 'SOCCER' | O | None | [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... | 2 |
2 | [17, 18): '-' | O | None | [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... | 3 |
3 | [19, 24): 'JAPAN' | B | LOC | [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... | 4 |
4 | [25, 28): 'GET' | O | None | [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... | 5 |
... | ... | ... | ... | ... | ... |
415 | [2178, 2182): 'each' | O | None | [2138, 2197): 'All four teams are level with o... | 437 |
416 | [2183, 2187): 'from' | O | None | [2138, 2197): 'All four teams are level with o... | 438 |
417 | [2188, 2191): 'one' | O | None | [2138, 2197): 'All four teams are level with o... | 439 |
418 | [2192, 2196): 'game' | O | None | [2138, 2197): 'All four teams are level with o... | 440 |
419 | [2196, 2197): '.' | O | None | [2138, 2197): 'All four teams are level with o... | 441 |
420 rows × 5 columns
entity_mentions = tp.io.conll.iob_to_spans(tokens)
entity_mentions.head()
span | ent_type | |
---|---|---|
0 | [19, 24): 'JAPAN' | LOC |
1 | [40, 45): 'CHINA' | PER |
2 | [66, 77): 'Nadim Ladki' | PER |
3 | [78, 84): 'AL-AIN' | LOC |
4 | [86, 106): 'United Arab Emirates' | LOC |
sentences = tokens["sentence"].unique()
entity_sentence_pairs = tp.spanner.contain_join(pd.Series(sentences), entity_mentions["span"], "sentence", "span")
entity_mentions = entity_mentions.merge(entity_sentence_pairs)
entity_mentions["sentence_id"] = entity_mentions["sentence"].array.begin
entity_mentions.head()
span | ent_type | sentence | sentence_id | |
---|---|---|---|---|
0 | [19, 24): 'JAPAN' | LOC | [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... | 11 |
1 | [40, 45): 'CHINA' | PER | [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... | 11 |
2 | [66, 77): 'Nadim Ladki' | PER | [66, 77): 'Nadim Ladki' | 66 |
3 | [78, 84): 'AL-AIN' | LOC | [78, 117): 'AL-AIN, United Arab Emirates 1996-... | 78 |
4 | [86, 106): 'United Arab Emirates' | LOC | [78, 117): 'AL-AIN, United Arab Emirates 1996-... | 78 |
We can take a closer look at what the span
column might look like in context by viewing the column alone as the SpanArray datatype.
entity_mentions["sentence"].unique()
-DOCSTART-
SOCCER- JAPAN GET LUCKY WIN, CHINA IN SURPRISE DEFEAT.
Nadim Ladki
AL-AIN, United Arab Emirates 1996-12-06
Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday.
But China saw their luck desert them in the second match of the group, crashing to a surprise 2-0 defeat to newcomers Uzbekistan.
China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net.
Oleg Shatskiku made sure of the win in injury time, hitting an unstoppable left foot shot from just outside the area.
The former Soviet republic was playing in an Asian Cup finals tie for the first time.
Despite winning the Asian Games title two years ago, Uzbekistan are in the finals as outsiders.
Two goals from defensive errors in the last six minutes allowed Japan to come from behind and collect all three points from their opening meeting against Syria.
Takuya Takagi scored the winner in the 88th minute, rising to head a Hiroshige Yanagimoto cross towards the Syrian goal which goalkeeper Salem Bitar appeared to have covered but then allowed to slip into the net.
It was the second costly blunder by Syria in four minutes.
Defender Hassan Abbas rose to intercept a long ball into the area in the 84th minute but only managed to divert it into the top corner of Bitar's goal.
Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute.
Japan then laid siege to the Syrian penalty area for most of the game but rarely breached the Syrian defence.
Bitar pulled off fine saves whenever they did.
Japan coach Shu Kamo said: ' ' The Syrian own goal proved lucky for us.
The Syrians scored early and then played defensively and adopted long balls which made it hard for us. '
'
Japan, co-hosts of the World Cup in 2002 and ranked 20th in the world by FIFA, are favourites to regain their title here.
Hosts UAE play Kuwait and South Korea take on Indonesia on Saturday in Group A matches.
All four teams are level with one point each from one game.
We don't really want to visualize every column in our dataframe as we're only interested in viewing the entity classifications. The next step is to drop any columns we don't care about.
Now that our data is prepared for analysis, we can load it up in our widget.
widget = tp.jupyter.DataFrameWidget(entity_mentions.drop(columns=["sentence"]))
widget.display()
Output(_dom_classes=('tep--dfwidget--output',))
If we want to view this widget interactively, we can pass in the additional parameter interactive_columns
with an array of column names we want to become interactive widgets.
One thing you may notice in the above widgets is that the column ent_type
is editable via a text box. This is fine, but there is a more appropriate way to interact with categorical data.
categorical = pd.Categorical(entity_mentions["ent_type"], categories=["PER", "LOC", "ORG", "MISC"])
entity_mentions["ent_type"] = categorical
tp.jupyter.DataFrameWidget(entity_mentions.drop(columns=["sentence", "sentence_id"]), interactive_columns=["ent_type"]).display()
Output(_dom_classes=('tep--dfwidget--output',))
corrected_entities = entity_mentions.copy(True)
new_types = corrected_entities["ent_type"].copy()
new_types[widget.selected] = "ORG"
corrected_entities["new_type"] = new_types
corrected_entities
span | ent_type | sentence | sentence_id | new_type | |
---|---|---|---|---|---|
0 | [19, 24): 'JAPAN' | LOC | [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... | 11 | LOC |
1 | [40, 45): 'CHINA' | PER | [11, 65): 'SOCCER- JAPAN GET LUCKY WIN, CHINA ... | 11 | PER |
2 | [66, 77): 'Nadim Ladki' | PER | [66, 77): 'Nadim Ladki' | 66 | PER |
3 | [78, 84): 'AL-AIN' | LOC | [78, 117): 'AL-AIN, United Arab Emirates 1996-... | 78 | LOC |
4 | [86, 106): 'United Arab Emirates' | LOC | [78, 117): 'AL-AIN, United Arab Emirates 1996-... | 78 | LOC |
5 | [118, 123): 'Japan' | LOC | [118, 244): 'Japan began the defence of their ... | 118 | LOC |
6 | [151, 160): 'Asian Cup' | MISC | [118, 244): 'Japan began the defence of their ... | 118 | MISC |
7 | [196, 201): 'Syria' | LOC | [118, 244): 'Japan began the defence of their ... | 118 | LOC |
8 | [249, 254): 'China' | LOC | [245, 374): 'But China saw their luck desert t... | 245 | LOC |
9 | [363, 373): 'Uzbekistan' | LOC | [245, 374): 'But China saw their luck desert t... | 245 | LOC |
10 | [375, 380): 'China' | LOC | [375, 617): 'China controlled most of the matc... | 375 | LOC |
11 | [468, 473): 'Uzbek' | MISC | [375, 617): 'China controlled most of the matc... | 375 | MISC |
12 | [482, 495): 'Igor Shkvyrin' | PER | [375, 617): 'China controlled most of the matc... | 375 | PER |
13 | [580, 587): 'Chinese' | MISC | [375, 617): 'China controlled most of the matc... | 375 | MISC |
14 | [618, 632): 'Oleg Shatskiku' | PER | [618, 735): 'Oleg Shatskiku made sure of the w... | 618 | PER |
15 | [747, 753): 'Soviet' | MISC | [736, 821): 'The former Soviet republic was pl... | 736 | MISC |
16 | [781, 790): 'Asian Cup' | MISC | [736, 821): 'The former Soviet republic was pl... | 736 | MISC |
17 | [842, 853): 'Asian Games' | MISC | [822, 917): 'Despite winning the Asian Games t... | 822 | MISC |
18 | [875, 885): 'Uzbekistan' | LOC | [822, 917): 'Despite winning the Asian Games t... | 822 | LOC |
19 | [982, 987): 'Japan' | LOC | [918, 1078): 'Two goals from defensive errors ... | 918 | LOC |
20 | [1072, 1077): 'Syria' | LOC | [918, 1078): 'Two goals from defensive errors ... | 918 | LOC |
21 | [1079, 1092): 'Takuya Takagi' | PER | [1079, 1291): 'Takuya Takagi scored the winner... | 1079 | PER |
22 | [1148, 1168): 'Hiroshige Yanagimoto' | PER | [1079, 1291): 'Takuya Takagi scored the winner... | 1079 | PER |
23 | [1187, 1193): 'Syrian' | MISC | [1079, 1291): 'Takuya Takagi scored the winner... | 1079 | MISC |
24 | [1216, 1227): 'Salem Bitar' | PER | [1079, 1291): 'Takuya Takagi scored the winner... | 1079 | PER |
25 | [1328, 1333): 'Syria' | LOC | [1292, 1350): 'It was the second costly blunde... | 1292 | LOC |
26 | [1360, 1372): 'Hassan Abbas' | PER | [1351, 1502): 'Defender Hassan Abbas rose to i... | 1351 | PER |
27 | [1489, 1494): 'Bitar' | PER | [1351, 1502): 'Defender Hassan Abbas rose to i... | 1351 | PER |
28 | [1503, 1517): 'Nader Jokhadar' | PER | [1503, 1591): 'Nader Jokhadar had given Syria ... | 1503 | PER |
29 | [1528, 1533): 'Syria' | LOC | [1503, 1591): 'Nader Jokhadar had given Syria ... | 1503 | LOC |
30 | [1592, 1597): 'Japan' | LOC | [1592, 1701): 'Japan then laid siege to the Sy... | 1592 | LOC |
31 | [1621, 1627): 'Syrian' | MISC | [1592, 1701): 'Japan then laid siege to the Sy... | 1592 | MISC |
32 | [1686, 1692): 'Syrian' | MISC | [1592, 1701): 'Japan then laid siege to the Sy... | 1592 | MISC |
33 | [1702, 1707): 'Bitar' | PER | [1702, 1748): 'Bitar pulled off fine saves whe... | 1702 | PER |
34 | [1749, 1754): 'Japan' | LOC | [1749, 1820): 'Japan coach Shu Kamo said: ' ' ... | 1749 | LOC |
35 | [1761, 1769): 'Shu Kamo' | PER | [1749, 1820): 'Japan coach Shu Kamo said: ' ' ... | 1749 | PER |
36 | [1784, 1790): 'Syrian' | MISC | [1749, 1820): 'Japan coach Shu Kamo said: ' ' ... | 1749 | MISC |
37 | [1825, 1832): 'Syrians' | MISC | [1821, 1925): 'The Syrians scored early and th... | 1821 | MISC |
38 | [1928, 1933): 'Japan' | LOC | [1928, 2049): 'Japan, co-hosts of the World Cu... | 1928 | LOC |
39 | [1951, 1960): 'World Cup' | MISC | [1928, 2049): 'Japan, co-hosts of the World Cu... | 1928 | MISC |
40 | [2001, 2005): 'FIFA' | ORG | [1928, 2049): 'Japan, co-hosts of the World Cu... | 1928 | ORG |
41 | [2056, 2059): 'UAE' | LOC | [2050, 2137): 'Hosts UAE play Kuwait and South... | 2050 | LOC |
42 | [2065, 2071): 'Kuwait' | LOC | [2050, 2137): 'Hosts UAE play Kuwait and South... | 2050 | LOC |
43 | [2076, 2087): 'South Korea' | LOC | [2050, 2137): 'Hosts UAE play Kuwait and South... | 2050 | LOC |
44 | [2096, 2105): 'Indonesia' | LOC | [2050, 2137): 'Hosts UAE play Kuwait and South... | 2050 | LOC |