An n-gram is a contiguous sequence of n items (typically words or characters) from a given text or speech. In the context of a text corpus, an n-gram is used to analyze patterns of word usage and co-occurrence by grouping items into chunks of size n. For example, a 1-gram (unigram) would analyze individual words, while a 2-gram (bigram) would examine pairs of consecutive words. N-grams are particularly useful in natural language processing (NLP) for tasks like text prediction, language modeling, and understanding the structure and context within a large corpus of texts. This notebook will show how to create n-grams within the Text-Fabric environment.
%load_ext autoreload
%autoreload 2
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use
# load the N1904 app and data
N1904 = use ("CenterBLC/N1904", version="1.0.0", hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 27 | 5102.93 | 100 |
chapter | 260 | 529.92 | 100 |
verse | 7944 | 17.34 | 100 |
sentence | 8011 | 17.20 | 100 |
group | 8945 | 7.01 | 46 |
clause | 42506 | 8.36 | 258 |
wg | 106868 | 6.88 | 533 |
phrase | 69007 | 1.90 | 95 |
subphrase | 116178 | 1.60 | 135 |
word | 137779 | 1.00 | 100 |
3
CenterBLC/N1904
C:/Users/tonyj/text-fabric-data/github/CenterBLC/N1904/app
gdb630837ae89b9468c9e50d13bda05cfd3de4f18
''
[]
none
unknown
NA
:
text-orig-full
https://github.com/CenterBLC/N1904/tree/main/docs
about
https://github.com/CenterBLC/N1904
https://github.com/CenterBLC/N1904/blob/main/docs/features/<feature>.md
README
text-orig-full
}True
local
C:/Users/tonyj/text-fabric-data/github/CenterBLC/N1904/_temp
main
Nestle 1904 Greek New Testament
10.5281/zenodo.13117910
[]
CenterBLC
/tf
N1904
N1904
1.0.0
https://learner.bible/text/show_text/nestle1904/
Show this on the website
en
https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
{webBase}/word?version={version}&id=<lid>
1.0.0
True
{typ} {function} {rela} \\ {cls} {role} {junction}
''
{typ} {function} {rela} \\ {typems} {role} {rule}
''
True
{typ} {function} {rela} \\ {typems} {role} {rule}
''
{typ} {function} {rela} \\ {role} {rule}
''
{typ} {function} {rela} \\ {typems} {role} {rule}
''
True
{book} {chapter}:{verse}
''
True
{typems} {role} {rule} {junction}
''
lemma
sp
gloss
]grc
Display is setup for viewtype syntax-view
See here for more information on viewtypes
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())
We'll extract n-grams of words along with their POS tags.
In the following script we rely upon the following Text-Fabric features:
Set the size of the n-gram you wish to extract. For example, n = 2
for bi-grams or n = 3
for tri-grams. A lower n (e.g., 1 or 2) provides a more granular and fine analysis, focusing on smaller units, which can be useful for basic statistics or initial insights. A higher n (e.g., 3 or more) captures more nuanced and structured relationships, which can be useful in more complex linguistic or computational tasks.
# Setting the size of the n-gram
n = 3 # This would be for bigrams, change to 3 for trigrams, etc.
The following script retrieves first all chapter nodes with F.otype.s('chapter')
and gets their word descendants using L.d(chapterNode, 'word')
. The extractNGrams(words, n)
function generates n-grams from these word lists. Features features like unicode, sp, lemma, and morph are used to create each n-gram item. The n-grams are stored as dictionaries, and the first five are printed for verification.
# Function to extract n-grams from a list of words
def extractNGrams(words, n):
return [words[i:i+n] for i in range(len(words) - n + 1)]
# Collect all n-grams in a list
allNGrams = []
# Iterate over all verses in the New Testament
for chapterNode in F.otype.s('chapter'):
wordsInChapter = L.d(chapterNode, 'word')
nGramsInChapter = extractNGrams(wordsInChapter, n)
for nGram in nGramsInChapter:
nGramData = []
for wordNode in nGram:
wordText = F.unicode.v(wordNode) # Greek word in Unicode
posTag = F.sp.v(wordNode) # Part of Speech
lemma = F.lemma.v(wordNode) # Lemma of the word
morph = F.morph.v(wordNode) # Morphological code
# Collect data for each word in the n-gram
nGramData.append({
'wordText': wordText,
'posTag': posTag,
'lemma': lemma,
'morph': morph
})
# Add the n-gram data to the list
allNGrams.append(nGramData)
# verification - Print the first 5 n-grams
for nGramData in allNGrams[:5]:
words = [wordData['wordText'] for wordData in nGramData]
posTags = [wordData['posTag'] for wordData in nGramData]
print(f"Words: {' '.join(words)}")
print(f"POS Tags: {posTags}")
print('-' * 50)
Words: Βίβλος γενέσεως Ἰησοῦ POS Tags: ['subs', 'subs', 'subs'] -------------------------------------------------- Words: γενέσεως Ἰησοῦ Χριστοῦ POS Tags: ['subs', 'subs', 'subs'] -------------------------------------------------- Words: Ἰησοῦ Χριστοῦ υἱοῦ POS Tags: ['subs', 'subs', 'subs'] -------------------------------------------------- Words: Χριστοῦ υἱοῦ Δαυεὶδ POS Tags: ['subs', 'subs', 'subs'] -------------------------------------------------- Words: υἱοῦ Δαυεὶδ υἱοῦ POS Tags: ['subs', 'subs', 'subs'] --------------------------------------------------
Once the n-grams are obtained, we can perform various analyses on the extracted n-grams. This section provides a few examples.
The following script provides an statistic overview of the most frequent n-grams.
from collections import Counter
import pandas as pd
from IPython.display import display
# Convert n-grams to tuples of word texts for counting
nGramTuples = [tuple(wordData['wordText'] for wordData in nGram) for nGram in allNGrams]
# Count the frequency of each n-gram
nGramFrequency = Counter(nGramTuples)
# Prepare the data for the DataFrame
nGramData = [{'N-Gram': ' '.join(nGram), 'Frequency': freq} for nGram, freq in nGramFrequency.most_common(10)]
# Create a pandas DataFrame
df = pd.DataFrame(nGramData)
# Display the title and the DataFrame
display(HTML("<h3>Most common n-grams</h3>"))
display(df)
N-Gram | Frequency | |
---|---|---|
0 | ὁ Υἱὸς τοῦ | 60 |
1 | ὁ δὲ εἶπεν | 54 |
2 | τοῦ Κυρίου ἡμῶν | 47 |
3 | λέγω ὑμῖν ὅτι | 42 |
4 | ὁ δὲ Ἰησοῦς | 42 |
5 | Υἱὸς τοῦ ἀνθρώπου | 41 |
6 | τοῦ Θεοῦ καὶ | 40 |
7 | αὐτοῖς ὁ Ἰησοῦς | 39 |
8 | ὁ Ἰησοῦς εἶπεν | 38 |
9 | δὲ ἐγέννησεν τὸν | 37 |
from collections import Counter
import pandas as pd
from IPython.display import display, HTML
# Extract POS tag sequences from n-grams
posSequences = [tuple(wordData['posTag'] for wordData in nGram) for nGram in allNGrams]
# Count the frequency of POS tag sequences
posSequenceFrequency = Counter(posSequences)
# Prepare the data for the DataFrame
posSequenceData = [{'POS Tag Sequence': ' '.join(posSeq), 'Frequency': freq} for posSeq, freq in posSequenceFrequency.most_common(10)]
# Create a pandas DataFrame
df_pos = pd.DataFrame(posSequenceData)
# Display the title and the DataFrame
display(HTML("<h3>Most common POS tag sequences</h3>")) # Title added here
display(df_pos)
POS Tag Sequence | Frequency | |
---|---|---|
0 | prep art subs | 3568 |
1 | verb art subs | 3513 |
2 | art subs pron | 3390 |
3 | art subs art | 2846 |
4 | subs art subs | 2765 |
5 | art subs verb | 2666 |
6 | art subs conj | 2402 |
7 | verb prep art | 1837 |
8 | conj art subs | 1818 |
9 | subs conj verb | 1719 |
N-grams can be used improve POS tagging by identifying common contexts in which words appear. For instance, if a word is often preceded by a definite article, it's likely a noun. It can also be used for disambiguation. Suppose you encounter the word "ἕως" which (according the the MACULA XML Treebank) can be (in the context of the Greek New Testament) be either a preposition, a conjunction or a adverbial. The following script will generate a pie chart displaing the frequency distribution of the POS for the word "ἕως".
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
# Define the target word in unicode
targetWord = 'ἕως'
# Prepare list to store the POS tags for "ἕως" when it's the last word
posTagsForTargetWord = []
# Find n-grams where "ἕως" is the last word and collect its POS tag
for nGramData in allNGrams:
words = [wordData['wordText'] for wordData in nGramData]
if words[-1] == targetWord: # Only include if "ἕως" is the last word
posTagsForTargetWord.append(nGramData[-1]['posTag']) # Collect the POS tag of "ἕως"
# Count the frequency of each POS tag for "ἕως"
posTagFrequency = Counter(posTagsForTargetWord)
# Prepare data for the pie chart
labels = list(posTagFrequency.keys())
sizes = list(posTagFrequency.values())
# Function to display absolute numbers and percentages on the pie chart
def absolute_and_percentage(pct, allvals):
absolute = int(pct/100. * sum(allvals))
return f"{absolute} ({pct:.1f}%)"
# Plot the pie chart
plt.figure(figsize=(7,7))
plt.pie(sizes, labels=labels, autopct=lambda pct: absolute_and_percentage(pct, sizes),
startangle=140, shadow=False)
plt.title("Distribution of POS tags for 'ἕως' as the last n-gram word")
plt.axis('equal') # Equal aspect ratio ensures that pie chart is drawn as a circle.
# Show the pie chart
plt.show()
This script analyzes the correlation between the part-of-speech (POS) tag sequences preceding the word "ἕως" and its POS tag when it appears as the last word in n-grams. It iterates through the n-grams, extracting the POS tags that precede "ἕως" and the POS tag assigned to "ἕως." These sequences are then grouped and counted, and a heatmap is generated to visualize how different preceding POS tag sequences are distributed across various POS tags of "ἕως," allowing to show the grammatical patterns leading up to its usage.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
# Define the target word in unicode
targetWord = 'ἕως'
# Prepare lists to store preceding POS tags and the POS of "ἕως"
precedingPOSTags = []
finalPOSForTargetWord = []
# Find n-grams where "ἕως" is the last word
for nGramData in allNGrams:
words = [wordData['wordText'] for wordData in nGramData]
if words[-1] == targetWord: # Only include if "ἕως" is the last word
posTags = [wordData['posTag'] for wordData in nGramData]
precedingPOSTags.append(' '.join(posTags[:-1])) # Collect preceding POS tags
finalPOSForTargetWord.append(posTags[-1]) # Collect POS tag of "ἕως"
# Create a DataFrame to store the relationships
df = pd.DataFrame({
'Preceding POS Tags': precedingPOSTags,
'Final POS (ἕως)': finalPOSForTargetWord
})
# Count occurrences of preceding POS tag sequences grouped by the POS of "ἕως"
groupedData = df.groupby(['Final POS (ἕως)', 'Preceding POS Tags']).size().unstack(fill_value=0)
# Plot a heatmap to visualize the relationship between preceding POS tags and the POS of "ἕως"
plt.figure(figsize=(12, 8))
sns.heatmap(groupedData, annot=True, fmt="d", cmap="Blues", cbar=True)
plt.title("Preceding POS tag sequences correlated with POS of 'ἕως'")
plt.xlabel("Preceding POS tags")
plt.ylabel("Final POS (ἕως)")
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()
The scripts in this notebook require (beside text-fabric
) the following Python libraries to be installed in your environment:
pandas
matplotlib
seaborn
collections
You can install any missing library from within Jupyter Notebook using eitherpip
or pip3
.
The following cell displays the active Anaconda environment along with a list of all installed packages and their versions within that environment.
import subprocess
from IPython.display import display, HTML
# Display the active conda environment
!conda env list | findstr "*"
# Run conda list and capture the output
condaListOutput = subprocess.check_output("conda list", shell=True).decode("utf-8")
# Wrap the output with <details> and <summary> HTML tags
htmlOutput = "<details><summary>Click to view installed packages</summary><pre>"
htmlOutput += condaListOutput
htmlOutput += "</pre></details>"
# Display the HTML in the notebook
display(HTML(htmlOutput))
Text-Fabric * C:\Users\tonyj\anaconda3\envs\Text-Fabric
# packages in environment at C:\Users\tonyj\anaconda3\envs\Text-Fabric: # # Name Version Build Channel anyio 4.6.2.post1 pyhd8ed1ab_0 conda-forge argon2-cffi 23.1.0 pyhd8ed1ab_0 conda-forge argon2-cffi-bindings 21.2.0 py312h4389bb4_5 conda-forge arrow 1.3.0 pyhd8ed1ab_0 conda-forge asttokens 2.4.1 pyhd8ed1ab_0 conda-forge async-lru 2.0.4 pyhd8ed1ab_0 conda-forge attrs 24.2.0 pyh71513ae_0 conda-forge babel 2.14.0 pyhd8ed1ab_0 conda-forge beautifulsoup4 4.12.3 pyha770c72_0 conda-forge bleach 6.1.0 pyhd8ed1ab_0 conda-forge blinker 1.8.2 pypi_0 pypi brotli-python 1.1.0 py312h275cf98_2 conda-forge bzip2 1.0.8 h2466b09_7 conda-forge ca-certificates 2024.8.30 h56e8100_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge certifi 2024.8.30 pyhd8ed1ab_0 conda-forge cffi 1.17.1 py312h4389bb4_0 conda-forge charset-normalizer 3.4.0 pyhd8ed1ab_0 conda-forge click 8.1.7 pypi_0 pypi colorama 0.4.6 pyhd8ed1ab_0 conda-forge comm 0.2.2 pyhd8ed1ab_0 conda-forge contourpy 1.3.0 pypi_0 pypi cpython 3.12.7 py312hd8ed1ab_0 conda-forge cycler 0.12.1 pypi_0 pypi debugpy 1.8.7 py312h275cf98_0 conda-forge decorator 5.1.1 pyhd8ed1ab_0 conda-forge defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge entrypoints 0.4 pyhd8ed1ab_0 conda-forge exceptiongroup 1.2.2 pyhd8ed1ab_0 conda-forge executing 2.1.0 pyhd8ed1ab_0 conda-forge flask 3.0.3 pypi_0 pypi fonttools 4.54.1 pypi_0 pypi fqdn 1.5.1 pyhd8ed1ab_0 conda-forge h11 0.14.0 pyhd8ed1ab_0 conda-forge h2 4.1.0 pyhd8ed1ab_0 conda-forge hpack 4.0.0 pyh9f0ad1d_0 conda-forge httpcore 1.0.6 pyhd8ed1ab_0 conda-forge httpx 0.27.2 pyhd8ed1ab_0 conda-forge hyperframe 6.0.1 pyhd8ed1ab_0 conda-forge idna 3.10 pyhd8ed1ab_0 conda-forge importlib-metadata 8.5.0 pyha770c72_0 conda-forge importlib_metadata 8.5.0 hd8ed1ab_0 conda-forge importlib_resources 6.4.5 pyhd8ed1ab_0 conda-forge intel-openmp 2024.2.1 h57928b3_1083 conda-forge ipykernel 6.29.5 pyh4bbf305_0 conda-forge ipython 8.28.0 pyh7428d3b_0 conda-forge isoduration 20.11.0 pyhd8ed1ab_0 conda-forge itsdangerous 2.2.0 pypi_0 pypi jedi 0.19.1 pyhd8ed1ab_0 conda-forge jinja2 3.1.4 pyhd8ed1ab_0 conda-forge json5 0.9.25 pyhd8ed1ab_0 conda-forge jsonpointer 3.0.0 py312h2e8e312_1 conda-forge jsonschema 4.23.0 pyhd8ed1ab_0 conda-forge jsonschema-specifications 2024.10.1 pyhd8ed1ab_0 conda-forge jsonschema-with-format-nongpl 4.23.0 hd8ed1ab_0 conda-forge jupyter-lsp 2.2.5 pyhd8ed1ab_0 conda-forge jupyter_client 8.6.3 pyhd8ed1ab_0 conda-forge jupyter_core 5.7.2 pyh5737063_1 conda-forge jupyter_events 0.10.0 pyhd8ed1ab_0 conda-forge jupyter_server 2.14.2 pyhd8ed1ab_0 conda-forge jupyter_server_terminals 0.5.3 pyhd8ed1ab_0 conda-forge jupyterlab 4.2.5 pyhd8ed1ab_0 conda-forge jupyterlab_pygments 0.3.0 pyhd8ed1ab_1 conda-forge jupyterlab_server 2.27.3 pyhd8ed1ab_0 conda-forge kiwisolver 1.4.7 pypi_0 pypi krb5 1.21.3 hdf4eb48_0 conda-forge libblas 3.9.0 24_win64_mkl conda-forge libcblas 3.9.0 24_win64_mkl conda-forge libexpat 2.6.3 he0c23c2_0 conda-forge libffi 3.4.2 h8ffe710_5 conda-forge libhwloc 2.11.1 default_h8125262_1000 conda-forge libiconv 1.17 hcfcfb64_2 conda-forge liblapack 3.9.0 24_win64_mkl conda-forge libsodium 1.0.20 hc70643c_0 conda-forge libsqlite 3.46.1 h2466b09_0 conda-forge libxml2 2.12.7 h0f24e4e_4 conda-forge libzlib 1.3.1 h2466b09_2 conda-forge markdown 3.7 pypi_0 pypi markdown2 2.5.1 pypi_0 pypi markupsafe 3.0.1 py312h31fea79_1 conda-forge matplotlib 3.9.2 pypi_0 pypi matplotlib-inline 0.1.7 pyhd8ed1ab_0 conda-forge mistune 3.0.2 pyhd8ed1ab_0 conda-forge mkl 2024.1.0 h66d3029_694 conda-forge nbclient 0.10.0 pyhd8ed1ab_0 conda-forge nbconvert-core 7.16.4 pyhd8ed1ab_1 conda-forge nbformat 5.10.4 pyhd8ed1ab_0 conda-forge nest-asyncio 1.6.0 pyhd8ed1ab_0 conda-forge notebook 7.2.2 pyhd8ed1ab_0 conda-forge notebook-shim 0.2.4 pyhd8ed1ab_0 conda-forge numpy 2.1.2 py312hf10105a_0 conda-forge openssl 3.3.2 h2466b09_0 conda-forge overrides 7.7.0 pyhd8ed1ab_0 conda-forge packaging 24.1 pyhd8ed1ab_0 conda-forge pandas 2.2.3 py312h72972c8_1 conda-forge pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge parso 0.8.4 pyhd8ed1ab_0 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 11.0.0 pypi_0 pypi pip 24.2 pyh8b19718_1 conda-forge pkgutil-resolve-name 1.3.10 pyhd8ed1ab_1 conda-forge platformdirs 4.3.6 pyhd8ed1ab_0 conda-forge prometheus_client 0.21.0 pyhd8ed1ab_0 conda-forge prompt-toolkit 3.0.48 pyha770c72_0 conda-forge psutil 6.0.0 py312h4389bb4_2 conda-forge pthreads-win32 2.9.1 h2466b09_4 conda-forge pure_eval 0.2.3 pyhd8ed1ab_0 conda-forge pycparser 2.22 pyhd8ed1ab_0 conda-forge pygments 2.18.0 pyhd8ed1ab_0 conda-forge pyparsing 3.2.0 pypi_0 pypi pysocks 1.7.1 pyh0701188_6 conda-forge python 3.12.7 hce54a09_0_cpython conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python-fastjsonschema 2.20.0 pyhd8ed1ab_0 conda-forge python-json-logger 2.0.7 pyhd8ed1ab_0 conda-forge python-tzdata 2024.2 pyhd8ed1ab_0 conda-forge python_abi 3.12 5_cp312 conda-forge pytz 2024.1 pyhd8ed1ab_0 conda-forge pywin32 307 py312h275cf98_3 conda-forge pywinpty 2.0.13 py312h275cf98_1 conda-forge pyyaml 6.0.2 py312h4389bb4_1 conda-forge pyzmq 26.2.0 py312hd7027bb_3 conda-forge referencing 0.35.1 pyhd8ed1ab_0 conda-forge requests 2.32.3 pyhd8ed1ab_0 conda-forge rfc3339-validator 0.1.4 pyhd8ed1ab_0 conda-forge rfc3986-validator 0.1.1 pyh9f0ad1d_0 conda-forge rpds-py 0.20.0 py312h2615798_1 conda-forge seaborn 0.13.2 pypi_0 pypi send2trash 1.8.3 pyh5737063_0 conda-forge setuptools 75.1.0 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge sniffio 1.3.1 pyhd8ed1ab_0 conda-forge soupsieve 2.5 pyhd8ed1ab_1 conda-forge stack_data 0.6.2 pyhd8ed1ab_0 conda-forge tbb 2021.13.0 hc790b64_0 conda-forge terminado 0.18.1 pyh5737063_0 conda-forge text-fabric 12.5.5 pypi_0 pypi tinycss2 1.3.0 pyhd8ed1ab_0 conda-forge tk 8.6.13 h5226925_1 conda-forge tomli 2.0.2 pyhd8ed1ab_0 conda-forge tornado 6.4.1 py312h4389bb4_1 conda-forge traitlets 5.14.3 pyhd8ed1ab_0 conda-forge types-python-dateutil 2.9.0.20241003 pyhff2d567_0 conda-forge typing-extensions 4.12.2 hd8ed1ab_0 conda-forge typing_extensions 4.12.2 pyha770c72_0 conda-forge typing_utils 0.1.0 pyhd8ed1ab_0 conda-forge tzdata 2024b hc8b5060_0 conda-forge ucrt 10.0.22621.0 h57928b3_1 conda-forge uri-template 1.3.0 pyhd8ed1ab_0 conda-forge urllib3 2.2.3 pyhd8ed1ab_0 conda-forge vc 14.3 h8a93ad2_22 conda-forge vc14_runtime 14.40.33810 hcc2c482_22 conda-forge vs2015_runtime 14.40.33810 h3bf8584_22 conda-forge wcwidth 0.2.13 pyhd8ed1ab_0 conda-forge webcolors 24.8.0 pyhd8ed1ab_0 conda-forge webencodings 0.5.1 pyhd8ed1ab_2 conda-forge websocket-client 1.8.0 pyhd8ed1ab_0 conda-forge werkzeug 3.0.4 pypi_0 pypi wheel 0.44.0 pyhd8ed1ab_0 conda-forge win_inet_pton 1.1.0 pyh7428d3b_7 conda-forge winpty 0.4.3 4 conda-forge xz 5.2.6 h8d14728_0 conda-forge yaml 0.2.5 h8ffe710_2 conda-forge zeromq 4.3.5 ha9f60a1_6 conda-forge zipp 3.20.2 pyhd8ed1ab_0 conda-forge zstandard 0.23.0 py312h7606c53_1 conda-forge zstd 1.5.6 h0ea2cb4_0 conda-forge