Notebook

Using n-gram in Text-Fabric (N1904-TF)¶

Table of content (TOC)¶

1 - Introduction
2 - Load Text-Fabric app and data
3 - Extracting n-grams
- 3.1 - Define the n-gram size
- 3.2 - Iterate through the text and extract n-grams
4 - Analyzing the n-grams
5 - Required libraries
6 - Notebook and environment details

1 - Introduction ¶

Back to TOC ¶

An n-gram is a contiguous sequence of n items (typically words or characters) from a given text or speech. In the context of a text corpus, an n-gram is used to analyze patterns of word usage and co-occurrence by grouping items into chunks of size n. For example, a 1-gram (unigram) would analyze individual words, while a 2-gram (bigram) would examine pairs of consecutive words. N-grams are particularly useful in natural language processing (NLP) for tasks like text prediction, language modeling, and understanding the structure and context within a large corpus of texts. This notebook will show how to create n-grams within the Text-Fabric environment.

2 - Load app and data ¶

Back to TOC ¶

In [3]:

%load_ext autoreload
%autoreload 2

In [4]:

# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [7]:

# load the N1904 app and data
N1904 = use ("CenterBLC/N1904", version="1.0.0", hoist=globals())

Locating corpus resources ...

app: ~/text-fabric-data/github/CenterBLC/N1904/app

data: ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0

TF: TF API 12.5.5, CenterBLC/N1904/app v3, Search Reference
Data: CenterBLC - N1904 1.0.0, Character table, Feature docs

Node types

Name	# of nodes	# slots / node	% coverage
book	27	5102.93	100
chapter	260	529.92	100
verse	7944	17.34	100
sentence	8011	17.20	100
group	8945	7.01	46
clause	42506	8.36	258
wg	106868	6.88	533
phrase	69007	1.90	95
subphrase	116178	1.60	135
word	137779	1.00	100

Sets: no custom sets
Features:

Nestle 1904 Greek New Testament

after

str

material after the end of the word

appositioncontainer

int

1 if it is an apposition container

articular

int

1 if the sentence, group, clause, phrase or wg has an article

before

str

this is XML attribute before

book

str

book name (full name)

bookshort

str

book name (abbreviated) from ref attribute in xml

case

str

grammatical case

chapter

int

chapter number, from ref attribute in xml

clausetype

str

clause type

cls

str

this is XML attribute cls

cltype

str

clause type

criticalsign

str

this is XML attribute criticalsign

crule

str

clause rule (from xml attribute Rule)

degree

str

grammatical degree

discontinuous

int

1 if the word is out of sequence in the xml

domain

str

domain

framespec

str

this is XML attribute framespec

function

str

this is XML attribute function

gender

str

grammatical gender

gloss

str

English gloss (BGVB)

str

xml id

junction

str

type of junction

lang

str

language the text is in

lemma

str

lexical lemma

lemmatranslit

str

transliteration of the word lemma

str

mood

str

verbal mood

morph

str

morphological code

nodeid

str

node id (as in the XML source data)

normalized

str

lemma normalized

note

str

annotation of linguistic nature

num

int

generated number (not in xml): book: (Matthew=1, Mark=2, ..., Revelation=27); sentence: numbered per chapter; word: numbered per verse.

number

str

grammatical number

otype

str

person

str

grammatical person

punctuation

str

punctuation found after a word

ref

str

biblical reference with word counting

referent

str

number of referent

rela

str

this is XML attribute rela

role

str

role

rule

str

syntactical rule

str

part-of-speach

strong

int

strong number

subjrefspec

str

this is XML attribute subjrefspec

tense

str

verbal tense

text

str

the text of a word

trailer

str

material after the end of the word (excluding critical signs)

trans

str

translation of the word surface text according to the Berean Interlinear Bible

translit

str

transliteration of the word surface text

typ

str

syntactical type (on sentence, group, clause or phrase)

typems

str

morphological type (on word), syntactical type (on sentence, group, clause, phrase or wg)

unaccent

str

word in unicode characters without accents and diacritical markers

unicode

str

word in unicode characters plus material after it

variant

str

this is XML attribute variant

verse

int

verse number, from ref attribute in xml

voice

str

verbal voice

frame

str

frame

oslots

none

parent

none

parent relationship between words

sibling

int

this is XML attribute sibling

subjref

none

number of subject referent

Settings:

specified

apiVersion: 3
appName: CenterBLC/N1904
appPath: C:/Users/tonyj/text-fabric-data/github/CenterBLC/N1904/app
commit: gdb630837ae89b9468c9e50d13bda05cfd3de4f18
css: ''
dataDisplay:
- excludedFeatures: []
- noneValues:
  - none
  - unknown
  - no value
  - NA
- sectionSep1:
- sectionSep2: :
- textFormat: text-orig-full
docs:
- docBase: https://github.com/CenterBLC/N1904/tree/main/docs
- docPage: about
- docRoot: https://github.com/CenterBLC/N1904
- featureBase:
  https://github.com/CenterBLC/N1904/blob/main/docs/features/<feature>.md
- featurePage: README
interfaceDefaults: {fmt: text-orig-full}
isCompatible: True
local: local
localDir:
C:/Users/tonyj/text-fabric-data/github/CenterBLC/N1904/_temp
provenanceSpec:
- branch: main
- corpus: Nestle 1904 Greek New Testament
- doi: 10.5281/zenodo.13117910
- moduleSpecs: []
- org: CenterBLC
- relative: /tf
- repo: N1904
- repro: N1904
- version: 1.0.0
- webBase: https://learner.bible/text/show_text/nestle1904/
- webHint: Show this on the website
- webLang: en
- webUrl:
  https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
- webUrlLex: {webBase}/word?version={version}&id=<lid>
release: 1.0.0
typeDisplay:
- clause:
  - condense: True
  - label: {typ} {function} {rela} \\ {cls} {role} {junction}
  - style: ''
- group:
  - label: {typ} {function} {rela} \\ {typems} {role} {rule}
  - style: ''
- phrase:
  - condense: True
  - label: {typ} {function} {rela} \\ {typems} {role} {rule}
  - style: ''
- sentence:
  - label: {typ} {function} {rela} \\ {role} {rule}
  - style: ''
- subphrase:
  - label: {typ} {function} {rela} \\ {typems} {role} {rule}
  - style: ''
- verse:
  - condense: True
  - label: {book} {chapter}:{verse}
  - style: ''
- wg:
  - condense: True
  - label: {typems} {role} {rule} {junction}
  - style: ''
- word:
  - features:
    lemma
    sp
  - featuresBare: [gloss]
writing: grc

TF API: names N F E L T S C TF Fs Fall Es Eall Cs Call directly usable

Display is setup for viewtype syntax-view

See here for more information on viewtypes

In [8]:

# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

3 - Extracting n-grams ¶

Back to TOC ¶

We'll extract n-grams of words along with their POS tags.

In the following script we rely upon the following Text-Fabric features:

Greek word in Unicode from feature unicode
Part of speech tag from feature sp.
The lemma (dictionary form) from feature lemma.
The morphological tag from feature morph.

3.1 - Define the n-gram size ¶

Set the size of the n-gram you wish to extract. For example, n = 2 for bi-grams or n = 3 for tri-grams. A lower n (e.g., 1 or 2) provides a more granular and fine analysis, focusing on smaller units, which can be useful for basic statistics or initial insights. A higher n (e.g., 3 or more) captures more nuanced and structured relationships, which can be useful in more complex linguistic or computational tasks.

In [39]:

# Setting the size of the n-gram
n = 3  # This would be for bigrams, change to 3 for trigrams, etc.

3.2 - Iterate through the text and extract n-grams ¶

The following script retrieves first all chapter nodes with F.otype.s('chapter') and gets their word descendants using L.d(chapterNode, 'word'). The extractNGrams(words, n) function generates n-grams from these word lists. Features features like unicode, sp, lemma, and morph are used to create each n-gram item. The n-grams are stored as dictionaries, and the first five are printed for verification.

In [42]:

# Function to extract n-grams from a list of words
def extractNGrams(words, n):
    return [words[i:i+n] for i in range(len(words) - n + 1)]

# Collect all n-grams in a list
allNGrams = []

# Iterate over all verses in the New Testament
for chapterNode in F.otype.s('chapter'):
    wordsInChapter = L.d(chapterNode, 'word')
    nGramsInChapter = extractNGrams(wordsInChapter, n)
    
    for nGram in nGramsInChapter:
        nGramData = []
        for wordNode in nGram:
            wordText = F.unicode.v(wordNode)      # Greek word in Unicode
            posTag = F.sp.v(wordNode)             # Part of Speech
            lemma = F.lemma.v(wordNode)           # Lemma of the word
            morph = F.morph.v(wordNode)           # Morphological  code
            
            # Collect data for each word in the n-gram
            nGramData.append({
                'wordText': wordText,
                'posTag': posTag,
                'lemma': lemma,
                'morph': morph
            })
        
        # Add the n-gram data to the list
        allNGrams.append(nGramData)

In [44]:

# verification - Print the first 5 n-grams
for nGramData in allNGrams[:5]:
    words = [wordData['wordText'] for wordData in nGramData]
    posTags = [wordData['posTag'] for wordData in nGramData]
    print(f"Words: {' '.join(words)}")
    print(f"POS Tags: {posTags}")
    print('-' * 50)

Words: Βίβλος γενέσεως Ἰησοῦ
POS Tags: ['subs', 'subs', 'subs']
--------------------------------------------------
Words: γενέσεως Ἰησοῦ Χριστοῦ
POS Tags: ['subs', 'subs', 'subs']
--------------------------------------------------
Words: Ἰησοῦ Χριστοῦ υἱοῦ
POS Tags: ['subs', 'subs', 'subs']
--------------------------------------------------
Words: Χριστοῦ υἱοῦ Δαυεὶδ
POS Tags: ['subs', 'subs', 'subs']
--------------------------------------------------
Words: υἱοῦ Δαυεὶδ υἱοῦ
POS Tags: ['subs', 'subs', 'subs']
--------------------------------------------------

4 - Analyzing the n-grams ¶

Back to TOC ¶

Once the n-grams are obtained, we can perform various analyses on the extracted n-grams. This section provides a few examples.

4.1 - N-gram frequency analysis ¶

The following script provides an statistic overview of the most frequent n-grams.

In [45]:

from collections import Counter
import pandas as pd
from IPython.display import display

# Convert n-grams to tuples of word texts for counting
nGramTuples = [tuple(wordData['wordText'] for wordData in nGram) for nGram in allNGrams]

# Count the frequency of each n-gram
nGramFrequency = Counter(nGramTuples)

# Prepare the data for the DataFrame
nGramData = [{'N-Gram': ' '.join(nGram), 'Frequency': freq} for nGram, freq in nGramFrequency.most_common(10)]

# Create a pandas DataFrame
df = pd.DataFrame(nGramData)

# Display the title and the DataFrame
display(HTML("<h3>Most common n-grams</h3>")) 
display(df)

Most common n-grams

	N-Gram	Frequency
0	ὁ Υἱὸς τοῦ	60
1	ὁ δὲ εἶπεν	54
2	τοῦ Κυρίου ἡμῶν	47
3	λέγω ὑμῖν ὅτι	42
4	ὁ δὲ Ἰησοῦς	42
5	Υἱὸς τοῦ ἀνθρώπου	41
6	τοῦ Θεοῦ καὶ	40
7	αὐτοῖς ὁ Ἰησοῦς	39
8	ὁ Ἰησοῦς εἶπεν	38
9	δὲ ἐγέννησεν τὸν	37

4.2 - POS-sequence frequency analysis ¶

In [46]:

from collections import Counter
import pandas as pd
from IPython.display import display, HTML

# Extract POS tag sequences from n-grams
posSequences = [tuple(wordData['posTag'] for wordData in nGram) for nGram in allNGrams]

# Count the frequency of POS tag sequences
posSequenceFrequency = Counter(posSequences)

# Prepare the data for the DataFrame
posSequenceData = [{'POS Tag Sequence': ' '.join(posSeq), 'Frequency': freq} for posSeq, freq in posSequenceFrequency.most_common(10)]

# Create a pandas DataFrame
df_pos = pd.DataFrame(posSequenceData)

# Display the title and the DataFrame
display(HTML("<h3>Most common POS tag sequences</h3>"))  # Title added here
display(df_pos)

Most common POS tag sequences

	POS Tag Sequence	Frequency
0	prep art subs	3568
1	verb art subs	3513
2	art subs pron	3390
3	art subs art	2846
4	subs art subs	2765
5	art subs verb	2666
6	art subs conj	2402
7	verb prep art	1837
8	conj art subs	1818
9	subs conj verb	1719

4.3 - N-grams and ambiguity ¶

N-grams can be used improve POS tagging by identifying common contexts in which words appear. For instance, if a word is often preceded by a definite article, it's likely a noun. It can also be used for disambiguation. Suppose you encounter the word "ἕως" which (according the the MACULA XML Treebank) can be (in the context of the Greek New Testament) be either a preposition, a conjunction or a adverbial. The following script will generate a pie chart displaing the frequency distribution of the POS for the word "ἕως".

In [47]:

import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

# Define the target word in unicode
targetWord = 'ἕως'

# Prepare list to store the POS tags for "ἕως" when it's the last word
posTagsForTargetWord = []

# Find n-grams where "ἕως" is the last word and collect its POS tag
for nGramData in allNGrams:
    words = [wordData['wordText'] for wordData in nGramData]
    if words[-1] == targetWord:  # Only include if "ἕως" is the last word
        posTagsForTargetWord.append(nGramData[-1]['posTag'])  # Collect the POS tag of "ἕως"

# Count the frequency of each POS tag for "ἕως"
posTagFrequency = Counter(posTagsForTargetWord)

# Prepare data for the pie chart
labels = list(posTagFrequency.keys())
sizes = list(posTagFrequency.values())

# Function to display absolute numbers and percentages on the pie chart
def absolute_and_percentage(pct, allvals):
    absolute = int(pct/100. * sum(allvals))
    return f"{absolute} ({pct:.1f}%)"

# Plot the pie chart
plt.figure(figsize=(7,7))
plt.pie(sizes, labels=labels, autopct=lambda pct: absolute_and_percentage(pct, sizes), 
        startangle=140, shadow=False)
plt.title("Distribution of POS tags for 'ἕως' as the last n-gram word")
plt.axis('equal')  # Equal aspect ratio ensures that pie chart is drawn as a circle.

# Show the pie chart
plt.show()

4.4 - Correlating preceding n-grams with final POS tagging ¶

This script analyzes the correlation between the part-of-speech (POS) tag sequences preceding the word "ἕως" and its POS tag when it appears as the last word in n-grams. It iterates through the n-grams, extracting the POS tags that precede "ἕως" and the POS tag assigned to "ἕως." These sequences are then grouped and counted, and a heatmap is generated to visualize how different preceding POS tag sequences are distributed across various POS tags of "ἕως," allowing to show the grammatical patterns leading up to its usage.

In [51]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Define the target word in unicode
targetWord = 'ἕως'

# Prepare lists to store preceding POS tags and the POS of "ἕως"
precedingPOSTags = []
finalPOSForTargetWord = []

# Find n-grams where "ἕως" is the last word
for nGramData in allNGrams:
    words = [wordData['wordText'] for wordData in nGramData]
    if words[-1] == targetWord:  # Only include if "ἕως" is the last word
        posTags = [wordData['posTag'] for wordData in nGramData]
        precedingPOSTags.append(' '.join(posTags[:-1]))  # Collect preceding POS tags
        finalPOSForTargetWord.append(posTags[-1])  # Collect POS tag of "ἕως"

# Create a DataFrame to store the relationships
df = pd.DataFrame({
    'Preceding POS Tags': precedingPOSTags,
    'Final POS (ἕως)': finalPOSForTargetWord
})

# Count occurrences of preceding POS tag sequences grouped by the POS of "ἕως"
groupedData = df.groupby(['Final POS (ἕως)', 'Preceding POS Tags']).size().unstack(fill_value=0)

# Plot a heatmap to visualize the relationship between preceding POS tags and the POS of "ἕως"
plt.figure(figsize=(12, 8))
sns.heatmap(groupedData, annot=True, fmt="d", cmap="Blues", cbar=True)
plt.title("Preceding POS tag sequences correlated with POS of 'ἕως'")
plt.xlabel("Preceding POS tags")
plt.ylabel("Final POS (ἕως)")
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()

5 - Required libraries ¶

Back to TOC ¶

The scripts in this notebook require (beside text-fabric) the following Python libraries to be installed in your environment:

pandas
matplotlib
seaborn
collections

You can install any missing library from within Jupyter Notebook using eitherpip or pip3.

6 - Notebook and environment details ¶

Back to TOC ¶

Author	Tony Jurg
Version	1.0
Date	20 October 2024

The following cell displays the active Anaconda environment along with a list of all installed packages and their versions within that environment.

In [49]:

import subprocess
from IPython.display import display, HTML

# Display the active conda environment
!conda env list | findstr "*"

# Run conda list and capture the output
condaListOutput = subprocess.check_output("conda list", shell=True).decode("utf-8")

# Wrap the output with <details> and <summary> HTML tags
htmlOutput = "<details><summary>Click to view installed packages</summary><pre>"
htmlOutput += condaListOutput
htmlOutput += "</pre></details>"

# Display the HTML in the notebook
display(HTML(htmlOutput))

Text-Fabric           *  C:\Users\tonyj\anaconda3\envs\Text-Fabric

Click to view installed packages

# packages in environment at C:\Users\tonyj\anaconda3\envs\Text-Fabric:
#
# Name                    Version                   Build  Channel
anyio                     4.6.2.post1        pyhd8ed1ab_0    conda-forge
argon2-cffi               23.1.0             pyhd8ed1ab_0    conda-forge
argon2-cffi-bindings      21.2.0          py312h4389bb4_5    conda-forge
arrow                     1.3.0              pyhd8ed1ab_0    conda-forge
asttokens                 2.4.1              pyhd8ed1ab_0    conda-forge
async-lru                 2.0.4              pyhd8ed1ab_0    conda-forge
attrs                     24.2.0             pyh71513ae_0    conda-forge
babel                     2.14.0             pyhd8ed1ab_0    conda-forge
beautifulsoup4            4.12.3             pyha770c72_0    conda-forge
bleach                    6.1.0              pyhd8ed1ab_0    conda-forge
blinker                   1.8.2                    pypi_0    pypi
brotli-python             1.1.0           py312h275cf98_2    conda-forge
bzip2                     1.0.8                h2466b09_7    conda-forge
ca-certificates           2024.8.30            h56e8100_0    conda-forge
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
certifi                   2024.8.30          pyhd8ed1ab_0    conda-forge
cffi                      1.17.1          py312h4389bb4_0    conda-forge
charset-normalizer        3.4.0              pyhd8ed1ab_0    conda-forge
click                     8.1.7                    pypi_0    pypi
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
comm                      0.2.2              pyhd8ed1ab_0    conda-forge
contourpy                 1.3.0                    pypi_0    pypi
cpython                   3.12.7          py312hd8ed1ab_0    conda-forge
cycler                    0.12.1                   pypi_0    pypi
debugpy                   1.8.7           py312h275cf98_0    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
entrypoints               0.4                pyhd8ed1ab_0    conda-forge
exceptiongroup            1.2.2              pyhd8ed1ab_0    conda-forge
executing                 2.1.0              pyhd8ed1ab_0    conda-forge
flask                     3.0.3                    pypi_0    pypi
fonttools                 4.54.1                   pypi_0    pypi
fqdn                      1.5.1              pyhd8ed1ab_0    conda-forge
h11                       0.14.0             pyhd8ed1ab_0    conda-forge
h2                        4.1.0              pyhd8ed1ab_0    conda-forge
hpack                     4.0.0              pyh9f0ad1d_0    conda-forge
httpcore                  1.0.6              pyhd8ed1ab_0    conda-forge
httpx                     0.27.2             pyhd8ed1ab_0    conda-forge
hyperframe                6.0.1              pyhd8ed1ab_0    conda-forge
idna                      3.10               pyhd8ed1ab_0    conda-forge
importlib-metadata        8.5.0              pyha770c72_0    conda-forge
importlib_metadata        8.5.0                hd8ed1ab_0    conda-forge
importlib_resources       6.4.5              pyhd8ed1ab_0    conda-forge
intel-openmp              2024.2.1          h57928b3_1083    conda-forge
ipykernel                 6.29.5             pyh4bbf305_0    conda-forge
ipython                   8.28.0             pyh7428d3b_0    conda-forge
isoduration               20.11.0            pyhd8ed1ab_0    conda-forge
itsdangerous              2.2.0                    pypi_0    pypi
jedi                      0.19.1             pyhd8ed1ab_0    conda-forge
jinja2                    3.1.4              pyhd8ed1ab_0    conda-forge
json5                     0.9.25             pyhd8ed1ab_0    conda-forge
jsonpointer               3.0.0           py312h2e8e312_1    conda-forge
jsonschema                4.23.0             pyhd8ed1ab_0    conda-forge
jsonschema-specifications 2024.10.1          pyhd8ed1ab_0    conda-forge
jsonschema-with-format-nongpl 4.23.0               hd8ed1ab_0    conda-forge
jupyter-lsp               2.2.5              pyhd8ed1ab_0    conda-forge
jupyter_client            8.6.3              pyhd8ed1ab_0    conda-forge
jupyter_core              5.7.2              pyh5737063_1    conda-forge
jupyter_events            0.10.0             pyhd8ed1ab_0    conda-forge
jupyter_server            2.14.2             pyhd8ed1ab_0    conda-forge
jupyter_server_terminals  0.5.3              pyhd8ed1ab_0    conda-forge
jupyterlab                4.2.5              pyhd8ed1ab_0    conda-forge
jupyterlab_pygments       0.3.0              pyhd8ed1ab_1    conda-forge
jupyterlab_server         2.27.3             pyhd8ed1ab_0    conda-forge
kiwisolver                1.4.7                    pypi_0    pypi
krb5                      1.21.3               hdf4eb48_0    conda-forge
libblas                   3.9.0              24_win64_mkl    conda-forge
libcblas                  3.9.0              24_win64_mkl    conda-forge
libexpat                  2.6.3                he0c23c2_0    conda-forge
libffi                    3.4.2                h8ffe710_5    conda-forge
libhwloc                  2.11.1          default_h8125262_1000    conda-forge
libiconv                  1.17                 hcfcfb64_2    conda-forge
liblapack                 3.9.0              24_win64_mkl    conda-forge
libsodium                 1.0.20               hc70643c_0    conda-forge
libsqlite                 3.46.1               h2466b09_0    conda-forge
libxml2                   2.12.7               h0f24e4e_4    conda-forge
libzlib                   1.3.1                h2466b09_2    conda-forge
markdown                  3.7                      pypi_0    pypi
markdown2                 2.5.1                    pypi_0    pypi
markupsafe                3.0.1           py312h31fea79_1    conda-forge
matplotlib                3.9.2                    pypi_0    pypi
matplotlib-inline         0.1.7              pyhd8ed1ab_0    conda-forge
mistune                   3.0.2              pyhd8ed1ab_0    conda-forge
mkl                       2024.1.0           h66d3029_694    conda-forge
nbclient                  0.10.0             pyhd8ed1ab_0    conda-forge
nbconvert-core            7.16.4             pyhd8ed1ab_1    conda-forge
nbformat                  5.10.4             pyhd8ed1ab_0    conda-forge
nest-asyncio              1.6.0              pyhd8ed1ab_0    conda-forge
notebook                  7.2.2              pyhd8ed1ab_0    conda-forge
notebook-shim             0.2.4              pyhd8ed1ab_0    conda-forge
numpy                     2.1.2           py312hf10105a_0    conda-forge
openssl                   3.3.2                h2466b09_0    conda-forge
overrides                 7.7.0              pyhd8ed1ab_0    conda-forge
packaging                 24.1               pyhd8ed1ab_0    conda-forge
pandas                    2.2.3           py312h72972c8_1    conda-forge
pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
parso                     0.8.4              pyhd8ed1ab_0    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    11.0.0                   pypi_0    pypi
pip                       24.2               pyh8b19718_1    conda-forge
pkgutil-resolve-name      1.3.10             pyhd8ed1ab_1    conda-forge
platformdirs              4.3.6              pyhd8ed1ab_0    conda-forge
prometheus_client         0.21.0             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.48             pyha770c72_0    conda-forge
psutil                    6.0.0           py312h4389bb4_2    conda-forge
pthreads-win32            2.9.1                h2466b09_4    conda-forge
pure_eval                 0.2.3              pyhd8ed1ab_0    conda-forge
pycparser                 2.22               pyhd8ed1ab_0    conda-forge
pygments                  2.18.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.2.0                    pypi_0    pypi
pysocks                   1.7.1              pyh0701188_6    conda-forge
python                    3.12.7          hce54a09_0_cpython    conda-forge
python-dateutil           2.9.0              pyhd8ed1ab_0    conda-forge
python-fastjsonschema     2.20.0             pyhd8ed1ab_0    conda-forge
python-json-logger        2.0.7              pyhd8ed1ab_0    conda-forge
python-tzdata             2024.2             pyhd8ed1ab_0    conda-forge
python_abi                3.12                    5_cp312    conda-forge
pytz                      2024.1             pyhd8ed1ab_0    conda-forge
pywin32                   307             py312h275cf98_3    conda-forge
pywinpty                  2.0.13          py312h275cf98_1    conda-forge
pyyaml                    6.0.2           py312h4389bb4_1    conda-forge
pyzmq                     26.2.0          py312hd7027bb_3    conda-forge
referencing               0.35.1             pyhd8ed1ab_0    conda-forge
requests                  2.32.3             pyhd8ed1ab_0    conda-forge
rfc3339-validator         0.1.4              pyhd8ed1ab_0    conda-forge
rfc3986-validator         0.1.1              pyh9f0ad1d_0    conda-forge
rpds-py                   0.20.0          py312h2615798_1    conda-forge
seaborn                   0.13.2                   pypi_0    pypi
send2trash                1.8.3              pyh5737063_0    conda-forge
setuptools                75.1.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sniffio                   1.3.1              pyhd8ed1ab_0    conda-forge
soupsieve                 2.5                pyhd8ed1ab_1    conda-forge
stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
tbb                       2021.13.0            hc790b64_0    conda-forge
terminado                 0.18.1             pyh5737063_0    conda-forge
text-fabric               12.5.5                   pypi_0    pypi
tinycss2                  1.3.0              pyhd8ed1ab_0    conda-forge
tk                        8.6.13               h5226925_1    conda-forge
tomli                     2.0.2              pyhd8ed1ab_0    conda-forge
tornado                   6.4.1           py312h4389bb4_1    conda-forge
traitlets                 5.14.3             pyhd8ed1ab_0    conda-forge
types-python-dateutil     2.9.0.20241003     pyhff2d567_0    conda-forge
typing-extensions         4.12.2               hd8ed1ab_0    conda-forge
typing_extensions         4.12.2             pyha770c72_0    conda-forge
typing_utils              0.1.0              pyhd8ed1ab_0    conda-forge
tzdata                    2024b                hc8b5060_0    conda-forge
ucrt                      10.0.22621.0         h57928b3_1    conda-forge
uri-template              1.3.0              pyhd8ed1ab_0    conda-forge
urllib3                   2.2.3              pyhd8ed1ab_0    conda-forge
vc                        14.3                h8a93ad2_22    conda-forge
vc14_runtime              14.40.33810         hcc2c482_22    conda-forge
vs2015_runtime            14.40.33810         h3bf8584_22    conda-forge
wcwidth                   0.2.13             pyhd8ed1ab_0    conda-forge
webcolors                 24.8.0             pyhd8ed1ab_0    conda-forge
webencodings              0.5.1              pyhd8ed1ab_2    conda-forge
websocket-client          1.8.0              pyhd8ed1ab_0    conda-forge
werkzeug                  3.0.4                    pypi_0    pypi
wheel                     0.44.0             pyhd8ed1ab_0    conda-forge
win_inet_pton             1.1.0              pyh7428d3b_7    conda-forge
winpty                    0.4.3                         4    conda-forge
xz                        5.2.6                h8d14728_0    conda-forge
yaml                      0.2.5                h8ffe710_2    conda-forge
zeromq                    4.3.5                ha9f60a1_6    conda-forge
zipp                      3.20.2             pyhd8ed1ab_0    conda-forge
zstandard                 0.23.0          py312h7606c53_1    conda-forge
zstd                      1.5.6                h0ea2cb4_0    conda-forge

Using n-gram in Text-Fabric (N1904-TF)¶

Table of content (TOC)¶

1 - Introduction ¶

Back to TOC¶

2 - Load app and data ¶

Back to TOC¶

3 - Extracting n-grams ¶

Back to TOC¶

3.1 - Define the n-gram size ¶

3.2 - Iterate through the text and extract n-grams ¶

4 - Analyzing the n-grams ¶

Back to TOC¶

4.1 - N-gram frequency analysis ¶

Most common n-grams

4.2 - POS-sequence frequency analysis ¶

Most common POS tag sequences

4.3 - N-grams and ambiguity ¶

4.4 - Correlating preceding n-grams with final POS tagging¶

5 - Required libraries ¶

Back to TOC¶

6 - Notebook and environment details¶

Back to TOC¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶

Back to TOC ¶

4.4 - Correlating preceding n-grams with final POS tagging ¶

Back to TOC ¶

6 - Notebook and environment details ¶

Back to TOC ¶