The surface text of the Torah is devided into pisqot (units that can be compared to paragrahps). This devision consist of two types of sections which are marked by th Hebrew letters פ (pe) and ס (samekh):
These markings help structure the text and convey interpretative cues within the Torah. In this notebook we will perform some statistic analysis on these surface text features.
Detailed information regarding petuchot and setumot can be found in “The Text of the Tanak” by Russel Fuller.1
This NoteBook uses the ETCBC BHSA as dataset representing the Hebrew text of the TeNaCh.
%load_ext autoreload
%autoreload 2
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment.
from tf.fabric import Fabric
from tf.app import use
# load the BHSL app and data
BHS = use ("etcbc/BHSA",hoist=globals())
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 39 | 10938.21 | 100 |
chapter | 929 | 459.19 | 100 |
lex | 9230 | 46.22 | 100 |
verse | 23213 | 18.38 | 100 |
half_verse | 45179 | 9.44 | 100 |
sentence | 63717 | 6.70 | 100 |
sentence_atom | 64514 | 6.61 | 100 |
clause | 88131 | 4.84 | 100 |
clause_atom | 90704 | 4.70 | 100 |
phrase | 253203 | 1.68 | 100 |
phrase_atom | 267532 | 1.59 | 100 |
subphrase | 113850 | 1.42 | 38 |
word | 426590 | 1.00 | 100 |
3
etcbc/BHSA
C:/Users/tonyj/text-fabric-data/github/etcbc/BHSA/app
gd905e3fb6e80d0fa537600337614adc2af157309
''
<code>Genesis 1:1</code> (use <a href="https://github.com/{org}/{repo}/blob/master/tf/{version}/book%40en.tf" target="_blank">English book names</a>)
g_uvf_utf8
g_vbs
kq_hybrid
languageISO
g_nme
lex0
is_root
g_vbs_utf8
g_uvf
dist
root
suffix_person
g_vbe
dist_unit
suffix_number
distributional_parent
kq_hybrid_utf8
crossrefSET
instruction
g_prs
lexeme_count
rank_occ
g_pfm_utf8
freq_occ
crossrefLCS
functional_parent
g_pfm
g_nme_utf8
g_vbe_utf8
kind
g_prs_utf8
suffix_gender
mother_object_type
none
unknown
NA
{docRoot}/{repo}
''
''
https://{org}.github.io
0_home
{}
True
local
C:/Users/tonyj/text-fabric-data/github/etcbc/BHSA/_temp
BHSA = Biblia Hebraica Stuttgartensia Amstelodamensis
10.5281/zenodo.1007624
Phonetic Transcriptions
https://nbviewer.jupyter.org/github/etcbc/phono/blob/master/programs/phono.ipynb
10.5281/zenodo.1007636
etcbc
/tf
phono
Parallel Passages
https://nbviewer.jupyter.org/github/etcbc/parallels/blob/master/programs/parallels.ipynb
10.5281/zenodo.1007642
etcbc
/tf
parallels
etcbc
/tf
BHSA
2021
https://shebanq.ancient-data.org/hebrew
Show this on SHEBANQ
la
True
{webBase}/text?book=<1>&chapter=<2>&verse=<3>&version={version}&mr=m&qw=q&tp=txt_p&tr=hb&wget=v&qget=v&nget=vt
{webBase}/word?version={version}&id=<lid>
v1.8
{typ} {rela}
''
True
{code}
1
''
True
{label}
''
True
gloss
{voc_lex_utf8}
word
orig
{voc_lex_utf8}
{typ} {function}
''
True
{typ} {rela}
1
''
{number}
''
True
{number}
1
''
True
{number}
''
pdp vs vt
lex:gloss
hbo
Note: The Text-Fabric feature documentation can be found at ETCBC GitHub
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
BHS.dh(BHS.getCss())
Occurrences of פ (pe) and ס (samekh), which function as section breakers in the BHSA Text-Fabric dataset, are available in the trailer feature.
To begin, we shall generate a frequency table for this feature, noting that it pertains to the full TeNaCh. As observed from the output, the Hebrew letters are displayed in their transliterated format, with trailing P and S representing pe and samekh, respectively.
F.trailer.freqList()
((' ', 236930), ('', 121801), ('&', 42275), ('00 ', 20146), ('05 ', 2266), ('00_S ', 1892), ('00_P ', 1165), ('_S ', 76), (' 05 ', 17), ('_P ', 13), ('00_N ', 7), ('00_N_P ', 1), ('00_N_S ', 1))
# find the parashots petuchot
petuchaQuery = '''
book book=Genesis|Exodus|Leviticus|Numeri|Deuteronomium
word trailer~_P
'''
petuchaResults = BHS.search(petuchaQuery)
0.32s 294 results
# find the parashot setumot
setumaQuery = '''
book book=Genesis|Exodus|Leviticus|Numeri|Deuteronomium
word trailer~_S
'''
setumaResults = BHS.search(setumaQuery)
0.33s 392 results
These two scripts count the occurrences of petuchot and setumot per book. They generate formatted tables that summarize these counts.
# Import necessary libraries
from collections import defaultdict
# Initialize a dictionary to store counts per book
petuchaCounts = defaultdict(int)
# Iterate over the results and count petuchot per book
for book, petucha in petuchaResults:
petuchaCounts[book] += 1
# Sort the books alphabetically
sortedBooks = sorted(petuchaCounts.keys())
# Display the results in a formatted table
print(f"{'Book':<20}{'Number of Petuchot'}")
print('-' * 35)
for book in sortedBooks:
print(f"{F.book.v(book):<20}{petuchaCounts[book]}")
Book Number of Petuchot ----------------------------------- Genesis 42 Exodus 70 Leviticus 55 Numeri 95 Deuteronomium 32
# Import necessary libraries
from collections import defaultdict
# Initialize a dictionary to store counts per book
setumaCounts = defaultdict(int)
# Iterate over the results and count setuma per book
for book, setuma in setumaResults:
setumaCounts[book] += 1
# Sort the books alphabetically
sortedBooks = sorted(setumaCounts.keys())
# Display the results in a formatted table
print(f"{'Book':<20}{'Number of Setumot'}")
print('-' * 35)
for book in sortedBooks:
print(f"{F.book.v(book):<20}{setumaCounts[book]}")
Book Number of Setumot ----------------------------------- Genesis 50 Exodus 94 Leviticus 49 Numeri 64 Deuteronomium 135
The following script creates a statistical overview of the petucha length per book.
# Import necessary libraries
import pandas as pd
from tf.app import use
# Function to get reference string from verse nodes
def getVerseReference(node):
section = T.sectionFromNode(node)
return f"{section[0]} {section[1]}:{section[2]}" if section else 'Unknown Reference'
# Function to process each petucha and append to list
def addPetuchaInfo(petuchaList, index, startWordNode, endWordNode, length, bookName):
startVerseNodes = L.u(startWordNode, otype='verse')
endVerseNodes = L.u(endWordNode, otype='verse')
startRefStr = getVerseReference(startVerseNodes[0]) if startVerseNodes else 'Unknown Reference'
endRefStr = getVerseReference(endVerseNodes[0]) if endVerseNodes else 'Unknown Reference'
petuchaList.append({
'Index': index,
'StartRef': startRefStr,
'EndRef': endRefStr,
'Length': length,
'Book': bookName
})
# Initialize variables
petuchaInfo = []
currentPetuchaLength = 0
currentPetuchaStartWord = None
index = 1
# Find all words in the Torah
wordsInTorahQuery = '''
book book=Genesis|Exodus|Leviticus|Numeri|Deuteronomium
word
'''
wordsInTorah = BHS.search(wordsInTorahQuery)
# Iterate over all words in the dataset
for bookNode, wordNode in wordsInTorah:
# Get the trailer feature of the word
trailer = F.trailer.v(wordNode) or '' # prevent 'NoneType' errors
# If starting a new petucha, record the start word
if currentPetuchaStartWord is None:
currentPetuchaStartWord = wordNode
currentBookName = F.book.v(bookNode)
# Increment the length counter
currentPetuchaLength += 1
# Check if the word ends with a petucha (represented by 'P' in the trailer)
if 'P' in trailer:
addPetuchaInfo(petuchaInfo, index, currentPetuchaStartWord, wordNode, currentPetuchaLength, currentBookName)
# Reset the variables for the next petucha
currentPetuchaLength = 0
currentPetuchaStartWord = None
index += 1
# Handle any remaining words after the last petucha
if currentPetuchaLength > 0 and currentPetuchaStartWord is not None:
addPetuchaInfo(petuchaInfo, index, currentPetuchaStartWord, wordNode, currentPetuchaLength, currentBookName)
# Convert the petuchaInfo list to a pandas DataFrame for analysis
df = pd.DataFrame(petuchaInfo)
# Define the desired book order
orderedBooks = ['Genesis', 'Exodus', 'Leviticus', 'Numeri', 'Deuteronomium']
# Display per-book statistics using a specified formatting
print("\nStatistical overview of petucha lengths per book:")
bookStats = df.groupby('Book')['Length'].describe().round(2)
bookStats['count'] = bookStats['count'].astype(int)
bookStats['min'] = bookStats['min'].astype(int)
bookStats['max'] = bookStats['max'].astype(int)
# Calculate total row across all books
totalStats = pd.DataFrame({
'count': [int(bookStats['count'].sum())],
'mean': [round(bookStats['mean'].mean(), 2)],
'std': [round(bookStats['std'].mean(), 2)],
'min': [int(bookStats['min'].min())],
'25%': [round(bookStats['25%'].mean(), 2)],
'50%': [round(bookStats['50%'].mean(), 2)],
'75%': [round(bookStats['75%'].mean(), 2)],
'max': [int(bookStats['max'].max())]
}, index=['Total'])
# Concatenate the total row with the book_stats DataFrame
bookStats = pd.concat([bookStats, totalStats])
# Reorder bookStats based on the original order of books in petuchaInfo
bookStats = bookStats.reindex(orderedBooks + ['Total'])
# Configure display options to show all data on a single line for each book
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.width', 1000) # Wide display to avoid line wrapping
print(bookStats)
0.46s 112927 results Statistical overview of petucha lengths per book: count mean std min 25% 50% 75% max Genesis 43 670.63 918.78 49 121.00 281.0 788.50 4775 Exodus 70 342.16 283.41 35 130.25 266.5 454.25 1685 Leviticus 55 309.85 309.83 49 116.00 189.0 377.00 1518 Numeri 95 266.60 363.68 31 81.00 160.0 358.50 2646 Deuteronomium 32 555.31 474.97 42 194.00 434.5 715.00 1703 Total 295 428.91 470.13 31 128.45 266.2 538.65 4775
The following script creates a scatter plot displaying the length distribution of each petucha sections. Hovering over the datapoints provids more details like word-count, and the start and end-verse. This script uses the data created by the previous script.
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.transform import factor_cmap
from bokeh.palettes import Category10
# Add a petucha index for plotting
df['petuchaIndex'] = df.index + 1
# Prepare data for Bokeh
source = ColumnDataSource(df)
# Define the color palette
books = df['Book'].unique()
palette = Category10[len(books)]
color_map = factor_cmap('Book', palette=palette, factors=books)
# Create the figure
output_notebook() # To display the plot in a Jupyter notebook
p = figure(
width=1000,
height=700,
title='Petucha lengths in the Torah (in words)',
x_axis_label='Petucha index',
y_axis_label='Length (in words)',
tools="pan,wheel_zoom,box_zoom,reset,save"
)
# Add the scatter plot using scatter()
p.scatter(
x='petuchaIndex',
y='Length',
source=source,
size=8,
color=color_map,
legend_field='Book',
marker='circle',
line_color='black',
fill_alpha=0.8
)
# Add hover tool
hover = HoverTool()
hover.tooltips = [
('Petucha index', '@Index'),
('Length', '@Length'),
('Start verse', '@StartRef'),
('End verse', '@EndRef'),
(' ', ' ') # to get a blank line when multiple datapoint are grouped when hovering
]
p.add_tools(hover)
# Customize legend
p.legend.location = 'top_right'
p.legend.click_policy = 'hide'
# Show the plot
show(p)
The following script creates a statistical overview of the setuma length per book.
# Import necessary libraries
import pandas as pd
from tf.app import use
# Function to get reference string from verse nodes
def getVerseReference(node):
section = T.sectionFromNode(node)
return f"{section[0]} {section[1]}:{section[2]}" if section else 'Unknown Reference'
# Function to process each setuma and append to list
def addSetumaInfo(setumaList, index, startWordNode, endWordNode, length, bookName):
startVerseNodes = L.u(startWordNode, otype='verse')
endVerseNodes = L.u(endWordNode, otype='verse')
startRefStr = getVerseReference(startVerseNodes[0]) if startVerseNodes else 'Unknown Reference'
endRefStr = getVerseReference(endVerseNodes[0]) if endVerseNodes else 'Unknown Reference'
setumaList.append({
'Index': index,
'StartRef': startRefStr,
'EndRef': endRefStr,
'Length': length,
'Book': bookName
})
# Initialize variables
setumaInfo = []
currentSetumaLength = 0
currentSetumaStartWord = None
index = 1
# Find all words in the Torah
wordsInTorahQuery = '''
book book=Genesis|Exodus|Leviticus|Numeri|Deuteronomium
word
'''
wordsInTorah = BHS.search(wordsInTorahQuery)
# Iterate over all words in the dataset
for bookNode, wordNode in wordsInTorah:
# Get the trailer feature of the word
trailer = F.trailer.v(wordNode) or '' # prevent 'NoneType' errors
# If starting a new setuma, record the start word
if currentSetumaStartWord is None:
currentSetumaStartWord = wordNode
currentBookName = F.book.v(bookNode)
# Increment the length counter
currentSetumaLength += 1
# Check if the word ends with a Setuma (represented by 'S' in the trailer)
if 'S' in trailer:
addSetumaInfo(setumaInfo, index, currentSetumaStartWord, wordNode, currentSetumaLength, currentBookName)
# Reset the variables for the next setuma
currentSetumaLength = 0
currentSetumaStartWord = None
index += 1
# Handle any remaining words after the last setuma
if currentSetumaLength > 0 and currentSetumaStartWord is not None:
addSetumaInfo(setumaInfo, index, currentSetumaStartWord, wordNode, currentSetumaLength, currentBookName)
# Convert the setumaInfo list to a pandas DataFrame for analysis
df = pd.DataFrame(setumaInfo)
# Define the desired book order
orderedBooks = ['Genesis', 'Exodus', 'Leviticus', 'Numeri', 'Deuteronomium']
# Display per-book statistics using a specified formatting
print("\nStatistical overview of setuma lengths per book:")
bookStats = df.groupby('Book')['Length'].describe().round(2)
bookStats['count'] = bookStats['count'].astype(int)
bookStats['min'] = bookStats['min'].astype(int)
bookStats['max'] = bookStats['max'].astype(int)
# Calculate total row across all books
totalStats = pd.DataFrame({
'count': [int(bookStats['count'].sum())],
'mean': [round(bookStats['mean'].mean(), 2)],
'std': [round(bookStats['std'].mean(), 2)],
'min': [int(bookStats['min'].min())],
'25%': [round(bookStats['25%'].mean(), 2)],
'50%': [round(bookStats['50%'].mean(), 2)],
'75%': [round(bookStats['75%'].mean(), 2)],
'max': [int(bookStats['max'].max())]
}, index=['Total'])
# Concatenate the total row with the book_stats DataFrame
bookStats = pd.concat([bookStats, totalStats])
# Reorder bookStats based on the original order of books in setumaInfo
bookStats = bookStats.reindex(orderedBooks + ['Total'])
# Configure display options to show all data on a single line for each book
pd.set_option('display.max_columns', None) # Show all columns
pd.set_option('display.width', 1000) # Wide display to avoid line wrapping
print(bookStats)
0.49s 112927 results Statistical overview of setuma lengths per book: count mean std min 25% 50% 75% max Genesis 51 570.84 867.37 7 37.00 239.0 720.50 4209 Exodus 94 250.84 345.09 2 29.75 102.5 305.50 1650 Leviticus 49 363.63 470.58 8 23.00 172.0 494.00 1848 Numeri 64 363.44 556.47 4 47.75 115.5 384.25 2522 Deuteronomium 135 141.90 268.65 2 18.00 59.0 138.00 2000 Total 393 338.13 501.63 2 31.10 137.6 408.45 4209
The following script creates a scatter plot displaying the length distribution of each setuma sections. Hovering over the datapoints provids more details like word-count, and the start and end-verse. This script uses the data created by the previous script.
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.transform import factor_cmap
from bokeh.palettes import Category10
# Add a setuma index for plotting
df['setumaIndex'] = df.index + 1
# Prepare data for Bokeh
source = ColumnDataSource(df)
# Define the color palette
books = df['Book'].unique()
palette = Category10[len(books)]
color_map = factor_cmap('Book', palette=palette, factors=books)
# Create the figure
output_notebook() # To display the plot in a Jupyter notebook
p = figure(
width=1000,
height=700,
title='Setuma lengths in the Torah (in words)',
x_axis_label='Setuma index',
y_axis_label='Length (in words)',
tools="pan,wheel_zoom,box_zoom,reset,save"
)
# Add the scatter plot using scatter()
p.scatter(
x='setumaIndex',
y='Length',
source=source,
size=8,
color=color_map,
legend_field='Book',
marker='circle',
line_color='black',
fill_alpha=0.8
)
# Add hover tool
hover = HoverTool()
hover.tooltips = [
('Setuma index', '@Index'),
('Length', '@Length'),
('Start verse', '@StartRef'),
('End verse', '@EndRef'),
(' ', ' ') # to get a blank line when multiple datapoint are grouped when hovering
]
p.add_tools(hover)
# Customize legend
p.legend.location = 'top_right'
p.legend.click_policy = 'hide'
# Show the plot
show(p)
1 Russell Fuller, “The Text of the Tanak,” in A History of Biblical Interpretation: The Medieval through the Reformation Periods, ed. Alan J. Hauser, Duane F. Watson, and Schuyler Kaufman (Grand Rapids, MI; Cambridge, U.K.: William B. Eerdmans Publishing Company, 2009), 206.
The scripts in this notebook require (beside text-fabric
) the following Python libraries to be installed in the environment:
collections
pandas
bokeh
IPython
You can install any missing library from within Jupyter Notebook using eitherpip
or pip3
.