Notebook

Splitting 'ref' into book, chapter, verse and word¶

The XML source data contains a tag called 'ref' which contains information related to book, chapter, verse and word. See following snippet:

<Node xml:id="n40015030011" ref="MAT 15:30!11" Cat="adj" Start="10" End="10" StrongNumber="5185" UnicodeLemma="τυφλός" Gender="Masculine" Number="Plural" FunctionalTag="A-APM" Type="" morphId="40015030011" NormalizedForm="τυφλούς" Case="Accusative" Unicode="τυφλούς," FormalTag="A-APM" nodeId="400150300110010" Gloss="blind" LexDomain="024001" LN="24.38">τυφλούς,</Node>

This small Jupyter Notebook explains how this compound variable 'ref' can be split into its four constituent components:

In [5]:

import re
input="MAT 15:30!11"
x= re.sub(r'[!: ]'," ", input).split()
print (x)

['MAT', '15', '30', '11']

Explanation of the code:¶

The code begins by importing the regular expression module re to work with regular expressions.

The variable input is initialized with the string "MAT 1:1!1". This string contains various punctuation marks and spaces.
The re.sub() function is used to substitute certain characters in the input string with a space character (" "). The regular expression pattern [!: ] matches any occurrence of either a colon (:), an exclamation mark (!), or a space character. The matched characters are replaced with a space.
The result of the substitution operation is assigned to the variable x. This variable now holds the modified string where the matched characters have been replaced with spaces.
The split() method is then called on the modified string x. This method splits the string into a list of substrings based on whitespace. Since no specific delimiter is provided to the split() method, it uses whitespace (spaces) as the default delimiter.
The resulting list, containing the substrings after the split operation, is printed using the print() function.

The following site can be used to build and verify a regular expression: regex101.com (choose the 'Pyton flavor')

In [ ]: