Version: 0.2 (26 September 2024); adding comparing description and datatype Version: 0.1 (25 September 2024); implementation of enhancement feature 17.
The main steps in producing the comparison are:
Your environment should (for obvious reasons) include the Python package Text-Fabric
. If not installed yet, it can be installed using pip
.
Further it is required to be able to invoke the Text-Fabric data set (either from an online resource, or from a localy stored copy). There are no further requirements as the scripts basicly operate 'stand alone'.
Set the version number and creation date of this script,
Run the following cell to store details on the script version into memory.
scriptVersion="0.2"
scriptDate="26 September 2024"
Set some parameters used by the script.
Review the options in the following cell and execute the cell.
# This switch can be set to 'True' if you want additional information, such as dictionary entries to be printed. For basic output, set this switch to 'False'.
verbose=False
# Limit the number of entries in the frequency tables per node type (set to 0 for 'no limit')
tableLimit=10
Load the Text-Fabric code in this notebook by running the following two cells.
%load_ext autoreload
%autoreload 2
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use
In this phase, the two Text-Fabric datasets are loaded. Which datasets are loaded is specified in the parameters, as detailed below.:
Ax = use ("{GitHub user name}/{repository name}", version="{version}")
In this notebook, we will load the two different versions into two object, respectively named A1 and A2. One of the consequences of working with two Text-Fabric datasets in the same Python environment is that we need to address them individually when using advanced API functions. That also means the invocation needs to exclude the hoist=globals() option.
For various options regarding other possible storage locations, and other load options, see the documentation for function use
.
# Load the app and data from the first dataset
A1 = use ("tonyjurg/Nestle1904LFT", version="0.7")
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 27 | 5102.93 | 100 |
chapter | 260 | 529.92 | 100 |
verse | 7943 | 17.35 | 100 |
sentence | 8011 | 17.20 | 100 |
wg | 105430 | 6.85 | 524 |
word | 137779 | 1.00 | 100 |
3
tonyjurg/Nestle1904LFT
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/app
g95357e8bf298b090341cf277596be01f7f1f5ce9
''
orig_order
verse
book
chapter
none
unknown
NA
''
0
text-orig-full
https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/
about
https://github.com/tonyjurg/Nestle1904LFT
https://github.com/tonyjurg/Nestle1904LFT/blob/main/docs/features/<feature>.md
layout-orig-full
}True
local
C:/Users/tonyj/text-fabric-data/github/tonyjurg/Nestle1904LFT/_temp
main
Nestle 1904 (Low Fat Tree)
10.5281/zenodo.10182594
tonyjurg
/tf
Nestle1904LFT
Nestle1904LFT
0.7
https://learner.bible/text/show_text/nestle1904/
Show this on the Bible Online Learner website
en
https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
{webBase}/word?version={version}&id=<lid>
v0.8
True
True
{book}
''
True
True
{chapter}
''
0
#{sentence} (start: {book} {chapter}:{headverse})
''
True
chapter verse
{book} {chapter}:{verse}
''
0
#{wgnum}: {wgtype} {wgclass} {clausetype} {wgrole} {wgrule} {junction}
''
True
lemma
gloss
chapter verse
grc
# Load the app and data from the second version in the set for comparison
A2 = use ("saulocantanhede/tfgreek2", version="0.5.7")
Locating corpus resources ...
Name | # of nodes | # slots / node | % coverage |
---|---|---|---|
book | 27 | 5102.93 | 100 |
chapter | 260 | 529.92 | 100 |
verse | 7944 | 17.34 | 100 |
sentence | 19703 | 13.82 | 198 |
group | 8945 | 7.01 | 46 |
clause | 30814 | 7.17 | 160 |
wg | 106868 | 6.88 | 533 |
phrase | 69007 | 1.90 | 95 |
subphrase | 116178 | 1.60 | 135 |
word | 137779 | 1.00 | 100 |
3
saulocantanhede/tfgreek2
C:/Users/tonyj/text-fabric-data/github/saulocantanhede/tfgreek2/app
352af50c8ce86edd8a0e2d58519453a8f53ee084
''
[]
none
unknown
NA
:
text-orig-full
https://github.com/saulocantanhede/tfgreek2/tree/main/docs
about
https://github.com/saulocantanhede/tfgreek2
https://github.com/saulocantanhede/tfgreek2/tree/main/docs/features/<feature>.md
README
text-orig-full
}True
local
C:/Users/tonyj/text-fabric-data/github/saulocantanhede/tfgreek2/_temp
main
Nestle 1904 Greek New Testament
10.5281/zenodo.notyet
[]
saulocantanhede
/tf
tfgreek2
tfgreek2
0.5.7
https://learner.bible/text/show_text/nestle1904/
Show this on the website
en
https://learner.bible/text/show_text/nestle1904/<1>/<2>/<3>
{webBase}/word?version={version}&id=<lid>
0.5.7
True
{typ} {function} {rela} \\ {cls} {role} {junction}
''
{typ} {function} {rela} \\ {type} {role} {rule}
''
True
{typ} {function} {rela} \\ {type} {role} {rule}
''
{typ} {function} {rela} \\ {role} {rule}
''
{typ} {function} {rela} \\ {type} {role} {rule}
''
True
{book} {chapter}:{verse}
''
True
{type} {role} {rule} {junction}
''
lemma
sp
gloss
]grc
Display is setup for viewtype syntax-view
See here for more information on viewtypes
Execute the following cell to create dictionaries containing all relevant information for the loaded node and edge features of the two datasets.
import time
# Initialize both APIs
api1 = A1.api
api2 = A2.api
# Initialize empty dictionaries to store feature data for both APIs
featureDict1 = {}
featureDict2 = {}
# Define some critical variables if not already defined by execution of step 2.2
if 'tableLimit' not in globals(): tableLimit = 10
if 'verbose' not in globals(): verbose = False
if 'scriptVersion' not in globals(): scriptVersion="not set"
if 'scriptDate' not in globals(): scriptDate="not set"
overallTime = time.time()
def getFeatureDescription(metaData):
"""
Retrieves the description of a feature from its metadata.
"""
return metaData.get('description', "No feature description")
def setDataType(metaData):
"""
Determines the data type of a feature based on its metadata.
"""
if 'valueType' in metaData:
return "String" if metaData["valueType"] == 'str' else "Integer"
return "Unknown"
def processFeature(feature, featureType, featureMethod, api, featureDict):
"""
Processes a single feature and updates the feature dictionary.
Parameters:
feature (str): The name of the feature to process.
featureType (str): Type of the feature ('Node' or 'Edge').
featureMethod (function): Method to retrieve feature data.
api: The API instance being processed.
featureDict (dict): The dictionary to store feature data.
"""
# Obtain the meta data
featureMetaData = featureMethod(feature).meta
featureDescription = getFeatureDescription(featureMetaData)
dataType = setDataType(featureMetaData)
# Initialize dictionary to store feature frequency data
featureFrequencyDict = {}
# Skip specific features based on type
if not (featureType == 'Node' and feature == 'otype') and not (featureType == 'Edge' and feature == 'oslots'):
for nodeType in api.F.otype.all:
frequencyLists = featureMethod(feature).freqList(nodeType)
if not isinstance(frequencyLists, int):
if len(frequencyLists) != 0:
featureFrequencyDict[nodeType] = {
'nodetype': nodeType,
'freq': frequencyLists[:tableLimit] if tableLimit > 0 else frequencyLists
}
elif isinstance(frequencyLists, int):
if frequencyLists != 0:
featureFrequencyDict[nodeType] = {
'nodetype': nodeType,
'freq': [("Link", frequencyLists)]
}
# Add processed feature data to the main dictionary
featureDict[feature] = {
'name': feature,
'descr': featureDescription,
'type': featureType,
'datatype': dataType,
'freqlist': featureFrequencyDict
}
def process_api(api, featureDict, api_label):
"""
Processes all node and edge features for a given API and populates the feature dictionary.
Parameters:
api: The API instance to process.
featureDict (dict): The dictionary to store feature data.
api_label (str): Label for the API (used in print statements).
"""
print(f'Analyzing Node Features for {api_label}: ', end='')
for nodeFeature in api.Fall():
if not verbose:
print('.', end='') # Progress indicator
processFeature(nodeFeature, 'Node', api.Fs, api, featureDict)
if verbose:
print(f'\nFeature {nodeFeature} = {featureDict[nodeFeature]}\n')
print('\n') # Newline after node features
print(f'Analyzing Edge Features for {api_label}: ', end='')
for edgeFeature in api.Eall():
if not verbose:
print('.', end='') # Progress indicator
processFeature(edgeFeature, 'Edge', api.Es, api, featureDict)
if verbose:
print(f'\nFeature {edgeFeature} = {featureDict[edgeFeature]}\n')
print('\n') # Newline after edge features
########################################################
# MAIN FUNCTION #
########################################################
# Gather generic information for first dataset (stored in API1)
print('Gathering generic details for first dataset')
# Initialize default values
corpusName1 = A1.appName
liveName1 = 'not set'
versionName1 = A1.version
# Locate corpus information for first dataset (stored in API1)
if A1.provenance:
for parts in A1.provenance[0]:
if isinstance(parts, tuple):
key, value = parts[0], parts[1]
if verbose: print(f'API1 General info: {key}={value}')
if key == 'corpus': corpusName1 = value
if key == 'version': versionName1 = value
# Value for live is a tuple
if key == 'live': liveName1 = value[1]
# Repeat the generic information gathering for API2 if needed
print('Gathering generic details for second dataset')
# Initialize default values for API2
corpusName2 = A2.appName
liveName2 = 'not set'
versionName2 = A2.version
# Locate corpus information for API2
if A2.provenance:
for parts in A2.provenance[0]:
if isinstance(parts, tuple):
key, value = parts[0], parts[1]
if verbose: print(f'API2 General info: {key}={value}')
if key == 'corpus': corpusName2 = value
if key == 'version': versionName2 = value
# Value for live is a tuple
if key == 'live': liveName2 = value[1]
# Process both APIs
process_api(api1, featureDict1, api_label="first dataset (stored in API1)")
process_api(api2, featureDict2, api_label="second dataset (stored in API2)")
print(f'Finished in {time.time() - overallTime:.2f} seconds.')
Gathering generic details for first dataset Gathering generic details for second dataset Analyzing Node Features for first dataset (stored in API1): ....................................................... Analyzing Edge Features for first dataset (stored in API1): . Analyzing Node Features for second dataset (stored in API2): ....................................................... Analyzing Edge Features for second dataset (stored in API2): .... Finished in 21.06 seconds.
Execute the following cell to create a detailed report indicating the delta between the two datasets.
from IPython.display import display, HTML
from datetime import datetime
# Get current date and time
current_time = datetime.now()
formatted_time = current_time.strftime("%Y-%m-%d %H:%M:%S")
# Function to compare two feature dictionaries and report datatype, freqlist, descr, and type differences
def compare_feature_dicts(dict1, dict2):
"""
Compares two feature dictionaries and returns a report of differences,
filtering out identical entries in both datasets, and comparing 'datatype', 'descr', 'type', and 'freqlist'.
"""
report = {
'only_in_dict1': [],
'only_in_dict2': [],
'differences_in_common': {}
}
keys1 = set(dict1.keys())
keys2 = set(dict2.keys())
report['only_in_dict1'] = sorted(keys1 - keys2)
report['only_in_dict2'] = sorted(keys2 - keys1)
common_features = keys1 & keys2
for feature in common_features:
differences = {}
feature1 = dict1[feature]
feature2 = dict2[feature]
for key in ['descr', 'type', 'datatype']:
value1 = feature1.get(key, None)
value2 = feature2.get(key, None)
if value1 != value2:
differences[key] = {'Dataset1': value1, 'Dataset2': value2}
freqlist1 = feature1.get('freqlist', {})
freqlist2 = feature2.get('freqlist', {})
freqlist_diff = {}
for nodetype in freqlist1.keys() | freqlist2.keys():
freq1 = dict(freqlist1.get(nodetype, {}).get('freq', []))
freq2 = dict(freqlist2.get(nodetype, {}).get('freq', []))
diff1 = [t for t in freq1.items() if t not in freq2.items()]
diff2 = [t for t in freq2.items() if t not in freq1.items()]
if diff1 or diff2:
freqlist_diff[nodetype] = {'Dataset1': diff1, 'Dataset2': diff2}
if freqlist_diff:
differences['freqlist'] = freqlist_diff
if differences:
report['differences_in_common'][feature] = differences
return report
# Function to generate HTML delta report using <details>, <summary>, and nested <ul><li> for structure with collapsible nodetypes
def generate_html_delta_report(report):
"""
Generates an HTML delta report from the comparison with collapsible sections using <details>, <summary>, and nested <ul><li> elements,
making both features and nodetypes collapsible.
"""
html = []
html.append("""
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='UTF-8'>
<meta name='viewport' content='width=device-width, initial-scale=1.0'>
<title>Delta Report</title>
<style>
body { font-family: Arial, sans-serif; margin: 20px; }
.only-in-1 { color: #E74C3C; }
.only-in-2 { color: #E67E22; }
.diff-key { color: #8E44AD; }
.freq-type { color: #16A085; }
.freq-value { color: #D35400; }
details { margin-bottom: 15px; }
summary { cursor: pointer; font-weight: bold; color: #2980B9; }
ul { list-style: none; padding-left: 0; } /* Enforce bullet removal */
li { margin-left: 20px; } /* Add margin for list indentation */
.api1 { color: #3498DB; }
.api2 { color: #1ABC9C; }
button { margin-bottom: 15px; padding: 10px 15px; cursor: pointer; }
</style>
<script>
function toggleAllDetails(expand) {
var detailsElements = document.querySelectorAll("details");
detailsElements.forEach(details => {
details.open = expand;
});
}
function toggleSpecificLevel(levelClass, expand) {
var detailsElements = document.querySelectorAll(levelClass);
detailsElements.forEach(details => {
details.open = expand;
});
}
</script>
</head>
<body>
""")
# get details on the two Text-Fabric dataset
liveName1=f'{A1.appName} - {A1.version}'
if A1.provenance:
for parts in A1.provenance[0]:
if isinstance(parts, tuple):
key, value = parts[0], parts[1]
if key == 'live':
liveName1=value[0]
break
liveName2=f'{A2.appName} - {A2.version}'
if A2.provenance:
for parts in A2.provenance[0]:
if isinstance(parts, tuple):
key, value = parts[0], parts[1]
if key == 'live':
liveName2=value[0]
break
html.append("<h1>Delta Report</h1>")
html.append(f"<p>Dataset 1: <span class='only-in-1'>{liveName1}</span></p>")
html.append(f"<p>Dataset 2: <span class='only-in-2'>{liveName2}</span></p>")
# Add buttons to expand or collapse all details
html.append("<button onclick='toggleAllDetails(false)'>Collapse All</button>")
html.append("<button onclick='toggleSpecificLevel(\".level2\", true);toggleSpecificLevel(\".level3\", false);toggleSpecificLevel(\".level4\", false)'>Expand up to second level</button>")
html.append("<button onclick='toggleSpecificLevel(\".level2\", true);toggleSpecificLevel(\".level3\", true);toggleSpecificLevel(\".level4\", false)'>Expand up to third level</button>")
html.append("<button onclick='toggleAllDetails(true)'>Expand All</button>")
# check for node name and number-range differences
# Initialize empty dictionaries
nodeIntervals1 = {}
nodeIntervals2 = {}
# Fill the dictionaries
for nodeType in api1.F.otype.all:
nodeIntervals1[nodeType] = api1.F.otype.sInterval(nodeType)
for nodeType in api2.F.otype.all:
nodeIntervals2[nodeType] = api2.F.otype.sInterval(nodeType)
# Calculate key (node name) differences
nodes_in_1_not_in_2 = set(nodeIntervals1.keys()) - set(nodeIntervals2.keys())
nodes_in_2_not_in_1 = set(nodeIntervals2.keys()) - set(nodeIntervals1.keys())
# Check if either set is not empty and print if true
if nodes_in_1_not_in_2 or nodes_in_2_not_in_1:
if nodes_in_1_not_in_2:
html.append("<details class='level2' open><summary> Nodenames only in Dataset 1</summary><ul>")
for node in nodes_in_1_not_in_2:
html.append(f"<li class='only-in-1'>{node}</li>")
html.append("</ul></details>")
if nodes_in_2_not_in_1:
html.append("<details class='level2' open><summary> Nodenames only in Dataset 2</summary><ul>")
for node in nodes_in_2_not_in_1:
html.append(f"<li class='only-in-2'>{node}</li>")
html.append("</ul></details>")
# Compare tuple content for node number differences
common_keys = set(nodeIntervals1.keys()) & set(nodeIntervals2.keys())
different_values = {key: {'nodeIntervals1': nodeIntervals1[key], 'nodeIntervals2': nodeIntervals2[key]}
for key in common_keys if nodeIntervals1[key] != nodeIntervals2[key]}
if different_values:
html.append("<details class='level2' open><summary>Differences in nodenumber range for common nodenames</summary><ul>")
for key, diff in different_values.items():
html.append(f"<li><details class='level3'><summary>Nodename {key}</summary><ul>")
html.append(f"<li>Dataset 1: <span class='only-in-1'>{diff['nodeIntervals1']}</span></li>")
html.append(f"<li>Dataset 2: <span class='only-in-2'>{diff['nodeIntervals2']}</span></li>")
html.append(f"</ul></details></li>")
html.append("</ul></details>")
# check for feature differences
# Features only in dict1
if report['only_in_dict1']:
html.append("<details class='level2' open><summary>Features only in Dataset 1</summary><ul>")
for feature in report['only_in_dict1']: html.append(f"<li class='only-in-1'>{feature}</li>")
html.append("</ul></details>")
# Features only in dict2
if report['only_in_dict2']:
html.append("<details class='level2' open><summary>Features only in Dataset 2</summary><ul>")
for feature in report['only_in_dict2']: html.append(f"<li class='only-in-2'>{feature}</li>")
html.append("</ul></details>")
# Differences in common features
if report['differences_in_common']:
html.append("<details class='level2' open>")
html.append("<summary>Differences in Common Features</summary>")
html.append("<ul>")
for feature, diffs in report['differences_in_common'].items():
html.append(f"<li><details class='level3'><summary>Feature: {feature}</summary>")
html.append("<ul>")
for key, change in diffs.items():
if key in ['descr', 'type', 'datatype']:
html.append(f"<li><strong style='color: #2980B9; '>{key.capitalize()} Difference:</strong>")
html.append("<ul>")
html.append(f"<li>Dataset 1: <span class='only-in-1'>{change['Dataset1']}</span></li>")
html.append(f"<li>Dataset 2: <span class='only-in-2'>{change['Dataset2']}</span></li>")
html.append("</ul></li>")
elif key == 'freqlist':
freqlist = change
html.append("<li><details class='level4'><summary>Frequency List Differences</summary><ul>")
for nodetype, freq_diff in freqlist.items():
html.append(f"<li><details class='level4'><summary>Nodetype: {nodetype}</summary>")
html.append("<ul>")
dataset1_val = ', '.join([f"{t[0]}: {t[1]}" for t in freq_diff['Dataset1']]) if freq_diff['Dataset1'] else 'None'
dataset2_val = ', '.join([f"{t[0]}: {t[1]}" for t in freq_diff['Dataset2']]) if freq_diff['Dataset2'] else 'None'
html.append(f"<li>Dataset 1: <span class='only-in-1'>{dataset1_val}</span></li>")
html.append(f"<li>Dataset 2: <span class='only-in-2'>{dataset2_val}</span></li>")
html.append("</ul></details></li>")
html.append("</ul></details></li>")
html.append("</ul></details></li>")
html.append("</ul></details>")
html.append(f"<p><small>Created on {formatted_time} with <a href='https://github.com/tonyjurg/Doc4TF/blob/main/tools/determineDeltaBetweenVersions.ipynb'>Doc4TF tool displayDeltaBetweenVersions</a> version {scriptVersion}.</small></p>")
html.append("</body></html>")
return "\n".join(html)
# Function to display HTML report in Jupyter Notebook
def display_html_report(report_html):
display(HTML(report_html))
# Compare the dictionaries
delta_report = compare_feature_dicts(featureDict1, featureDict2)
# Generate the HTML delta report
report_html = generate_html_delta_report(delta_report)
# Display the report in the Jupyter Notebook
display_html_report(report_html)
Dataset 1: tonyjurg/Nestle1904LFT/tf v:0.7(rv0.8=#g95357e8bf298b090341cf277596be01f7f1f5ce9 offline under C:/Users/tonyj/text-fabric-data/github)
Dataset 2: saulocantanhede/tfgreek2/tf v:0.5.7(r0.5.8=#1a251a4a8daacae4cd5e02294a95d806b3964000 offline under C:/Users/tonyj/text-fabric-data/github)
Created on 2024-09-26 21:07:30 with Doc4TF tool displayDeltaBetweenVersions version 0.2.
from IPython.display import HTML
import base64
def create_download_link(html_content, file_name):
# Encode the HTML content to base64
b64_html = base64.b64encode(html_content.encode()).decode()
# Create the HTML download link
download_link = f'''
<a download="{file_name}" href="data:text/html;base64,{b64_html}">
<button>Download HTML File</button>
</a>
'''
return HTML(download_link)
# Display the download link in the notebook
create_download_link(report_html, 'report.html')
Version 0.2 (26 September 2024):
Added functionality:
- comparing description and datatype
- dynamicaly show/hide parts of the output
- create a download link
Version 0.1 (25 September 2024):
- initial implementation of enhancement feature 17.
Licenced under Creative Commons Attribution 4.0 International (CC BY 4.0)