This notebook processes the example handbook (CAHRC_HR_Manual.txt). This is done in a simple fashion using the following heuristic: If a line of text consisting of less than 5 words is followed by paragraphs of text the assume the line of text with less than 5 words is a topic (i.e. the topic of a question an employee might ask) and that the paragraphs of text are the answer to that question (called action_text for the lack of a better term).
When a topic and action_text are found these are stored in Cloud Datastore as a key-value pair with the topic as the key and the action_text as the value.
!pip uninstall -y google-cloud-datastore
!pip install google-cloud-datastore
Hit Reset Session > Restart, then resume with the following cells.
from google.cloud import datastore
datastore_client = datastore.Client()
employee_handbook = open('CAHRC_HR_Manual.txt', 'r')
while True:
topic = employee_handbook.readline()
if not(topic):
break
if (topic != '\r\n') and (len(topic.split(' ')) < 5):
action_text = ''
last_line = ''
line = employee_handbook.readline()
while (last_line != '\r\n') and (line != '\r\n') and (len(line.split(' ')) > 5):
action_text += line
last_line = line
line = employee_handbook.readline()
if action_text != '':
kind = 'Topic'
topic_key = datastore_client.key(kind, topic.strip().lower())
topic = datastore.Entity(key=topic_key)
topic['action_text'] = action_text
datastore_client.put(topic)
print('Saved {}: {}'.format(topic.key.name, topic['action_text']))