Notebook

Entity Recognition¶

In this notebook, we'll deploy and use an entity recognition model from the spaCy library.

Note: When running this notebook on SageMaker Studio, you should make sure the 'SageMaker JumpStart PyTorch 1.0' image/kernel is used. When running this notebook on SageMaker Notebook Instance, you should make sure the 'sagemaker-soln' kernel is used.

This solution relies on a config file to run the provisioned AWS resources. Run the cell below to generate that file.

In [ ]:

import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
    json.dump(stack_output, f)

We start by importing a variety of packages that will be used throughout the notebook. One of the most important packages is the Amazon SageMaker Python SDK (i.e. import sagemaker). We also import modules from our own custom (and editable) package that can be found at ../package.

In [ ]:

import boto3
import sagemaker
from sagemaker.pytorch import PyTorchModel
import sys

sys.path.insert(0, '../package')
from package import config, utils

Up next, we define the current folder and create a SageMaker client (from boto3). We can use the SageMaker client to call SageMaker APIs directly, as an alternative to using the Amazon SageMaker SDK. We'll use it at the end of the notebook to delete certain resources that are created in this notebook.

In [ ]:

current_folder = utils.get_current_folder(globals())
sagemaker_client = boto3.client('sagemaker')

We'll use the unique solution prefix to name the model and endpoint.

In [ ]:

model_name = "{}-entity-recognition".format(config.SOLUTION_PREFIX)

Up next, we need to define the Amazon SageMaker Model which references the source code and the specifies which container to use. Our pre-trained model is from the spaCy library which doesn't rely on a specific deep learning framework. Just for consistency with the other notebooks we'll continue to use the PyTorchModel from the Amazon SageMaker Python SDK. Using PyTorchModel and setting the framework_version argument, means that our deployed model will run inside a container that has PyTorch pre-installed. Other requirements can be installed by defining a requirements.txt file at the specified source_dir location. We use the entry_point argument to reference the code (within source_dir) that should be run for model inference: functions called model_fn, input_fn, predict_fn and output_fn are expected to be defined. And lastly, you can pass model_data from a training job, but we are going to load the pre-trained model in the source code running on the endpoint. We still need to provide model_data, so we pass an empty archive.

In [ ]:

model = PyTorchModel(
    name=model_name,
    model_data=f'{config.SOURCE_S3_PATH}/models/empty.tar.gz',
    entry_point='entry_point.py',
    source_dir='../containers/entity_recognition',
    role=config.IAM_ROLE,
    framework_version='1.5.0',
    py_version='py3',
    code_location='s3://' + config.S3_BUCKET + '/code'
)

Using this Amazon SageMaker Model, we can deploy a HTTPS endpoint on a dedicated instance. We choose to deploy the endpoint on a single ml.p3.2xlarge instance (or ml.g4dn.2xlarge if unavailable in this region). You can expect this deployment step to take around 5 minutes. After approximately 15 dashes, you can expect to see an exclamation mark which indicates a successful deployment.

In [ ]:

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = model.deploy(
    endpoint_name=model_name,
    instance_type=config.HOSTING_INSTANCE_TYPE,
    initial_instance_count=1,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

When you're trying to update the model for development purposes, but experiencing issues because the model/endpoint-config/endpoint already exists, you can delete the existing model/endpoint-config/endpoint by uncommenting and running the following commands:

In [ ]:

# sagemaker_client.delete_endpoint(EndpointName=model_name)
# sagemaker_client.delete_endpoint_config(EndpointConfigName=model_name)
# sagemaker_client.delete_model(ModelName=model_name)

When calling our new endpoint from the notebook, we use a Amazon SageMaker SDK Predictor. A Predictor is used to send data to an endpoint (as part of a request), and interpret the response. Our model.deploy command returned a Predictor but, by default, it will send and receive numpy arrays. Our endpoint expects to receive (and also sends) JSON formatted objects, so we modify the Predictor to use JSON instead of the PyTorch endpoint default of numpy arrays. JSON is used here because it is a standard endpoint format and the endpoint response can contain nested data structures.

With our model successfully deployed and our predictor configured, we can try out the entity recognizer out on example inputs. All we need to do is construct a dictionary object with a single key called text and provide the the input string. We call predict on our predictor and we should get a response from the endpoint that contains our entities.

In [ ]:

data = {'text': 'Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.'}
response = predictor.predict(data=data)

We have the responce and we can print out the named entities and noun chunks that have been extracted from the text above. You will see the verbatim text of each alongside its location in the original text (given by start and end character indexes). Usually a document will contain many more noun chunks than named entities, but named entities have an additional field called label that indicates the class of the named entity. Since the spaCy model was trained on the OneNotes 5 corpus, it uses the following classes:

TYPE	DESCRIPTION
PERSON	People, including fictional.
NORP	Nationalities or religious or political groups.
FAC	Buildings, airports, highways, bridges, etc.
ORG	Companies, agencies, institutions, etc.
GPE	Countries, cities, states.
LOC	Non-GPE locations, mountain ranges, bodies of water.
PRODUCT	Objects, vehicles, foods, etc. (Not services.)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART	Titles of books, songs, etc.
LAW	Named documents made into laws.
LANGUAGE	Any named language.
DATE	Absolute or relative dates or periods.
TIME	Times smaller than a day.
PERCENT	Percentage, including ”%“.
MONEY	Monetary values, including unit.
QUANTITY	Measurements, as of weight or distance.
ORDINAL	“first”, “second”, etc.
CARDINAL	Numerals that do not fall under another type.

In [ ]:

print(response['entities'])
print(response['noun_chunks'])

You can try more examples above, but note that this model has been pretrained on the OneNotes 5 dataset. You may need to fine-tune this model with your own question answering data to obtain better results.

Clean Up¶

When you've finished with the summarization endpoint (and associated endpoint-config), make sure that you delete it to avoid accidental charges.

In [ ]:

sagemaker_client.delete_endpoint(EndpointName=model_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=model_name)

Next Stage¶

We've just looked at how you can extract named entities and noun chunks from a document. Up next we'll look at a technique that can be used to classify relationships between entities.

Click here to continue.