In this notebook, we'll deploy and use an entity recognition model from the spaCy library.
Note: When running this notebook on SageMaker Studio, you should make sure the 'SageMaker JumpStart PyTorch 1.0' image/kernel is used. When running this notebook on SageMaker Notebook Instance, you should make sure the 'sagemaker-soln' kernel is used.
This solution relies on a config file to run the provisioned AWS resources. Run the cell below to generate that file.
import boto3
import os
import json
client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)
keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))
with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
json.dump(stack_output, f)
We start by importing a variety of packages that will be used throughout
the notebook. One of the most important packages is the Amazon SageMaker
Python SDK (i.e. import sagemaker
). We also import modules from our own
custom (and editable) package that can be found at ../package
.
import boto3
import sagemaker
from sagemaker.pytorch import PyTorchModel
import sys
sys.path.insert(0, '../package')
from package import config, utils
Up next, we define the current folder and create a SageMaker client (from
boto3
). We can use the SageMaker client to call SageMaker APIs
directly, as an alternative to using the Amazon SageMaker SDK. We'll use
it at the end of the notebook to delete certain resources that are
created in this notebook.
current_folder = utils.get_current_folder(globals())
sagemaker_client = boto3.client('sagemaker')
We'll use the unique solution prefix to name the model and endpoint.
model_name = "{}-entity-recognition".format(config.SOLUTION_PREFIX)
Up next, we need to define the Amazon SageMaker Model which references
the source code and the specifies which container to use. Our pre-trained
model is from the spaCy library which doesn't rely on a specific deep
learning framework. Just for consistency with the other notebooks we'll
continue to use the PyTorchModel from the Amazon SageMaker Python SDK.
Using PyTorchModel and setting the framework_version argument, means that
our deployed model will run inside a container that has PyTorch
pre-installed. Other requirements can be installed by defining a
requirements.txt file at the specified source_dir location. We use the
entry_point argument to reference the code (within source_dir) that
should be run for model inference: functions called model_fn, input_fn,
predict_fn and output_fn are expected to be defined. And lastly, you can
pass model_data
from a training job, but we are going to load the
pre-trained model in the source code running on the endpoint. We still
need to provide model_data
, so we pass an empty archive.
model = PyTorchModel(
name=model_name,
model_data=f'{config.SOURCE_S3_PATH}/models/empty.tar.gz',
entry_point='entry_point.py',
source_dir='../containers/entity_recognition',
role=config.IAM_ROLE,
framework_version='1.5.0',
py_version='py3',
code_location='s3://' + config.S3_BUCKET + '/code'
)
Using this Amazon SageMaker Model, we can deploy a HTTPS endpoint on a dedicated instance. We choose to deploy the endpoint on a single ml.p3.2xlarge instance (or ml.g4dn.2xlarge if unavailable in this region). You can expect this deployment step to take around 5 minutes. After approximately 15 dashes, you can expect to see an exclamation mark which indicates a successful deployment.
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
predictor = model.deploy(
endpoint_name=model_name,
instance_type=config.HOSTING_INSTANCE_TYPE,
initial_instance_count=1,
serializer=JSONSerializer(),
deserializer=JSONDeserializer()
)
When you're trying to update the model for development purposes, but experiencing issues because the model/endpoint-config/endpoint already exists, you can delete the existing model/endpoint-config/endpoint by uncommenting and running the following commands:
# sagemaker_client.delete_endpoint(EndpointName=model_name)
# sagemaker_client.delete_endpoint_config(EndpointConfigName=model_name)
# sagemaker_client.delete_model(ModelName=model_name)
When calling our new endpoint from the notebook, we use a Amazon
SageMaker SDK
Predictor
.
A Predictor
is used to send data to an endpoint (as part of a request),
and interpret the response. Our model.deploy
command returned a
Predictor
but, by default, it will send and receive numpy arrays. Our
endpoint expects to receive (and also sends) JSON formatted objects, so
we modify the Predictor
to use JSON instead of the PyTorch endpoint
default of numpy arrays. JSON is used here because it is a standard
endpoint format and the endpoint response can contain nested data
structures.
With our model successfully deployed and our predictor configured, we can
try out the entity recognizer out on example inputs. All we need to do is
construct a dictionary object with a single key called text
and provide
the the input string. We call predict
on our predictor and we should
get a response from the endpoint that contains our entities.
data = {'text': 'Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.'}
response = predictor.predict(data=data)
We have the responce and we can print out the named entities and noun
chunks that have been extracted from the text above. You will see the
verbatim text of each alongside its location in the original text (given
by start and end character indexes). Usually a document will contain many
more noun chunks than named entities, but named entities have an
additional field called label
that indicates the class of the named
entity. Since the spaCy model was trained on the OneNotes 5 corpus, it
uses the following classes:
TYPE | DESCRIPTION |
---|---|
PERSON | People, including fictional. |
NORP | Nationalities or religious or political groups. |
FAC | Buildings, airports, highways, bridges, etc. |
ORG | Companies, agencies, institutions, etc. |
GPE | Countries, cities, states. |
LOC | Non-GPE locations, mountain ranges, bodies of water. |
PRODUCT | Objects, vehicles, foods, etc. (Not services.) |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART | Titles of books, songs, etc. |
LAW | Named documents made into laws. |
LANGUAGE | Any named language. |
DATE | Absolute or relative dates or periods. |
TIME | Times smaller than a day. |
PERCENT | Percentage, including ”%“. |
MONEY | Monetary values, including unit. |
QUANTITY | Measurements, as of weight or distance. |
ORDINAL | “first”, “second”, etc. |
CARDINAL | Numerals that do not fall under another type. |
print(response['entities'])
print(response['noun_chunks'])
You can try more examples above, but note that this model has been pretrained on the OneNotes 5 dataset. You may need to fine-tune this model with your own question answering data to obtain better results.
When you've finished with the summarization endpoint (and associated endpoint-config), make sure that you delete it to avoid accidental charges.
sagemaker_client.delete_endpoint(EndpointName=model_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=model_name)
We've just looked at how you can extract named entities and noun chunks from a document. Up next we'll look at a technique that can be used to classify relationships between entities.