Welcome to our end-to-end binary Text-Classification example. In this demo, we will use the Hugging Faces transformers
and datasets
library together with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on binary text classification. In particular, the pre-trained model will be fine-tuned using the imdb
dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. This demo will also show you can use spot instances and continue training.
NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances
Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed
!pip install "sagemaker>=2.48.0" "transformers==4.6.1" "datasets[s3]==1.6.2" --upgrade
upgrade ipywidgets for datasets
library and restart kernel, only needed when prerpocessing is done in the notebook
%%capture
import IPython
!conda install -c conda-forge ipywidgets -y
IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used
import sagemaker.huggingface
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
import sagemaker
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
We are using the datasets
library to download and preprocess the imdb
dataset. After preprocessing, the dataset will be uploaded to our sagemaker_session_bucket
to be used within our training job. The imdb dataset consists of 25000 training and 25000 testing highly polar movie reviews.
from datasets import load_dataset
from transformers import AutoTokenizer
# tokenizer used in preprocessing
tokenizer_name = 'distilbert-base-uncased'
# dataset used
dataset_name = 'imdb'
# s3 key prefix for the data
s3_prefix = 'samples/datasets/imdb'
# load dataset
dataset = load_dataset(dataset_name)
# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
# tokenizer helper function
def tokenize(batch):
return tokenizer(batch['text'], padding='max_length', truncation=True)
# load dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k
# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)
# set format for pytorch
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
sagemaker_session_bucket
¶After we processed the datasets
we are going to use the new FileSystem
integration to upload our dataset to S3.
import botocore
from datasets.filesystems import S3FileSystem
s3 = S3FileSystem()
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)
# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
In order to create a sagemaker training job we need an HuggingFace
Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as entry_point
, which instance_type
should be used, which hyperparameters
are passed in .....
huggingface_estimator = HuggingFace(entry_point='train.py',
source_dir='./scripts',
base_job_name='huggingface-sdk-extension',
instance_type='ml.p3.2xlarge',
instance_count=1,
transformers_version='4.4',
pytorch_version='1.6',
py_version='py36',
role=role,
hyperparameters = {'epochs': 1,
'train_batch_size': 32,
'model_name':'distilbert-base-uncased'
})
When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the huggingface
container, uploads the provided fine-tuning script train.py
and downloads the data from our sagemaker_session_bucket
into the container at /opt/ml/input/data
. Then, it starts the training job by running.
/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32
The hyperparameters
you define in the HuggingFace
estimator are passed in as named arguments.
Sagemaker is providing useful properties about the training environment through various environment variables, including the following:
SM_MODEL_DIR
: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.
SM_NUM_GPUS
: An integer representing the number of GPUs available to the host.
SM_CHANNEL_XXXX:
A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the HuggingFace estimator’s fit call, named train
and test
, the environment variables SM_CHANNEL_TRAIN
and SM_CHANNEL_TEST
are set.
To run your training job locally you can define instance_type='local'
or instance_type='local_gpu'
for gpu usage. Note: this does not working within SageMaker Studio
!pygmentize ./scripts/train.py
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments from transformers.trainer_utils import get_last_checkpoint from sklearn.metrics import accuracy_score, precision_recall_fscore_support from datasets import load_from_disk import logging import sys import argparse import os # Set up logging logger = logging.getLogger(__name__) logging.basicConfig( level=logging.getLevelName("INFO"), handlers=[logging.StreamHandler(sys.stdout)], format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", ) if __name__ == "__main__": logger.info(sys.argv) parser = argparse.ArgumentParser() # hyperparameters sent by the client are passed as command-line arguments to the script. parser.add_argument("--epochs", type=int, default=3) parser.add_argument("--train-batch-size", type=int, default=32) parser.add_argument("--eval-batch-size", type=int, default=64) parser.add_argument("--warmup_steps", type=int, default=500) parser.add_argument("--model_name", type=str) parser.add_argument("--learning_rate", type=str, default=5e-5) parser.add_argument("--output_dir", type=str) # Data, model, and output directories parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"]) parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"]) parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"]) parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"]) parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"]) args, _ = parser.parse_known_args() # load datasets train_dataset = load_from_disk(args.training_dir) test_dataset = load_from_disk(args.test_dir) logger.info(f" loaded train_dataset length is: {len(train_dataset)}") logger.info(f" loaded test_dataset length is: {len(test_dataset)}") # compute metrics function for binary classification def compute_metrics(pred): labels = pred.label_ids preds = pred.predictions.argmax(-1) precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary") acc = accuracy_score(labels, preds) return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall} # download model from model hub model = AutoModelForSequenceClassification.from_pretrained(args.model_name) # define training args training_args = TrainingArguments( output_dir=args.output_dir, num_train_epochs=args.epochs, per_device_train_batch_size=args.train_batch_size, per_device_eval_batch_size=args.eval_batch_size, warmup_steps=args.warmup_steps, evaluation_strategy="epoch", logging_dir=f"{args.output_data_dir}/logs", learning_rate=float(args.learning_rate), ) # create Trainer instance trainer = Trainer( model=model, args=training_args, compute_metrics=compute_metrics, train_dataset=train_dataset, eval_dataset=test_dataset, ) # train model if get_last_checkpoint(args.output_dir) is not None: logger.info("***** continue training *****") trainer.train(resume_from_checkpoint=args.output_dir) else: trainer.train() # evaluate model eval_result = trainer.evaluate(eval_dataset=test_dataset) # writes eval result to file which can be accessed later in s3 ouput with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer: print(f"***** Eval results *****") for key, value in sorted(eval_result.items()): writer.write(f"{key} = {value}\n") # Saves the model to s3 trainer.save_model(args.model_dir)
from sagemaker.huggingface import HuggingFace
# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
'train_batch_size': 32,
'model_name':'distilbert-base-uncased',
'output_dir':'/opt/ml/checkpoints'
}
# s3 uri where our checkpoints will be uploaded during training
job_name = "using-spot"
checkpoint_s3_uri = f's3://{sess.default_bucket()}/{job_name}/checkpoints'
huggingface_estimator = HuggingFace(entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
base_job_name=job_name,
checkpoint_s3_uri=checkpoint_s3_uri,
use_spot_instances=True,
max_wait=3600, # This should be equal to or greater than max_run in seconds'
max_run=1000, # expected max run in seconds
role=role,
transformers_version='4.6',
pytorch_version='1.7',
py_version='py36',
hyperparameters = hyperparameters)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})
# Training seconds: 874
# Billable seconds: 262
# Managed Spot Training savings: 70.0%
To deploy our endpoint, we call deploy()
on our HuggingFace estimator object, passing in our desired number of instances and instance type.
predictor = huggingface_estimator.deploy(1,"ml.g4dn.xlarge")
Then, we use the returned predictor object to call the endpoint.
sentiment_input= {"inputs":"I love using the new Inference DLC."}
predictor.predict(sentiment_input)
Finally, we delete the endpoint again.
predictor.delete_endpoint()
# container image used for training job
print(f"container image used for training job: \n{huggingface_estimator.image_uri}\n")
# s3 uri where the trained model is located
print(f"s3 uri where the trained model is located: \n{huggingface_estimator.model_data}\n")
# latest training job name for this estimator
print(f"latest training job name for this estimator: \n{huggingface_estimator.latest_training_job.name}\n")
container image used for training job: 558105141721.dkr.ecr.us-east-1.amazonaws.com/huggingface-training:pytorch1.6.0-transformers4.2.2-tokenizers0.9.4-datasets1.2.1-py36-gpu-cu110 s3 uri where the trained model is located: s3://philipps-sagemaker-bucket-us-east-1/huggingface-training-2021-02-04-16-47-39-189/output/model.tar.gz latest training job name for this estimator: huggingface-training-2021-02-04-16-47-39-189
# access the logs of the training job
huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)
In Sagemaker you can attach an old training job to an estimator to continue training, get results etc..
from sagemaker.estimator import Estimator
# job which is going to be attached to the estimator
old_training_job_name=''
# attach old training job
huggingface_estimator_loaded = Estimator.attach(old_training_job_name)
# get model output s3 from training job
huggingface_estimator_loaded.model_data
2021-01-15 19:31:50 Starting - Preparing the instances for training 2021-01-15 19:31:50 Downloading - Downloading input data 2021-01-15 19:31:50 Training - Training image download completed. Training in progress. 2021-01-15 19:31:50 Uploading - Uploading generated training model 2021-01-15 19:31:50 Completed - Training job completed
's3://philipps-sagemaker-bucket-eu-central-1/huggingface-sdk-extension-2021-01-15-19-14-13-725/output/model.tar.gz'