Notebook

Amazon Personalize Generic Module - Data Layer¶

In [ ]:

!mkdir -p code/cloudformation
!wget -q --show-progress -O code/cloudformation/immersion_day.yaml https://personalization-at-amazon.s3.amazonaws.com/amazon-personalize/AmazonPersonalizeImmersionDay.yaml

code/cloudformation 100%[===================>]   2.57K  --.-KB/s    in 0s

In [ ]:

!cat code/cloudformation/immersion_day.yaml

---
AWSTemplateFormatVersion: '2010-09-09'

Description: Creates an S3 Bucket, IAM Policies, and SageMaker Notebook to work with Personalize.

Parameters:
  NotebookName:
    Type: String
    Default: AmazonPersonalizeImmersionDay
    Description: Enter the name of the SageMaker notebook instance. Default is PersonalizeImmersionDay.

  VolumeSize:
    Type: Number
    Default: 64
    MinValue: 5
    MaxValue: 16384
    ConstraintDescription: Must be an integer between 5 (GB) and 16384 (16 TB).
    Description: Enter the size of the EBS volume in GB.
    
  domain:
    Type: String
    Default: Media
    Description: Enter the name of the domain (Retail, Media, or CPG) you would like to use in your Amazon Personalize Immersion Day.


Resources:
  SAMArtifactsBucket:
    Type: AWS::S3::Bucket
  # SageMaker Execution Role
  SageMakerIamRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          -
            Effect: Allow
            Principal:
              Service: sagemaker.amazonaws.com
            Action: sts:AssumeRole
      Path: "/"
      ManagedPolicyArns:
        - "arn:aws:iam::aws:policy/IAMFullAccess"
        - "arn:aws:iam::aws:policy/AWSCloudFormationFullAccess"
        - "arn:aws:iam::aws:policy/AmazonS3FullAccess"
        - "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
        - "arn:aws:iam::aws:policy/AWSStepFunctionsFullAccess"
        - "arn:aws:iam::aws:policy/AWSLambda_FullAccess"
        - "arn:aws:iam::aws:policy/AmazonSNSFullAccess"
        - "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
        
        

  # SageMaker notebook
  NotebookInstance:
    Type: "AWS::SageMaker::NotebookInstance"
    Properties:
      InstanceType: "ml.t2.medium"
      NotebookInstanceName: !Ref NotebookName
      RoleArn: !GetAtt SageMakerIamRole.Arn
      VolumeSizeInGB: !Ref VolumeSize
      LifecycleConfigName: !GetAtt AmazonPersonalizeMLOpsLifecycleConfig.NotebookInstanceLifecycleConfigName


  AmazonPersonalizeMLOpsLifecycleConfig:
    Type: "AWS::SageMaker::NotebookInstanceLifecycleConfig"
    Properties:
      OnStart:
        - Content:
            Fn::Base64: 
              !Sub |
                #!/bin/bash
                sudo -u ec2-user -i <<'EOF'
                cd /home/ec2-user/SageMaker/
                git clone https://github.com/aws-samples/amazon-personalize-immersion-day.git
                cd /home/ec2-user/SageMaker/amazon-personalize-immersion-day/automation/ml_ops/
                nohup sh deploy.sh "${SAMArtifactsBucket}" "${domain}" &
                EOF

Data Preparation¶

In [ ]:

import time
from time import sleep
import json
from datetime import datetime
import pandas as pd

In [ ]:

original_data = pd.read_csv('./data/bronze/ml-latest-small/ratings.csv')
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB

The int64 format is clearly suitable for userId and movieId. However, we need to dive deeper to understand the timestamps in the data. To use Amazon Personalize, you need to save timestamps in Unix Epoch format. Currently, the timestamp values are not human-readable. So let's grab an arbitrary timestamp value and figure out how to interpret it. Do a quick sanity check on the transformed dataset by picking an arbitrary timestamp and transforming it to a human-readable format.

In [ ]:

arb_time_stamp = original_data.iloc[50]['timestamp']
print(arb_time_stamp)
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

964982681.0
2000-07-30 18:44:41

Since this is a dataset of an explicit feedback movie ratings, it includes movies rated from 1 to 5. We want to include only moves that were "liked" by the users, and simulate a dataset of data that would be gathered by a VOD platform. In order to do that, we will filter out all interactions under 2 out of 5, and create two EVENT_Types "click" and and "watch". We will then assign all movies rated 2 and above as "click" and movies rated 4 and above as both "click" and "watch".

Note that this is to correspond with the events we are modeling, for a real data set you would actually model based on implicit feedback such as clicks, watches and/or explicit feedback such as ratings, likes etc.

In [ ]:

watched_df = original_data.copy()
watched_df = watched_df[watched_df['rating'] > 3]
watched_df = watched_df[['userId', 'movieId', 'timestamp']]
watched_df['EVENT_TYPE']='watch'
display(watched_df.head())

clicked_df = original_data.copy()
clicked_df = clicked_df[clicked_df['rating'] > 1]
clicked_df = clicked_df[['userId', 'movieId', 'timestamp']]
clicked_df['EVENT_TYPE']='click'
display(clicked_df.head())

interactions_df = clicked_df.copy()
interactions_df = interactions_df.append(watched_df)
interactions_df.sort_values("timestamp", axis = 0, ascending = True, 
                 inplace = True, na_position ='last') 
interactions_df.info()

	userId	movieId	timestamp	EVENT_TYPE
0	1	1	964982703	watch
1	1	3	964981247	watch
2	1	6	964982224	watch
3	1	47	964983815	watch
4	1	50	964982931	watch

	userId	movieId	timestamp	EVENT_TYPE
0	1	1	964982703	click
1	1	3	964981247	click
2	1	6	964982224	click
3	1	47	964983815	click
4	1	50	964982931	click

<class 'pandas.core.frame.DataFrame'>
Int64Index: 158371 entries, 66679 to 81092
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   userId      158371 non-null  int64 
 1   movieId     158371 non-null  int64 
 2   timestamp   158371 non-null  int64 
 3   EVENT_TYPE  158371 non-null  object
dtypes: int64(3), object(1)
memory usage: 6.0+ MB

Amazon Personalize has default column names for users, items, and timestamp. These default column names are USER_ID, ITEM_ID, AND TIMESTAMP. So the final modification to the dataset is to replace the existing column headers with the default headers.

In [ ]:

interactions_df.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', 
                              'timestamp':'TIMESTAMP'}, inplace = True)
interactions_df.head()

Out[ ]:

	USER_ID	ITEM_ID	TIMESTAMP	EVENT_TYPE
66679	429	222	828124615	watch
66681	429	227	828124615	click
66719	429	595	828124615	watch
66718	429	592	828124615	watch
66717	429	590	828124615	watch

In [ ]:

interactions_df.to_csv('./data/silver/ml-latest-small/interactions.csv', index=False, float_format='%.0f')

In [ ]:

original_data = pd.read_csv('./data/bronze/ml-latest-small/movies.csv')
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB

In [ ]:

original_data['year'] =original_data['title'].str.extract('.*\((.*)\).*',expand = False)
original_data = original_data.dropna(axis=0)

itemmetadata_df = original_data.copy()
itemmetadata_df = itemmetadata_df[['movieId', 'genres', 'year']]
itemmetadata_df.head()

Out[ ]:

	movieId	genres	year
0	1	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	2	Adventure\|Children\|Fantasy	1995
2	3	Comedy\|Romance	1995
3	4	Comedy\|Drama\|Romance	1995
4	5	Comedy	1995

We will add a new dataframe to help us generate a creation timestamp. If you don’t provide the CREATION_TIMESTAMP for an item, the model infers this information from the interaction dataset and uses the timestamp of the item’s earliest interaction as its corresponding release date. If an item doesn’t have an interaction, its release date is set as the timestamp of the latest interaction in the training set and it is considered a new item. For the current dataset we will set the CREATION_TIMESTAMP to 0.

In [ ]:

itemmetadata_df['CREATION_TIMESTAMP'] = 0
itemmetadata_df.rename(columns = {'genres':'GENRE', 'movieId':'ITEM_ID', 'year':'YEAR'}, inplace = True) 
itemmetadata_df.to_csv('./data/silver/ml-latest-small/item-meta.csv', index=False, float_format='%.0f')

AWS Personalize¶

In [ ]:

!pip install -q boto3
import boto3
import json
import time

In [ ]:

!mkdir -p ~/.aws && cp /content/drive/MyDrive/AWS/d01_admin/* ~/.aws

ETL Job for Interactions data without using generic data loading module¶

In [ ]:

# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')
print("We can communicate with Personalize!")

We can communicate with Personalize!

In [ ]:

# create the dataset group (the highest level of abstraction)
create_dataset_group_response = personalize.create_dataset_group(
    name = "immersion-day-dataset-group-movielens-latest"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

# wait for it to become active
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

In [ ]:

interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "personalize-poc-movielens-interactions",
    schema = json.dumps(interactions_schema)
)

interaction_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "personalize-poc-movielens-ints",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = interaction_schema_arn
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

In [ ]:

region = 'us-east-1'
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "-" + region + "-" + "personalizepocvod"
print(bucket_name)
if region == "us-east-1":
    s3.create_bucket(Bucket=bucket_name)
else:
    s3.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={'LocationConstraint': region}
        )

746888961694-us-east-1-personalizepocvod

In [ ]:

interactions_file_path = './data/silver/ml-latest-small/interactions.csv'
interactions_filename = 'interactions.csv'
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_filename

In [ ]:

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

In [ ]:

iam = boto3.client("iam")

role_name = "PersonalizeRolePOC"
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::746888961694:role/PersonalizeRolePOC

In [ ]:

create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "personalize-poc-import1",
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

# wait fir this import job to gets activated

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

ETL Job for Item meta using generic data loading module¶

In [ ]:

import sys
sys.path.insert(0,'./code')

from generic_modules.import_dataset import personalize_dataset

In [ ]:

dataset_group_arn = 'arn:aws:personalize:us-east-1:746888961694:dataset-group/immersion-day-dataset-group-movielens-latest'
bucket_name = '746888961694-us-east-1-personalizepocvod'
role_arn = 'arn:aws:iam::746888961694:role/PersonalizeRolePOC'

dataset_type = 'ITEMS'
source_data_path = './data/silver/ml-latest-small/item-meta.csv'
target_file_name = 'item-meta.csv'

In [ ]:

personalize_item_meta = personalize_dataset(
    dataset_group_arn = dataset_group_arn,
    bucket_name = bucket_name,
    role_arn = role_arn,
    dataset_type = dataset_type,
    source_data_path = source_data_path,
    target_file_name = target_file_name,
    dataset_arn = dataset_arn,
)

In [ ]:

personalize_item_meta.setup_connection()

SUCCESS | We can communicate with Personalize!

In [ ]:

itemmetadata_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "GENRE",
            "type": "string",
            "categorical": True
        },{
            "name": "YEAR",
            "type": "int",
        },
        {
            "name": "CREATION_TIMESTAMP",
            "type": "long",
        }
    ],
    "version": "1.0"
}

In [ ]:

personalize_item_meta.create_dataset(schema=itemmetadata_schema,
                                     schema_name='personalize-poc-movielens-item',
                                     dataset_name='personalize-poc-movielens-items')

In [ ]:

personalize_item_meta.dataset_arn

Out[ ]:

'arn:aws:personalize:us-east-1:746888961694:dataset/immersion-day-dataset-group-movielens-latest/ITEMS'

In [ ]:

personalize_item_meta.upload_data_to_s3()

In [ ]:

personalize_item_meta.import_data_from_s3(import_job_name='personalize-poc-item-import1')

In [ ]:

import boto3
import json
import time


class personalize_dataset:
    def __init__(self,
                 dataset_group_arn=None,
                 schema_arn=None,
                 dataset_arn=None,
                 dataset_type='INTERACTIONS',
                 region='us-east-1',
                 bucket_name=None,
                 role_arn=None,
                 source_data_path=None,
                 target_file_name=None,
                 dataset_import_job_arn=None
                 ):
        self.personalize = None
        self.personalize_runtime = None
        self.s3 = None
        self.iam = None
        self.dataset_group_arn = dataset_group_arn
        self.schema_arn = schema_arn
        self.dataset_arn = dataset_arn
        self.dataset_type = dataset_type
        self.region = region
        self.bucket_name = bucket_name
        self.role_arn = role_arn
        self.source_data_path = source_data_path
        self.target_file_name = target_file_name
        self.dataset_import_job_arn = dataset_import_job_arn

    def setup_connection(self):
        try:
            self.personalize = boto3.client('personalize')
            self.personalize_runtime = boto3.client('personalize-runtime')
            self.s3 = boto3.client('s3')
            self.iam = boto3.client("iam")
            print("SUCCESS | We can communicate with Personalize!")
        except:
            print("ERROR | Connection can't be established!")
    
    def create_dataset_group(self, dataset_group_name=None):
        """
        The highest level of isolation and abstraction with Amazon Personalize
        is a dataset group. Information stored within one of these dataset groups
        has no impact on any other dataset group or models created from one. they
        are completely isolated. This allows you to run many experiments and is
        part of how we keep your models private and fully trained only on your data.
        """
        create_dataset_group_response = self.personalize.create_dataset_group(name=dataset_group_name)
        self.dataset_group_arn = create_dataset_group_response['datasetGroupArn']
        # print(json.dumps(create_dataset_group_response, indent=2))

        # Before we can use the dataset group, it must be active. 
        # This can take a minute or two. Execute the cell below and wait for it
        # to show the ACTIVE status. It checks the status of the dataset group
        # every minute, up to a maximum of 3 hours.
        max_time = time.time() + 3*60*60 # 3 hours
        while time.time() < max_time:
            status = self.check_dataset_group_status()
            print("DatasetGroup: {}".format(status))
            if status == "ACTIVE" or status == "CREATE FAILED":
                break
            time.sleep(60)

    def check_dataset_group_status(self):
        """
        Check the status of dataset group
        """
        describe_dataset_group_response = self.personalize.describe_dataset_group(
            datasetGroupArn = self.dataset_group_arn
            )
        status = describe_dataset_group_response["datasetGroup"]["status"]
        return status

    def create_dataset(self, schema=None, schema_name=None, dataset_name=None):
        """
        First, define a schema to tell Amazon Personalize what type of dataset
        you are uploading. There are several reserved and mandatory keywords
        required in the schema, based on the type of dataset. More detailed
        information can be found in the documentation.
        """
        create_schema_response = self.personalize.create_schema(
            name = schema_name,
            schema = json.dumps(schema)
        )
        self.schema_arn = create_schema_response['schemaArn']

        """
        With a schema created, you can create a dataset within the dataset group.
        Note that this does not load the data yet, it just defines the schema for
        the data. The data will be loaded a few steps later.
        """
        create_dataset_response = self.personalize.create_dataset(
            name = dataset_name,
            datasetType = self.dataset_type,
            datasetGroupArn = self.dataset_group_arn,
            schemaArn = self.schema_arn
        )
        self.dataset_arn = create_dataset_response['datasetArn']
    
    def create_s3_bucket(self):
        if region == "us-east-1":
            self.s3.create_bucket(Bucket=self.bucket_name)
        else:
            self.s3.create_bucket(
                Bucket=self.bucket_name,
                CreateBucketConfiguration={'LocationConstraint': self.region}
                )
    
    def upload_data_to_s3(self):
        """
        Now that your Amazon S3 bucket has been created, upload the CSV file of
        our user-item-interaction data.
        """
        boto3.Session().resource('s3').Bucket(self.bucket_name).Object(self.target_file_name).upload_file(self.source_data_path)
        s3DataPath = "s3://"+self.bucket_name+"/"+self.target_file_name
    
    def set_s3_bucket_policy(self, policy=None):
        """
        Amazon Personalize needs to be able to read the contents of your S3
        bucket. So add a bucket policy which allows that.
        """
        if not policy:
            policy = {
                "Version": "2012-10-17",
                "Id": "PersonalizeS3BucketAccessPolicy",
                "Statement": [
                    {
                        "Sid": "PersonalizeS3BucketAccessPolicy",
                        "Effect": "Allow",
                        "Principal": {
                            "Service": "personalize.amazonaws.com"
                        },
                        "Action": [
                            "s3:*Object",
                            "s3:ListBucket"
                        ],
                        "Resource": [
                            "arn:aws:s3:::{}".format(self.bucket_name),
                            "arn:aws:s3:::{}/*".format(self.bucket_name)
                        ]
                    }
                ]
            }

        self.s3.put_bucket_policy(Bucket=self.bucket_name, Policy=json.dumps(policy))

    def create_iam_role(self, role_name=None):
        """
        Amazon Personalize needs the ability to assume roles in AWS in order to
        have the permissions to execute certain tasks. Let's create an IAM role
        and attach the required policies to it. The code below attaches very permissive
        policies; please use more restrictive policies for any production application.
        """
        assume_role_policy_document = {
            "Version": "2012-10-17",
            "Statement": [
                {
                "Effect": "Allow",
                "Principal": {
                    "Service": "personalize.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
                }
            ]
        }
        create_role_response = self.iam.create_role(
            RoleName = role_name,
            AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
        )

        # AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
        # if you would like to use a bucket with a different name, please consider creating and attaching a new policy
        # that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
        policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
        self.iam.attach_role_policy(
            RoleName = role_name,
            PolicyArn = policy_arn
        )
        # Now add S3 support
        self.iam.attach_role_policy(
            PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
            RoleName=role_name
        )
        time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate
        self.role_arn = create_role_response["Role"]["Arn"]

    def import_data_from_s3(self, import_job_name=None):
        """
        Earlier you created the dataset group and dataset to house your information,
        so now you will execute an import job that will load the data from the S3
        bucket into the Amazon Personalize dataset.
        """
        create_dataset_import_job_response = self.personalize.create_dataset_import_job(
        jobName = import_job_name,
        datasetArn = self.dataset_arn,
        dataSource = {
            "dataLocation": "s3://{}/{}".format(self.bucket_name, self.target_file_name)
        },
        roleArn = self.role_arn
        )
        self.dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']

        """
        Before we can use the dataset, the import job must be active. Execute the
        cell below and wait for it to show the ACTIVE status. It checks the status
        of the import job every minute, up to a maximum of 6 hours.
        Importing the data can take some time, depending on the size of the dataset.
        In this workshop, the data import job should take around 15 minutes.
        """
        max_time = time.time() + 6*60*60 # 6 hours
        while time.time() < max_time:
            describe_dataset_import_job_response = personalize.describe_dataset_import_job(
                datasetImportJobArn = dataset_import_job_arn
            )
            status = self.check_import_job_status()
            print("DatasetImportJob: {}".format(status))
            if status == "ACTIVE" or status == "CREATE FAILED":
                break
            time.sleep(60)
    
    def check_import_job_status(self):
        describe_dataset_import_job_response = self.personalize.describe_dataset_import_job(
            datasetImportJobArn = self.dataset_import_job_arn
        )
        status = describe_dataset_import_job_response["datasetImportJob"]['status']
        return status

    def __getstate__(self):
        attributes = self.__dict__.copy()
        del attributes['personalize']
        del attributes['personalize_runtime']
        del attributes['s3']
        del attributes['iam']
        return attributes

In [ ]:

dataset_arn = 'arn:aws:personalize:us-east-1:746888961694:dataset/immersion-day-dataset-group-movielens-latest/ITEMS'
dataset_import_job_arn = 'arn:aws:personalize:us-east-1:746888961694:dataset-import-job/personalize-poc-item-import1'

personalize_item_meta = personalize_dataset(
    dataset_group_arn = dataset_group_arn,
    bucket_name = bucket_name,
    role_arn = role_arn,
    dataset_type = dataset_type,
    source_data_path = source_data_path,
    target_file_name = target_file_name,
    dataset_arn = dataset_arn,
    dataset_import_job_arn = dataset_import_job_arn
)

personalize_item_meta.setup_connection()

SUCCESS | We can communicate with Personalize!

In [ ]:

personalize_item_meta.check_import_job_status()

Out[ ]:

'ACTIVE'

Saving the state¶

In [ ]:

import pickle

with open('./artifacts/etc/personalize_item_meta.pkl', 'wb') as outp:
    pickle.dump(personalize_item_meta, outp, pickle.HIGHEST_PROTOCOL)

In [ ]:

personalize_item_meta.__getstate__()

Out[ ]:

{'bucket_name': '746888961694-us-east-1-personalizepocvod',
 'dataset_arn': 'arn:aws:personalize:us-east-1:746888961694:dataset/immersion-day-dataset-group-movielens-latest/ITEMS',
 'dataset_group_arn': 'arn:aws:personalize:us-east-1:746888961694:dataset-group/immersion-day-dataset-group-movielens-latest',
 'dataset_import_job_arn': 'arn:aws:personalize:us-east-1:746888961694:dataset-import-job/personalize-poc-item-import1',
 'dataset_type': 'ITEMS',
 'region': 'us-east-1',
 'role_arn': 'arn:aws:iam::746888961694:role/PersonalizeRolePOC',
 'schema_arn': None,
 'source_data_path': './data/silver/ml-latest-small/item-meta.csv',
 'target_file_name': 'item-meta.csv'}