This notebook uses the Workbench Science Workbench to quickly build an AWS® Machine Learning Pipeline with the AQSolDB public dataset. This dataset aggregates aqueous solubility data for a large set of compounds.
We're going to set up a full AWS Machine Learning Pipeline from start to finish. Since the Workbench Classes encapsulate, organize, and manage sets of AWS® Services, setting up our ML pipeline will be straight forward.
Workbench also provides visibility into AWS services for every step of the process so we know exactly what we've got and how to use it.
Wine Dataset: A classic dataset used in pattern recognition, machine learning, and data mining, the Wine dataset comprises 178 wine samples sourced from three different cultivars in Italy. The dataset features 13 physico-chemical attributes for each wine sample, providing a multi-dimensional feature space ideal for classification tasks. The aim is to correctly classify the wine samples into one of the three cultivars based on these chemical constituents. This dataset is widely employed for testing and benchmarking classification algorithms and is notable for its well-balanced distribution among classes. It serves as a straightforward, real-world example for classification tasks in machine learning.
Main Reference: Forster, P. (1991). Machine Learning of Natural Language and Ontology (Technical Report DAI-TR-261). Department of Artificial Intelligence, University of Edinburgh.
Important Note: We've made a small change to the wine dataset to have string based target column called 'wine_class' with string labels instead of integer.
Download Data
Workbench is a medium granularity framework that manages and aggregates AWS® Services into classes and concepts. When you use Workbench you think about DataSources, FeatureSets, Models, and Endpoints. Underneath the hood those classes handle all the details around updating and
This notebook uses the Workbench Science Workbench to quickly build an AWS® Machine Learning Pipeline.
We're going to set up a full AWS Machine Learning Pipeline from start to finish. Since the Workbench Classes encapsulate, organize, and manage sets of AWS® Services, setting up our ML pipeline will be straight forward.
Workbench also provides visibility into AWS services for every step of the process so we know exactly what we've got and how to use it.
® Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates.
# Workbench has verbose log messages so set to warning
import workbench
import logging
logging.getLogger("workbench").setLevel(logging.WARNING)
# Note: If you want to use local data just use a file path
from workbench.api.data_source import DataSource
s3_path = "s3://workbench-public-data/common/wine_dataset.csv"
data_source = DataSource(s3_path, 'wine_data')
Okay, so it was just a few lines of code but Workbench did the following for you:
The new 'DataSource' will show up in AWS and of course the Workbench AWS Dashboard. Anyone can see the data, get information on it, use AWS® Athena to query it, and of course use it as part of their analysis pipelines.
Since Workbench manages a broad range of AWS Services it means that you get visibility into exactly what data you have in AWS. It also means nice perks like hitting the 'Query' link in the Dashboard Web Interface and getting a direct Athena console on your dataset. With AWS Athena you can use typical SQL statements to inspect and investigate your data.
But that's not all!
Workbench also provides API to directly query DataSources and FeatureSets right from the API, so lets do that now.
# Athena queries are easy
data_source.query('SELECT * from wine_data limit 5')
We can see in the dataframe above that our target column has strings in it. You do not need to convert these to integers, just use the transformation classes and a LabelEncoder will be used internally for training and prediction/inference.
Note: Normally this is where you'd do a deep dive on the data/features, look at data quality metrics, redudant features and engineer new features. For the purposes of this notebook we're simply going to take the given 13 physico-chemical attributes for each wine sample.
data_source.column_details()
help(data_source.to_features)
Great question, between row 'ingestion' and waiting for the offline store to finish populating itself it does take a long time. Workbench is simply invoking the AWS Service APIs and those APIs are taking a while to do their thing.
The good news is that Workbench can monitor and query the status of the object and let you know when things are ready.
data_source.to_features("wine_features", target_column="wine_class", tags=["wine", "classification", "uci"])
Now we see our new feature set automatically pop up in our dashboard. FeatureSet creation involves the most complex set of AWS Services:
The new 'FeatureSet' will show up in AWS and of course the Workbench AWS Dashboard. Anyone can see the feature set, get information on it, use AWS® Athena to query it, and of course use it as part of their analysis pipelines.
Important: All inputs are stored to track provenance on your data as it goes through the pipeline. We can see the last field in the FeatureSet shows the input DataSource.
Note: Normally this is where you'd do a deep dive on the feature set. For the purposes of this notebook we're simply going to take the features given to us and make a reference model that can track our baseline model performance for other to improve upon. :)
from workbench.api.feature_set import FeatureSet
from workbench.api.model import Model, ModelType
fs = FeatureSet("wine_features")
help(fs.to_model)
fs.column_names()
tags = ["wine", "classification", "public"]
fs.to_model(name="wine-classification", model_type=ModelType.CLASSIFIER, target_column="wine_class",
tags=tags, description="Wine Classification Model")
Okay now that are model has been published we can deploy an AWS Endpoint to serve inference requests for that model. Deploying an Endpoint allows a large set of servies/APIs to use our model in production.
model = Model("wine-classification"
model.to_endpoint("wine-classification-end", tags=["wine", "classification"])
AWS Endpoints will bundle up a model as a service that responds to HTTP requests. The typical way to use an endpoint is to send a POST request with your features in CSV format. Workbench provides a nice DataFrame based interface that takes care of many details for you.
# Get the Endpoint
from workbench.api.endpoint import Endpoint
my_endpoint = Endpoint('wine-classification-end')
We can now look at the model, see what FeatureSet was used to train it and even better see exactly which ROWS in that training set where used to create the model. We can make a query that returns the ROWS that were not used for training.
table = fs.view("training").table
test_df = fs.query(f"select * from {table} where training=0")
test_df.head()
# Okay now use the Workbench Endpoint to make prediction on TEST data
prediction_df = my_endpoint.predict(test_df)
metrics = my_endpoint.classification_metrics("wine_class", prediction_df)
metrics
Looking at the prediction plot above we can see that many predictions were close to the actual value but about 10 of the predictions were WAY off. So at this point we'd use Workbench to investigate those predictions, map them back to our FeatureSet and DataSource and see if there were irregularities in the training data.
This notebook used the Workbench Science Toolkit to quickly build an AWS® Machine Learning Pipeline with the AQSolDB public dataset. We built a full AWS Machine Learning Pipeline from start to finish.
Workbench made it easy:
Using Workbench will minimizize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a Workbench Alpha Tester, contact us at workbench@supercowpowers.com.
# Plotting defaults
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-deep')
#plt.style.use('seaborn-dark')
plt.rcParams['font.size'] = 12.0
plt.rcParams['figure.figsize'] = 14.0, 7.0