Green Manufacturing for vehicles
Introduction
Since the first automobile, the Real Wheel Motor Company has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag and intelligent assistance systems. Real Wheel Motor Company applies for nearly 2000 patents per year, making the brand the European leader among premium car makers. These cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized models of their dreams.
To ensure the safety and reliability of each and every unique car configuration before they hit the road, the engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on their production lines. ClearScape Analytics can help integrate, standardize, use, and reuse all the automotive data — from customers, vehicles, supply chains, production, and R&D — to digitally transform operations.
Business Values
Why Vantage?
Many organizations fail to realize value from their ML and AI investments due to a lack of scale. It is estimated that for broad adoption across many industries, the number of models and model deployments needs to scale 100-1000x larger than their organizations currently support. The unique massively-parallel architecture of Teradata Vantage allows users to prepare data, train, evaluate, and deploy models at unprecedented scale.
In this particular use case, the goal is to work with a dataset representing different permutations of Real Wheel Motor Company car features to predict the time it takes to pass testing. This will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing the standards of the company cars. Vantage helps maximize efficiency across the entire machine learning lifecycle. ClearScape Analytics’ comprehensive in-database functions and machine learning pipeline seamlessly apply algorithms to large-scale data, with minimal data movement.
This dataset contains an anonymized set of variables, each representing a custom feature in a car. For example, a variable could be 4WD, added air suspension, or a head-up display. The ground truth is labelled ‘y’ and represents the time (in seconds) that the car took to pass testing for each variable.
1. Connect to Vantage
In the section, we import the required libraries and set environment variables and environment paths (if required).
%%capture
# # '%%capture' suppresses the display of installation steps of the following packages
# !pip install xgboost==1.7.3
# !pip install colorlover
Note: The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: 0 0
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
import json
import getpass
import pandas as pd
from teradataml.dataframe.dataframe import DataFrame
from teradataml import *
import numpy as np # linear algebra
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
color = sns.color_palette()
import xgboost as xgb
%matplotlib inline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from collections import defaultdict
import plotly.offline as offline
import colorlover as cl
offline.init_notebook_mode()
display.max_rows=5
We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
%%capture
execute_sql('''SET query_band='DEMO=Green_Manufacturing_PY_SQL.ipynb;' UPDATE FOR SESSION; ''')
2. Getting Data for This Demo
We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.
** Note : Due to the large number of columns the initial table creation and data loading make take more time.
# %run -i ../run_procedure.py "call get_data('DEMO_GreenManufacturing_cloud');"
# Takes about 50 seconds
%run -i ../run_procedure.py "call get_data('DEMO_GreenManufacturing_local');"
# Takes about 3 minutes 30 seconds
Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.
%run -i ../run_procedure.py "call space_report();"
3. Analyze the raw data set
Let us start by creating a "Virtual DataFrame" that points directly to the dataset in Vantage. We then begin our analysis by checking the shape of the DataFrame and examining the data types of all its columns.
** Note : There may be a warning message due to a large number of columns in the dataframe. It's a Warning and not an error. Please ignore the warning
datadf = DataFrame(in_schema('DEMO_GreenManufacturing', 'Manufacturing_Data'))
datadf
The ID column is the ID of the cars, 'y' is the time in seconds which the car took to pass testing for each variable. The variables X0-X8 are categorical variables and the remaining are numerical variables having values of 0 and 1. These are the variables which impact the value of 'y'.
4. Check the impact of Categorical variables on target variable 'y'
We can see that the aggregated data is available to us in teradataml dataframe. Let's visualize this data to better understand the categorical variables. Vantage's Clearscape Analytics can easily integrate with 3rd party visualization tools like Tableau, PowerBI or many python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing on Vantage and pass only the necessary information to visualization tools, this will not only make the calculation faster but also reduce the time due to less data movement between tools.
train_df=datadf.to_pandas().reset_index()
warnings.simplefilter(action='ignore', category=FutureWarning)
var_name = "X0"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
# sns.stripplot(x=var_name, y='y', data=train_df, order=col_order)
sns.countplot(x=var_name, data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
df_plot_x0 = datadf.assign(drop_columns=True,
X0=datadf.X0,
X0_cnt = datadf.X0)
df_plot_x0 = df_plot_x0.groupby('X0').count()
df_plot_x0
warnings.simplefilter(action='ignore', category=FutureWarning)
var_name = "X1"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.stripplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
warnings.simplefilter(action='ignore', category=FutureWarning)
var_name = "X2"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
warnings.simplefilter(action='ignore', category=FutureWarning)
var_name = "X3"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.violinplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
warnings.simplefilter(action='ignore', category=FutureWarning)
var_name = "X4"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.violinplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
warnings.simplefilter(action='ignore', category=FutureWarning)
var_name = "X5"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
warnings.simplefilter(action='ignore', category=FutureWarning)
var_name = "X6"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
warnings.simplefilter(action='ignore', category=FutureWarning)
var_name = "X8"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.barplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "ID"
plt.figure(figsize=(12,6))
sns.regplot(x=var_name, y='y', data=train_df, scatter_kws={'alpha':0.5, 's':30})
# sns.barplot(x=var_name, y='y', data=train_df)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
After the initial analysis done on the variables and the value of y based on these variables, let's go ahead and try to predict the value of Y using these variables. Below are some steps that should be done before using any prediction model.
5. Check the importance of various features on target variable 'y'
We are using the python xgboost model to check the feature importance.
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
for f in ["X0", "X1", "X2", "X3", "X4", "X5", "X6", "X8"]:
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train_df[f].values))
train_df[f] = lbl.transform(list(train_df[f].values))
train_y = train_df['y'].values
train_X = train_df.drop(["ID", "y"], axis=1)
def xgb_r2_score(preds, dtrain):
labels = dtrain.get_label()
return 'r2', r2_score(labels, preds)
xgb_params = {
'eta': 0.05,
'max_depth': 6,
'subsample': 0.7,
'colsample_bytree': 0.7,
'objective': 'reg:linear',
'silent': 1
}
dtrain = xgb.DMatrix(train_X, train_y, feature_names=train_X.columns.values)
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=100, feval=xgb_r2_score, maximize=True)
# plot the important features #
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
6. OrdinalEncoding of the categorical variables
Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding.
Ordinal encoding, which turns each label into an integer value and depicts the sequence of labels in the encoded data, is employed when the variables in the data are ordinal. Ordinal encoding converts each label into integer values and the encoded data represents the sequence of labels.
Since the variables X0-X8 are categorical, we will need to convert them into numerical to use them in different models. We are using the OrdinalEncoding for this conversion. The OrdinalEncodingFit() function identifies distinct categorical values from the input data or a user-defined list and generates the distinct categorical values along with the ordinal value for each category.
The Ordinal encoding will be done for both the Train and Test Datasets.
Ordinal_fit_output = OrdinalEncodingFit(target_column=['X0','X1','X2','X3','X4','X5','X6','X8'],
data=datadf)
Ordinal encoding transform is used on the ordinal encoding fit data to get the numerical values for the categorical values.
The OrdinalEncodingTransform() function maps the categorical value to a specified ordinal value using the OrdinalEncodingFit() function output.
ordinal_encoding_transform_out = OrdinalEncodingTransform(data = datadf,
object = Ordinal_fit_output.result,
accumulate=['ID','y']
)
df = ordinal_encoding_transform_out.result
df
7. Preparation of Data
In the below steps we are preparing the data by joining the converted categorical features and some other important features to be used in Model Training, Scoring and Evaluation
We join the converted dataframe and the important numerical features to get the final dataset which will be used for the model.
Get the output of OrdinalTransform into dataframe.
OrdTransdf = df
We join the converted dataframe and the original data to get important numerical features.
warnings.simplefilter(action='ignore', category=DeprecationWarning)
datadf=datadf.drop(columns=["X0", "X1","X2","X3","X4","X5","X6","X8"])
datadf_join = OrdTransdf.join(other = datadf, on = ["ID"], how = "left",lprefix='t1',rprefix='t2')
datadf_join=datadf_join.drop(columns=["t2_ID", "t2_y"])
datadf_join = datadf_join.assign(ID=datadf_join.t1_ID, y=datadf_join.t1_y)
datadf_join = datadf_join.drop(columns=["t1_ID", "t1_y"])
Create a final dataframe with only the required important features along with the ID and the response column.
data_new_df = datadf_join[["ID",
"y",
"X0",
"X1",
"X2",
"X3",
"X4",
"X5",
"X6",
"X8",
"X47",
"X314",
"X118",
"X315",
"X127",
"X29",
"X115",
"X351",
"X151"]]
copy_to_sql(df = data_new_df, table_name = 'final_data',if_exists='replace')
We split the data into Train and Test data.
data_new_df = DataFrame('final_data')
TrainTestSplit_out = TrainTestSplit(data = data_new_df,
id_column="ID",
train_size=0.80,
test_size=0.20,
seed=42)
train_new_df=TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)
test_new_df=TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)
copy_to_sql(df = train_new_df, table_name = 'final_train_data',if_exists='replace')
copy_to_sql(df = test_new_df, table_name = 'final_test_data',if_exists='replace')
8. Decision Forest
The Decision Forest is a powerful method used for predicting outcomes in both classification and regression problems. It's an improvement on the technique of combining (or "bagging") multiple decision trees. Normally, building a decision tree involves assessing the importance of each feature in the data to determine how to divide the information. This method takes a unique approach by only considering a random subset of features at each division point in the tree. This forces each decision tree within the "forest" to be different from one another, which ultimately improves the accuracy of the predictions. The function relies on a training dataset to develop a prediction model. Then, the DecisionForestPredict function uses the model built by the DecisionForest function to make predictions. It supports regression, binary, and multi-class classification tasks.
Typically, constructing a decision tree involves evaluating the value for each input feature in the data to select a split point. The function reduces the features to a random subset (that can be considered at each split point); the algorithm can force each decision tree in the forest to be very different to improve prediction accuracy. The function uses a training dataset to create a predictive model. The DecisionForestPredict function uses the model created by the DecisionForest function for making predictions. The function supports regression, binary, and multi-class classification.
Consider the following points:
train_new_df = DataFrame('final_train_data')
test_new_df = DataFrame('final_test_data')
DecisionForest_out = DecisionForest(data = train_new_df,
input_columns = [ 'ID','X0','X1','X2','X3','X4','X5','X6','X8','X47','X314','X118'
,'X315','X127','X29','X115','X351','X151'],
response_column = 'y',
max_depth = 12,
num_trees = 4,
min_node_size = 1,
mtry = -1,
mtry_seed = 1,
seed = 1,
tree_type = 'REGRESSION')
The DecisionForestPredict() function uses the model generated by the DecisionForest() function to generate predictions on a response variable for a test set of data. The model can be stored in either a teradataml DataFrame or a DecisionForest object.
decision_forest_predict_out = TDDecisionForestPredict(object = DecisionForest_out,
newdata = test_new_df,
id_column = "ID",
detailed = False,
accumulate = ["y"]
)
decision_forest_predict_out.result
df_plot = decision_forest_predict_out.result.head(50)
from teradataml import Figure
figure = Figure(width=1000, height=700, heading="Actual vs Predicted using DecisionForest Classification")
df_plot.plot(x=df_plot.ID, y=[df_plot.y, df_plot.prediction],
style=['dark orange', 'green'], xlabel='Vehicla ID', ylabel='Time in test cycle', grid_color='black',
grid_linewidth=0.5, grid_linestyle="-", legend=['Actual Value','Predicted Value'],figure=figure)
The RegressionEvaluator() function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.
RegressionEvaluator_out = RegressionEvaluator(data = decision_forest_predict_out.result,
observation_column = "y",
prediction_column = "prediction",
freedom_degrees = [5, 28],
independent_features_num = 15,
metrics = ['MAE','MSE','RMSE','R2','FSTAT'])
DF_eval = RegressionEvaluator_out.result
DF_eval = DF_eval.assign(model='Decision Forest')
DF_eval
9. XGBoost
The XGBoost function, also known as eXtreme Gradient Boosting, is an implementation of the gradient boosted decision tree designed for speed and performance. It has recently been dominating applied machine learning.
In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
Gradient boosting involves three elements:
The loss function used depends on the type of problem being solved. For example, regression may use a squared error and binary classification may use binomial. A benefit of the gradient boosting is that a new boosting algorithm does not have to be derived for each loss function. Instead, it provides a generic enough framework that any differentiable loss function can be used. The XGBoost function supports both regression and classification predictive modelling problems. The model that it creates is used in the XGBoostPredict function for making predictions.
XGBoost_out = XGBoost(data=train_new_df,
input_columns=['ID','X0','X1','X2','X3','X4','X5','X6','X8','X47','X314','X118',
'X315','X127','X29','X115','X351','X151'],
response_column = 'y',
max_depth=2,
lambda1 = 1000.0,
model_type='Regression',
seed=1,
shrinkage_factor=0.95,
iter_num=24,
min_node_size=1,
num_boosted_trees=6,
column_sampling=1.0)
XGBoostPredict performs prediction for test input data using multiple simple trees in the trained model. The test input data should have the same attributes as used during the training phase, which can be up to 2048. These attributes are used to score based on the trees in the model.
The output contains prediction for each data point in the test data based on regression or classification. The prediction probability is computed based on the majority vote from participating trees. A higher probability implies a more confident prediction by the model. Majority of the trees result in the same prediction.
XGBoostPredict_out = XGBoostPredict(newdata=test_new_df,
object=XGBoost_out.result,
id_column='ID',
model_type = 'regression',
object_order_column=['task_index', 'tree_num',
'iter', 'tree_order'],
accumulate = 'y')
xgb_result = XGBoostPredict_out.result
xgb_result
df_xgb_plot = XGBoostPredict_out.result.head(50)
from teradataml import Figure
figure = Figure(width=1000, height=700, heading="Actual vs Predicted using XGBoost Classifier")
df_xgb_plot.plot(x=df_xgb_plot.ID, y=[df_xgb_plot.y, df_xgb_plot.Prediction],
style=['dark orange', 'green'], xlabel='Vehicla ID', ylabel='Time in test cycle', grid_color='black',
grid_linewidth=0.5, grid_linestyle="-", legend=['Actual Value','Predicted Value'],figure=figure)
The RegressionEvaluator function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.
Note: Since we have sample data here the predicted values do not seem to be very near to the actual values. In scenarios with real data the parameters can be tweaked to get better predicted values.
RegressionEvaluator_out = RegressionEvaluator(data = xgb_result,
observation_column = "y",
prediction_column = "Prediction",
freedom_degrees = [5, 28],
independent_features_num = 15,
metrics = ['MAE','MSE','RMSE','R2','FSTAT'])
XGB_Eval = RegressionEvaluator_out.result
XGB_Eval = XGB_Eval.assign(model='XGBoost')
XGB_Eval
The Metrics of the regression evaluator has the RMSE, R2 and the F-STAT metrics which are specified in the Metrics.
Thus, here we have used 2 different models to train and predict the data. The Regression evaluator is used to evaluate and compare the models. The Teradata In-Database functions are used for training, prediction and evaluation. In this case since we have sample data the result parameters may not be accurate for these models.
Root mean squared error (RMSE)The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model’s predictions are when compared to actual observed values. So, a high RMSE is “bad” and a low RMSE is “good”.
The coefficient of determination — more commonly known as R² — allows us to measure the strength of the relationship between the response and predictor variables in the model. It’s just the square of the correlation coefficient R, so its values are in the range 0.0–1.0. Higher values of R- Squared is Good.
The metrics specified in the Metrics syntax element are displayed. For FSTAT, the following columns are displayed:
Here we can see the comparison for MAE,MSE,RMSE and R2 for XGBoost and DecisionForest.
transposed_df_eval = DF_eval.concat(XGB_Eval)
transposed_df_eval.select(['model','MAE','MSE','RMSE','R2','F_SCORE','F_CRITICALVALUE','P_VALUE','F_CONCLUSION'])
Conclusion
This, with ClearScape Analytics, we can scale AI/ML quicker and more effectively to solve your most complex challenges, reduce cost and friction, and accelerate time to value throughout your organization. Teradata can help manufacturing clients achieve success by helping them to grow revenues, optimise asset uptime, boost product quality and improve efficiency in the face of growing global disruption and competition. The InDB Analytic functions of Vantage help improve speed, performance, and time-to value by minimizing data movement by handling data where it lies for reduced costs and heightened security.
10. Cleanup
Work Tables
tables = ['final_data','final_train_data','final_test_data']
# Loop through the list of tables and execute the drop table command for each table
for table in tables:
try:
db_drop_table(table_name = table)
except:
pass
Databases and Tables
The following code will clean up tables and databases created above.
%run -i ../run_procedure.py "call remove_data('DEMO_GreenManufacturing');"
#Takes 40 seconds
remove_context()
Let’s look at the elements we have available for reference for this notebook:
Filters:
Related Resources: