Green Manufacturing for vehicles
Introduction
Since the first automobile, the Real Wheel Motor Company has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag and intelligent assistance systems. Real Wheel Motor Company applies for nearly 2000 patents per year, making the brand the European leader among premium car makers. These cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized models of their dreams.
To ensure the safety and reliability of each and every unique car configuration before they hit the road, the engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on their production lines.
The goal is to work with a dataset representing different permutations of Real Wheel Motor Company car features to predict the time it takes to pass testing. This will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing the standards of the company cars.
This dataset contains an anonymized set of variables, each representing a custom feature in a car. For example, a variable could be 4WD, added air suspension, or a head-up display. The ground truth is labelled ‘y’ and represents the time (in seconds) that the car took to pass testing for each variable.
1. Start by setting up Jupyter and connecting to the Vantage system.
In the section, we import the required libraries and set environment variables and environment paths (if required).
%%capture
# # '%%capture' suppresses the display of installation steps of the following packages
# !pip install xgboost==1.7.3
# !pip install colorlover
# !pip install teradataml --upgrade teradataml
NOTE: The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: 0 0
import json
import getpass
import pandas as pd
from teradataml.dataframe.dataframe import DataFrame
from teradataml import *
import numpy as np # linear algebra
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
color = sns.color_palette()
import xgboost as xgb
%matplotlib inline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from collections import defaultdict
import plotly.offline as offline
import colorlover as cl
offline.init_notebook_mode()
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
You will be prompted to provide the password. Enter your password, press the Enter key, then use down arrow to go to next cell. Begin running steps with Shift + Enter keys.
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO=GreenManufacturing.ipynb;' UPDATE FOR SESSION; ''')
2. Getting Data for This Demo
We have provided data for this demo on cloud storage. You have the option of either running the demo using foreign tables to access the data without using any storage on your environment or downloading the data to local storage which may yield somewhat faster execution, but there could be considerations of available storage. There are two statements in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.
** Note : Due to the large number of columns the initial table creation and data loading make take more time.
#%run -i ../run_procedure.py "call get_data('DEMO_GreenManufacturing_cloud');"
# Takes about 50 seconds
%run -i ../run_procedure.py "call get_data('DEMO_GreenManufacturing_local');"
# Takes about 3 minutes 30 seconds
Next is an optional step – if you want to see status of databases/tables created and space used.
%run -i ../run_procedure.py "call space_report();"
3. Analyze the raw data set
Create a DataFrame to get the data from the table created.
** Note : There may be a warning message due to a large number of columns in the dataframe. It's a Warning and not an error. Please ignore the warning
datadf = DataFrame(in_schema('DEMO_GreenManufacturing', 'Manufacturing_Data'))
datadf
The ID column is the ID of the cars, 'y' is the time in seconds which the car took to pass testing for each variable. The variables X0-X8 are categorical variables and the remaining are numerical variables having values of 0 and 1. These are the variables which impact the value of 'y'.
4. Check the impact of Categorical variables on target variable 'y'
Create a DataFrame to get the data from the table created.
train_df=datadf.to_pandas().reset_index()
var_name = "X0"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
# sns.stripplot(x=var_name, y='y', data=train_df, order=col_order)
sns.countplot(x=var_name, data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "X1"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.stripplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "X2"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "X3"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.violinplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "X4"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.violinplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "X5"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "X6"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.boxplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "X8"
col_order = np.sort(train_df[var_name].unique()).tolist()
plt.figure(figsize=(12,6))
sns.barplot(x=var_name, y='y', data=train_df, order=col_order)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
var_name = "ID"
plt.figure(figsize=(12,6))
sns.regplot(x=var_name, y='y', data=train_df, scatter_kws={'alpha':0.5, 's':30})
# sns.barplot(x=var_name, y='y', data=train_df)
plt.xlabel(var_name, fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title("Distribution of y variable with "+var_name, fontsize=15)
plt.show()
After the initial analysis done on the variables and the value of y based on these variables, let's go ahead and try to predict the value of Y using these variables. Below are some steps that should be done before using any prediction model.
5. Check the importance of various features on target variable 'y'
We are using the python xgboost model to check the feature importance.
warnings.simplefilter(action='ignore', category=UserWarning)
for f in ["X0", "X1", "X2", "X3", "X4", "X5", "X6", "X8"]:
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train_df[f].values))
train_df[f] = lbl.transform(list(train_df[f].values))
train_y = train_df['y'].values
train_X = train_df.drop(["ID", "y"], axis=1)
def xgb_r2_score(preds, dtrain):
labels = dtrain.get_label()
return 'r2', r2_score(labels, preds)
xgb_params = {
'eta': 0.05,
'max_depth': 6,
'subsample': 0.7,
'colsample_bytree': 0.7,
'objective': 'reg:linear',
'silent': 1
}
dtrain = xgb.DMatrix(train_X, train_y, feature_names=train_X.columns.values)
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=100, feval=xgb_r2_score, maximize=True)
# plot the important features #
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
6. OrdinalEncoding of the categorical variables
Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding.
Ordinal encoding, which turns each label into an integer value and depicts the sequence of labels in the encoded data, is employed when the variables in the data are ordinal. Ordinal encoding converts each label into integer values and the encoded data represents the sequence of labels.
Since the variables X0-X8 are categorical, we will need to convert them into numerical to use them in different models. We are using the OrdinalEncoding for this conversion. The TD_OrdinalEncodingFit function identifies distinct categorical values from the input table or a user defined list and returns the distinct categorical values along with the ordinal value for each category.
The Ordinal encoding will be done for both the Train and Test Datasets.
# Ordinal encoding for Dataset
query = '''Select * from TD_OrdinalEncodingFit (ON
DEMO_GreenManufacturing.Manufacturing_Data AS INPUTTABLE
OUT volatile table outputtable(Ordinal_fit_output)
USING TargetColumn('X0','X1','X2','X3','X4','X5','X6','X8')) as dt;
'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE Ordinal_fit_output;')
eng.execute(query)
Ordinal encoding transform is used on the ordinal encoding fit data to get the numerical values for the categorical values.
The TD_OrdinalEncodingTransform function maps the categorical value to a specified ordinal value using the TD_OrdinalEncodingFit output.
# Ordinal Transform for Dataset
query = '''Create multiset table Ordinal_transform_output as (SELECT * FROM TD_OrdinalEncodingTransform (
ON DEMO_GreenManufacturing.Manufacturing_Data AS InputTable
ON Ordinal_fit_output AS FitTable DIMENSION
USING
Accumulate ('id','y')
) as dt) with data primary index("ID");
'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE Ordinal_transform_output;')
eng.execute(query)
pd.read_sql('SELECT * FROM Ordinal_transform_output', eng)
7. Preparation of Data
In the below steps we are preparing the data by joining the converted categorical features and some other important features to be used in Model Training, Scoring and Evaluation
We join the converted dataframe and the important numerical features to get the final dataset which will be used for the model.
Get the output of OrdinalTransform into dataframe.
OrdTransdf = DataFrame(in_schema('demo_user', 'Ordinal_transform_output'))
We join the converted dataframe and the original data to get important numerical features.
datadf=datadf.drop(columns=["X0", "X1","X2","X3","X4","X5","X6","X8"])
datadf_join = OrdTransdf.join(other = datadf, on = ["ID"], how = "left",lsuffix='t1',rsuffix='t2')
datadf_join=datadf_join.drop(columns=["t2_ID", "t2_y"])
datadf_join = datadf_join.assign(ID=datadf_join.t1_ID, y=datadf_join.t1_y)
datadf_join = datadf_join.drop(columns=["t1_ID", "t1_y"])
Create a final dataframe with only the required important features along with the ID and the response column.
data_new_df = datadf_join[["ID",
"y",
"X0",
"X1",
"X2",
"X3",
"X4",
"X5",
"X6",
"X8",
"X47",
"X314",
"X118",
"X315",
"X127",
"X29",
"X115",
"X351",
"X151"]]
copy_to_sql(df = data_new_df, table_name = 'final_data',if_exists='replace')
We split the data into Train and Test data.
query = '''Create table TTS_output as (
SELECT * FROM TD_TrainTestSplit(
ON final_data AS InputTable
USING
IDColumn('Id')
Seed(42)
)AS dt
) with data;
'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE TTS_output;')
eng.execute(query)
query = '''Create multiset table final_train_data as (
SELECT * FROM TTS_OUTPUT WHERE TD_IsTrainRow = 1
) with data;
'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE final_train_data;')
eng.execute(query)
query = '''Create multiset table final_test_data as (
SELECT * FROM TTS_OUTPUT WHERE TD_IsTrainRow = 0
) with data;
'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE final_test_data;')
eng.execute(query)
train_new_df=DataFrame(in_schema('demo_user', 'final_train_data'),index=True,index_label='ID')
test_new_df=DataFrame(in_schema('demo_user', 'final_test_data'),index=True,index_label='ID')
8. Decision Forest
The Decision Forest is a powerful method used for predicting outcomes in both classification and regression problems. It's an improvement on the technique of combining (or "bagging") multiple decision trees. Normally, building a decision tree involves assessing the importance of each feature in the data to determine how to divide the information. This method takes a unique approach by only considering a random subset of features at each division point in the tree. This forces each decision tree within the "forest" to be different from one another, which ultimately improves the accuracy of the predictions. The function relies on a training dataset to develop a prediction model. Then, the TD_DecisionForestPredict function uses the model built by the TD_DecisionForest function to make predictions. It supports regression, binary, and multi-class classification tasks.
Typically, constructing a decision tree involves evaluating the value for each input feature in the data to select a split point. The function reduces the features to a random subset (that can be considered at each split point); the algorithm can force each decision tree in the forest to be very different to improve prediction accuracy. The function uses a training dataset to create a predictive model. The TD_DecisionForestPredict function uses the model created by the TD_DecisionForest function for making predictions. The function supports regression, binary, and multi-class classification.
Consider the following points:
query = '''Create table DF_train as (
SELECT * FROM TD_DecisionForest (
ON final_train_data AS INPUTTABLE partition by ANY
USING
ResponseColumn('y')
InputColumns('id','X0','X1','X2','X3','X4','X5','X6','X8','X47','X314','X118','X315','X127','X29','X115','X351','X151')
MaxDepth(12)
MinNodeSize(1)
NumTrees(4)
ModelType('REGRESSION')
Seed(1)
Mtry(-1)
MtrySeed(1)
) AS dt
) with data;
'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE DF_train;')
eng.execute(query)
TD_DecisionForestPredict function uses the model output by TD_DecisionForest function to analyze the input data and make predictions. This function outputs the probability that each observation is in the predicted class. Processing times are controlled by the number of trees in the model. When the number of trees is more than what can fit in memory, then the trees are cached in a local spool space.
query = '''
Create table DF_Predict as (
SELECT * FROM TD_DecisionForestPredict (
ON final_test_data AS InputTable PARTITION BY ANY
ON DF_Train AS ModelTable DIMENSION
USING
IdColumn ('id')
Detailed('false')
Accumulate('y')
) AS dt) with data;'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE DF_Predict;')
eng.execute(query)
df_result = DataFrame(in_schema('demo_user', 'DF_Predict'))
df_result_pd=df_result.to_pandas().reset_index().sort_values("ID")
df_result_pd
plt.figure(figsize=(20,20))
plt.ylabel('Time in Test Cycle', fontsize = 14)
plt.xlabel('Vehicle ID', fontsize = 14)
plt.plot(df_result_pd['ID'][:50], df_result_pd['y'][:50], color='g', label='Actual Value')
plt.plot(df_result_pd['ID'][:50], df_result_pd['prediction'][:50], color='r', label='Predicted Value')
plt.title('Actual vs Predicted using DecisionForest Classification', fontsize = 18)
plt.legend()
# plt.ylim(0,1)
plt.show()
query = '''
SELECT * FROM TD_RegressionEvaluator(
ON DF_Predict as InputTable
USING
ObservationColumn('y')
PredictionColumn('prediction')
Metrics('MAE','MSE','RMSE','R2','FSTAT')
DegreesOfFreedom(5,28)
NUMOFINDEPENDENTVARIABLES(15)
) as dt;
'''
DF_eval=pd.read_sql(query, eng)
DF_eval['model']='DecisionForest'
DF_eval
9. XGBoost
The TD_XGBoost function, also known as eXtreme Gradient Boosting, is an implementation of the gradient boosted decision tree designed for speed and performance. It has recently been dominating applied machine learning.
In gradient boosting, each iteration fits a model to the residuals (errors) of the previous iteration to correct the errors made by existing models. The predicted residual is multiplied by this learning rate and then added to the previous prediction. Models are added sequentially until no further improvements can be made. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
Gradient boosting involves three elements:
The loss function used depends on the type of problem being solved. For example, regression may use a squared error and binary classification may use binomial. A benefit of the gradient boosting is that a new boosting algorithm does not have to be derived for each loss function. Instead, it provides a generic enough framework that any differentiable loss function can be used. The TD_XGBoost function supports both regression and classification predictive modelling problems. The model that it creates is used in the TD_XGBoostPredict function for making predictions.
query = '''create table xgb_model as (
SELECT * FROM TD_XGBoost (
ON final_train_data partition by ANY
OUT TABLE MetaInformationTable(xgb_out)
USING
ResponseColumn('y')
InputColumns('id','X0','X1','X2','X3','X4','X5','X6','X8','X47','X314','X118','X315','X127','X29','X115','X351','X151')
MaxDepth(6)
MinNodeSize(1)
NumBoostedTrees(20)
ModelType('REGRESSION')
Seed(1)
RegularizationLambda(1000)
ShrinkageFactor(0.5)
IterNum(10)
ColumnSampling(0.7)
) as dt) with data;
'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE xgb_model;')
eng.execute('DROP TABLE xgb_out;')
eng.execute(query)
TD_XGBoostPredict performs prediction for test input data using multiple simple trees in the trained model. The test input data should have the same attributes as used during the training phase, which can be up to 2048. These attributes are used to score based on the trees in the model.
The output contains prediction for each data point in the test data based on regression or classification. The prediction probability is computed based on the majority vote from participating trees. A higher probability implies a more confident prediction by the model. Majority of the trees result in the same prediction.
query = '''create table xgb_predict_out as (
SELECT * FROM TD_XGBoostPredict(
ON final_test_data as inputtable partition by ANY
ON xgb_model as modeltable dimension ORDER BY task_index, tree_num, iter, tree_order
USING
IdColumn('id')
ModelType('regression')
accumulate('y')
) as dt) with data;
'''
try:
eng.execute(query)
except:
eng.execute('DROP TABLE xgb_predict_out;')
eng.execute(query)
xgb_result = DataFrame(in_schema('demo_user', 'xgb_predict_out'))
xgb_result_pd=df_result.to_pandas().reset_index().sort_values("ID")
xgb_result_pd
plt.figure(figsize=(20,20))
plt.ylabel('Time in Test Cycle', fontsize = 14)
plt.xlabel('Vehicle ID', fontsize = 14)
plt.plot(xgb_result_pd['ID'][:50], xgb_result_pd['y'][:50], color='g', label='Actual Value')
plt.plot(xgb_result_pd['ID'][:50], xgb_result_pd['prediction'][:50], color='r', label='Predicted Value')
plt.title('Actual vs Predicted using XGBoost Classification', fontsize = 18)
plt.legend()
# plt.ylim(0,1)
plt.show()
The TD_RegressionEvaluator function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.
query = '''
SELECT * FROM TD_RegressionEvaluator(
ON xgb_predict_out as InputTable
USING
ObservationColumn('confidence_lower')
PredictionColumn('prediction')
Metrics('MAE','MSE','RMSE','R2','FSTAT')
DegreesOfFreedom(5,48)
NUMOFINDEPENDENTVARIABLES(5)
) as dt;
'''
XGB_Eval=pd.read_sql(query, eng)
XGB_Eval['model']='XGBoost'
XGB_Eval
The Metrics of the regression evaluator has the RMSE, R2 and the F-STAT metrics which are specified in the Metrics.
Thus here we have used 2 different models to train and predict the data. The Regression evaluator is used to evaluate and compare the models. The Teradata In-Database functions are used for training, prediction and evaluation. In this case since we have sample data the result parameters may not be accurate for these models.
Root mean squared error (RMSE)The most common metric for evaluating linear regression model performance is called root mean squared error, or RMSE. The basic idea is to measure how bad/erroneous the model’s predictions are when compared to actual observed values. So a high RMSE is “bad” and a low RMSE is “good”.
The coefficient of determination — more commonly known as R² — allows us to measure the strength of the relationship between the response and predictor variables in the model. It’s just the square of the correlation coefficient R, so its values are in the range 0.0–1.0. Higher values of R- Squared is Good.
The metrics specified in the Metrics syntax element are displayed. For FSTAT, the following columns are displayed:
Here we can see the comparison for MAE,MSE,RMSE and R2 for XGBoost and DecisionForest.
frames=[DF_eval,XGB_Eval]
result = pd.concat(frames)
result = result.set_index([['Decision Forest','XGBoost']])
result = result.drop(['model'],axis=1)
transposed_df_eval = result.transpose()
# transposed_df_eval.reset_index()
transposed_df_eval
10. Cleanup
Work Tables
eng.execute('DROP TABLE Ordinal_fit_output;')
eng.execute('DROP TABLE Ordinal_transform_output;')
eng.execute('DROP TABLE TTS_output;')
eng.execute('DROP TABLE final_train_data;')
eng.execute('DROP TABLE final_test_data;')
eng.execute('DROP TABLE DF_train;')
eng.execute('DROP TABLE DF_Predict;')
eng.execute('DROP TABLE xgb_model;')
eng.execute('DROP TABLE xgb_out;')
eng.execute('DROP TABLE xgb_predict_out;')
Databases and Tables
The following code will clean up tables and databases created above.
%run -i ../run_procedure.py "call remove_data('DEMO_GreenManufacturing');"
#Takes 40 seconds
remove_context()