#!/usr/bin/env python # coding: utf-8 # # Tabular Playground Series - Nov 2021 # ![TabularPGSnov2021.jpeg](attachment:TabularPGSnov2021.jpeg) # # ## 1. Exploratory data analyst # # # # Load required libraries and open downloading datasets from this link. # In[1]: # This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load from IPython.display import display from IPython.display import HTML import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import numpy as np import pandas as pd import tensorflow as tf import warnings pd.options.display.max_columns = 110 pd.options.display.max_rows = 400 from sklearn.metrics import accuracy_score, roc_auc_score from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import StandardScaler from tensorflow import keras from keras.regularizers import l1 from keras.regularizers import l2 from keras.backend import sigmoid from tensorflow.keras import activations from tensorflow.keras.optimizers import Adam from tensorflow.keras import initializers from tensorflow.keras import layers from tensorflow.keras import regularizers from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import InputLayer from tensorflow.keras.constraints import max_norm from tensorflow.keras.layers import LayerNormalization from tensorflow.keras.layers.experimental.preprocessing import Normalization from tensorflow.keras.callbacks import EarlyStopping warnings.filterwarnings('ignore') # Open dataset and get general statistical data. For the calculation mutual information values for сlassifier, I used my own function, so as not to clutter up the place with unnecessary code, I load the summary results of mutual values and general statistics values from a cvs file. # In[2]: get_ipython().run_cell_magic('time', '', 'train = pd.read_csv("data/train.csv")\ntest = pd.read_csv("data/test.csv")\nsubmission = pd.read_csv("data/sample_submission.csv")\ntrain.set_index("id", inplace=True)\n# Check nan values\nprint("The train has {} features with nan values.".format(list(train.isnull().sum().values > 0).count(True)))\nprint("The test has {} features with nan values.".format(list(test.isnull().sum().values > 0).count(True)))\nprint("The sample_submission has with {} features nan values.".format(list(submission.isnull().sum().values > 0).count(True)))\ntrain_mutual_clf = pd.read_csv("train_mutual_clf.csv")\ntrain_mutual_clf\n') # As I see above, I have the classical binary classification task with numeric continuous values for x_values and binary y_values (1 or 0). Formally , only 70 values mutual information other than zero and deleting features with zero value should increase the accuracy value , but as I found out experimentally deleting features with a zero value mutual information decreasing the accuracy values by 3-6%. File `train_mutual_clf.csv` with mutual information values you can download here. Let 's define the outliers quantity. # In[3]: get_ipython().run_cell_magic('time', '', 'def dfoutlsidx(dataframe):\n """\n Define indexes of outliers values less than quintile 25% - 1.5IRQ and more\n then quintile 75% + 1.5 IRQ for continuous values of features.\n Parameters\n ----------\n dataframe : tested pandas dataframe \n Returns\n -------\n list indexes of outliers values.\n """\n df = train.copy()\n outliers = set()\n features = list(df.columns)[:-1]\n for feature in features:\n quant_25 = df[feature].quantile(0.25)\n quant_75 = df[feature].quantile(0.75)\n delta = 1.5*(quant_75 - quant_25)\n df_feature = set(train[(train[feature] < quant_25 - delta) \\\n | (train[feature] > quant_75 + delta)].index)\n for idx in df_feature:\n if idx not in outliers:\n outliers.add(idx)\n return list(outliers)\n\noutls_idx = dfoutlsidx(train)\nprint("Train dataset contains {:,} outliers values in the {:,} rows.\\n\\\nShare of outliers {:.3f}%".format(len(outls_idx), train.shape[0],\n len(outls_idx)/train.shape[0]*100.0))\n') # As seen above, almost all x_values of `train` dataset - outliers. # # ## 2. Determinate model. # # To be honest, I was surprised that DL like ML, it does not have any clear and clear criteria for building a model - the number of hidden layers, the total number of neurons and numbers of neurons for each layer and the model you have to create empirically based on the rules `rule-of-thumbs` and your own experience or fantasy. For example: # # `I have a few rules of thumb that I use to choose hidden layers. There are many rule-of-thumb methods for determining an acceptable number of neurons to use in the hidden layers, such as the following: # 1.The number of hidden neurons should be between the size of the input layer and the size of the output layer. # 2.The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. # 3.The number of hidden neurons should be less than twice the size of the input layer.` # # At the first iteration I tried to use xgboost, randomforest and SVM from scikit-learn to solve this binary classification problem. When, after more than 12 hours, the pipeline cross validation with these three train dataset algorithms did not end, I decided to use tensorflow. Maximal accuracy have SVM (train:test ration = 1:4, 3 trials) ~ 0.64. # # At second iteration I started with Tensorflow, I tried to write my own functions to determine the optimal number of neurons, hidden layers, activation functions and etc. I tried to use KerasClassifier - but I couldn't connect the loss_validation metric in it - as a result, I got a bunch of spaghetti code with monstrous time costs. Realizing that I was getting dirty in the abyss of writing functions that spend a lot of time, I searched the Internet and found AutoKeras. # # With the help of Autokers and Keras tuner I created 3 models in about five hours: # 1. Standard binary classifier `automl_clf` with default settings from AutoKeras. # 2. `automl_clf` was regularized by d1 and d2 and rename to `automl_tuner`. # 3. `automl_regr` is a standard linear regression model with default settings from AutoKeras. # The code for finding all three models can be found at this link. # # After finishing the work, these models were exported to `json` or inserted into a text file using the `get_config` method and typed manually. In these cases , the acceptable calculation speed for defining models and hyperparameters is given by batch_size = 1024 - 2048. Experimentally, it was found out that the maximum value for `validation_accuracy` and minimal values for `validation_loss` gives only StandardScaler # # # # ## 3. Train models # # ### 3.1 Train `automl_clf` model. # # Load required functions: # In[4]: get_ipython().run_cell_magic('time', '', 'def dfsplit(dataframe, scaler=None):\n """\n Split dataframe to x_train, y_train, x_test, y_test on ratio 4:1. \n Possible scale/transform option for x features:\n 1. None – not scale or trainsform\n 2. “ptbc” Power-transformer by Box-Cox\n 3. “ptbc” - .PowerTransformer by Yeo-Johnson’\n 4. “rb” - .RobustScaler(\n 5. "ss" - StandardScaler \n For prevent data leakage using separate instance scaler/transformer \n for each train and test parts.\n Parameters\n ----------\n dataframe : pandas dataframe with numeric values of features.\n scaler : TYPE - None or str, optional. The default is None.\n Returns\n -------\n x_train, x_test, y_train, y_test - numpy arrays.\n """\n df = dataframe.copy()\n mms_train = MinMaxScaler(feature_range=(1, 2))\n mms_test = MinMaxScaler(feature_range=(1, 2))\n ptbc_train = PowerTransformer(method=\'box-cox\')\n ptbc_test = PowerTransformer(method=\'box-cox\')\n ptyj_train = PowerTransformer()\n ptyj_test = PowerTransformer()\n rb_train = RobustScaler(unit_variance=True)\n rb_test = RobustScaler(unit_variance=True)\n ss_train = StandardScaler()\n ss_test = StandardScaler()\n df = dataframe.copy()\n # split dataframe for train and test x and y nparrays\n x_all, y_all =df.iloc[:,:-1].values, np.ravel(df.iloc[:,[-1]].values) \n x_train, x_test, y_train, y_test = train_test_split(x_all, y_all, \n test_size=0.2, \n random_state=42, \n stratify=y_all)\n # Transform or scale \n scalers = [None, "ptbc", "ptyj", "rb", "ss"]\n if scaler == None:\n x_train, x_test = x_train[:,:], x_test[:,:] \n \n if scaler == "ptbc":\n x_train, x_test = \\\n ptbc_train.fit_transform(mms_train.fit_transform(x_train[:,:])), \\\n ptbc_test.fit_transform(mms_test.fit_transform(x_test[:,:]))\n \n elif scaler == "ptyj":\n x_train, x_test = \\\n ptyj_train.fit_transform(x_train[:,:]), \\\n ptyj_test.fit_transform(x_test[:,:])\n elif scaler == "rb":\n x_train, x_test = \\\n rb_train.fit_transform(x_train[:,:]), \\\n rb_test.fit_transform(x_test[:,:])\n elif scaler == "ss":\n x_train, x_test = \\\n ss_train.fit_transform(x_train[:,:]), \\\n ss_test.fit_transform(x_test[:,:])\n if scaler not in scalers:\n return "Value error for \'scaler\'!", "Enter None or", \\\n "\'ptbc\' or", " \'ptyj\' or \'rb\' or \'ss\' value for scaler!"\n return x_train, x_test, y_train, y_test\n\n\ndef df_transform(dataframe, scaler=None, y=True):\n """\n Split dataframe to x_train, y_train, x_test, y_test on ratio 4:1. \n Possible scale/transform option for x features:\n 1. None – not scale or trainsform\n 2. “ptbc” Power-transformer by Box-Cox\n 3. “ptbc” - .PowerTransformer by Yeo-Johnson’\n 4. “rb” - .RobustScaler(\n 5. "ss" - StandardScaler \n For prevent data leakage using separate instance scaler/transformer \n for each train and test parts.\n Parameters\n ----------\n dataframe : pandas dataframe with numeric values of features.\n scaler : TYPE - None or str, optional. The default is None.\n Returns\n -------\n If y==True: x_train, x_test, y_train, y_test - numpy arrays.\n If y==False: x_train, x_test - numpy arrays.\n """\n df = dataframe.copy()\n mms_all = MinMaxScaler(feature_range=(1, 2))\n ptbc_all = PowerTransformer(method=\'box-cox\')\n ptyj_all = PowerTransformer()\n rb_all = RobustScaler(unit_variance=True)\n ss_all = StandardScaler()\n df = dataframe.copy()\n \n # split dataframe for train and test x and y nparrays\n if y==True:\n x_all, y_all =df.iloc[:,:-1].values, np.ravel(df.iloc[:,[-1]].values) \n elif y==False:\n x_all =df.iloc[:,:].values\n if y not in [True, False]:\n return "Y value error!", "Enter or True or False!"\n # Transform or scale x_all \n scalers = [None, "ptbc", "ptyj", "rb", "ss"]\n if scaler == None:\n x_all = x_all[:,:] \n \n if scaler == "ptbc":\n x_all = ptbc_all.fit_transform(mms_all.fit_transform(x_all[:,:]))\n \n elif scaler == "ptyj":\n x_all = ptyj_all.fit_transform(x_all[:,:])\n \n elif scaler == "rb":\n x_all = rb_all.fit_transform(x_all[:,:]), \\\n \n elif scaler == "ss":\n x_all = ss_all.fit_transform(x_all[:,:])\n \n if scaler not in scalers:\n return "Value error for \'scaler\'!", "Enter None or", \\\n "\'ptbc\' or", " \'ptyj\' or \'rb\' or \'ss\' value for scaler!"\n if y==True:\n return x_all, y_all\n elif y==False:\n return x_all\n \n \ndef automl_clf(shape_x, learn_rate=0.01):\n """\n Model created manually from json file model from auto-keras\n Parameters\n ----------\n \tshape_x : integer, equal of dimensions the dataset features.\n learn_rate : float, value for learning_rate of optimizer. \n Default value of learn_rate = 0.001.\n Returns\n -------\n \t model : the keras model\n """\n model = Sequential()\n # 0.Input\n model.add(InputLayer(input_shape=(100,), dtype=\'float64\', name="input_1"))\n # Normalization input == StandardScaler\n model.add(Normalization(name=\'normalization\'))\n \n # Hidden layer 1\n # 1.1 Initializer for first hidden layer input linear\n model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_1"))\n # 1.2 Activation for fisrt hidden layer\n model.add(layers.Activation(activations.relu, name="relu_1"))\n model.add(layers.Dropout(.25))\n \n # Hidden layer 2\n # 2.1 Initializer for first hidden layer input linear\n model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_2"))\n # 2.2 Activation for second hidden layer\n model.add(layers.Activation(activations.relu, name="relu_2"))\n model.add(layers.Dropout(.25))\n \n # Hidden layer 3\n # 3.1 Initializer for third hidden layer input linear\n model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_3"))\n # 3.2 Activation for second hidden layer\n model.add(layers.Activation(activations.relu, name="relu_3"))\n model.add(layers.Dropout(.25))\n \n # 4. Final sigmoid\n model.add(layers.Dense(units=1, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_4"))\n model.add(layers.Activation(activations.sigmoid, name="sigmoid_1"))\n \n model.compile(loss=\'binary_crossentropy\', \n optimizer = tf.keras.optimizers.Adam(learning_rate=learn_rate),\n metrics=[\'accuracy\',tf.keras.metrics.AUC(name=\'auc\')])\n return model\n\n\ndef train_model(model, dataframe, batch_sz=16384, stop_no=30, scaler=None,\n estimator="clf"):\n """\n Scale / Transform numeric features for fit and train model.\n Parameters\n ----------\n \tmodel : keras model for fitting data.\n \tDataframe : pandas dataframe with numeric values of x and y .\n \tbatch_sz : integer, Size of the batch, optional. The default is 16384.\n \tstop_no : integer, number of repeat for callback, optional. \n The default is 30.\n \tscaler : None or str, available values - None, "ptbc", "ptyj", \n "rb", "ss", optional. Default is None.\n Returns\n -------\n \tmodel : keras fitted and trained model\n \thist_stat : pandas dataframe with values of metrics ane epochs for \n model.\n """ \n callbacks = [EarlyStopping(monitor=\'val_loss\',mode=\'min\',\n patience=stop_no,restore_best_weights=True)]\n df = dataframe.copy()\n scaler=scaler\n # split and scale or transform features\n x_train, x_test, y_train, y_test = dfsplit(df, scaler=scaler)\n # Fit and train model\n history = model.fit(x_train, y_train,\n batch_size=batch_sz,\n epochs=10000,\n validation_data=(x_test,y_test),\n callbacks=callbacks,\n verbose=0)\n # Export history to dataframe\n hist_stat = pd.DataFrame(history.history)\n hist_stat["epochs"] = np.array(list(hist_stat.index))+1\n if estimator == "clf":\n hist_stat.sort_values("val_accuracy", ascending=False, inplace=True)\n elif estimator == "regr":\n hist_stat.sort_values("val_mean_squared_error", ascending=True, inplace=True)\n estimators = ["clf", "regr"]\n if estimator not in estimators:\n return "Estimator value error!", "Enter \'clf\' of \'regr\'!"\n hist_stat.reset_index(drop=True, inplace=True)\n return model, hist_stat\n\n\n# Get model and model history\nautoml_clf_ss, automl_stat_clf_ss = train_model(automl_clf(train.shape[1]-1), \n train, batch_sz=2048, scaler=\'ss\')\nautoml_stat_clf_ss\n') # ### 3.2. Train `automl_tuner`. # In[5]: get_ipython().run_cell_magic('time', '', 'def automl_tuner(shape_x):\n learning_rate = 0.0012589254117941675\n l1_kernel=0.0023713737056616554\n l2_bias = 0.0007943282347242813\n l1_val = 0.0001258925411794166\n \n model = Sequential()\n # 0.Input\n model.add(InputLayer(input_shape=(100,), dtype=\'float64\', name="input_1"))\n model.add(Normalization(name=\'normalization\'))\n \n # Hidden layer 1\n # 1.1 Initializer for first hidden layer input linear\n model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_1"))\n # l1 regularization\n model.add(layers.Dense(\n units=32, kernel_regularizer = tf.keras.regularizers.l1(l1_kernel),\n bias_regularizer=tf.keras.regularizers.l2(l2_bias),\n activity_regularizer=tf.keras.regularizers.l1(l1_val)))\n # 1.2 Activation for fisrt hidden layer\n model.add(layers.Activation(activations.relu, name="relu_1"))\n model.add(layers.Dropout(.25))\n \n # Hidden layer 2\n # 2.1 Initializer for first hidden layer input linear\n model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_2"))\n # l1 regularization\n model.add(layers.Dense(\n units=32, kernel_regularizer = tf.keras.regularizers.l1(l1_kernel),\n bias_regularizer=tf.keras.regularizers.l2(l2_bias),\n activity_regularizer=tf.keras.regularizers.l1(l1_val)))\n # 2.2 Activation for second hidden layer\n model.add(layers.Activation(activations.relu, name="relu_2"))\n model.add(layers.Dropout(.25))\n \n # Hidden layer 3\n # 3.1 Initializer for third hidden layer input linear\n model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_3"))\n # l1 regularization\n model.add(layers.Dense(\n units=32, kernel_regularizer = tf.keras.regularizers.l1(l1_kernel),\n bias_regularizer=tf.keras.regularizers.l2(l2_bias),\n activity_regularizer=tf.keras.regularizers.l1(l1_val)))\n # 3.2 Activation for second hidden layer\n model.add(layers.Activation(activations.relu, name="relu_3"))\n model.add(layers.Dropout(.25))\n \n # 4. Final sigmoid\n model.add(layers.Dense(units=1, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_4"))\n model.add(layers.Activation(activations.sigmoid, name="sigmoid_1"))\n \n model.compile(loss=\'binary_crossentropy\', \n optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate),\n metrics=[\'accuracy\',tf.keras.metrics.AUC(name=\'auc\')])\n return model\n\nautoml_clf_ss_tuner, automl_stat_clf_ss_tuner = train_model(automl_tuner(train.shape[1]-1), \n train, batch_sz=2048, scaler=\'ss\')\nautoml_stat_clf_ss_tuner\n') # As see above increasing validation_accuracy for regularized classifier model hasn't radically improve, in this case, it is within the statistical error but iteration numbers increasing by two times comparing non regularized classifier model. # # ### 3.3 Train `automl_regr` # In[6]: get_ipython().run_cell_magic('time', '', 'def automl_regr(shape_x, learn_rate=0.001):\n """\n Regression Model created manually from json file model from auto-keras\n Parameters\n ----------\n \tshape_x : integer, equal of dimensions the dataset features.\n learn_rate : float, value for learning_rate of optimizer. \n Default value of learn_rate = 0.001.\n Returns\n -------\n \t model : the keras model\n """\n model = Sequential()\n # 0.Input\n model.add(InputLayer(input_shape=(100,), dtype=\'float64\', name="input_1"))\n # Normalization input\n model.add(Normalization(name=\'normalization\'))\n \n # Hidden layer 1\n # 1.1 Initializer for first hidden layer input linear\n model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_1"))\n # 1.2 Activation for fisrt hidden layer\n model.add(layers.Activation(activations.relu, name="relu_1"))\n \n # Hidden layer 2\n # 2.1 Initializer for first hidden layer input linear\n model.add(layers.Dense(units=32, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_2"))\n # 2.2 Activation for second hidden layer\n model.add(layers.Activation(activations.relu, name="relu_2"))\n model.add(layers.Dropout(.25))\n \n # 3. Final linear \n model.add(layers.Dense(units=1, kernel_initializer="GlorotUniform",\n bias_initializer=\'zeros\', name="layer_3"))\n model.add(layers.Activation(activations.linear, name="regression_head_1"))\n \n model.compile(loss=\'mean_squared_error\', \n optimizer = tf.keras.optimizers.Adam(learning_rate=learn_rate),\n metrics=[\'mean_squared_error\'])\n return model\n\n\nautoml_regr_ss, automl_regr_ss_stat = train_model(automl_regr(train.shape[1]-1), \n train, batch_sz=2048, scaler=\'ss\',\n estimator="regr")\nautoml_regr_ss_stat\n') # ## 4. Define best estimator # In[7]: get_ipython().run_cell_magic('time', '', '# Transform and divide x and y for train dataset\nx_all, y_all = df_transform(train, scaler="ss")\n\n# Predict y_all for all models and select best estimator using accuracy metric\ny_pred_automl_clf = automl_clf_ss.predict(x_all, batch_size=2048, verbose=1)\ny_pred_automl_clf_conv = np.where(y_pred_automl_clf < 0.5, 0, 1)\n\n\ny_pred_automl_clf_tuner = automl_clf_ss_tuner.predict(x_all, batch_size=2048, \n verbose=1)\ny_pred_automl_clf_tuner_conv = np.where(y_pred_automl_clf_tuner < 0.5, 0, 1)\n\n\ny_pred_automl_regr_ss = automl_regr_ss.predict(x_all, batch_size=2048, \n verbose=1)\ny_pred_automl_regr_ss_conv = np.where(y_pred_automl_regr_ss < 0.5, 0, 1)\n') # Compare accuracy: # In[8]: get_ipython().run_cell_magic('time', '', 'accuracy_automl_clf = accuracy_score(y_all, y_pred_automl_clf_conv)\naccuracy_automl_clf_tuner = accuracy_score(y_all, y_pred_automl_clf_tuner_conv)\naccuracy_automl_regr = accuracy_score(y_all, y_pred_automl_regr_ss_conv)\n') # In[9]: print(f"Accuracy for `automl_clf` model: {accuracy_automl_clf:.4f}.") print(f"Accuracy for `automl_clf_tuner` model: {accuracy_automl_clf_tuner:.4f}.") print(f"Accuracy for `automl_regr` model: {accuracy_automl_regr:.4f}.") # As see above, the best accuracy has `automl_regr` model. # # ## 5. Predict target for test dataset using `automl_regr` model # In[10]: # Open and read test dataset test = pd.read_csv("data/test.csv") test.set_index("id", inplace=True) # Set id as index for submission # submission.set_index("id", inplace=True) # Convert test x values with StandardScaler test_ss = df_transform(test, scaler="ss", y=False) # Predict target and convert to binary format predict_target = automl_regr_ss.predict(test_ss, batch_size=2048, verbose=1) predict_target_conv = np.where(predict_target < 0.5, 0, 1) # Fill `target` column submission["target"] = predict_target_conv # Save predict submission.to_csv("submission_pred.csv") submission.head(18) # ## 6. Conclusions # # # 1. Auto-Keras gives a completely workable model and may in the future get rid of manual work that takes a lot of time to build an optimal model and a selection of hyperparameters. # # # 2. In this case, the l2/l2 regularization of the classifier model did not bring any visible improvements, but only increased the execution time. It turned out that a simple regression has accuracy comparable with regularized l1/l2 classifier. l2/l2 regularization isn't always the silver bullet. The possible reason is a wild outliers amount. # # # 3. I was pleased with the speed of the old GTX 1050 2GB RAM graphics accelerator with a data set containing 600K rows and 100 columns (60M cells), which takes several minutes to process and cross-validate. As I wrote earlier, when trying to choose the optimal algorithm from the classic ML between boost, random forest and SVM, the cross validation time took more than 12 hours. ML is dead long live DL, using ML is justified on small datasets with several thousand rows, with an increase in the amount of data ML loses in speed to DL. Advice for those who do not have modern video cards - lower the TF version for example, GTX 1050 2GB RAM works quite correctly by this dataset with TF 2.5, with version TF above, out of memory problems begin. # # # 4. This dataset itself turned out to be a tough nut to crack. High accuracy given only with StandardScaler, all other transformations from classical ML - MinMaxScaler, RobustScaler, QuantileTransformer, PowerTransformer, KBinsDiscretizer, Normalizer had lower values of accuracy values and high time of executions. Also, the removal algorithms of outliers from scikit-learn doesn't work with this dataset. # # # 5. I avoided data leakage everywhere, but if using standard scaling for the entire train dataset, it can be increased accuracy by rough 0.005. # # # # # Created on Mart 08, 2022 # # @author: Vadim Maklakov, used some ideas from public Internet resources. # # © 3-clause BSD License # # Software environment: Debian 11, Python 3.8.12, TensorFlow 2.5.1 for notebook, TensorFlow 2.8 for defining model with Auto-Keras. # # See required installed and imported python modules in the cell No 1.