#!/usr/bin/env python # coding: utf-8 # ## Predicting Listing Gains in the Indian IPO Market Using TensorFlow # # I develop a deep learning classification model to predict listing gains for Initial Public Offerings (IPO) in the Indian market. This model can be useful to make investment decisions in the IPO market. # ### Preliminary data exploration # The dataset is available under the file name `data.csv` (https://github.com/magorshunov/predicting_ipo_gains/blob/main/data.csv). Listing gains are the percentage increase in the share price of a company from its IPO issue price on the day of listing. # The data consists of following columns: # - `Date`: date when the IPO was listed # - `IPOName`: name of the IPO # - `Issue_Size`: size of the IPO issue, in INR Crores # - `Subscription_QIB`: number of times the IPO was subscribed by the QIB (Qualified Institutional Buyer) investor category # - `Subscription_HNI`: number of times the IPO was subscribed by the HNI (High Networth Individual) investor category # - `Subscription_RII`: number of times the IPO was subscribed by the RII (Retail Individual Investors) investor category # - `Subscription_Total`: total number of times the IPO was subscribed overall # - `Issue_Price`: the price in INR at which the IPO was issued # - `Listing_Gains_Percent`: is the percentage gain in the listing price over the issue price # In[32]: import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from math import sqrt # In[33]: df = pd.read_csv('data.csv') print(df.shape) df.head() # In[34]: df.describe(include='all') # In[35]: df['Listing_Gains_Percent'].describe() # In[36]: df.isnull().sum() # ## Exploring the Data # The `Listing_Gains_Percent` target variable is continous. Therfore, I will need to convert it into a categorical variable before I proceed. Approximately 55% of the IPOs listed in profit, and the data is fairly balanced. I have also dropped some of the variables that might not have predictive power. # In[37]: df['Listing_Gains_Profit'] = df['Listing_Gains_Percent'].apply(lambda x: 1 if x > 0 else 0) # In[38]: df['Listing_Gains_Profit'].describe() # In[39]: df['Listing_Gains_Profit'].value_counts(normalize=True) # In[40]: df.drop(['Date ', 'IPOName', 'Listing_Gains_Percent'], axis=1, inplace=True) df.info() # ## Data Visualization # I will check for the distribution of predictors with respect to the target variable, since they could be informative for modeling. To do that, I will: # - Created a countplot to visualize the distribution of the target variable, and give the plot a proper title. # - Used plots to check for the presence of outliers in each of the continuous variables of the dataset. # - Used visualizations to check the relationship between your selected predictor variables and the target variable. Check if segmenting the plots with the distribution of the outcome classes provides any meaningful insight. # - Used visualizations to check if there are correlations between predictor variables. # # Here are some of the findings: # # 1. The histogram and the boxplots show that outliers are present in the data and might need outlier treatment. # # 2. The boxplot of `Issue_Price`, with respect to `Listing_Gains_Profit`, shows that there are more outliers for IPOs that listed a loss than there are outliers for IPOs that listed a profit. # # 3. The scatterplot shows a correlation between Retail and Total IPO Subscription via a scatterplot. # In[41]: # visualizing the target variable sns.countplot(x='Listing_Gains_Profit', data=df) plt.title('Distribution of IPO Listing Profit Category') plt.xlabel('Listing Profit (No=0, Yes=1)') plt.ylabel('Frequency') plt.show() # In[42]: plt.figure(figsize=[8,5]) sns.histplot(data=df, x='Issue_Price', bins=50).set(title='Distribution of Issue_Price', ylabel='Count') plt.show() # In[43]: plt.figure(figsize=[8,5]) sns.histplot(data=df, x='Issue_Size', bins=50).set(title='Distribution of Issue_Size', ylabel='Count') plt.show() # In[44]: sns.boxplot(data=df, y='Issue_Size') plt.title('Boxplot of Issue_Size') plt.show() # In[45]: sns.boxplot(data=df, x='Listing_Gains_Profit', y='Issue_Price') plt.title('Boxplot of Issue_Price with respect to Listing Gains Type') plt.xlabel('Listing Profit (No=0, Yes=1)') plt.show() # In[46]: print(df.skew()) # In[47]: sns.boxplot(data=df, y='Subscription_QIB') plt.title('Boxplot of Subscription_QIB') plt.show() # In[48]: sns.scatterplot(data=df, x='Subscription_RII', y='Subscription_Total') plt.title('Scatterplot between Retail and Total IPO Subscription') plt.show() # ## Outlier Treatment # Apart from performing a visual inspection, outliers can also be identified with the skewness or the interquartile range (IQR) value. There are different approaches to outlier treatment, but the one I've used here is outlier identification using the interquartile menthod. Once I identified the outliers, I clipped the variable values between the upper and lower bounds. # In[49]: q1 = df['Issue_Size'].quantile(q=0.25) q3 = df['Issue_Size'].quantile(q=0.75) iqr = q3 - q1 lower = (q1 - 1.5 * iqr) upper = (q3 + 1.5 * iqr) print('IQR = ', iqr, '\nlower = ', lower, '\nupper = ', upper, sep='') # In[50]: df['Issue_Size'] = df['Issue_Size'].clip(lower, upper) df['Issue_Size'].describe() # In[51]: q1 = df['Subscription_QIB'].quantile(q=0.25) q3 = df['Subscription_QIB'].quantile(q=0.75) iqr = q3 - q1 lower = (q1 - 1.5 * iqr) upper = (q3 + 1.5 * iqr) print('IQR = ', iqr, '\nlower = ', lower, '\nupper = ', upper, sep='') # In[52]: df['Subscription_QIB'] = df['Subscription_QIB'].clip(lower, upper) df['Subscription_QIB'].describe() # In[53]: q1 = df['Subscription_HNI'].quantile(q=0.25) q3 = df['Subscription_HNI'].quantile(q=0.75) iqr = q3 - q1 lower = (q1 - 1.5 * iqr) upper = (q3 + 1.5 * iqr) print('IQR = ', iqr, '\nlower = ', lower, '\nupper = ', upper, sep='') # In[54]: df['Subscription_HNI'] = df['Subscription_HNI'].clip(lower, upper) df['Subscription_HNI'].describe() # In[55]: q1 = df['Subscription_RII'].quantile(q=0.25) q3 = df['Subscription_RII'].quantile(q=0.75) iqr = q3 - q1 lower = (q1 - 1.5 * iqr) upper = (q3 + 1.5 * iqr) print('IQR = ', iqr, '\nlower = ', lower, '\nupper = ', upper, sep='') # In[56]: df['Subscription_RII'] = df['Subscription_RII'].clip(lower, upper) df['Subscription_RII'].describe() # In[57]: q1 = df['Subscription_Total'].quantile(q=0.25) q3 = df['Subscription_Total'].quantile(q=0.75) iqr = q3 - q1 lower = (q1 - 1.5 * iqr) upper = (q3 + 1.5 * iqr) print('IQR = ', iqr, '\nlower = ', lower, '\nupper = ', upper, sep='') # In[58]: df['Subscription_Total'] = df['Subscription_Total'].clip(lower, upper) df['Subscription_Total'].describe() # ## Setting the Target and Predictor Variables # Before moving on to modelling, I will: # - Create an array of the target variable (dependent variable). # - Create an array of the predictor variables (independent variables). # - Perform normalization on the predictor variables to scale their values to between 0 and 1. # In[59]: target_variable = ['Listing_Gains_Profit'] predictors = list(set(list(df.columns)) - set(target_variable)) df[predictors] = df[predictors]/df[predictors].max() df.describe() # ## Creating the Holdout Validation Approach # I will use the hold out validation approach to model evaluation. In this approach, I will divide the data in the 70:30 ratio, where I will use 70% of the data for training the model, while I will use the other 30% of the data to test the model. # In[60]: X = df[predictors].values y = df[target_variable].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=100) print(X_train.shape); print(X_test.shape) # ## Define the Deep Learning Classification Model # In this step, I've defined the model by instantiating the sequential model class in TensorFlow's Keras. The model architecture is comprised of four hidden layers with `relu` as the activation function. The output layer uses a `sigmoid` activation function, which is a good choice for a binary classification model. # In[61]: # define model tf.random.set_seed(100) model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(32, input_shape = (X_train.shape[1],), activation = 'relu')) model.add(tf.keras.layers.Dense(16, activation= 'relu')) model.add(tf.keras.layers.Dense(8, activation= 'relu')) model.add(tf.keras.layers.Dense(4, activation= 'relu')) model.add(tf.keras.layers.Dense(1, activation='sigmoid')) # ## Compile and Train the Model # Once I have defined the model, the next steps are to compile and train it. Compiling a model requires specification of the following: # - An optimizer # - A loss function # - An evaluation metric # # After compiling the model, I fitted it on the training set. The accuracy improved over epochs. # In[62]: model.compile(optimizer=tf.keras.optimizers.Adam(0.001), loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy']) print(model.summary()) # In[63]: model.fit(X_train, y_train, epochs=250) # ## Model Evaluation # The model evaluation output shows the performance of the model on both training and test data. The accuracy was approximately 75% on the training data and 74% on the test data. It's noteworthy that the training and test set accuracies are close to each other, which shows that there is consistency, and that the accuracy doesn't drop too much when I test the model on unseen data. # In[64]: model.evaluate(X_train, y_train) # In[65]: model.evaluate(X_test, y_test) # ## Conclusion # # I have built a deep learning classification model using the deep learning framework, Keras, in TensorFlow. I used a IPO dataset and built a classifier algorithm to predict whether an IPO will list at profit or not. I used the Sequential API to build the model, which is achieving a decent accuracy of 75% and 74% on training and test data, respectively. I see that the accuracy is consistent across the training and test datasets, which is a promising sign. This model can be useful to make investment decisions in the IPO market. # # # PS: Alternative Approach via Functional API # In[68]: input_layer = tf.keras.Input(shape=(X_train.shape[1],)) hidden_layer1 = tf.keras.layers.Dense(128, activation='relu')(input_layer) drop1 = tf.keras.layers.Dropout(rate=0.40)(hidden_layer1) hidden_layer2 = tf.keras.layers.Dense(64, activation='relu')(drop1) drop2 =tf.keras.layers.Dropout(rate=0.20)(hidden_layer2) hidden_layer3 = tf.keras.layers.Dense(16, activation='relu')(drop2) hidden_layer4 = tf.keras.layers.Dense(8, activation='relu')(hidden_layer3) hidden_layer5 = tf.keras.layers.Dense(4, activation='relu')(hidden_layer4) output_layer = tf.keras.layers.Dense(1, activation='sigmoid')(hidden_layer5) model = tf.keras.Model(inputs=input_layer, outputs=output_layer) print(model.summary()) optimizer = tf.keras.optimizers.Adam(0.001) loss = tf.keras.losses.BinaryCrossentropy() metrics = ['accuracy'] model.compile(optimizer=optimizer, loss=loss, metrics=metrics) model.fit(X_train, y_train, epochs=250, verbose=0) print(model.evaluate(X_train, y_train)) print(model.evaluate(X_test, y_test)) # In[ ]: