Welcome to PyExplainer Quickstart Guide ¶

PART A - Quick Start¶

1. Prepare data and model¶

Note. We use the default data and model here for an example

1.1 Import required library¶

In [1]:

from pyexplainer import pyexplainer_pyexplainer
from sklearn.ensemble import RandomForestClassifier

1.2 Obtain default dataset and train a global model (Random Forest)¶

In [2]:

default_data_and_model = pyexplainer_pyexplainer.get_dflt()
rf_model = RandomForestClassifier(random_state=0)
rf_model.fit(default_data_and_model['X_train'],
             default_data_and_model['y_train'])
py_explainer = pyexplainer_pyexplainer.PyExplainer(X_train = default_data_and_model['X_train'],
                           y_train = default_data_and_model['y_train'],
                           indep = default_data_and_model['indep'],
                           dep = default_data_and_model['dep'],
                           blackbox_model = rf_model)

🔧2. Create PyExplainer object¶

2.1 Prepare data for creating PyExplainer¶

In [3]:

X_explain = default_data_and_model['X_explain']
y_explain = default_data_and_model['y_explain']

2.2 Create rules¶

In [6]:

created_rules = py_explainer.explain(X_explain=X_explain,
                                     y_explain=y_explain,
                                     search_function='crossoverinterpolation',
                                     random_state=0,
                                     reuse_local_model=True)

3. Create interactive visualization¶

You can change feature values at the slider bar to observe change of risk score.

In [7]:

py_explainer.visualise(created_rules)

HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…

Output(layout=Layout(border='3px solid black'), outputs=({'output_type': 'display_data', 'data': {'text/plain'…

FloatSlider(value=246.0, continuous_update=False, description='#1 The value of AddedLOC is more than 246.0', l…

FloatSlider(value=11.0, continuous_update=False, description='#2 The value of LOC is more than 11.0', layout=L…

In [ ]:

import os 
os.system("jupyter nbextension enable --py widgetsnbextension")

PART B - Full Tutorial¶

1. Prepare sample data and model¶

1.1 For the simplicity, we load the sample DataFrame that is included in the package already¶

In [6]:

import pandas as pd
import numpy as np
from pyexplainer import pyexplainer_pyexplainer

df = pyexplainer_pyexplainer.load_sample_data()
df.head(3)

Out[6]:

	File	AvgLineCode	CountLine	MaxCyclomatic	AvgEssential	CountDeclClassVariable	SumCyclomaticStrict	AvgCyclomatic	...	OWN_LINE	OWN_COMMIT	MINOR_LINE	MAJOR_COMMIT	MAJOR_LINE	RealBug	HeuBug
0	activemq-console/src/main/java/org/apache/acti...	10	171	5	2	0	18	2	...	1.00000	1.0	1	1	0	False	False
1	activemq-console/src/main/java/org/apache/acti...	8	123	5	1	1	15	3	...	0.98374	0.5	1	2	1	False	False
2	activemq-console/src/main/java/org/apache/acti...	7	136	5	1	1	16	2	...	1.00000	1.0	1	1	0	False	False

3 rows × 70 columns

1.2 Define index column (OPTIONAL) and drop unwanted columns¶

First, we set 'File' col as index col since it is the file that we wanna inspect, and it has nothing to do with features or label¶

We use 'RealBug' as the label col, and the cols before 'RealBug' as feature cols¶

Then we drop unnecessary cols (e.g. File, HeuBug, HeuBugCount, RealBugCount)¶

In [7]:

df = df.set_index(df['File'])
df = df.drop(['File', 'HeuBug', 'HeuBugCount', 'RealBugCount'], axis=1)
df.head(3)

Out[7]:

	CountDeclMethodPrivate	AvgLineCode	CountLine	MaxCyclomatic	CountDeclMethodDefault	AvgEssential	CountDeclClassVariable	SumCyclomaticStrict	AvgCyclomatic	AvgLine	...	DDEV	Added_lines	Del_lines	OWN_LINE	OWN_COMMIT	MINOR_COMMIT	MINOR_LINE	MAJOR_COMMIT	MAJOR_LINE	RealBug
File
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractAmqCommand.java	0	10	171	5	0	2	0	18	2	18	...	1	32	18	1.00000	1.0	0	1	1	0	False
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractCommand.java	0	8	123	5	0	1	1	15	3	17	...	2	30	28	0.98374	0.5	0	1	2	1	False
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractJmxCommand.java	0	7	136	5	0	1	1	16	2	13	...	1	8	8	1.00000	1.0	0	1	1	0	False

3 rows × 66 columns

1.3 Define feature cols (X), and label col (y)¶

the function AutoSpearman is used as a feature selection method to reduce number of features¶

for more information about the algorithm please refer to this paper ¶

In [8]:

from pyexplainer.pyexplainer_pyexplainer import AutoSpearman
# select all rows, and all feature cols
# the last col, which is label col, is not selected
X = df.iloc[:, :-1]
total_features = len(X.columns)

# apply feature selection function to our feature DataFrame
X = AutoSpearman(X)
selected = len(X.columns)

# select all rows, and the last label col
y = df.iloc[:, -1]

print(selected, " out of ", total_features, " were selected via AutoSpearman feature selection process")
print('feature cols:', '\n\n', X.head(1), '\n\n')
print('label col:', '\n\n', y.head(1))

(Part 1) Automatically select non-correlated metrics based on a Spearman rank correlation test
> Step 1 comparing between CountDeclMethod and CountDeclFunction
>> CountDeclMethod has the average correlation of 0.433 with other metrics
>> CountDeclFunction has the average correlation of 0.433 with other metrics
>> Exclude CountDeclMethod
> Step 2 comparing between MAJOR_COMMIT and DDEV
>> MAJOR_COMMIT has the average correlation of 0.274 with other metrics
>> DDEV has the average correlation of 0.274 with other metrics
>> Exclude DDEV
> Step 3 comparing between SumCyclomatic and SumCyclomaticModified
>> SumCyclomatic has the average correlation of 0.501 with other metrics
>> SumCyclomaticModified has the average correlation of 0.501 with other metrics
>> Exclude SumCyclomatic
> Step 4 comparing between AvgCyclomatic and AvgCyclomaticModified
>> AvgCyclomatic has the average correlation of 0.387 with other metrics
>> AvgCyclomaticModified has the average correlation of 0.387 with other metrics
>> Exclude AvgCyclomatic
> Step 5 comparing between MaxCyclomatic and MaxCyclomaticModified
>> MaxCyclomatic has the average correlation of 0.476 with other metrics
>> MaxCyclomaticModified has the average correlation of 0.476 with other metrics
>> Exclude MaxCyclomatic
> Step 6 comparing between SumCyclomaticModified and SumCyclomaticStrict
>> SumCyclomaticModified has the average correlation of 0.488 with other metrics
>> SumCyclomaticStrict has the average correlation of 0.489 with other metrics
>> Exclude SumCyclomaticStrict
> Step 7 comparing between CountStmtDecl and CountLineCodeDecl
>> CountStmtDecl has the average correlation of 0.49 with other metrics
>> CountLineCodeDecl has the average correlation of 0.487 with other metrics
>> Exclude CountStmtDecl
> Step 8 comparing between CountLineCode and CountStmt
>> CountLineCode has the average correlation of 0.504 with other metrics
>> CountStmt has the average correlation of 0.501 with other metrics
>> Exclude CountLineCode
> Step 9 comparing between CountSemicolon and CountStmt
>> CountSemicolon has the average correlation of 0.484 with other metrics
>> CountStmt has the average correlation of 0.492 with other metrics
>> Exclude CountStmt
> Step 10 comparing between OWN_COMMIT and MAJOR_COMMIT
>> OWN_COMMIT has the average correlation of 0.238 with other metrics
>> MAJOR_COMMIT has the average correlation of 0.249 with other metrics
>> Exclude MAJOR_COMMIT
> Step 11 comparing between CountPath_Max and MaxCyclomaticModified
>> CountPath_Max has the average correlation of 0.447 with other metrics
>> MaxCyclomaticModified has the average correlation of 0.448 with other metrics
>> Exclude MaxCyclomaticModified
> Step 12 comparing between CountStmtExe and CountLineCodeExe
>> CountStmtExe has the average correlation of 0.473 with other metrics
>> CountLineCodeExe has the average correlation of 0.475 with other metrics
>> Exclude CountLineCodeExe
> Step 13 comparing between SumEssential and CountDeclFunction
>> SumEssential has the average correlation of 0.397 with other metrics
>> CountDeclFunction has the average correlation of 0.379 with other metrics
>> Exclude SumEssential
> Step 14 comparing between CountPath_Max and MaxCyclomaticStrict
>> CountPath_Max has the average correlation of 0.427 with other metrics
>> MaxCyclomaticStrict has the average correlation of 0.428 with other metrics
>> Exclude MaxCyclomaticStrict
> Step 15 comparing between CountPath_Max and CountPath_Mean
>> CountPath_Max has the average correlation of 0.416 with other metrics
>> CountPath_Mean has the average correlation of 0.399 with other metrics
>> Exclude CountPath_Max
> Step 16 comparing between AvgCyclomaticStrict and AvgCyclomaticModified
>> AvgCyclomaticStrict has the average correlation of 0.337 with other metrics
>> AvgCyclomaticModified has the average correlation of 0.33 with other metrics
>> Exclude AvgCyclomaticStrict
> Step 17 comparing between CountDeclFunction and CountDeclInstanceMethod
>> CountDeclFunction has the average correlation of 0.364 with other metrics
>> CountDeclInstanceMethod has the average correlation of 0.342 with other metrics
>> Exclude CountDeclFunction
> Step 18 comparing between CountSemicolon and CountLineCodeDecl
>> CountSemicolon has the average correlation of 0.436 with other metrics
>> CountLineCodeDecl has the average correlation of 0.421 with other metrics
>> Exclude CountSemicolon
> Step 19 comparing between CountLine and CountLineBlank
>> CountLine has the average correlation of 0.413 with other metrics
>> CountLineBlank has the average correlation of 0.372 with other metrics
>> Exclude CountLine
> Step 20 comparing between MaxNesting_Mean and CountPath_Mean
>> MaxNesting_Mean has the average correlation of 0.33 with other metrics
>> CountPath_Mean has the average correlation of 0.365 with other metrics
>> Exclude CountPath_Mean
> Step 21 comparing between MaxNesting_Max and MaxNesting_Mean
>> MaxNesting_Max has the average correlation of 0.337 with other metrics
>> MaxNesting_Mean has the average correlation of 0.316 with other metrics
>> Exclude MaxNesting_Max
> Step 22 comparing between CountOutput_Mean and AvgLineCode
>> CountOutput_Mean has the average correlation of 0.284 with other metrics
>> AvgLineCode has the average correlation of 0.317 with other metrics
>> Exclude AvgLineCode
> Step 23 comparing between CountLineCodeDecl and SumCyclomaticModified
>> CountLineCodeDecl has the average correlation of 0.385 with other metrics
>> SumCyclomaticModified has the average correlation of 0.375 with other metrics
>> Exclude CountLineCodeDecl
> Step 24 comparing between CountPath_Min and MaxNesting_Min
>> CountPath_Min has the average correlation of 0.083 with other metrics
>> MaxNesting_Min has the average correlation of 0.077 with other metrics
>> Exclude CountPath_Min
> Step 25 comparing between CountDeclInstanceMethod and SumCyclomaticModified
>> CountDeclInstanceMethod has the average correlation of 0.304 with other metrics
>> SumCyclomaticModified has the average correlation of 0.371 with other metrics
>> Exclude SumCyclomaticModified
> Step 26 comparing between RatioCommentToCode and CountStmtExe
>> RatioCommentToCode has the average correlation of 0.341 with other metrics
>> CountStmtExe has the average correlation of 0.379 with other metrics
>> Exclude CountStmtExe
> Step 27 comparing between CountInput_Max and CountInput_Mean
>> CountInput_Max has the average correlation of 0.293 with other metrics
>> CountInput_Mean has the average correlation of 0.232 with other metrics
>> Exclude CountInput_Max
> Step 28 comparing between CountOutput_Max and CountOutput_Mean
>> CountOutput_Max has the average correlation of 0.329 with other metrics
>> CountOutput_Mean has the average correlation of 0.259 with other metrics
>> Exclude CountOutput_Max
> Step 29 comparing between MaxNesting_Mean and AvgCyclomaticModified
>> MaxNesting_Mean has the average correlation of 0.257 with other metrics
>> AvgCyclomaticModified has the average correlation of 0.247 with other metrics
>> Exclude MaxNesting_Mean
> Step 30 comparing between Added_lines and Del_lines
>> Added_lines has the average correlation of 0.294 with other metrics
>> Del_lines has the average correlation of 0.291 with other metrics
>> Exclude Added_lines
> Step 31 comparing between CountLineBlank and CountDeclInstanceMethod
>> CountLineBlank has the average correlation of 0.299 with other metrics
>> CountDeclInstanceMethod has the average correlation of 0.258 with other metrics
>> Exclude CountLineBlank
> Step 32 comparing between MINOR_LINE and OWN_LINE
>> MINOR_LINE has the average correlation of 0.08 with other metrics
>> OWN_LINE has the average correlation of 0.078 with other metrics
>> Exclude MINOR_LINE
> Step 33 comparing between CountDeclInstanceMethod and CountDeclMethodPublic
>> CountDeclInstanceMethod has the average correlation of 0.246 with other metrics
>> CountDeclMethodPublic has the average correlation of 0.232 with other metrics
>> Exclude CountDeclInstanceMethod
> Step 34 comparing between AvgLine and CountOutput_Mean
>> AvgLine has the average correlation of 0.234 with other metrics
>> CountOutput_Mean has the average correlation of 0.239 with other metrics
>> Exclude CountOutput_Mean
> Step 35 comparing between CountLineComment and AvgLineComment
>> CountLineComment has the average correlation of 0.149 with other metrics
>> AvgLineComment has the average correlation of 0.112 with other metrics
>> Exclude CountLineComment
> Step 36 comparing between Del_lines and ADEV
>> Del_lines has the average correlation of 0.265 with other metrics
>> ADEV has the average correlation of 0.233 with other metrics
>> Exclude Del_lines
According to Part 1 of AutoSpearman, ['ADEV', 'CountClassCoupled', 'AvgLine', 'OWN_LINE', 'CountDeclMethodProtected', 'CountDeclInstanceVariable', 'PercentLackOfCohesion', 'CountDeclClass', 'MAJOR_LINE', 'AvgLineBlank', 'CountDeclMethodPublic', 'CountInput_Mean', 'MaxNesting_Min', 'CountOutput_Min', 'CountDeclMethodDefault', 'AvgCyclomaticModified', 'CountInput_Min', 'CountDeclClassMethod', 'CountClassDerived', 'AvgLineComment', 'CountDeclClassVariable', 'CountClassBase', 'OWN_COMMIT', 'MaxInheritanceTree', 'CountDeclMethodPrivate', 'MINOR_COMMIT', 'AvgEssential', 'COMM', 'RatioCommentToCode'] are selected.
(Part 2) Automatically select non-correlated metrics based on a Variance Inflation Factor analysis

C:\Users\micha\miniconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)
C:\Users\micha\miniconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:193: RuntimeWarning: divide by zero encountered in double_scalars
  vif = 1. / (1. - r_squared_i)

> Step 1 - exclude ADEV
> Step 2 - exclude AvgLine
Finally, according to Part 2 of AutoSpearman, Index(['CountClassCoupled', 'OWN_LINE', 'CountDeclMethodProtected',
       'CountDeclInstanceVariable', 'PercentLackOfCohesion', 'CountDeclClass',
       'MAJOR_LINE', 'AvgLineBlank', 'CountDeclMethodPublic',
       'CountInput_Mean', 'MaxNesting_Min', 'CountOutput_Min',
       'CountDeclMethodDefault', 'AvgCyclomaticModified', 'CountInput_Min',
       'const', 'CountDeclClassMethod', 'CountClassDerived', 'AvgLineComment',
       'CountDeclClassVariable', 'CountClassBase', 'OWN_COMMIT',
       'MaxInheritanceTree', 'CountDeclMethodPrivate', 'MINOR_COMMIT',
       'AvgEssential', 'COMM', 'RatioCommentToCode'],
      dtype='object') are selected.
27  out of  65  were selected via AutoSpearman feature selection process
feature cols: 

                                                     CountClassCoupled  \
File                                                                    
activemq-console/src/main/java/org/apache/activ...                  2   

                                                    OWN_LINE  \
File                                                           
activemq-console/src/main/java/org/apache/activ...       1.0   

                                                    CountDeclMethodProtected  \
File                                                                           
activemq-console/src/main/java/org/apache/activ...                         7   

                                                    CountDeclInstanceVariable  \
File                                                                            
activemq-console/src/main/java/org/apache/activ...                          3   

                                                    PercentLackOfCohesion  \
File                                                                        
activemq-console/src/main/java/org/apache/activ...                     61   

                                                    CountDeclClass  \
File                                                                 
activemq-console/src/main/java/org/apache/activ...               1   

                                                    MAJOR_LINE  AvgLineBlank  \
File                                                                           
activemq-console/src/main/java/org/apache/activ...           0             1   

                                                    CountDeclMethodPublic  \
File                                                                        
activemq-console/src/main/java/org/apache/activ...                      0   

                                                    CountInput_Mean  ...  \
File                                                                 ...   
activemq-console/src/main/java/org/apache/activ...         2.714286  ...   

                                                    AvgLineComment  \
File                                                                 
activemq-console/src/main/java/org/apache/activ...               6   

                                                    CountDeclClassVariable  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...                       0   

                                                    CountClassBase  \
File                                                                 
activemq-console/src/main/java/org/apache/activ...               1   

                                                    OWN_COMMIT  \
File                                                             
activemq-console/src/main/java/org/apache/activ...         1.0   

                                                    MaxInheritanceTree  \
File                                                                     
activemq-console/src/main/java/org/apache/activ...                   2   

                                                    CountDeclMethodPrivate  \
File                                                                         
activemq-console/src/main/java/org/apache/activ...                       0   

                                                    MINOR_COMMIT  \
File                                                               
activemq-console/src/main/java/org/apache/activ...             0   

                                                    AvgEssential  COMM  \
File                                                                     
activemq-console/src/main/java/org/apache/activ...             2     1   

                                                    RatioCommentToCode  
File                                                                    
activemq-console/src/main/java/org/apache/activ...                 0.7  

[1 rows x 27 columns] 


label col: 

 File
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractAmqCommand.java    False
Name: RealBug, dtype: bool

1.4 Split data into training and testing set¶

In [9]:

from sklearn.model_selection import train_test_split
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

2. Training and Predicting¶

2.1 Train a RandomForest model using sklearn¶

In [10]:

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=0)
rf_model.fit(X_train, y_train)

Out[10]:

RandomForestClassifier(random_state=0)

2.2 Generate predictions¶

In [11]:

# generate prediction from the model, which will return a list of predicted labels
y_preds = rf_model.predict(X_test) 
# create a DataFrame which only has predicted label column
y_preds = pd.DataFrame(data={'PredictedBug': y_preds}, index=y_test.index) 
y_preds.head(3)

Out[11]:

	PredictedBug
File
activemq-core/src/main/java/org/apache/activemq/kaha/MapContainer.java	False
activemq-core/src/main/java/org/apache/activemq/openwire/v3/MessageAckMarshaller.java	False
activemq-core/src/main/java/org/apache/activemq/ConnectionFailedException.java	False

3. Prediction post processing¶

3.1 Combine feature cols, label col, and the predicted col in testing set¶

In [12]:

combined_testing_data = X_test.join(y_test.to_frame())
combined_testing_data = combined_testing_data.join(y_preds)
combined_testing_data.head(3)
# total num of rows
total_rows = len(combined_testing_data)

3.2 Filter out wronly predicted rows¶

In [13]:

correctly_predicted_data = combined_testing_data[combined_testing_data['RealBug']==combined_testing_data['PredictedBug']]
correctly_predicted_rows = len(correctly_predicted_data)
print('The model correctly predicted ', round((correctly_predicted_rows / total_rows), 3) * 100, '% of testing data')

The model correctly predicted  90.60000000000001 % of testing data

3.3 We focus on the bug file, therefore, filter out the non-buggy file¶

In [14]:

correctly_predicted_bug = correctly_predicted_data[correctly_predicted_data['RealBug']==True]
correctly_predicted_bug.head(3)

Out[14]:

	CountClassCoupled	OWN_LINE	CountDeclMethodProtected	CountDeclInstanceVariable	PercentLackOfCohesion	CountDeclClass	MAJOR_LINE	AvgLineBlank	CountDeclMethodPublic	CountInput_Mean	...	CountClassBase	OWN_COMMIT	MaxInheritanceTree	CountDeclMethodPrivate	MINOR_COMMIT	AvgEssential	COMM	RatioCommentToCode	RealBug	PredictedBug
File
activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java	12	0.738916	3	2	77	3	1	1	8	2.181818	...	2	0.800000	6	0	0	1	5	0.27	True	True
activemq-core/src/main/java/org/apache/activemq/ActiveMQMessageConsumer.java	27	0.569082	10	21	89	5	2	0	35	4.807692	...	4	0.500000	1	5	0	1	10	0.42	True	True
activemq-openwire-generator/src/main/java/org/apache/activemq/openwire/tool/SingleSourceGenerator.java	0	0.995781	8	8	88	1	1	0	20	1.428571	...	1	0.666667	2	0	0	1	3	0.15	True	True

3 rows × 29 columns

3.4 Define feature cols and label col using correctly predicted testing data¶

In [15]:

# select all rows and feature cols
feature_cols = correctly_predicted_bug.iloc[:, :-2]
# selected all rows and one label col (either RealBug or PredictedBug is fine since they are the same)
label_col = correctly_predicted_bug.iloc[:, -2]

3.5 Select one row of correctly predicted bug to be explained¶

In [16]:

# decide which row to be selected
selected_row = 0
# select the row in X_test which contains all of the feature values
X_explain = feature_cols.iloc[[selected_row]]
# select the corresponding label from the DataFrame that we just created above
y_explain = label_col.iloc[[selected_row]]
print('one row of feature:', '\n\n', X_explain, '\n')
print('one row of label:', '\n\n', y_explain)

one row of feature: 

                                                     CountClassCoupled  \
File                                                                    
activemq-core/src/test/java/org/apache/activemq...                 12   

                                                    OWN_LINE  \
File                                                           
activemq-core/src/test/java/org/apache/activemq...  0.738916   

                                                    CountDeclMethodProtected  \
File                                                                           
activemq-core/src/test/java/org/apache/activemq...                         3   

                                                    CountDeclInstanceVariable  \
File                                                                            
activemq-core/src/test/java/org/apache/activemq...                          2   

                                                    PercentLackOfCohesion  \
File                                                                        
activemq-core/src/test/java/org/apache/activemq...                     77   

                                                    CountDeclClass  \
File                                                                 
activemq-core/src/test/java/org/apache/activemq...               3   

                                                    MAJOR_LINE  AvgLineBlank  \
File                                                                           
activemq-core/src/test/java/org/apache/activemq...           1             1   

                                                    CountDeclMethodPublic  \
File                                                                        
activemq-core/src/test/java/org/apache/activemq...                      8   

                                                    CountInput_Mean  ...  \
File                                                                 ...   
activemq-core/src/test/java/org/apache/activemq...         2.181818  ...   

                                                    AvgLineComment  \
File                                                                 
activemq-core/src/test/java/org/apache/activemq...               2   

                                                    CountDeclClassVariable  \
File                                                                         
activemq-core/src/test/java/org/apache/activemq...                       1   

                                                    CountClassBase  \
File                                                                 
activemq-core/src/test/java/org/apache/activemq...               2   

                                                    OWN_COMMIT  \
File                                                             
activemq-core/src/test/java/org/apache/activemq...         0.8   

                                                    MaxInheritanceTree  \
File                                                                     
activemq-core/src/test/java/org/apache/activemq...                   6   

                                                    CountDeclMethodPrivate  \
File                                                                         
activemq-core/src/test/java/org/apache/activemq...                       0   

                                                    MINOR_COMMIT  \
File                                                               
activemq-core/src/test/java/org/apache/activemq...             0   

                                                    AvgEssential  COMM  \
File                                                                     
activemq-core/src/test/java/org/apache/activemq...             1     5   

                                                    RatioCommentToCode  
File                                                                    
activemq-core/src/test/java/org/apache/activemq...                0.27  

[1 rows x 27 columns] 

one row of label: 

 File
activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java    True
Name: RealBug, dtype: bool

4. Create rules (explanations) and visualise it !¶

4.1 Initialise a PyExplainer object¶

In [17]:

from pyexplainer import pyexplainer_pyexplainer

py_explainer = pyexplainer_pyexplainer.PyExplainer(X_train = X_train,
                                                   y_train = y_train,
                                                   indep = X_train.columns,
                                                   dep = 'RealBug',
                                                   blackbox_model = rf_model)

4.2 Create rules by triggering explain function under PyExplainer object¶

Attention: This step can be time-consuming¶

In [18]:

rules = py_explainer.explain(X_explain=X_explain,
                             y_explain=y_explain,
                             search_function='crossoverinterpolation')

Those created rules are stored in a dictionary, for more information about what is contained in each key, please refer to 'Appendix' part¶

In [19]:

rules.keys()

Out[19]:

dict_keys(['synthetic_data', 'synthetic_predictions', 'X_explain', 'y_explain', 'indep', 'dep', 'top_k_positive_rules', 'top_k_negative_rules', 'local_rulefit_model'])

4.3 Simply trigger visualise function under PyExplainer object to visualise the created rules¶

In [20]:

py_explainer.visualise(rules)

HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…

Output(layout=Layout(border='3px solid black'))

FloatSlider(value=1.0, continuous_update=False, description='#1 The value of CountDeclClassVariable is more th…

FloatSlider(value=0.0, continuous_update=False, description='#2 The value of CountDeclMethodPrivate is more th…

Appendix¶

The detail of variables used to to create PyExplainer¶

Synthetic_data¶

Synthetic_data is data that are generated by PyExplainer using one of the following approaches.

Crossover and Interpolation
Random Perturbation.

After Synthetic_data is generated, it is stored as a pandas DataFrame object.

In [21]:

print("Type of pyExp_rule_obj['synthetic_data'] - ", type(rules['synthetic_data']), "\n")

print('Example')
display(rules['synthetic_data'].head(2))

Type of pyExp_rule_obj['synthetic_data'] -  <class 'pandas.core.frame.DataFrame'> 

Example

	CountClassCoupled	OWN_LINE	CountDeclMethodProtected	CountDeclInstanceVariable	PercentLackOfCohesion	CountDeclClass	MAJOR_LINE	AvgLineBlank	CountDeclMethodPublic	CountInput_Mean	...	AvgLineComment	CountDeclClassVariable	CountClassBase	OWN_COMMIT	MaxInheritanceTree	CountDeclMethodPrivate	MINOR_COMMIT	AvgEssential	COMM	RatioCommentToCode
0	1.0	0.75	1.0	2.0	42.0	1.0	0.0	3.0	6.0	3.57	...	3.0	1.0	1.0	0.8	3.0	0.0	0.0	1.0	5.0	0.27
1	1.0	1.00	0.0	2.0	68.0	1.0	0.0	0.0	5.0	2.60	...	0.0	3.0	1.0	1.0	6.0	0.0	0.0	1.0	2.0	0.64

2 rows × 27 columns

Synthetic_predictions¶

Synthetic_predictions is the prediction of Synthetic_data, which is obtained from the global model inside PyExplainer.

In [22]:

print("Type of pyExp_rule_obj['synthetic_predictions'] - ", type(rules['synthetic_predictions']), "\n")
print("Example", "\n\n", rules['synthetic_predictions'])

Type of pyExp_rule_obj['synthetic_predictions'] -  <class 'numpy.ndarray'> 

Example 

 [ True False False ... False False False]

X_explain¶

X_explain is an instance to be explained (which is a defective commit in this context)

In [23]:

print("Type of pyExp_rule_obj['X_explain'] - ", type(rules['X_explain']), "\n")

print('Example')
display(rules['X_explain'])

Type of pyExp_rule_obj['X_explain'] -  <class 'pandas.core.frame.DataFrame'> 

Example

	CountClassCoupled	OWN_LINE	CountDeclMethodProtected	CountDeclInstanceVariable	PercentLackOfCohesion	CountDeclClass	MAJOR_LINE	AvgLineBlank	CountDeclMethodPublic	CountInput_Mean	...	AvgLineComment	CountDeclClassVariable	CountClassBase	OWN_COMMIT	MaxInheritanceTree	CountDeclMethodPrivate	MINOR_COMMIT	AvgEssential	COMM	RatioCommentToCode
File
activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java	12	0.738916	3	2	77	3	1	1	8	2.181818	...	2	1	2	0.8	6	0	0	1	5	0.27

1 rows × 27 columns

y_explain¶

y_explain is a label of X_explain

In [24]:

print("Type of pyExp_rule_obj['y_explain'] - ", type(rules['y_explain']), "\n")
print("Example", "\n\n", rules['y_explain'])

Type of pyExp_rule_obj['y_explain'] -  <class 'pandas.core.series.Series'> 

Example 

 File
activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java    True
Name: RealBug, dtype: bool

indep¶

indep is feature names of X_explain¶

In [25]:

print("Type of pyExp_rule_obj['indep'] - ", type(rules['indep']), "\n")
print("Example", "\n\n", rules['indep'])

Type of pyExp_rule_obj['indep'] -  <class 'pandas.core.indexes.base.Index'> 

Example 

 Index(['CountClassCoupled', 'OWN_LINE', 'CountDeclMethodProtected',
       'CountDeclInstanceVariable', 'PercentLackOfCohesion', 'CountDeclClass',
       'MAJOR_LINE', 'AvgLineBlank', 'CountDeclMethodPublic',
       'CountInput_Mean', 'MaxNesting_Min', 'CountOutput_Min',
       'CountDeclMethodDefault', 'AvgCyclomaticModified', 'CountInput_Min',
       'CountDeclClassMethod', 'CountClassDerived', 'AvgLineComment',
       'CountDeclClassVariable', 'CountClassBase', 'OWN_COMMIT',
       'MaxInheritanceTree', 'CountDeclMethodPrivate', 'MINOR_COMMIT',
       'AvgEssential', 'COMM', 'RatioCommentToCode'],
      dtype='object')

dep¶

dep is a label name¶

In [26]:

print("Type of pyExp_rule_obj['dep'] - ", type(rules['dep']), "\n")
print("Example", "\n\n", rules['dep'])

Type of pyExp_rule_obj['dep'] -  <class 'str'> 

Example 

 RealBug

top_k_positive_rules¶

top_k_positive_rules is top-k rules that are genereated by PyExplainer to explain why a commit is predicted as defective.

Here we show top-3 rules that lead to defective commits=

In [27]:

print("Type of pyExp_rule_obj['top_k_positive_rules'] - ", type(rules['top_k_positive_rules']), "\n")
print('Example')
display(rules['top_k_positive_rules'].head(3))

Type of pyExp_rule_obj['top_k_positive_rules'] -  <class 'pandas.core.frame.DataFrame'> 

Example

	index	rule	type	coef	support	importance	is_satisfy_instance
0	161	AvgLineComment > -3.6149998903274536 & CountCl...	rule	3.490109e-23	0.271540	1.552240e-23	True
1	562	OWN_COMMIT <= 0.8349999785423279 & COMM > 2.98...	rule	3.639777e-23	0.208877	1.479593e-23	True
2	820	CountClassBase <= 2.9850000143051147 & OWN_COM...	rule	3.484802e-23	0.232376	1.471797e-23	True

top_k_negative_rules¶

top_k_negative_rules is top-k negative rules that are genereated by PyExplainer to explain why a commit is predicted as clean.

The default number of generated rules is 3.

In [28]:

print("Type of pyExp_rule_obj['top_k_negative_rules'] - ", type(rules['top_k_negative_rules']), "\n")
print('Example')
display(rules['top_k_negative_rules'])

Type of pyExp_rule_obj['top_k_negative_rules'] -  <class 'pandas.core.frame.DataFrame'> 

Example

	rule	type	coef	support	importance	Class
918	OWN_COMMIT > 0.8550000190734863 & CountDeclCla...	rule	-4.819474e-23	0.678851	2.250298e-23	Clean
1652	OWN_COMMIT > 0.8650000095367432	rule	-4.748976e-23	0.689295	2.197742e-23	Clean
1107	CountDeclMethodPrivate <= 1.8399999737739563 &...	rule	-4.863102e-23	0.725849	2.169360e-23	Clean