(1) X_train (pd.core.frame.DataFrame) - feature columns from training data
(2) y_train (pd.core.series.Series) - label column from training data
(3) indep (pd.core.indexes.base.Index) - names of feature columns > most of the time, you can get it by 'X_explain.columns'
(4) dep (str) - name of label column
(5) blackbox_model (any supervised classification model trained from sklearn lib) - model trained from sklearn lib
(1) X_explain (pd.core.frame.DataFrame) - one row of feature data
(2) y_explain (pd.core.series.Series) - one row of predicted data
In our Full Tutorial (PART B) example, the FileName column was used as the custom index.
However, it is fine if you don't have custom index, pandas will generate default row index starting from 0.
If you do want to make use of custom index, make sure to use it consistently, whenever you do the data processing.
Otherwise, some of your data may have pandas default index while the others have your custom index,
which will trigger errors whenever you try to combine your DataFrame and Series.
Note. We use the default data and model here for an example
from pyexplainer import pyexplainer_pyexplainer
from sklearn.ensemble import RandomForestClassifier
default_data_and_model = pyexplainer_pyexplainer.get_dflt()
rf_model = RandomForestClassifier(random_state=0)
rf_model.fit(default_data_and_model['X_train'],
default_data_and_model['y_train'])
py_explainer = pyexplainer_pyexplainer.PyExplainer(X_train = default_data_and_model['X_train'],
y_train = default_data_and_model['y_train'],
indep = default_data_and_model['indep'],
dep = default_data_and_model['dep'],
blackbox_model = rf_model)
X_explain = default_data_and_model['X_explain']
y_explain = default_data_and_model['y_explain']
created_rules = py_explainer.explain(X_explain=X_explain,
y_explain=y_explain,
search_function='crossoverinterpolation',
random_state=0,
reuse_local_model=True)
You can change feature values at the slider bar to observe change of risk score.
py_explainer.visualise(created_rules)
HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…
Output(layout=Layout(border='3px solid black'), outputs=({'output_type': 'display_data', 'data': {'text/plain'…
FloatSlider(value=246.0, continuous_update=False, description='#1 The value of AddedLOC is more than 246.0', l…
FloatSlider(value=11.0, continuous_update=False, description='#2 The value of LOC is more than 11.0', layout=L…
import os
os.system("jupyter nbextension enable --py widgetsnbextension")
import pandas as pd
import numpy as np
from pyexplainer import pyexplainer_pyexplainer
df = pyexplainer_pyexplainer.load_sample_data()
df.head(3)
File | CountDeclMethodPrivate | AvgLineCode | CountLine | MaxCyclomatic | CountDeclMethodDefault | AvgEssential | CountDeclClassVariable | SumCyclomaticStrict | AvgCyclomatic | ... | OWN_LINE | OWN_COMMIT | MINOR_COMMIT | MINOR_LINE | MAJOR_COMMIT | MAJOR_LINE | RealBug | HeuBug | HeuBugCount | RealBugCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | activemq-console/src/main/java/org/apache/acti... | 0 | 10 | 171 | 5 | 0 | 2 | 0 | 18 | 2 | ... | 1.00000 | 1.0 | 0 | 1 | 1 | 0 | False | False | 0 | 0 |
1 | activemq-console/src/main/java/org/apache/acti... | 0 | 8 | 123 | 5 | 0 | 1 | 1 | 15 | 3 | ... | 0.98374 | 0.5 | 0 | 1 | 2 | 1 | False | False | 0 | 0 |
2 | activemq-console/src/main/java/org/apache/acti... | 0 | 7 | 136 | 5 | 0 | 1 | 1 | 16 | 2 | ... | 1.00000 | 1.0 | 0 | 1 | 1 | 0 | False | False | 0 | 0 |
3 rows × 70 columns
df = df.set_index(df['File'])
df = df.drop(['File', 'HeuBug', 'HeuBugCount', 'RealBugCount'], axis=1)
df.head(3)
CountDeclMethodPrivate | AvgLineCode | CountLine | MaxCyclomatic | CountDeclMethodDefault | AvgEssential | CountDeclClassVariable | SumCyclomaticStrict | AvgCyclomatic | AvgLine | ... | DDEV | Added_lines | Del_lines | OWN_LINE | OWN_COMMIT | MINOR_COMMIT | MINOR_LINE | MAJOR_COMMIT | MAJOR_LINE | RealBug | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
File | |||||||||||||||||||||
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractAmqCommand.java | 0 | 10 | 171 | 5 | 0 | 2 | 0 | 18 | 2 | 18 | ... | 1 | 32 | 18 | 1.00000 | 1.0 | 0 | 1 | 1 | 0 | False |
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractCommand.java | 0 | 8 | 123 | 5 | 0 | 1 | 1 | 15 | 3 | 17 | ... | 2 | 30 | 28 | 0.98374 | 0.5 | 0 | 1 | 2 | 1 | False |
activemq-console/src/main/java/org/apache/activemq/console/command/AbstractJmxCommand.java | 0 | 7 | 136 | 5 | 0 | 1 | 1 | 16 | 2 | 13 | ... | 1 | 8 | 8 | 1.00000 | 1.0 | 0 | 1 | 1 | 0 | False |
3 rows × 66 columns
from pyexplainer.pyexplainer_pyexplainer import AutoSpearman
# select all rows, and all feature cols
# the last col, which is label col, is not selected
X = df.iloc[:, :-1]
total_features = len(X.columns)
# apply feature selection function to our feature DataFrame
X = AutoSpearman(X)
selected = len(X.columns)
# select all rows, and the last label col
y = df.iloc[:, -1]
print(selected, " out of ", total_features, " were selected via AutoSpearman feature selection process")
print('feature cols:', '\n\n', X.head(1), '\n\n')
print('label col:', '\n\n', y.head(1))
(Part 1) Automatically select non-correlated metrics based on a Spearman rank correlation test > Step 1 comparing between CountDeclMethod and CountDeclFunction >> CountDeclMethod has the average correlation of 0.433 with other metrics >> CountDeclFunction has the average correlation of 0.433 with other metrics >> Exclude CountDeclMethod > Step 2 comparing between MAJOR_COMMIT and DDEV >> MAJOR_COMMIT has the average correlation of 0.274 with other metrics >> DDEV has the average correlation of 0.274 with other metrics >> Exclude DDEV > Step 3 comparing between SumCyclomatic and SumCyclomaticModified >> SumCyclomatic has the average correlation of 0.501 with other metrics >> SumCyclomaticModified has the average correlation of 0.501 with other metrics >> Exclude SumCyclomatic > Step 4 comparing between AvgCyclomatic and AvgCyclomaticModified >> AvgCyclomatic has the average correlation of 0.387 with other metrics >> AvgCyclomaticModified has the average correlation of 0.387 with other metrics >> Exclude AvgCyclomatic > Step 5 comparing between MaxCyclomatic and MaxCyclomaticModified >> MaxCyclomatic has the average correlation of 0.476 with other metrics >> MaxCyclomaticModified has the average correlation of 0.476 with other metrics >> Exclude MaxCyclomatic > Step 6 comparing between SumCyclomaticModified and SumCyclomaticStrict >> SumCyclomaticModified has the average correlation of 0.488 with other metrics >> SumCyclomaticStrict has the average correlation of 0.489 with other metrics >> Exclude SumCyclomaticStrict > Step 7 comparing between CountStmtDecl and CountLineCodeDecl >> CountStmtDecl has the average correlation of 0.49 with other metrics >> CountLineCodeDecl has the average correlation of 0.487 with other metrics >> Exclude CountStmtDecl > Step 8 comparing between CountLineCode and CountStmt >> CountLineCode has the average correlation of 0.504 with other metrics >> CountStmt has the average correlation of 0.501 with other metrics >> Exclude CountLineCode > Step 9 comparing between CountSemicolon and CountStmt >> CountSemicolon has the average correlation of 0.484 with other metrics >> CountStmt has the average correlation of 0.492 with other metrics >> Exclude CountStmt > Step 10 comparing between OWN_COMMIT and MAJOR_COMMIT >> OWN_COMMIT has the average correlation of 0.238 with other metrics >> MAJOR_COMMIT has the average correlation of 0.249 with other metrics >> Exclude MAJOR_COMMIT > Step 11 comparing between CountPath_Max and MaxCyclomaticModified >> CountPath_Max has the average correlation of 0.447 with other metrics >> MaxCyclomaticModified has the average correlation of 0.448 with other metrics >> Exclude MaxCyclomaticModified > Step 12 comparing between CountStmtExe and CountLineCodeExe >> CountStmtExe has the average correlation of 0.473 with other metrics >> CountLineCodeExe has the average correlation of 0.475 with other metrics >> Exclude CountLineCodeExe > Step 13 comparing between SumEssential and CountDeclFunction >> SumEssential has the average correlation of 0.397 with other metrics >> CountDeclFunction has the average correlation of 0.379 with other metrics >> Exclude SumEssential > Step 14 comparing between CountPath_Max and MaxCyclomaticStrict >> CountPath_Max has the average correlation of 0.427 with other metrics >> MaxCyclomaticStrict has the average correlation of 0.428 with other metrics >> Exclude MaxCyclomaticStrict > Step 15 comparing between CountPath_Max and CountPath_Mean >> CountPath_Max has the average correlation of 0.416 with other metrics >> CountPath_Mean has the average correlation of 0.399 with other metrics >> Exclude CountPath_Max > Step 16 comparing between AvgCyclomaticStrict and AvgCyclomaticModified >> AvgCyclomaticStrict has the average correlation of 0.337 with other metrics >> AvgCyclomaticModified has the average correlation of 0.33 with other metrics >> Exclude AvgCyclomaticStrict > Step 17 comparing between CountDeclFunction and CountDeclInstanceMethod >> CountDeclFunction has the average correlation of 0.364 with other metrics >> CountDeclInstanceMethod has the average correlation of 0.342 with other metrics >> Exclude CountDeclFunction > Step 18 comparing between CountSemicolon and CountLineCodeDecl >> CountSemicolon has the average correlation of 0.436 with other metrics >> CountLineCodeDecl has the average correlation of 0.421 with other metrics >> Exclude CountSemicolon > Step 19 comparing between CountLine and CountLineBlank >> CountLine has the average correlation of 0.413 with other metrics >> CountLineBlank has the average correlation of 0.372 with other metrics >> Exclude CountLine > Step 20 comparing between MaxNesting_Mean and CountPath_Mean >> MaxNesting_Mean has the average correlation of 0.33 with other metrics >> CountPath_Mean has the average correlation of 0.365 with other metrics >> Exclude CountPath_Mean > Step 21 comparing between MaxNesting_Max and MaxNesting_Mean >> MaxNesting_Max has the average correlation of 0.337 with other metrics >> MaxNesting_Mean has the average correlation of 0.316 with other metrics >> Exclude MaxNesting_Max > Step 22 comparing between CountOutput_Mean and AvgLineCode >> CountOutput_Mean has the average correlation of 0.284 with other metrics >> AvgLineCode has the average correlation of 0.317 with other metrics >> Exclude AvgLineCode > Step 23 comparing between CountLineCodeDecl and SumCyclomaticModified >> CountLineCodeDecl has the average correlation of 0.385 with other metrics >> SumCyclomaticModified has the average correlation of 0.375 with other metrics >> Exclude CountLineCodeDecl > Step 24 comparing between CountPath_Min and MaxNesting_Min >> CountPath_Min has the average correlation of 0.083 with other metrics >> MaxNesting_Min has the average correlation of 0.077 with other metrics >> Exclude CountPath_Min > Step 25 comparing between CountDeclInstanceMethod and SumCyclomaticModified >> CountDeclInstanceMethod has the average correlation of 0.304 with other metrics >> SumCyclomaticModified has the average correlation of 0.371 with other metrics >> Exclude SumCyclomaticModified > Step 26 comparing between RatioCommentToCode and CountStmtExe >> RatioCommentToCode has the average correlation of 0.341 with other metrics >> CountStmtExe has the average correlation of 0.379 with other metrics >> Exclude CountStmtExe > Step 27 comparing between CountInput_Max and CountInput_Mean >> CountInput_Max has the average correlation of 0.293 with other metrics >> CountInput_Mean has the average correlation of 0.232 with other metrics >> Exclude CountInput_Max > Step 28 comparing between CountOutput_Max and CountOutput_Mean >> CountOutput_Max has the average correlation of 0.329 with other metrics >> CountOutput_Mean has the average correlation of 0.259 with other metrics >> Exclude CountOutput_Max > Step 29 comparing between MaxNesting_Mean and AvgCyclomaticModified >> MaxNesting_Mean has the average correlation of 0.257 with other metrics >> AvgCyclomaticModified has the average correlation of 0.247 with other metrics >> Exclude MaxNesting_Mean > Step 30 comparing between Added_lines and Del_lines >> Added_lines has the average correlation of 0.294 with other metrics >> Del_lines has the average correlation of 0.291 with other metrics >> Exclude Added_lines > Step 31 comparing between CountLineBlank and CountDeclInstanceMethod >> CountLineBlank has the average correlation of 0.299 with other metrics >> CountDeclInstanceMethod has the average correlation of 0.258 with other metrics >> Exclude CountLineBlank > Step 32 comparing between MINOR_LINE and OWN_LINE >> MINOR_LINE has the average correlation of 0.08 with other metrics >> OWN_LINE has the average correlation of 0.078 with other metrics >> Exclude MINOR_LINE > Step 33 comparing between CountDeclInstanceMethod and CountDeclMethodPublic >> CountDeclInstanceMethod has the average correlation of 0.246 with other metrics >> CountDeclMethodPublic has the average correlation of 0.232 with other metrics >> Exclude CountDeclInstanceMethod > Step 34 comparing between AvgLine and CountOutput_Mean >> AvgLine has the average correlation of 0.234 with other metrics >> CountOutput_Mean has the average correlation of 0.239 with other metrics >> Exclude CountOutput_Mean > Step 35 comparing between CountLineComment and AvgLineComment >> CountLineComment has the average correlation of 0.149 with other metrics >> AvgLineComment has the average correlation of 0.112 with other metrics >> Exclude CountLineComment > Step 36 comparing between Del_lines and ADEV >> Del_lines has the average correlation of 0.265 with other metrics >> ADEV has the average correlation of 0.233 with other metrics >> Exclude Del_lines According to Part 1 of AutoSpearman, ['ADEV', 'CountClassCoupled', 'AvgLine', 'OWN_LINE', 'CountDeclMethodProtected', 'CountDeclInstanceVariable', 'PercentLackOfCohesion', 'CountDeclClass', 'MAJOR_LINE', 'AvgLineBlank', 'CountDeclMethodPublic', 'CountInput_Mean', 'MaxNesting_Min', 'CountOutput_Min', 'CountDeclMethodDefault', 'AvgCyclomaticModified', 'CountInput_Min', 'CountDeclClassMethod', 'CountClassDerived', 'AvgLineComment', 'CountDeclClassVariable', 'CountClassBase', 'OWN_COMMIT', 'MaxInheritanceTree', 'CountDeclMethodPrivate', 'MINOR_COMMIT', 'AvgEssential', 'COMM', 'RatioCommentToCode'] are selected. (Part 2) Automatically select non-correlated metrics based on a Variance Inflation Factor analysis
C:\Users\micha\miniconda3\lib\site-packages\statsmodels\tsa\tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only x = pd.concat(x[::order], 1) C:\Users\micha\miniconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:193: RuntimeWarning: divide by zero encountered in double_scalars vif = 1. / (1. - r_squared_i)
> Step 1 - exclude ADEV > Step 2 - exclude AvgLine Finally, according to Part 2 of AutoSpearman, Index(['CountClassCoupled', 'OWN_LINE', 'CountDeclMethodProtected', 'CountDeclInstanceVariable', 'PercentLackOfCohesion', 'CountDeclClass', 'MAJOR_LINE', 'AvgLineBlank', 'CountDeclMethodPublic', 'CountInput_Mean', 'MaxNesting_Min', 'CountOutput_Min', 'CountDeclMethodDefault', 'AvgCyclomaticModified', 'CountInput_Min', 'const', 'CountDeclClassMethod', 'CountClassDerived', 'AvgLineComment', 'CountDeclClassVariable', 'CountClassBase', 'OWN_COMMIT', 'MaxInheritanceTree', 'CountDeclMethodPrivate', 'MINOR_COMMIT', 'AvgEssential', 'COMM', 'RatioCommentToCode'], dtype='object') are selected. 27 out of 65 were selected via AutoSpearman feature selection process feature cols: CountClassCoupled \ File activemq-console/src/main/java/org/apache/activ... 2 OWN_LINE \ File activemq-console/src/main/java/org/apache/activ... 1.0 CountDeclMethodProtected \ File activemq-console/src/main/java/org/apache/activ... 7 CountDeclInstanceVariable \ File activemq-console/src/main/java/org/apache/activ... 3 PercentLackOfCohesion \ File activemq-console/src/main/java/org/apache/activ... 61 CountDeclClass \ File activemq-console/src/main/java/org/apache/activ... 1 MAJOR_LINE AvgLineBlank \ File activemq-console/src/main/java/org/apache/activ... 0 1 CountDeclMethodPublic \ File activemq-console/src/main/java/org/apache/activ... 0 CountInput_Mean ... \ File ... activemq-console/src/main/java/org/apache/activ... 2.714286 ... AvgLineComment \ File activemq-console/src/main/java/org/apache/activ... 6 CountDeclClassVariable \ File activemq-console/src/main/java/org/apache/activ... 0 CountClassBase \ File activemq-console/src/main/java/org/apache/activ... 1 OWN_COMMIT \ File activemq-console/src/main/java/org/apache/activ... 1.0 MaxInheritanceTree \ File activemq-console/src/main/java/org/apache/activ... 2 CountDeclMethodPrivate \ File activemq-console/src/main/java/org/apache/activ... 0 MINOR_COMMIT \ File activemq-console/src/main/java/org/apache/activ... 0 AvgEssential COMM \ File activemq-console/src/main/java/org/apache/activ... 2 1 RatioCommentToCode File activemq-console/src/main/java/org/apache/activ... 0.7 [1 rows x 27 columns] label col: File activemq-console/src/main/java/org/apache/activemq/console/command/AbstractAmqCommand.java False Name: RealBug, dtype: bool
from sklearn.model_selection import train_test_split
# 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=2, random_state=0)
rf_model.fit(X_train, y_train)
RandomForestClassifier(random_state=0)
# generate prediction from the model, which will return a list of predicted labels
y_preds = rf_model.predict(X_test)
# create a DataFrame which only has predicted label column
y_preds = pd.DataFrame(data={'PredictedBug': y_preds}, index=y_test.index)
y_preds.head(3)
PredictedBug | |
---|---|
File | |
activemq-core/src/main/java/org/apache/activemq/kaha/MapContainer.java | False |
activemq-core/src/main/java/org/apache/activemq/openwire/v3/MessageAckMarshaller.java | False |
activemq-core/src/main/java/org/apache/activemq/ConnectionFailedException.java | False |
combined_testing_data = X_test.join(y_test.to_frame())
combined_testing_data = combined_testing_data.join(y_preds)
combined_testing_data.head(3)
# total num of rows
total_rows = len(combined_testing_data)
correctly_predicted_data = combined_testing_data[combined_testing_data['RealBug']==combined_testing_data['PredictedBug']]
correctly_predicted_rows = len(correctly_predicted_data)
print('The model correctly predicted ', round((correctly_predicted_rows / total_rows), 3) * 100, '% of testing data')
The model correctly predicted 90.60000000000001 % of testing data
correctly_predicted_bug = correctly_predicted_data[correctly_predicted_data['RealBug']==True]
correctly_predicted_bug.head(3)
CountClassCoupled | OWN_LINE | CountDeclMethodProtected | CountDeclInstanceVariable | PercentLackOfCohesion | CountDeclClass | MAJOR_LINE | AvgLineBlank | CountDeclMethodPublic | CountInput_Mean | ... | CountClassBase | OWN_COMMIT | MaxInheritanceTree | CountDeclMethodPrivate | MINOR_COMMIT | AvgEssential | COMM | RatioCommentToCode | RealBug | PredictedBug | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
File | |||||||||||||||||||||
activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java | 12 | 0.738916 | 3 | 2 | 77 | 3 | 1 | 1 | 8 | 2.181818 | ... | 2 | 0.800000 | 6 | 0 | 0 | 1 | 5 | 0.27 | True | True |
activemq-core/src/main/java/org/apache/activemq/ActiveMQMessageConsumer.java | 27 | 0.569082 | 10 | 21 | 89 | 5 | 2 | 0 | 35 | 4.807692 | ... | 4 | 0.500000 | 1 | 5 | 0 | 1 | 10 | 0.42 | True | True |
activemq-openwire-generator/src/main/java/org/apache/activemq/openwire/tool/SingleSourceGenerator.java | 0 | 0.995781 | 8 | 8 | 88 | 1 | 1 | 0 | 20 | 1.428571 | ... | 1 | 0.666667 | 2 | 0 | 0 | 1 | 3 | 0.15 | True | True |
3 rows × 29 columns
# select all rows and feature cols
feature_cols = correctly_predicted_bug.iloc[:, :-2]
# selected all rows and one label col (either RealBug or PredictedBug is fine since they are the same)
label_col = correctly_predicted_bug.iloc[:, -2]
# decide which row to be selected
selected_row = 0
# select the row in X_test which contains all of the feature values
X_explain = feature_cols.iloc[[selected_row]]
# select the corresponding label from the DataFrame that we just created above
y_explain = label_col.iloc[[selected_row]]
print('one row of feature:', '\n\n', X_explain, '\n')
print('one row of label:', '\n\n', y_explain)
one row of feature: CountClassCoupled \ File activemq-core/src/test/java/org/apache/activemq... 12 OWN_LINE \ File activemq-core/src/test/java/org/apache/activemq... 0.738916 CountDeclMethodProtected \ File activemq-core/src/test/java/org/apache/activemq... 3 CountDeclInstanceVariable \ File activemq-core/src/test/java/org/apache/activemq... 2 PercentLackOfCohesion \ File activemq-core/src/test/java/org/apache/activemq... 77 CountDeclClass \ File activemq-core/src/test/java/org/apache/activemq... 3 MAJOR_LINE AvgLineBlank \ File activemq-core/src/test/java/org/apache/activemq... 1 1 CountDeclMethodPublic \ File activemq-core/src/test/java/org/apache/activemq... 8 CountInput_Mean ... \ File ... activemq-core/src/test/java/org/apache/activemq... 2.181818 ... AvgLineComment \ File activemq-core/src/test/java/org/apache/activemq... 2 CountDeclClassVariable \ File activemq-core/src/test/java/org/apache/activemq... 1 CountClassBase \ File activemq-core/src/test/java/org/apache/activemq... 2 OWN_COMMIT \ File activemq-core/src/test/java/org/apache/activemq... 0.8 MaxInheritanceTree \ File activemq-core/src/test/java/org/apache/activemq... 6 CountDeclMethodPrivate \ File activemq-core/src/test/java/org/apache/activemq... 0 MINOR_COMMIT \ File activemq-core/src/test/java/org/apache/activemq... 0 AvgEssential COMM \ File activemq-core/src/test/java/org/apache/activemq... 1 5 RatioCommentToCode File activemq-core/src/test/java/org/apache/activemq... 0.27 [1 rows x 27 columns] one row of label: File activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java True Name: RealBug, dtype: bool
from pyexplainer import pyexplainer_pyexplainer
py_explainer = pyexplainer_pyexplainer.PyExplainer(X_train = X_train,
y_train = y_train,
indep = X_train.columns,
dep = 'RealBug',
blackbox_model = rf_model)
rules = py_explainer.explain(X_explain=X_explain,
y_explain=y_explain,
search_function='crossoverinterpolation')
rules.keys()
dict_keys(['synthetic_data', 'synthetic_predictions', 'X_explain', 'y_explain', 'indep', 'dep', 'top_k_positive_rules', 'top_k_negative_rules', 'local_rulefit_model'])
py_explainer.visualise(rules)
HBox(children=(Label(value='Risk Score: '), FloatProgress(value=0.0, bar_style='info', layout=Layout(width='40…
Output(layout=Layout(border='3px solid black'))
FloatSlider(value=1.0, continuous_update=False, description='#1 The value of CountDeclClassVariable is more th…
FloatSlider(value=0.0, continuous_update=False, description='#2 The value of CountDeclMethodPrivate is more th…
Synthetic_data is data that are generated by PyExplainer using one of the following approaches.
After Synthetic_data is generated, it is stored as a pandas DataFrame object.
print("Type of pyExp_rule_obj['synthetic_data'] - ", type(rules['synthetic_data']), "\n")
print('Example')
display(rules['synthetic_data'].head(2))
Type of pyExp_rule_obj['synthetic_data'] - <class 'pandas.core.frame.DataFrame'> Example
CountClassCoupled | OWN_LINE | CountDeclMethodProtected | CountDeclInstanceVariable | PercentLackOfCohesion | CountDeclClass | MAJOR_LINE | AvgLineBlank | CountDeclMethodPublic | CountInput_Mean | ... | AvgLineComment | CountDeclClassVariable | CountClassBase | OWN_COMMIT | MaxInheritanceTree | CountDeclMethodPrivate | MINOR_COMMIT | AvgEssential | COMM | RatioCommentToCode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.75 | 1.0 | 2.0 | 42.0 | 1.0 | 0.0 | 3.0 | 6.0 | 3.57 | ... | 3.0 | 1.0 | 1.0 | 0.8 | 3.0 | 0.0 | 0.0 | 1.0 | 5.0 | 0.27 |
1 | 1.0 | 1.00 | 0.0 | 2.0 | 68.0 | 1.0 | 0.0 | 0.0 | 5.0 | 2.60 | ... | 0.0 | 3.0 | 1.0 | 1.0 | 6.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.64 |
2 rows × 27 columns
Synthetic_predictions is the prediction of Synthetic_data, which is obtained from the global model inside PyExplainer.
print("Type of pyExp_rule_obj['synthetic_predictions'] - ", type(rules['synthetic_predictions']), "\n")
print("Example", "\n\n", rules['synthetic_predictions'])
Type of pyExp_rule_obj['synthetic_predictions'] - <class 'numpy.ndarray'> Example [ True False False ... False False False]
X_explain is an instance to be explained (which is a defective commit in this context)
print("Type of pyExp_rule_obj['X_explain'] - ", type(rules['X_explain']), "\n")
print('Example')
display(rules['X_explain'])
Type of pyExp_rule_obj['X_explain'] - <class 'pandas.core.frame.DataFrame'> Example
CountClassCoupled | OWN_LINE | CountDeclMethodProtected | CountDeclInstanceVariable | PercentLackOfCohesion | CountDeclClass | MAJOR_LINE | AvgLineBlank | CountDeclMethodPublic | CountInput_Mean | ... | AvgLineComment | CountDeclClassVariable | CountClassBase | OWN_COMMIT | MaxInheritanceTree | CountDeclMethodPrivate | MINOR_COMMIT | AvgEssential | COMM | RatioCommentToCode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
File | |||||||||||||||||||||
activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java | 12 | 0.738916 | 3 | 2 | 77 | 3 | 1 | 1 | 8 | 2.181818 | ... | 2 | 1 | 2 | 0.8 | 6 | 0 | 0 | 1 | 5 | 0.27 |
1 rows × 27 columns
y_explain is a label of X_explain
print("Type of pyExp_rule_obj['y_explain'] - ", type(rules['y_explain']), "\n")
print("Example", "\n\n", rules['y_explain'])
Type of pyExp_rule_obj['y_explain'] - <class 'pandas.core.series.Series'> Example File activemq-core/src/test/java/org/apache/activemq/transport/fanout/FanoutTransportBrokerTest.java True Name: RealBug, dtype: bool
print("Type of pyExp_rule_obj['indep'] - ", type(rules['indep']), "\n")
print("Example", "\n\n", rules['indep'])
Type of pyExp_rule_obj['indep'] - <class 'pandas.core.indexes.base.Index'> Example Index(['CountClassCoupled', 'OWN_LINE', 'CountDeclMethodProtected', 'CountDeclInstanceVariable', 'PercentLackOfCohesion', 'CountDeclClass', 'MAJOR_LINE', 'AvgLineBlank', 'CountDeclMethodPublic', 'CountInput_Mean', 'MaxNesting_Min', 'CountOutput_Min', 'CountDeclMethodDefault', 'AvgCyclomaticModified', 'CountInput_Min', 'CountDeclClassMethod', 'CountClassDerived', 'AvgLineComment', 'CountDeclClassVariable', 'CountClassBase', 'OWN_COMMIT', 'MaxInheritanceTree', 'CountDeclMethodPrivate', 'MINOR_COMMIT', 'AvgEssential', 'COMM', 'RatioCommentToCode'], dtype='object')
print("Type of pyExp_rule_obj['dep'] - ", type(rules['dep']), "\n")
print("Example", "\n\n", rules['dep'])
Type of pyExp_rule_obj['dep'] - <class 'str'> Example RealBug
top_k_positive_rules is top-k rules that are genereated by PyExplainer to explain why a commit is predicted as defective.
Here we show top-3 rules that lead to defective commits=
print("Type of pyExp_rule_obj['top_k_positive_rules'] - ", type(rules['top_k_positive_rules']), "\n")
print('Example')
display(rules['top_k_positive_rules'].head(3))
Type of pyExp_rule_obj['top_k_positive_rules'] - <class 'pandas.core.frame.DataFrame'> Example
index | rule | type | coef | support | importance | is_satisfy_instance | |
---|---|---|---|---|---|---|---|
0 | 161 | AvgLineComment > -3.6149998903274536 & CountCl... | rule | 3.490109e-23 | 0.271540 | 1.552240e-23 | True |
1 | 562 | OWN_COMMIT <= 0.8349999785423279 & COMM > 2.98... | rule | 3.639777e-23 | 0.208877 | 1.479593e-23 | True |
2 | 820 | CountClassBase <= 2.9850000143051147 & OWN_COM... | rule | 3.484802e-23 | 0.232376 | 1.471797e-23 | True |
top_k_negative_rules is top-k negative rules that are genereated by PyExplainer to explain why a commit is predicted as clean.
The default number of generated rules is 3.
print("Type of pyExp_rule_obj['top_k_negative_rules'] - ", type(rules['top_k_negative_rules']), "\n")
print('Example')
display(rules['top_k_negative_rules'])
Type of pyExp_rule_obj['top_k_negative_rules'] - <class 'pandas.core.frame.DataFrame'> Example
rule | type | coef | support | importance | Class | |
---|---|---|---|---|---|---|
918 | OWN_COMMIT > 0.8550000190734863 & CountDeclCla... | rule | -4.819474e-23 | 0.678851 | 2.250298e-23 | Clean |
1652 | OWN_COMMIT > 0.8650000095367432 | rule | -4.748976e-23 | 0.689295 | 2.197742e-23 | Clean |
1107 | CountDeclMethodPrivate <= 1.8399999737739563 &... | rule | -4.863102e-23 | 0.725849 | 2.169360e-23 | Clean |