Experiments¶

TODO: 24-27 June 2019:

Create pipeline
- with initial unigrams baseline
- accuracy measure (e.g. precision-recall with AUROC)

Setup environment:¶

import libraries
load csv data

In [1]:

import random
import pandas as pd
from nltk.corpus import treebank
from sklearn.model_selection import train_test_split

description_df = pd.read_csv('./data/description.csv')
installation_df = pd.read_csv('./data/installation.csv')
invocation_df = pd.read_csv('./data/invocation.csv')
citation_df = pd.read_csv('./data/citation.csv')

Data Preview¶

Make sure that csv data has been successfully imported.

In [2]:

print("Number of description entries: {}".format(len(description_df)))
description_df.head()

Number of description entries: 281

Out[2]:

	URL	excerpt
0	https://github.com/GoogleChrome/puppeteer	Puppeteer is a Node library which provides a h...
1	https://github.com/JimmySuen/integral-human-pose	The major contributors of this repository incl...
2	https://github.com/JimmySuen/integral-human-pose	Integral Regression is initially described in ...
3	https://github.com/JimmySuen/integral-human-pose	We build a 3D pose estimation system based mai...
4	https://github.com/JimmySuen/integral-human-pose	The Integral Regression is also known as soft-...

In [3]:

print("Number of installation entries: {}".format(len(installation_df)))
installation_df.head()

Number of installation entries: 800

Out[3]:

	URL	excerpt
0	https://github.com/GoogleChrome/puppeteer	Installation
1	https://github.com/GoogleChrome/puppeteer	To use Puppeteer in your project, run:
2	https://github.com/GoogleChrome/puppeteer	npm i puppeteer
3	https://github.com/GoogleChrome/puppeteer	# or "yarn add puppeteer"
4	https://github.com/GoogleChrome/puppeteer	puppeteer-core

In [4]:

print("Number of invocation entries: {}".format(len(invocation_df)))
invocation_df.head()

Number of invocation entries: 1118

Out[4]:

	URL	excerpt
0	https://github.com/JimmySuen/integral-human-pose	Usage
1	https://github.com/JimmySuen/integral-human-pose	We have placed some example config files in ex...
2	https://github.com/JimmySuen/integral-human-pose	Train
3	https://github.com/JimmySuen/integral-human-pose	For Integral Human Pose Regression, cd to pyto...
4	https://github.com/JimmySuen/integral-human-pose	Integral Regression

In [5]:

print("Number of citation entries: {}".format(len(citation_df)))
citation_df.head()

Number of citation entries: 309

Out[5]:

	URL	excerpt
0	https://github.com/JimmySuen/integral-human-pose	If you find Integral Regression useful in your...
1	https://github.com/JimmySuen/integral-human-pose	@article{sun2017integral,
2	https://github.com/JimmySuen/integral-human-pose	title={Integral human pose regression},
3	https://github.com/JimmySuen/integral-human-pose	author={Sun, Xiao and Xiao, Bin and Liang, Shu...
4	https://github.com/JimmySuen/integral-human-pose	journal={arXiv preprint arXiv:1711.08229},

Each data set currently contains positive samples of its respective trait. However, negative samples are necessary to distinguish the positive against some sort of control. Per category, negative samples include those from the other categories and also text samples completely unrelated to repository information. For example, in the description classifier, positive samples would be those that were labelled as a description, and negative samples would include those labelled as a installation, invocation, or citation in addition to nonpertinent text such as the Treebank corpus.

As there are many more negative samples than there are positive samples, randomly selected negative samples will be used. The aim is for about 40% positive and 60% negative. Of the 60% negative, 15% for each outside category and 15% for random, e.g. Treebank, text.

Question: Treebank sentences are already tokenized / split by word. Does nltk have sentences not already split or is it possible to utilize the already split state of the sentences for later tokenizer usage?

Description Classifier¶

In [6]:

neg_quant = int(len(description_df) * .375)
treebank_background = pd.DataFrame(list(map(lambda sent: ' '.join(sent), random.sample(list(treebank.sents()), neg_quant))), columns=["excerpt"]).assign(description=False)
description_corpus = pd.concat([description_df.assign(description=True), installation_df.sample(neg_quant).assign(description=False), invocation_df.sample(neg_quant).assign(description=False), citation_df.sample(neg_quant).assign(description=False),treebank_background], sort=False)
description_corpus.drop('URL', 1, inplace=True)
description_corpus.dropna(0, inplace=True)
description_corpus.reset_index(drop=True, inplace=True)
description_corpus

Out[6]:

	excerpt	description
0	Puppeteer is a Node library which provides a h...	True
1	The major contributors of this repository incl...	True
2	Integral Regression is initially described in ...	True
3	We build a 3D pose estimation system based mai...	True
4	The Integral Regression is also known as soft-...	True
5	This is an official implementation for Integra...	True
6	The original implementation is based on our in...	True
7	LibGEOS is a LGPL-licensed package for manipul...	True
8	Among other things, it allows you to parse Wel...	True
9	This repository contains the experiments in th...	True
10	For the results presented in the paper, we did...	True
11	Batch normalization is currently not supported...	True
12	Open-source Ground Penetrating Radar processin...	True
13	Pytorch implementation for high-resolution (e....	True
14	The PVGeo Python package contains VTK powered ...	True
15	A PyVista (and VTK) interface for the Open Min...	True
16	GeoNotebook is an application that provides cl...	True
17	Fiona is OGR's neat and nimble API for Python ...	True
18	Fiona is designed to be simple and dependable....	True
19	Shapely is a BSD-licensed Python package for m...	True
20	Rain streaks can severely degrade the visibili...	True
21	The pytorch branch contains:	True
22	the pytorch implementation of Peak Response Ma...	True
23	the PASCAL-VOC demo (training, inference, and ...	True
24	Lithology and stratigraphic logs for wells and...	True
25	This Python module allows you to:	True
26	Interactively control an instance of ANSYS v14...	True
27	Extract data directly from binary ANSYS v14.5+...	True
28	Rapidly read in binary result (.rst), binary m...	True
29	Official implementation of GANimation. In this...	True
...	...	...
670	A Department of Health and Human Services rule...	False
671	But Mr. Hahn rose swiftly through the ranks , ...	False
672	AT&T FAX :	False
673	And many emerging markets have outpaced more m...	False
674	`` * Remember Pinocchio ? '' says T-1 a fema...	False
675	*-1 Currently a $ 300 million-a-year business ...	False
676	Koito has refused *-1 to grant Mr. Pickens sea...	False
677	The market again showed little interest in fur...	False
678	The idea , of course : * to prove to 125 corpo...	False
679	Because of deteriorating hearing , she told co...	False
680	And construction also was described *-101 as s...	False
681	The restrictions on viewing and dissemination ...	False
682	Whereas conventional securities financings are...	False
683	What T-102 's more , the test and Learning M...	False
684	But Robert R. Murray , a special master appoin...	False
685	Sales in stores open more than one year rose 3...	False
686	`` You 'd see her correcting homework in the s...	False
687	The ban on cross-border movement was imposed *...	False
688	Perhaps none of the unconstitutional condition...	False
689	A steady deposit base .	False
690	Buick approached American Express about a join...	False
691	Kalamazoo , Mich.-based First of America said ...	False
692	Michael R. Bromwich , a member since January 1...	False
693	Terms were n't disclosed *-1 .	False
694	The ultimate goal of any investor is a profit ...	False
695	Mr. Trump withdrew a $ 120-a-share U bid las...	False
696	On Wall Street men and women walk with great p...	False
697	One claims 0 he 's pro-choice .	False
698	Another was Nancy Yeargin , who T-89 came to...	False
699	* Think about what T-1 causes the difference...	False

700 rows × 2 columns

Description Classifier pipeline¶

Train-test split¶

In [7]:

X, y = description_corpus.excerpt, description_corpus.description
X_train, X_test, y_train, y_test = train_test_split(X, y)

Count Vectorizer and Logistic Regression in Pipeline¶

In [8]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

def display_accuracy_score(y_test, y_pred_class):
    score = accuracy_score(y_test, y_pred_class)
    print('accuracy score: %s' % '{:.2%}'.format(score))
    return score
def display_null_accuracy(y_test):
    value_counts = pd.value_counts(y_test)
    null_accuracy = max(value_counts) / float(len(y_test))
    print('null accuracy: %s' % '{:.2%}'.format(null_accuracy))
    return null_accuracy

def display_accuracy_difference(y_test, y_pred_class):
    null_accuracy = display_null_accuracy(y_test)
    accuracy_score = display_accuracy_score(y_test, y_pred_class)
    difference = accuracy_score - null_accuracy
    if difference > 0:
        print('model is %s more accurate than null accuracy' % '{:.2%}'.format(difference))
    elif difference < 0:
        print('model is %s less accurate than null accuracy' % '{:.2%}'.format(abs(difference)))
    elif difference == 0:
        print('model is exactly as accurate as null accuracy')
    return null_accuracy, accuracy_score

pipeline = make_pipeline(CountVectorizer(), LogisticRegression())
pipeline.fit(X_train, y_train)
y_pred_class = pipeline.predict(X_test)
y_pred_vals = pipeline.predict_proba(X_test)
#print(y_pred_vals)
#print("X_test: {}, y_pred: {}".format(X_test, y_pred_class))
#results_df = pd.DataFrame({"x_test": X_test, "y_pred": y_pred_vals[:,1], "y_TF_pred": y_pred_class, "y_actual": y_test})
results_df = pd.DataFrame({"x_test": X_test,  "y_TF_pred": y_pred_class, "y_actual": y_test})
print(results_df)
print(confusion_matrix(y_test, y_pred_class))
print('-' * 75 + '\nClassification Report\n')
print(classification_report(y_test, y_pred_class))
display_accuracy_difference(y_test, y_pred_class)

                                                x_test  y_TF_pred  y_actual
488                           tin = _meshfix.PyTMesh()      False     False
597  Lord Chilver , 63-year-old chairman of English...      False     False
686  `` You 'd see her correcting homework in the s...      False     False
417                                             header      False     False
529  title = {{PyVista}: 3D plotting and mesh analy...      False     False
566             @inproceedings{pumarola2018ganimation,      False     False
282                 pip install opencv-python==3.2.0.6      False     False
361                                pip install empymod      False     False
365  A C++ compiler for the Python extension, and C...       True     False
2    Integral Regression is initially described in ...      False      True
561  booktitle = {Proceedings of the International ...      False     False
101  The writing functionality in segyio is largely...       True      True
595  `` You either believe 0 Seymour can do it agai...      False     False
456          Semantic Segmentation with Deeplab-Resnet      False     False
492  Key Laboratory of Machine Perception, Shenzhen...      False     False
193  The goal of Tippecanoe is to enable making a s...       True      True
50   Finally e also provide precompiled Docker imag...       True      True
65   Calculates the complete (diffusion and wave ph...       True      True
430  and as CurveItem objects with associated metad...      False     False
596  If the money manager performing this service i...      False     False
303                              From source at GitHub      False     False
44   New developments in the field of augmented rea...       True      True
347                                         matplotlib      False     False
331                             Install the usual way:      False     False
434                          tensorboard --logdir logs      False     False
320  Installing apsg from the conda-forge channel c...      False     False
616  `` There 's no question that some of those wor...      False     False
496  Yu, (2018). PyGeoPressure: Geopressure Predict...      False     False
559            Fast End-to-End Trainable Guided Filter      False     False
91   Segyio is a small LGPL licensed C library for ...       True      True
..                                                 ...        ...       ...
369  Install python3.6 and pytorch 3. I recommend t...      False     False
626  Net income surged 31 % to 7.63 billion yen fro...      False     False
527                                        year={2018}      False     False
387                                                 ~/      False     False
78   Complete full-space (electric and magnetic sou...       True      True
572                                      Year = {2017}      False     False
156  The file read parameters are based on GSSI's D...       True      True
258  Tilematrix handles geographic web tiles and ti...       True      True
103  Segyio can handle a lot of files that are SEG-...       True      True
489                                         plt.show()      False     False
552                                                  }      False     False
459  The quantitative results of PSNR and SSIM in t...       True     False
105  Declarative: React makes it painless to create...       True      True
21                        The pytorch branch contains:      False      True
557                    title={CU-Net: Coupled U-Nets},      False     False
165  mplleaflet is a Python library that converts a...       True      True
696  On Wall Street men and women walk with great p...       True     False
503  and Andrew Tao and Jan Kautz and Bryan Catanza...      False     False
377  Users who need an older stable version of PySA...      False     False
178       exports to common formats (Mapnik XML, PNG…)      False      True
618  Mr. Driscoll did n't elaborate about who the p...      False     False
486       Pore Pressure Prediction using well log data       True     False
659  In its construction spending report , the Comm...      False     False
83                          Add-ons (empymod.scripts):      False      True
58                                        Introduction       True      True
407       strikes = strike + 10 * np.random.randn(num)      False     False
475     Run python predict_dgf.py -h for more details.      False     False
658                     Terms were n't disclosed *-1 .      False     False
617  But the growing controversy comes as many prac...      False     False
438                     well.params['horizon']["T20"])      False     False

[175 rows x 3 columns]
[[110  12]
 [ 18  35]]
---------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

       False       0.86      0.90      0.88       122
        True       0.74      0.66      0.70        53

    accuracy                           0.83       175
   macro avg       0.80      0.78      0.79       175
weighted avg       0.82      0.83      0.83       175

null accuracy: 69.71%
accuracy score: 82.86%
model is 13.14% more accurate than null accuracy

/home/allen/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Out[8]:

(0.6971428571428572, 0.8285714285714286)

In [9]:

len(description_df)

Out[9]: