import random
import pandas as pd
from nltk.corpus import treebank
from sklearn.model_selection import train_test_split
description_df = pd.read_csv('./data/description.csv')
installation_df = pd.read_csv('./data/installation.csv')
invocation_df = pd.read_csv('./data/invocation.csv')
citation_df = pd.read_csv('./data/citation.csv')
Make sure that csv data has been successfully imported.
print("Number of description entries: {}".format(len(description_df)))
description_df.head()
Number of description entries: 281
URL | excerpt | |
---|---|---|
0 | https://github.com/GoogleChrome/puppeteer | Puppeteer is a Node library which provides a h... |
1 | https://github.com/JimmySuen/integral-human-pose | The major contributors of this repository incl... |
2 | https://github.com/JimmySuen/integral-human-pose | Integral Regression is initially described in ... |
3 | https://github.com/JimmySuen/integral-human-pose | We build a 3D pose estimation system based mai... |
4 | https://github.com/JimmySuen/integral-human-pose | The Integral Regression is also known as soft-... |
print("Number of installation entries: {}".format(len(installation_df)))
installation_df.head()
Number of installation entries: 800
URL | excerpt | |
---|---|---|
0 | https://github.com/GoogleChrome/puppeteer | Installation |
1 | https://github.com/GoogleChrome/puppeteer | To use Puppeteer in your project, run: |
2 | https://github.com/GoogleChrome/puppeteer | npm i puppeteer |
3 | https://github.com/GoogleChrome/puppeteer | # or "yarn add puppeteer" |
4 | https://github.com/GoogleChrome/puppeteer | puppeteer-core |
print("Number of invocation entries: {}".format(len(invocation_df)))
invocation_df.head()
Number of invocation entries: 1118
URL | excerpt | |
---|---|---|
0 | https://github.com/JimmySuen/integral-human-pose | Usage |
1 | https://github.com/JimmySuen/integral-human-pose | We have placed some example config files in ex... |
2 | https://github.com/JimmySuen/integral-human-pose | Train |
3 | https://github.com/JimmySuen/integral-human-pose | For Integral Human Pose Regression, cd to pyto... |
4 | https://github.com/JimmySuen/integral-human-pose | Integral Regression |
print("Number of citation entries: {}".format(len(citation_df)))
citation_df.head()
Number of citation entries: 309
URL | excerpt | |
---|---|---|
0 | https://github.com/JimmySuen/integral-human-pose | If you find Integral Regression useful in your... |
1 | https://github.com/JimmySuen/integral-human-pose | @article{sun2017integral, |
2 | https://github.com/JimmySuen/integral-human-pose | title={Integral human pose regression}, |
3 | https://github.com/JimmySuen/integral-human-pose | author={Sun, Xiao and Xiao, Bin and Liang, Shu... |
4 | https://github.com/JimmySuen/integral-human-pose | journal={arXiv preprint arXiv:1711.08229}, |
Each data set currently contains positive samples of its respective trait. However, negative samples are necessary to distinguish the positive against some sort of control. Per category, negative samples include those from the other categories and also text samples completely unrelated to repository information. For example, in the description classifier, positive samples would be those that were labelled as a description, and negative samples would include those labelled as a installation, invocation, or citation in addition to nonpertinent text such as the Treebank corpus.
As there are many more negative samples than there are positive samples, randomly selected negative samples will be used. The aim is for about 40% positive and 60% negative. Of the 60% negative, 15% for each outside category and 15% for random, e.g. Treebank, text.
Question: Treebank sentences are already tokenized / split by word. Does nltk have sentences not already split or is it possible to utilize the already split state of the sentences for later tokenizer usage?
neg_quant = int(len(description_df) * .375)
treebank_background = pd.DataFrame(list(map(lambda sent: ' '.join(sent), random.sample(list(treebank.sents()), neg_quant))), columns=["excerpt"]).assign(description=False)
description_corpus = pd.concat([description_df.assign(description=True), installation_df.sample(neg_quant).assign(description=False), invocation_df.sample(neg_quant).assign(description=False), citation_df.sample(neg_quant).assign(description=False),treebank_background], sort=False)
description_corpus.drop('URL', 1, inplace=True)
description_corpus.dropna(0, inplace=True)
description_corpus.reset_index(drop=True, inplace=True)
description_corpus
excerpt | description | |
---|---|---|
0 | Puppeteer is a Node library which provides a h... | True |
1 | The major contributors of this repository incl... | True |
2 | Integral Regression is initially described in ... | True |
3 | We build a 3D pose estimation system based mai... | True |
4 | The Integral Regression is also known as soft-... | True |
5 | This is an official implementation for Integra... | True |
6 | The original implementation is based on our in... | True |
7 | LibGEOS is a LGPL-licensed package for manipul... | True |
8 | Among other things, it allows you to parse Wel... | True |
9 | This repository contains the experiments in th... | True |
10 | For the results presented in the paper, we did... | True |
11 | Batch normalization is currently not supported... | True |
12 | Open-source Ground Penetrating Radar processin... | True |
13 | Pytorch implementation for high-resolution (e.... | True |
14 | The PVGeo Python package contains VTK powered ... | True |
15 | A PyVista (and VTK) interface for the Open Min... | True |
16 | GeoNotebook is an application that provides cl... | True |
17 | Fiona is OGR's neat and nimble API for Python ... | True |
18 | Fiona is designed to be simple and dependable.... | True |
19 | Shapely is a BSD-licensed Python package for m... | True |
20 | Rain streaks can severely degrade the visibili... | True |
21 | The pytorch branch contains: | True |
22 | the pytorch implementation of Peak Response Ma... | True |
23 | the PASCAL-VOC demo (training, inference, and ... | True |
24 | Lithology and stratigraphic logs for wells and... | True |
25 | This Python module allows you to: | True |
26 | Interactively control an instance of ANSYS v14... | True |
27 | Extract data directly from binary ANSYS v14.5+... | True |
28 | Rapidly read in binary result (.rst), binary m... | True |
29 | Official implementation of GANimation. In this... | True |
... | ... | ... |
670 | A Department of Health and Human Services rule... | False |
671 | But Mr. Hahn rose swiftly through the ranks , ... | False |
672 | AT&T FAX : | False |
673 | And many emerging markets have outpaced more m... | False |
674 | `` * Remember Pinocchio ? '' says *T*-1 a fema... | False |
675 | *-1 Currently a $ 300 million-a-year business ... | False |
676 | Koito has refused *-1 to grant Mr. Pickens sea... | False |
677 | The market again showed little interest in fur... | False |
678 | The idea , of course : * to prove to 125 corpo... | False |
679 | Because of deteriorating hearing , she told co... | False |
680 | And construction also was described *-101 as s... | False |
681 | The restrictions on viewing and dissemination ... | False |
682 | Whereas conventional securities financings are... | False |
683 | What *T*-102 's more , the test and Learning M... | False |
684 | But Robert R. Murray , a special master appoin... | False |
685 | Sales in stores open more than one year rose 3... | False |
686 | `` You 'd see her correcting homework in the s... | False |
687 | The ban on cross-border movement was imposed *... | False |
688 | Perhaps none of the unconstitutional condition... | False |
689 | A steady deposit base . | False |
690 | Buick approached American Express about a join... | False |
691 | Kalamazoo , Mich.-based First of America said ... | False |
692 | Michael R. Bromwich , a member since January 1... | False |
693 | Terms were n't disclosed *-1 . | False |
694 | The ultimate goal of any investor is a profit ... | False |
695 | Mr. Trump withdrew a $ 120-a-share *U* bid las... | False |
696 | On Wall Street men and women walk with great p... | False |
697 | One claims 0 he 's pro-choice . | False |
698 | Another was Nancy Yeargin , who *T*-89 came to... | False |
699 | * Think about what *T*-1 causes the difference... | False |
700 rows × 2 columns
X, y = description_corpus.excerpt, description_corpus.description
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
def display_accuracy_score(y_test, y_pred_class):
score = accuracy_score(y_test, y_pred_class)
print('accuracy score: %s' % '{:.2%}'.format(score))
return score
def display_null_accuracy(y_test):
value_counts = pd.value_counts(y_test)
null_accuracy = max(value_counts) / float(len(y_test))
print('null accuracy: %s' % '{:.2%}'.format(null_accuracy))
return null_accuracy
def display_accuracy_difference(y_test, y_pred_class):
null_accuracy = display_null_accuracy(y_test)
accuracy_score = display_accuracy_score(y_test, y_pred_class)
difference = accuracy_score - null_accuracy
if difference > 0:
print('model is %s more accurate than null accuracy' % '{:.2%}'.format(difference))
elif difference < 0:
print('model is %s less accurate than null accuracy' % '{:.2%}'.format(abs(difference)))
elif difference == 0:
print('model is exactly as accurate as null accuracy')
return null_accuracy, accuracy_score
pipeline = make_pipeline(CountVectorizer(), LogisticRegression())
pipeline.fit(X_train, y_train)
y_pred_class = pipeline.predict(X_test)
y_pred_vals = pipeline.predict_proba(X_test)
#print(y_pred_vals)
#print("X_test: {}, y_pred: {}".format(X_test, y_pred_class))
#results_df = pd.DataFrame({"x_test": X_test, "y_pred": y_pred_vals[:,1], "y_TF_pred": y_pred_class, "y_actual": y_test})
results_df = pd.DataFrame({"x_test": X_test, "y_TF_pred": y_pred_class, "y_actual": y_test})
print(results_df)
print(confusion_matrix(y_test, y_pred_class))
print('-' * 75 + '\nClassification Report\n')
print(classification_report(y_test, y_pred_class))
display_accuracy_difference(y_test, y_pred_class)
x_test y_TF_pred y_actual 488 tin = _meshfix.PyTMesh() False False 597 Lord Chilver , 63-year-old chairman of English... False False 686 `` You 'd see her correcting homework in the s... False False 417 header False False 529 title = {{PyVista}: 3D plotting and mesh analy... False False 566 @inproceedings{pumarola2018ganimation, False False 282 pip install opencv-python==3.2.0.6 False False 361 pip install empymod False False 365 A C++ compiler for the Python extension, and C... True False 2 Integral Regression is initially described in ... False True 561 booktitle = {Proceedings of the International ... False False 101 The writing functionality in segyio is largely... True True 595 `` You either believe 0 Seymour can do it agai... False False 456 Semantic Segmentation with Deeplab-Resnet False False 492 Key Laboratory of Machine Perception, Shenzhen... False False 193 The goal of Tippecanoe is to enable making a s... True True 50 Finally e also provide precompiled Docker imag... True True 65 Calculates the complete (diffusion and wave ph... True True 430 and as CurveItem objects with associated metad... False False 596 If the money manager performing this service i... False False 303 From source at GitHub False False 44 New developments in the field of augmented rea... True True 347 matplotlib False False 331 Install the usual way: False False 434 tensorboard --logdir logs False False 320 Installing apsg from the conda-forge channel c... False False 616 `` There 's no question that some of those wor... False False 496 Yu, (2018). PyGeoPressure: Geopressure Predict... False False 559 Fast End-to-End Trainable Guided Filter False False 91 Segyio is a small LGPL licensed C library for ... True True .. ... ... ... 369 Install python3.6 and pytorch 3. I recommend t... False False 626 Net income surged 31 % to 7.63 billion yen fro... False False 527 year={2018} False False 387 ~/ False False 78 Complete full-space (electric and magnetic sou... True True 572 Year = {2017} False False 156 The file read parameters are based on GSSI's D... True True 258 Tilematrix handles geographic web tiles and ti... True True 103 Segyio can handle a lot of files that are SEG-... True True 489 plt.show() False False 552 } False False 459 The quantitative results of PSNR and SSIM in t... True False 105 Declarative: React makes it painless to create... True True 21 The pytorch branch contains: False True 557 title={CU-Net: Coupled U-Nets}, False False 165 mplleaflet is a Python library that converts a... True True 696 On Wall Street men and women walk with great p... True False 503 and Andrew Tao and Jan Kautz and Bryan Catanza... False False 377 Users who need an older stable version of PySA... False False 178 exports to common formats (Mapnik XML, PNG…) False True 618 Mr. Driscoll did n't elaborate about who the p... False False 486 Pore Pressure Prediction using well log data True False 659 In its construction spending report , the Comm... False False 83 Add-ons (empymod.scripts): False True 58 Introduction True True 407 strikes = strike + 10 * np.random.randn(num) False False 475 Run python predict_dgf.py -h for more details. False False 658 Terms were n't disclosed *-1 . False False 617 But the growing controversy comes as many prac... False False 438 well.params['horizon']["T20"]) False False [175 rows x 3 columns] [[110 12] [ 18 35]] --------------------------------------------------------------------------- Classification Report precision recall f1-score support False 0.86 0.90 0.88 122 True 0.74 0.66 0.70 53 accuracy 0.83 175 macro avg 0.80 0.78 0.79 175 weighted avg 0.82 0.83 0.83 175 null accuracy: 69.71% accuracy score: 82.86% model is 13.14% more accurate than null accuracy
/home/allen/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
(0.6971428571428572, 0.8285714285714286)
len(description_df)
281