Подключаем библиотеки
import numpy as np
import scipy as sp
import pandas as pd
pd.set_option("display.max_columns", 1000)
pd.set_option('display.max_colwidth', -1)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
%matplotlib inline
import matplotlib.pyplot as plt
Конкурс доступен по ссылке. Наша задача - по паре вопросов предсказать вероятность того, что они об одном и том же (дубликаты).
Считываем данные:
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
train.fillna('', inplace=True)
test.fillna('', inplace=True)
В данных нас интересуют колонки "question1" и "question2" - сами вопросы. В train выборке есть целевая колонка "is_duplicate", кроме того, у каждого вопроса есть свой номер - "qid1" и "qid2" соответственно.
train.head()
id | qid1 | qid2 | question1 | question2 | is_duplicate | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | What is the step by step guide to invest in share market in india? | What is the step by step guide to invest in share market? | 0 |
1 | 1 | 3 | 4 | What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? | 0 |
2 | 2 | 5 | 6 | How can I increase the speed of my internet connection while using a VPN? | How can Internet speed be increased by hacking through DNS? | 0 |
3 | 3 | 7 | 8 | Why am I mentally very lonely? How can I solve it? | Find the remainder when [math]23^{24}[/math] is divided by 24,23? | 0 |
4 | 4 | 9 | 10 | Which one dissolve in water quikly sugar, salt, methane and carbon di oxide? | Which fish would survive in salt water? | 0 |
test.head()
test_id | question1 | question2 | |
---|---|---|---|
0 | 0 | How does the Surface Pro himself 4 compare with iPad Pro? | Why did Microsoft choose core m3 and not core i3 home Surface Pro 4? |
1 | 1 | Should I have a hair transplant at age 24? How much would it cost? | How much cost does hair transplant require? |
2 | 2 | What but is the best way to send money from China to the US? | What you send money to China? |
3 | 3 | Which food not emulsifiers? | What foods fibre? |
4 | 4 | How "aberystwyth" start reading? | How their can I start reading? |
В качестве признаков возьмем частоты слов в вопросах
prep = CountVectorizer().fit(list(train['question1']) + list(train['question2']))
X_train = sp.sparse.hstack([prep.transform(train['question1']),
prep.transform(train['question2'])])
Y_train = train['is_duplicate'].as_matrix()
X_test = sp.sparse.hstack([prep.transform(test['question1']),
prep.transform(test['question2'])])
Обучаем логистическую регрессию с L1-регуляризацией.
clf = LogisticRegression(penalty='l1')
clf.fit(X_train, Y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l1', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
Предсказываем нужные вероятности
preds = clf.predict_proba(X_test)[:, 1]
Записываем в файл
submission = pd.read_csv('./input/sample_submission.csv')
submission['is_duplicate'] = preds
submission.to_csv('submission.csv', index=False)