Подключаем библиотеки

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
pd.set_option("display.max_columns", 1000)
pd.set_option('display.max_colwidth', -1)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

%matplotlib inline
import matplotlib.pyplot as plt

Конкурс доступен по ссылке. Наша задача - по паре вопросов предсказать вероятность того, что они об одном и том же (дубликаты).

Считываем данные:

In [2]:
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')

train.fillna('', inplace=True)
test.fillna('', inplace=True)

В данных нас интересуют колонки "question1" и "question2" - сами вопросы. В train выборке есть целевая колонка "is_duplicate", кроме того, у каждого вопроса есть свой номер - "qid1" и "qid2" соответственно.

In [3]:
train.head()
Out[3]:
id qid1 qid2 question1 question2 is_duplicate
0 0 1 2 What is the step by step guide to invest in share market in india? What is the step by step guide to invest in share market? 0
1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Diamond? What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? 0
2 2 5 6 How can I increase the speed of my internet connection while using a VPN? How can Internet speed be increased by hacking through DNS? 0
3 3 7 8 Why am I mentally very lonely? How can I solve it? Find the remainder when [math]23^{24}[/math] is divided by 24,23? 0
4 4 9 10 Which one dissolve in water quikly sugar, salt, methane and carbon di oxide? Which fish would survive in salt water? 0
In [10]:
test.head()
Out[10]:
test_id question1 question2
0 0 How does the Surface Pro himself 4 compare with iPad Pro? Why did Microsoft choose core m3 and not core i3 home Surface Pro 4?
1 1 Should I have a hair transplant at age 24? How much would it cost? How much cost does hair transplant require?
2 2 What but is the best way to send money from China to the US? What you send money to China?
3 3 Which food not emulsifiers? What foods fibre?
4 4 How "aberystwyth" start reading? How their can I start reading?

В качестве признаков возьмем частоты слов в вопросах

In [4]:
prep = CountVectorizer().fit(list(train['question1']) + list(train['question2']))
In [5]:
X_train = sp.sparse.hstack([prep.transform(train['question1']),
                            prep.transform(train['question2'])])

Y_train = train['is_duplicate'].as_matrix()

X_test = sp.sparse.hstack([prep.transform(test['question1']),
                           prep.transform(test['question2'])])
In [6]:
clf = LogisticRegression(penalty='l1')
clf.fit(X_train, Y_train)
Out[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Предсказываем нужные вероятности

In [7]:
preds = clf.predict_proba(X_test)[:, 1]

Записываем в файл

In [8]:
submission = pd.read_csv('./input/sample_submission.csv')
submission['is_duplicate'] = preds
submission.to_csv('submission.csv', index=False)

Файл с предсказанием нужно загрузить по ссылке. Ваше решение будет оцениваться по метрике Logloss.