Подключаем библиотеки

In [1]:

import numpy as np
import scipy as sp
import pandas as pd
pd.set_option("display.max_columns", 1000)
pd.set_option('display.max_colwidth', -1)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

%matplotlib inline
import matplotlib.pyplot as plt

Конкурс доступен по ссылке. Наша задача - по паре вопросов предсказать вероятность того, что они об одном и том же (дубликаты).

Считываем данные:

In [2]:

train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')

train.fillna('', inplace=True)
test.fillna('', inplace=True)

В данных нас интересуют колонки "question1" и "question2" - сами вопросы. В train выборке есть целевая колонка "is_duplicate", кроме того, у каждого вопроса есть свой номер - "qid1" и "qid2" соответственно.

In [3]:

train.head()

Out[3]:

	id	qid1	qid2	question1	question2
0	0	1	2	What is the step by step guide to invest in share market in india?	What is the step by step guide to invest in share market?
1	1	3	4	What is the story of Kohinoor (Koh-i-Noor) Diamond?	What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?
2	2	5	6	How can I increase the speed of my internet connection while using a VPN?	How can Internet speed be increased by hacking through DNS?
3	3	7	8	Why am I mentally very lonely? How can I solve it?	Find the remainder when [math]23^{24}[/math] is divided by 24,23?
4	4	9	10	Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?	Which fish would survive in salt water?

In [10]:

test.head()

Out[10]:

	test_id	question1	question2
0	0	How does the Surface Pro himself 4 compare with iPad Pro?	Why did Microsoft choose core m3 and not core i3 home Surface Pro 4?
1	1	Should I have a hair transplant at age 24? How much would it cost?	How much cost does hair transplant require?
2	2	What but is the best way to send money from China to the US?	What you send money to China?
3	3	Which food not emulsifiers?	What foods fibre?
4	4	How "aberystwyth" start reading?	How their can I start reading?

В качестве признаков возьмем частоты слов в вопросах

In [4]:

prep = CountVectorizer().fit(list(train['question1']) + list(train['question2']))

In [5]:

X_train = sp.sparse.hstack([prep.transform(train['question1']),
                            prep.transform(train['question2'])])

Y_train = train['is_duplicate'].as_matrix()

X_test = sp.sparse.hstack([prep.transform(test['question1']),
                           prep.transform(test['question2'])])

Обучаем логистическую регрессию с L1-регуляризацией.

In [6]:

clf = LogisticRegression(penalty='l1')
clf.fit(X_train, Y_train)

Out[6]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Предсказываем нужные вероятности

In [7]:

preds = clf.predict_proba(X_test)[:, 1]

Записываем в файл

In [8]:

submission = pd.read_csv('./input/sample_submission.csv')
submission['is_duplicate'] = preds
submission.to_csv('submission.csv', index=False)

Файл с предсказанием нужно загрузить по ссылке. Ваше решение будет оцениваться по метрике Logloss.