In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 
In [2]:
import ktrain
from ktrain import text
Using TensorFlow backend.
using Keras version: 2.2.4

Building an Arabic Sentiment Analyzer

In this notebook, we will build a simple, fast, and accurate Arabic-language text classification model in 4 simple steps. More specifically, we will build a model that classifies Arabic hotel reviews as either positive or negative.

The dataset can be downloaded from Ashraf Elnagar's GitHub repository (https://github.com/elnagara/HARD-Arabic-Dataset).

Each entry in the dataset includes a review in Arabic and a rating between 1 and 5. We will convert this to a binary classification dataset by assigning reviews with a rating of above 3 a positive label of 1 and assigning reviews with a rating of less than 3 a negative label of 0.

(Disclaimer: I don't speak Arabic. Please forgive mistakes.)

In [3]:
# convert ratings to a binary format:  1=positive, 0=negative
import pandas as pd
df = pd.read_csv('data/arabic_hotel_reviews/balanced-reviews.txt', delimiter='\t', encoding='utf-16')
df = df[['rating', 'review']] 
df['rating'] = df['rating'].apply(lambda x: 'neg' if x < 3 else 'pos')
df.columns = ['label', 'text']
df = pd.concat([df, df.label.astype('str').str.get_dummies()], axis=1, sort=False)
df = df[['text', 'neg', 'pos']]
df.head()
Out[3]:
text neg pos
0 “ممتاز”. النظافة والطاقم متعاون. 1 0
1 استثنائي. سهولة إنهاء المعاملة في الاستقبال. ل... 0 1
2 استثنائي. انصح بأختيار الاسويت و بالاخص غرفه ر... 0 1
3 “استغرب تقييم الفندق كخمس نجوم”. لا شي. يستحق ... 1 0
4 جيد. المكان جميل وهاديء. كل شي جيد ونظيف بس كا... 0 1

STEP 1: Load and Preprocess the Data

First, we use the texts_from_df function to load and preprocess the data in to arrays that can be directly fed into a neural network model.

We set val_pct as 0.1, which will automatically sample 10% of the data for validation. We specifiy preprocess_mode='bert', as we will fine-tuning a BERT model in this example. If using a different model, you will select preprocess_mode='standard'.

Notice that there is nothing speical or extra we need to do here for non-English text. ktrain automatically detects the language and character encoding and prepares the data and configures the model appropriately.

In [4]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(df, 
                                                                   'text', # name of column containing review text
                                                                   label_columns=['neg', 'pos'],
                                                                   maxlen=75, 
                                                                   max_features=100000,
                                                                   preprocess_mode='bert',
                                                                   val_pct=0.1)
preprocessing train...
language: ar
done.
preprocessing test...
language: ar
done.

STEP 2: Create a Model and Wrap in Learner Object

We will employ a neural implementation of the NBSVM.

In [5]:
model = text.text_classifier('bert', (x_train, y_train) , preproc=preproc)
learner = ktrain.get_learner(model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=32)
Is Multi-Label? False
maxlen is 75
done.

STEP 3: Train the Model

We will use the fit_onecycle method that employs a 1cycle learning rate policy and train 1 epoch.

As shown in the cell below, our final validation accuracy is 95.53% over a single epoch!

In [6]:
learner.fit_onecycle(2e-5, 1)
begin training using onecycle policy with max lr of 2e-05...
Train on 95128 samples, validate on 10570 samples
Epoch 1/1
95128/95128 [==============================] - 818s 9ms/step - loss: 0.1683 - acc: 0.9322 - val_loss: 0.1225 - val_acc: 0.9553
Out[6]:
<keras.callbacks.History at 0x7f941a4bf9b0>

Making Predictions on New Data

In [9]:
p = ktrain.get_predictor(learner.model, preproc)

Predicting label for the text

"The room was clean, the food excellent, and I loved the view from my room."

In [10]:
p.predict("الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.")
Out[10]:
'pos'

Predicting label for:

"This hotel was too expensive and the staff is rude."

In [11]:
p.predict('كان هذا الفندق باهظ الثمن والموظفين غير مهذبين.')
Out[11]:
'neg'

Save our Predictor for Later Deployment

In [12]:
# save model for later use
p.save('/tmp/arabic_predictor')
In [13]:
# reload from disk
p = ktrain.load_predictor('/tmp/arabic_predictor')
In [14]:
# still works as expected after reloading from disk
p.predict("الغرفة كانت نظيفة ، الطعام ممتاز ، وأنا أحب المنظر من غرفتي.")
Out[14]:
'pos'
In [ ]: