Obsei Tutorial 03

This example shows following Obsei workflow

  1. Observe: App Store's app reviews
  2. Pre-process: Clean review text with properly
  3. Analyze: Classify review text within given category list
  4. Inform: Provide all data in Pandas DataFrame
  5. Store: Store data in Google Drive in CSV format

Install Obsei from latest code, perform these steps -

  • Select GPU RunType for faster computation
  • Restart Runtime after installation
In [ ]:
!pip install git+https://github.com/lalitpagaria/obsei.git
Collecting git+https://github.com/lalitpagaria/obsei.git
  Cloning https://github.com/lalitpagaria/obsei.git to /tmp/pip-req-build-wl_1hpon
  Running command git clone -q https://github.com/lalitpagaria/obsei.git /tmp/pip-req-build-wl_1hpon
Requirement already satisfied (use --upgrade to upgrade): obsei==0.0.9 from git+https://github.com/lalitpagaria/obsei.git in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: app-store-reviews-reader==1.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.2)
Requirement already satisfied: atlassian-python-api==3.10.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.10.0)
Requirement already satisfied: beautifulsoup4==4.9.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.9.3)
Requirement already satisfied: blis==0.7.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.7.4)
Requirement already satisfied: cachetools==4.2.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.2.2)
Requirement already satisfied: catalogue==2.0.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.0.4)
Requirement already satisfied: certifi==2021.5.30 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2021.5.30)
Requirement already satisfied: chardet==4.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.0.0)
Requirement already satisfied: click==7.1.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (7.1.2)
Requirement already satisfied: courlan==0.4.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.4.0)
Requirement already satisfied: cssselect==1.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.1.0)
Requirement already satisfied: cymem==2.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.0.5)
Requirement already satisfied: dateparser==1.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.0)
Requirement already satisfied: deprecated==1.2.12 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.2.12)
Requirement already satisfied: elasticsearch==7.13.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (7.13.1)
Requirement already satisfied: feedparser==6.0.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (6.0.2)
Requirement already satisfied: filelock==3.0.12 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.12)
Requirement already satisfied: gnews==0.1.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.1.3)
Requirement already satisfied: google-api-core==1.30.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.30.0)
Requirement already satisfied: google-api-python-client==2.8.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.8.0)
Requirement already satisfied: google-auth==1.30.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.30.2)
Requirement already satisfied: google-auth-httplib2==0.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.1.0)
Requirement already satisfied: google-play-scraper==1.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.0)
Requirement already satisfied: googleapis-common-protos==1.53.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.53.0)
Requirement already satisfied: greenlet==1.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.1.0)
Requirement already satisfied: htmldate==0.8.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.8.1)
Requirement already satisfied: httplib2==0.19.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.19.1)
Requirement already satisfied: huggingface-hub==0.0.8 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.0.8)
Requirement already satisfied: idna==2.10 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.10)
Requirement already satisfied: importlib-metadata==4.5.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.5.0)
Requirement already satisfied: jinja2==3.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.1)
Requirement already satisfied: joblib==1.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.1)
Requirement already satisfied: justext==2.2.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.2.0)
Requirement already satisfied: lxml==4.6.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.6.3)
Requirement already satisfied: markupsafe==2.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.0.1)
Requirement already satisfied: mmh3==3.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.0)
Requirement already satisfied: murmurhash==1.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.5)
Requirement already satisfied: nltk==3.6.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.6.2)
Requirement already satisfied: numpy==1.20.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.20.3)
Requirement already satisfied: oauthlib==3.1.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.1.1)
Requirement already satisfied: packaging==20.9 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (20.9)
Requirement already satisfied: pandas==1.2.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.2.4)
Requirement already satisfied: pathy==0.5.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.5.2)
Requirement already satisfied: praw==7.2.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (7.2.0)
Requirement already satisfied: prawcore==2.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.1.0)
Requirement already satisfied: preshed==3.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.5)
Requirement already satisfied: presidio-analyzer==2.2.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.2.1)
Requirement already satisfied: presidio-anonymizer==2.2.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.2.1)
Requirement already satisfied: protobuf==3.17.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.17.3)
Requirement already satisfied: pyasn1==0.4.8 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.4.8)
Requirement already satisfied: pyasn1-modules==0.2.8 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.2.8)
Requirement already satisfied: pycryptodome==3.10.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.10.1)
Requirement already satisfied: pydantic==1.7.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.7.4)
Requirement already satisfied: pyparsing==2.4.7 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.4.7)
Requirement already satisfied: python-dateutil==2.8.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.8.1)
Requirement already satisfied: python-facebook-api==0.9.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.9.2)
Requirement already satisfied: pytz==2021.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2021.1)
Requirement already satisfied: pyyaml==5.4.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (5.4.1)
Requirement already satisfied: readability-lxml==0.8.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.8.1)
Requirement already satisfied: reddit-rss-reader==1.3.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.3.2)
Requirement already satisfied: regex==2020.11.13 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2020.11.13)
Requirement already satisfied: requests==2.25.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.25.1)
Requirement already satisfied: requests-file==1.5.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.5.1)
Requirement already satisfied: requests-oauthlib==1.3.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.3.0)
Requirement already satisfied: rsa==4.7.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.7.2)
Requirement already satisfied: sacremoses==0.0.45 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.0.45)
Requirement already satisfied: searchtweets-v2==1.0.7 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.7)
Requirement already satisfied: sentencepiece==0.1.95 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.1.95)
Requirement already satisfied: sgmllib3k==1.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.0)
Requirement already satisfied: six==1.16.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.16.0)
Requirement already satisfied: slack-sdk==3.6.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.6.0)
Requirement already satisfied: smart-open==3.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.0)
Requirement already satisfied: soupsieve==2.2.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.2.1)
Requirement already satisfied: spacy==3.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.5)
Requirement already satisfied: spacy-legacy==3.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.5)
Requirement already satisfied: sqlalchemy==1.4.17 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.4.17)
Requirement already satisfied: srsly==2.4.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.4.1)
Requirement already satisfied: thinc==8.0.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (8.0.4)
Requirement already satisfied: tld==0.12.6 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.12.6)
Requirement already satisfied: tldextract==3.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.1.0)
Requirement already satisfied: tokenizers==0.10.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.10.3)
Requirement already satisfied: tqdm==4.61.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.61.0)
Requirement already satisfied: trafilatura==0.8.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.8.2)
Requirement already satisfied: transformers==4.6.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.6.1)
Requirement already satisfied: tweet-preprocessor==0.6.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.6.0)
Requirement already satisfied: typer==0.3.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.3.2)
Requirement already satisfied: typing-extensions==3.10.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.10.0.0)
Requirement already satisfied: tzlocal==2.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.1)
Requirement already satisfied: update-checker==0.18.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.18.0)
Requirement already satisfied: uritemplate==3.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.1)
Requirement already satisfied: urllib3==1.26.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.26.5)
Requirement already satisfied: vadersentiment==3.3.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.3.2)
Requirement already satisfied: wasabi==0.8.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.8.2)
Requirement already satisfied: websocket-client==1.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.1)
Requirement already satisfied: wrapt==1.12.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.12.1)
Requirement already satisfied: zenpy==2.0.24 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.0.24)
Requirement already satisfied: zipp==3.4.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.4.1)
Requirement already satisfied: torch==1.8.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.8.1)
Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.7/dist-packages (from google-api-core==1.30.0->obsei==0.0.9) (57.0.0)
Requirement already satisfied: cattrs<2.0,>=1.1; python_version >= "3.7" and python_version < "4.0" in /usr/local/lib/python3.7/dist-packages (from python-facebook-api==0.9.2->obsei==0.0.9) (1.7.1)
Requirement already satisfied: responses>=0.11 in /usr/local/lib/python3.7/dist-packages (from python-facebook-api==0.9.2->obsei==0.0.9) (0.13.3)
Requirement already satisfied: attrs<21.0.0,>=20.1.0 in /usr/local/lib/python3.7/dist-packages (from python-facebook-api==0.9.2->obsei==0.0.9) (20.3.0)
Building wheels for collected packages: obsei
  Building wheel for obsei (setup.py) ... done
  Created wheel for obsei: filename=obsei-0.0.9-cp37-none-any.whl size=65557 sha256=cce33049986ee20144625a85f90699a6ae020c7a8454bb4f156750446385e03b
  Stored in directory: /tmp/pip-ephem-wheel-cache-4be2m6lr/wheels/49/1a/6e/2fd83c9a275b7096fc615a0edef2d55b1fc33c3751ba45c1ad
Successfully built obsei

Mount your Google Drive to store CSV

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Configure following input -

  • name: Brand name of App
  • category_list: List of categories to perform review text classification
  • identifier: Id of the app, it can be found at the end of the url of app in app store
  • country: Country of reviews
  • lookup_period: How many old reviews to collect (Note: Apple rate limit and provide max 450 reviews only)
  • extra_stop_words: Extra stop words top clean from review text
In [ ]:
name = "zomato"
category_list = ["easy order placement", "realtime order tracking", "easy payment options", "rewards and discounts","user interface", "social media Integration"]
identifier = "434613896"
country = "in"
lookup_period = "365d"
extra_stop_words = ["i", "-", "day", "will", ".", "use", "n", "without", "please", "app", "ha", "ho", "nt", "wa", 
                    "thi", "plz", "pleas", "ff", "ya", "thank", "you", "thanks", "mai"]

Configure columns of Pandas DataFrame

included_cols will only be returned by Pandas Sink and rename_cols_dict will rename selected included_cols columns to desired one

In [ ]:
included_cols = [f"segmented_data_classifier_data_{category}" for category in category_list]
included_cols.append("segmented_data_classifier_data_positive")
included_cols.append("segmented_data_classifier_data_negative")
included_cols.append("processed_text")
included_cols.append("meta_at")
included_cols.append("meta_date")
included_cols.append("meta_published date")
included_cols.append("meta_rating")
# included_cols.append("meta_title")
included_cols.append("meta_publisher_title")

rename_cols_dict = {f"segmented_data_classifier_data_{category}": category for category in category_list}
rename_cols_dict["segmented_data_classifier_data_positive"] = "positive"
rename_cols_dict["segmented_data_classifier_data_negative"] = "negative"
rename_cols_dict["processed_text"] = "text"
rename_cols_dict["meta_at"] = "time"
rename_cols_dict["meta_date"] = "time"
rename_cols_dict["meta_rating"] = "ratings"
rename_cols_dict["meta_published date"] = "time"
# rename_cols_dict["meta_title"] = "title"
rename_cols_dict["meta_publisher_title"] = "news publisher"
rename_cols_dict['Unnamed: 0'] = 'reviews'

Configure App Store Review Observer

In [ ]:
from obsei.source.appstore_scrapper import (
    AppStoreScrapperConfig,
    AppStoreScrapperSource,
)

source_config = AppStoreScrapperConfig(
    countries=[country],
    app_id=identifier,
    lookup_period=lookup_period
)

source = AppStoreScrapperSource()

Configure TextCleaner as Pre-Processor to clean review text

These cleaning function will run serially

In [ ]:
from obsei.preprocessor.text_cleaner import TextCleaner, TextCleanerConfig
from obsei.preprocessor.text_cleaning_function import *

text_cleaner_config = TextCleanerConfig(
    stop_words=extra_stop_words,
    cleaning_functions = [
        ToLowerCase(),
        RemoveWhiteSpaceAndEmptyToken(),
        RemovePunctuation(),
        RemoveSpecialChars(),
        DecodeUnicode(),
        RemoveDateTime(),
        RemoveStopWords(),
        RemoveStopWords(stop_words=extra_stop_words),
        RemoveWhiteSpaceAndEmptyToken(),
   ]
)

text_cleaner = TextCleaner()
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Configure Classification Analyzer

Note: Select model from https://huggingface.co/models?pipeline_tag=zero-shot-classification, if you want to try different one

In [ ]:
from obsei.analyzer.classification_analyzer import ClassificationAnalyzerConfig, ZeroShotClassificationAnalyzer

analyzer_config=ClassificationAnalyzerConfig(
   labels=category_list,
)

text_analyzer = ZeroShotClassificationAnalyzer(
   model_name_or_path="typeform/mobilebert-uncased-mnli",
   device="auto"
)

Configure Pandas DataFrame Informer

In [ ]:
from pandas import DataFrame
from obsei.sink.pandas_sink import PandasSink, PandasSinkConfig

sink_config = PandasSinkConfig(
   dataframe=DataFrame(),
   include_columns_list=included_cols
)
sink = PandasSink()

Fetch app reviews

In [ ]:
source_response_list = source.lookup(source_config)

PreProcess review text to clean it

In [ ]:
cleaner_response_list = text_cleaner.preprocess_input(
    input_list=source_response_list,
    config=text_cleaner_config
)

Analyze reviews to perform classification

Note: This is compute heavy step

In [ ]:
analyzer_response_list = text_analyzer.analyze_input(
    source_response_list=cleaner_response_list,
    analyzer_config=analyzer_config
)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

Inform review data in form of Pandas DataFrame

In [ ]:
sink_config = PandasSinkConfig(
   dataframe=DataFrame(),
   include_columns_list=included_cols
)

dataframe = sink.send_data(analyzer_response_list, sink_config)
dataframe.rename(rename_cols_dict,axis=1,inplace=True)


dataframe["brand"] = name
dataframe
Out[ ]:
text positive user interface rewards and discounts negative realtime order tracking social media Integration easy order placement easy payment options ratings time brand
0 awesome unmade zomato user switched limited re... 0.72 0.11 0.06 0.02 0.02 0.02 0.01 0.01 5 2021-07-10 12:21:41 zomato
1 best service mast service thii time 0.99 0.26 0.17 0.00 0.16 0.01 0.21 0.29 5 2021-07-10 12:20:34 zomato
2 nice nice 1.00 0.70 0.38 0.00 0.30 0.06 0.44 0.58 5 2021-07-10 12:20:07 zomato
3 listening single cheese burger concern love zo... 0.98 0.81 0.00 0.00 0.05 0.00 0.06 0.07 5 2021-07-10 12:19:20 zomato
4 good good 1.00 0.62 0.42 0.00 0.50 0.05 0.53 0.69 5 2021-07-10 12:15:17 zomato
... ... ... ... ... ... ... ... ... ... ... ... ...
495 nice gud 1.00 0.87 0.30 0.00 0.14 0.08 0.36 0.68 5 2021-07-07 15:54:35 zomato
496 bad experience delivery guy refused take rs no... 0.00 0.29 0.08 1.00 0.02 0.03 0.00 0.00 1 2021-07-07 15:54:24 zomato
497 shikha excellent 1.00 0.94 0.45 0.00 0.48 0.06 0.70 0.91 5 2021-07-07 15:53:40 zomato
498 ordered delivery yet pathetic service 0.00 0.27 0.01 1.00 0.00 0.02 0.00 0.00 1 2021-07-07 15:47:03 zomato
499 super awesome experience 0.99 0.37 0.06 0.00 0.09 0.01 0.01 0.09 5 2021-07-07 15:40:27 zomato

500 rows × 12 columns

Store result in Google Drive as CSV

In [ ]:
dataframe.to_csv(f'/content/drive/MyDrive/appstore_{name}.csv')