!pip install obsei[all]
Collecting git+https://github.com/lalitpagaria/obsei.git Cloning https://github.com/lalitpagaria/obsei.git to /tmp/pip-req-build-wl_1hpon Running command git clone -q https://github.com/lalitpagaria/obsei.git /tmp/pip-req-build-wl_1hpon Requirement already satisfied (use --upgrade to upgrade): obsei==0.0.9 from git+https://github.com/lalitpagaria/obsei.git in /usr/local/lib/python3.7/dist-packages Requirement already satisfied: app-store-reviews-reader==1.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.2) Requirement already satisfied: atlassian-python-api==3.10.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.10.0) Requirement already satisfied: beautifulsoup4==4.9.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.9.3) Requirement already satisfied: blis==0.7.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.7.4) Requirement already satisfied: cachetools==4.2.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.2.2) Requirement already satisfied: catalogue==2.0.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.0.4) Requirement already satisfied: certifi==2021.5.30 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2021.5.30) Requirement already satisfied: chardet==4.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.0.0) Requirement already satisfied: click==7.1.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (7.1.2) Requirement already satisfied: courlan==0.4.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.4.0) Requirement already satisfied: cssselect==1.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.1.0) Requirement already satisfied: cymem==2.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.0.5) Requirement already satisfied: dateparser==1.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.0) Requirement already satisfied: deprecated==1.2.12 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.2.12) Requirement already satisfied: elasticsearch==7.13.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (7.13.1) Requirement already satisfied: feedparser==6.0.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (6.0.2) Requirement already satisfied: filelock==3.0.12 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.12) Requirement already satisfied: gnews==0.1.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.1.3) Requirement already satisfied: google-api-core==1.30.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.30.0) Requirement already satisfied: google-api-python-client==2.8.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.8.0) Requirement already satisfied: google-auth==1.30.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.30.2) Requirement already satisfied: google-auth-httplib2==0.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.1.0) Requirement already satisfied: google-play-scraper==1.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.0) Requirement already satisfied: googleapis-common-protos==1.53.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.53.0) Requirement already satisfied: greenlet==1.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.1.0) Requirement already satisfied: htmldate==0.8.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.8.1) Requirement already satisfied: httplib2==0.19.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.19.1) Requirement already satisfied: huggingface-hub==0.0.8 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.0.8) Requirement already satisfied: idna==2.10 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.10) Requirement already satisfied: importlib-metadata==4.5.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.5.0) Requirement already satisfied: jinja2==3.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.1) Requirement already satisfied: joblib==1.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.1) Requirement already satisfied: justext==2.2.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.2.0) Requirement already satisfied: lxml==4.6.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.6.3) Requirement already satisfied: markupsafe==2.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.0.1) Requirement already satisfied: mmh3==3.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.0) Requirement already satisfied: murmurhash==1.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.5) Requirement already satisfied: nltk==3.6.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.6.2) Requirement already satisfied: numpy==1.20.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.20.3) Requirement already satisfied: oauthlib==3.1.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.1.1) Requirement already satisfied: packaging==20.9 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (20.9) Requirement already satisfied: pandas==1.2.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.2.4) Requirement already satisfied: pathy==0.5.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.5.2) Requirement already satisfied: praw==7.2.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (7.2.0) Requirement already satisfied: prawcore==2.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.1.0) Requirement already satisfied: preshed==3.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.5) Requirement already satisfied: presidio-analyzer==2.2.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.2.1) Requirement already satisfied: presidio-anonymizer==2.2.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.2.1) Requirement already satisfied: protobuf==3.17.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.17.3) Requirement already satisfied: pyasn1==0.4.8 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.4.8) Requirement already satisfied: pyasn1-modules==0.2.8 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.2.8) Requirement already satisfied: pycryptodome==3.10.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.10.1) Requirement already satisfied: pydantic==1.7.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.7.4) Requirement already satisfied: pyparsing==2.4.7 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.4.7) Requirement already satisfied: python-dateutil==2.8.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.8.1) Requirement already satisfied: python-facebook-api==0.9.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.9.2) Requirement already satisfied: pytz==2021.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2021.1) Requirement already satisfied: pyyaml==5.4.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (5.4.1) Requirement already satisfied: readability-lxml==0.8.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.8.1) Requirement already satisfied: reddit-rss-reader==1.3.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.3.2) Requirement already satisfied: regex==2020.11.13 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2020.11.13) Requirement already satisfied: requests==2.25.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.25.1) Requirement already satisfied: requests-file==1.5.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.5.1) Requirement already satisfied: requests-oauthlib==1.3.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.3.0) Requirement already satisfied: rsa==4.7.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.7.2) Requirement already satisfied: sacremoses==0.0.45 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.0.45) Requirement already satisfied: searchtweets-v2==1.0.7 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.7) Requirement already satisfied: sentencepiece==0.1.95 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.1.95) Requirement already satisfied: sgmllib3k==1.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.0) Requirement already satisfied: six==1.16.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.16.0) Requirement already satisfied: slack-sdk==3.6.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.6.0) Requirement already satisfied: smart-open==3.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.0) Requirement already satisfied: soupsieve==2.2.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.2.1) Requirement already satisfied: spacy==3.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.5) Requirement already satisfied: spacy-legacy==3.0.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.5) Requirement already satisfied: sqlalchemy==1.4.17 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.4.17) Requirement already satisfied: srsly==2.4.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.4.1) Requirement already satisfied: thinc==8.0.4 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (8.0.4) Requirement already satisfied: tld==0.12.6 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.12.6) Requirement already satisfied: tldextract==3.1.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.1.0) Requirement already satisfied: tokenizers==0.10.3 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.10.3) Requirement already satisfied: tqdm==4.61.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.61.0) Requirement already satisfied: trafilatura==0.8.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.8.2) Requirement already satisfied: transformers==4.6.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (4.6.1) Requirement already satisfied: tweet-preprocessor==0.6.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.6.0) Requirement already satisfied: typer==0.3.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.3.2) Requirement already satisfied: typing-extensions==3.10.0.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.10.0.0) Requirement already satisfied: tzlocal==2.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.1) Requirement already satisfied: update-checker==0.18.0 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.18.0) Requirement already satisfied: uritemplate==3.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.0.1) Requirement already satisfied: urllib3==1.26.5 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.26.5) Requirement already satisfied: vadersentiment==3.3.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.3.2) Requirement already satisfied: wasabi==0.8.2 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (0.8.2) Requirement already satisfied: websocket-client==1.0.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.0.1) Requirement already satisfied: wrapt==1.12.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.12.1) Requirement already satisfied: zenpy==2.0.24 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (2.0.24) Requirement already satisfied: zipp==3.4.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (3.4.1) Requirement already satisfied: torch==1.8.1 in /usr/local/lib/python3.7/dist-packages (from obsei==0.0.9) (1.8.1) Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.7/dist-packages (from google-api-core==1.30.0->obsei==0.0.9) (57.0.0) Requirement already satisfied: cattrs<2.0,>=1.1; python_version >= "3.7" and python_version < "4.0" in /usr/local/lib/python3.7/dist-packages (from python-facebook-api==0.9.2->obsei==0.0.9) (1.7.1) Requirement already satisfied: responses>=0.11 in /usr/local/lib/python3.7/dist-packages (from python-facebook-api==0.9.2->obsei==0.0.9) (0.13.3) Requirement already satisfied: attrs<21.0.0,>=20.1.0 in /usr/local/lib/python3.7/dist-packages (from python-facebook-api==0.9.2->obsei==0.0.9) (20.3.0) Building wheels for collected packages: obsei Building wheel for obsei (setup.py) ... done Created wheel for obsei: filename=obsei-0.0.9-cp37-none-any.whl size=65557 sha256=cce33049986ee20144625a85f90699a6ae020c7a8454bb4f156750446385e03b Stored in directory: /tmp/pip-ephem-wheel-cache-4be2m6lr/wheels/49/1a/6e/2fd83c9a275b7096fc615a0edef2d55b1fc33c3751ba45c1ad Successfully built obsei
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
name
: Brand name of Appcategory_list
: List of categories to perform review text classificationidentifier
: Id of the app, it can be found at the end of the url of app in app storecountry
: Country of reviewslookup_period
: How many old reviews to collect (Note: Apple rate limit and provide max 450 reviews only)extra_stop_words
: Extra stop words top clean from review textname = "zomato"
category_list = ["easy order placement", "realtime order tracking", "easy payment options", "rewards and discounts","user interface", "social media Integration"]
identifier = "434613896"
country = "in"
lookup_period = "365d"
extra_stop_words = ["i", "-", "day", "will", ".", "use", "n", "without", "please", "app", "ha", "ho", "nt", "wa",
"thi", "plz", "pleas", "ff", "ya", "thank", "you", "thanks", "mai"]
included_cols
will only be returned by Pandas Sink and rename_cols_dict
will rename selected included_cols
columns to desired one
included_cols = [f"segmented_data_classifier_data_{category}" for category in category_list]
included_cols.append("segmented_data_classifier_data_positive")
included_cols.append("segmented_data_classifier_data_negative")
included_cols.append("processed_text")
included_cols.append("meta_at")
included_cols.append("meta_date")
included_cols.append("meta_published date")
included_cols.append("meta_rating")
# included_cols.append("meta_title")
included_cols.append("meta_publisher_title")
rename_cols_dict = {f"segmented_data_classifier_data_{category}": category for category in category_list}
rename_cols_dict["segmented_data_classifier_data_positive"] = "positive"
rename_cols_dict["segmented_data_classifier_data_negative"] = "negative"
rename_cols_dict["processed_text"] = "text"
rename_cols_dict["meta_at"] = "time"
rename_cols_dict["meta_date"] = "time"
rename_cols_dict["meta_rating"] = "ratings"
rename_cols_dict["meta_published date"] = "time"
# rename_cols_dict["meta_title"] = "title"
rename_cols_dict["meta_publisher_title"] = "news publisher"
rename_cols_dict['Unnamed: 0'] = 'reviews'
from obsei.source.appstore_scrapper import (
AppStoreScrapperConfig,
AppStoreScrapperSource,
)
source_config = AppStoreScrapperConfig(
countries=[country],
app_id=identifier,
lookup_period=lookup_period
)
source = AppStoreScrapperSource()
These cleaning function will run serially
from obsei.preprocessor.text_cleaner import TextCleaner, TextCleanerConfig
from obsei.preprocessor.text_cleaning_function import *
text_cleaner_config = TextCleanerConfig(
stop_words=extra_stop_words,
cleaning_functions = [
ToLowerCase(),
RemoveWhiteSpaceAndEmptyToken(),
RemovePunctuation(),
RemoveSpecialChars(),
DecodeUnicode(),
RemoveDateTime(),
RemoveStopWords(),
RemoveStopWords(stop_words=extra_stop_words),
RemoveWhiteSpaceAndEmptyToken(),
]
)
text_cleaner = TextCleaner()
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date!
Note: Select model from https://huggingface.co/models?pipeline_tag=zero-shot-classification, if you want to try different one
from obsei.analyzer.classification_analyzer import ClassificationAnalyzerConfig, ZeroShotClassificationAnalyzer
analyzer_config=ClassificationAnalyzerConfig(
labels=category_list,
)
text_analyzer = ZeroShotClassificationAnalyzer(
model_name_or_path="typeform/mobilebert-uncased-mnli",
device="auto"
)
from pandas import DataFrame
from obsei.sink.pandas_sink import PandasSink, PandasSinkConfig
sink_config = PandasSinkConfig(
dataframe=DataFrame(),
include_columns_list=included_cols
)
sink = PandasSink()
source_response_list = source.lookup(source_config)
cleaner_response_list = text_cleaner.preprocess_input(
input_list=source_response_list,
config=text_cleaner_config
)
Note: This is compute heavy step
analyzer_response_list = text_analyzer.analyze_input(
source_response_list=cleaner_response_list,
analyzer_config=analyzer_config
)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
sink_config = PandasSinkConfig(
dataframe=DataFrame(),
include_columns_list=included_cols
)
dataframe = sink.send_data(analyzer_response_list, sink_config)
dataframe.rename(rename_cols_dict,axis=1,inplace=True)
dataframe["brand"] = name
dataframe
text | positive | user interface | rewards and discounts | negative | realtime order tracking | social media Integration | easy order placement | easy payment options | ratings | time | brand | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | awesome unmade zomato user switched limited re... | 0.72 | 0.11 | 0.06 | 0.02 | 0.02 | 0.02 | 0.01 | 0.01 | 5 | 2021-07-10 12:21:41 | zomato |
1 | best service mast service thii time | 0.99 | 0.26 | 0.17 | 0.00 | 0.16 | 0.01 | 0.21 | 0.29 | 5 | 2021-07-10 12:20:34 | zomato |
2 | nice nice | 1.00 | 0.70 | 0.38 | 0.00 | 0.30 | 0.06 | 0.44 | 0.58 | 5 | 2021-07-10 12:20:07 | zomato |
3 | listening single cheese burger concern love zo... | 0.98 | 0.81 | 0.00 | 0.00 | 0.05 | 0.00 | 0.06 | 0.07 | 5 | 2021-07-10 12:19:20 | zomato |
4 | good good | 1.00 | 0.62 | 0.42 | 0.00 | 0.50 | 0.05 | 0.53 | 0.69 | 5 | 2021-07-10 12:15:17 | zomato |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | nice gud | 1.00 | 0.87 | 0.30 | 0.00 | 0.14 | 0.08 | 0.36 | 0.68 | 5 | 2021-07-07 15:54:35 | zomato |
496 | bad experience delivery guy refused take rs no... | 0.00 | 0.29 | 0.08 | 1.00 | 0.02 | 0.03 | 0.00 | 0.00 | 1 | 2021-07-07 15:54:24 | zomato |
497 | shikha excellent | 1.00 | 0.94 | 0.45 | 0.00 | 0.48 | 0.06 | 0.70 | 0.91 | 5 | 2021-07-07 15:53:40 | zomato |
498 | ordered delivery yet pathetic service | 0.00 | 0.27 | 0.01 | 1.00 | 0.00 | 0.02 | 0.00 | 0.00 | 1 | 2021-07-07 15:47:03 | zomato |
499 | super awesome experience | 0.99 | 0.37 | 0.06 | 0.00 | 0.09 | 0.01 | 0.01 | 0.09 | 5 | 2021-07-07 15:40:27 | zomato |
500 rows × 12 columns
dataframe.to_csv(f'/content/drive/MyDrive/appstore_{name}.csv')