Topic: Descriptive Statistics & Hypothesis Testing
In this project, we will explore the TikTok dataset and conduct a hypothesis testing.
The purpose of this project is to demostrate knowledge of preparing, creating, and analyzing hypothesis tests.
The goal is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
The project consists of three parts:
Part 1: Imports and data loading
Part 2: Conduct hypothesis testing
Part 3: Communicate insights with stakeholders
Do videos from verified accounts and videos unverified accounts have different average view counts?
Is there a relationship between the account being verified and the associated videos' view counts?
Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.
# Import packages for data manipulation
import pandas as pd
import numpy as np
# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Import packages for statistical analysis/hypothesis testing
from scipy import stats
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")
Descriptive statistics are useful because they allow us to quickly explore and understand large amounts of data. In this case, descriptive statistics helps quickly compute the mean values of video_view_count for each group of verified_status in the sample data.
Use descriptive statistics to conduct Exploratory Data Analysis (EDA).
Inspect the first five rows of the dataframe.
# Display first few rows
data.head()
# | claim_status | video_id | video_duration_sec | video_transcription_text | verified_status | author_ban_status | video_view_count | video_like_count | video_share_count | video_download_count | video_comment_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | claim | 7017666017 | 59 | someone shared with me that drone deliveries a... | not verified | under review | 343296.0 | 19425.0 | 241.0 | 1.0 | 0.0 |
1 | 2 | claim | 4014381136 | 32 | someone shared with me that there are more mic... | not verified | active | 140877.0 | 77355.0 | 19034.0 | 1161.0 | 684.0 |
2 | 3 | claim | 9859838091 | 31 | someone shared with me that american industria... | not verified | active | 902185.0 | 97690.0 | 2858.0 | 833.0 | 329.0 |
3 | 4 | claim | 1866847991 | 25 | someone shared with me that the metro of st. p... | not verified | active | 437506.0 | 239954.0 | 34812.0 | 1234.0 | 584.0 |
4 | 5 | claim | 7105231098 | 19 | someone shared with me that the number of busi... | not verified | active | 56167.0 | 34987.0 | 4110.0 | 547.0 | 152.0 |
# Generate a table of descriptive statistics about the data
data.describe()
# | video_id | video_duration_sec | video_view_count | video_like_count | video_share_count | video_download_count | video_comment_count | |
---|---|---|---|---|---|---|---|---|
count | 19382.000000 | 1.938200e+04 | 19382.000000 | 19084.000000 | 19084.000000 | 19084.000000 | 19084.000000 | 19084.000000 |
mean | 9691.500000 | 5.627454e+09 | 32.421732 | 254708.558688 | 84304.636030 | 16735.248323 | 1049.429627 | 349.312146 |
std | 5595.245794 | 2.536440e+09 | 16.229967 | 322893.280814 | 133420.546814 | 32036.174350 | 2004.299894 | 799.638865 |
min | 1.000000 | 1.234959e+09 | 5.000000 | 20.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 4846.250000 | 3.430417e+09 | 18.000000 | 4942.500000 | 810.750000 | 115.000000 | 7.000000 | 1.000000 |
50% | 9691.500000 | 5.618664e+09 | 32.000000 | 9954.500000 | 3403.500000 | 717.000000 | 46.000000 | 9.000000 |
75% | 14536.750000 | 7.843960e+09 | 47.000000 | 504327.000000 | 125020.000000 | 18222.000000 | 1156.250000 | 292.000000 |
max | 19382.000000 | 9.999873e+09 | 60.000000 | 999817.000000 | 657830.000000 | 256130.000000 | 14994.000000 | 9599.000000 |
Check for and handle missing values.
# Check for missing values
data.isna().sum()
# 0 claim_status 298 video_id 0 video_duration_sec 0 video_transcription_text 298 verified_status 0 author_ban_status 0 video_view_count 298 video_like_count 298 video_share_count 298 video_download_count 298 video_comment_count 298 dtype: int64
# Drop rows with missing values
data = data.dropna(axis=0)
# Display first few rows after handling missing values
data.head()
# | claim_status | video_id | video_duration_sec | video_transcription_text | verified_status | author_ban_status | video_view_count | video_like_count | video_share_count | video_download_count | video_comment_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | claim | 7017666017 | 59 | someone shared with me that drone deliveries a... | not verified | under review | 343296.0 | 19425.0 | 241.0 | 1.0 | 0.0 |
1 | 2 | claim | 4014381136 | 32 | someone shared with me that there are more mic... | not verified | active | 140877.0 | 77355.0 | 19034.0 | 1161.0 | 684.0 |
2 | 3 | claim | 9859838091 | 31 | someone shared with me that american industria... | not verified | active | 902185.0 | 97690.0 | 2858.0 | 833.0 | 329.0 |
3 | 4 | claim | 1866847991 | 25 | someone shared with me that the metro of st. p... | not verified | active | 437506.0 | 239954.0 | 34812.0 | 1234.0 | 584.0 |
4 | 5 | claim | 7105231098 | 19 | someone shared with me that the number of busi... | not verified | active | 56167.0 | 34987.0 | 4110.0 | 547.0 | 152.0 |
Let's look into the relationship between verified_status
and video_view_count
. One approach is to examine the mean values of video_view_count
for each group of verified_status
in the sample data.
# Compute the mean `video_view_count` for each group in `verified_status`
### YOUR CODE HERE ###
data.groupby("verified_status")["video_view_count"].mean()
verified_status not verified 265663.785339 verified 91439.164167 Name: video_view_count, dtype: float64
Our goal is to conduct a two-sample t-test. Here are the steps:
# Conduct a two-sample t-test to compare means
### YOUR CODE HERE ###
# Save each sample in a variable
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]
# Implement a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)
Ttest_indResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120)
Result:
Since the p-value is extremely small (much smaller than the significance level of 5%), we can reject the null hypothesis. You conclude that there is a statistically significant difference in the mean video view count between verified and unverified accounts on TikTok.
Consider the questions in your PACE Strategy Document to reflect on the Execute stage.
Question:
The analysis shows that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts. This suggests there might be fundamental behavioral differences between these two groups of accounts.
It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post more clickbait-y videos? Or are unverified accounts associated with spam bots that help inflate view counts?
The next step will be to build a regression model on verified_status. A regression model is the natural next step because the end goal is to make predictions on claim status. A regression model for verified_status can help analyze user behavior in this group of verified users. Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a logistic regression model.