TikTok Project¶

Topic: Descriptive Statistics & Hypothesis Testing

Data Exploration and Hypothesis Testing¶

In this project, we will explore the TikTok dataset and conduct a hypothesis testing.

The purpose of this project is to demostrate knowledge of preparing, creating, and analyzing hypothesis tests.

The goal is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.

The project consists of three parts:

Part 1: Imports and data loading

Part 2: Conduct hypothesis testing

Part 3: Communicate insights with stakeholders

Data Exploration and Hypothesis Testing¶

PACE stages: Plan, Analyze, Construct, and Execute¶

PACE: Plan¶

Do videos from verified accounts and videos unverified accounts have different average view counts?
Is there a relationship between the account being verified and the associated videos' view counts?

Part 1 - Imports and Data Loading¶

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [2]:

# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

In [3]:

# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

PACE: Analyze and Construct¶

Descriptive statistics are useful because they allow us to quickly explore and understand large amounts of data. In this case, descriptive statistics helps quickly compute the mean values of video_view_count for each group of verified_status in the sample data.

Task 2. Data exploration¶

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).

Inspect the first five rows of the dataframe.

In [4]:

# Display first few rows
data.head()

Out[4]:

	#	claim_status	video_id	video_duration_sec	video_transcription_text	verified_status	author_ban_status	video_view_count	video_like_count	video_share_count	video_download_count	video_comment_count
0	1	claim	7017666017	59	someone shared with me that drone deliveries a...	not verified	under review	343296.0	19425.0	241.0	1.0	0.0
1	2	claim	4014381136	32	someone shared with me that there are more mic...	not verified	active	140877.0	77355.0	19034.0	1161.0	684.0
2	3	claim	9859838091	31	someone shared with me that american industria...	not verified	active	902185.0	97690.0	2858.0	833.0	329.0
3	4	claim	1866847991	25	someone shared with me that the metro of st. p...	not verified	active	437506.0	239954.0	34812.0	1234.0	584.0
4	5	claim	7105231098	19	someone shared with me that the number of busi...	not verified	active	56167.0	34987.0	4110.0	547.0	152.0

In [5]:

# Generate a table of descriptive statistics about the data
data.describe()

Out[5]:

	#	video_id	video_duration_sec	video_view_count	video_like_count	video_share_count	video_download_count	video_comment_count
count	19382.000000	1.938200e+04	19382.000000	19084.000000	19084.000000	19084.000000	19084.000000	19084.000000
mean	9691.500000	5.627454e+09	32.421732	254708.558688	84304.636030	16735.248323	1049.429627	349.312146
std	5595.245794	2.536440e+09	16.229967	322893.280814	133420.546814	32036.174350	2004.299894	799.638865
min	1.000000	1.234959e+09	5.000000	20.000000	0.000000	0.000000	0.000000	0.000000
25%	4846.250000	3.430417e+09	18.000000	4942.500000	810.750000	115.000000	7.000000	1.000000
50%	9691.500000	5.618664e+09	32.000000	9954.500000	3403.500000	717.000000	46.000000	9.000000
75%	14536.750000	7.843960e+09	47.000000	504327.000000	125020.000000	18222.000000	1156.250000	292.000000
max	19382.000000	9.999873e+09	60.000000	999817.000000	657830.000000	256130.000000	14994.000000	9599.000000

Check for and handle missing values.

In [6]:

# Check for missing values
data.isna().sum()

Out[6]:

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [7]:

# Drop rows with missing values
data = data.dropna(axis=0)

In [8]:

# Display first few rows after handling missing values
data.head()

Out[8]:

	#	claim_status	video_id	video_duration_sec	video_transcription_text	verified_status	author_ban_status	video_view_count	video_like_count	video_share_count	video_download_count	video_comment_count
0	1	claim	7017666017	59	someone shared with me that drone deliveries a...	not verified	under review	343296.0	19425.0	241.0	1.0	0.0
1	2	claim	4014381136	32	someone shared with me that there are more mic...	not verified	active	140877.0	77355.0	19034.0	1161.0	684.0
2	3	claim	9859838091	31	someone shared with me that american industria...	not verified	active	902185.0	97690.0	2858.0	833.0	329.0
3	4	claim	1866847991	25	someone shared with me that the metro of st. p...	not verified	active	437506.0	239954.0	34812.0	1234.0	584.0
4	5	claim	7105231098	19	someone shared with me that the number of busi...	not verified	active	56167.0	34987.0	4110.0	547.0	152.0

Let's look into the relationship between verified_status and video_view_count. One approach is to examine the mean values of video_view_count for each group of verified_status in the sample data.

In [9]:

# Compute the mean `video_view_count` for each group in `verified_status`
### YOUR CODE HERE ###
data.groupby("verified_status")["video_view_count"].mean()

Out[9]:

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

Task 3. Hypothesis testing¶

Null hypothesis: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to chance or sampling variability).
Alternative hypothesis: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).

Our goal is to conduct a two-sample t-test. Here are the steps:

State the null hypothesis and the alternative hypothesis
Choose a signficance level (5% in this project)
Find the p-value
Reject or fail to reject the null hypothesis

In [10]:

# Conduct a two-sample t-test to compare means
### YOUR CODE HERE ###

# Save each sample in a variable
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

# Implement a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

Out[10]:

Ttest_indResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120)

Result:

Since the p-value is extremely small (much smaller than the significance level of 5%), we can reject the null hypothesis. You conclude that there is a statistically significant difference in the mean video view count between verified and unverified accounts on TikTok.

PACE: Execute¶

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

Task 4. Communicate insights with stakeholders¶

Question:

What business insight(s) can you draw from the result of your hypothesis test?

The analysis shows that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts. This suggests there might be fundamental behavioral differences between these two groups of accounts.

It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post more clickbait-y videos? Or are unverified accounts associated with spam bots that help inflate view counts?

The next step will be to build a regression model on verified_status. A regression model is the natural next step because the end goal is to make predictions on claim status. A regression model for verified_status can help analyze user behavior in this group of verified users. Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a logistic regression model.