While I'm not a fan nor a hater of the Kardashians and Jenners, the polarizing family intrigues me. Why? Their marketing prowess. Say what you will about them and what they stand for, they are great at the hype game. Everything they touch turns to content. In this Project, you will explore the data underneath the hype in the form of search interest data from Google Trends. You'll recreate the Google Trends plot to visualize their ups and downs over time, then make a few custom plots of your own. And you'll answer the big question - "is Kim even the most famous sister anymore?" This is the Result of Project "Up and Down With the Kardashians", via datacamp.
While I'm not a fan nor a hater of the Kardashians and Jenners, the polarizing family intrigues me. Why? Their marketing prowess. Say what you will about them and what they stand for, they are great at the hype game. Everything they touch turns to content.
The sisters in particular over the past decade have been especially productive in this regard. Let's get some facts straight. I consider the "sisters" to be the following daughters of Kris Jenner. Three from her first marriage to lawyer Robert Kardashian:
And two from her second marriage to Olympic gold medal-winning decathlete, Caitlyn Jenner (formerly Bruce):
This family tree can be confusing, but we aren't here to explain it. We're here to explore the data underneath the hype, and we'll do it using search interest data from Google Trends. We'll recreate the Google Trends plot to visualize their ups and downs over time, then make a few custom plots of our own. And we'll answer the big question: is Kim even the most famous sister anymore?
First, let's load and inspect our Google Trends data, which was downloaded in CSV form. The query parameters: each of the sisters, worldwide search data, 2007 to present day. (2007 was the year Kim became "active" according to Wikipedia.)
import matplotlib.pyplot as plt
import pandas as pd
plt.rcParams['figure.figsize'] = (10, 8)
# Read in dataset
trends = pd.read_csv('./dataset/trends_kj_sisters.csv')
# Inspect data
trends.head()
Month | Kim Kardashian: (Worldwide) | Khloé Kardashian: (Worldwide) | Kourtney Kardashian: (Worldwide) | Kendall Jenner: (Worldwide) | Kylie Jenner: (Worldwide) | |
---|---|---|---|---|---|---|
0 | 2007-01 | 2 | <1 | <1 | <1 | 2 |
1 | 2007-02 | 12 | <1 | <1 | <1 | 2 |
2 | 2007-03 | 9 | <1 | <1 | <1 | 1 |
3 | 2007-04 | 6 | <1 | <1 | <1 | 1 |
4 | 2007-05 | 6 | <1 | <1 | <1 | 2 |
So we have a column for each month since January 2007 and a column for the worldwide search interest for each of the sisters each month. By the way, Google defines the values of search interest as:
Numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term.
Okay, that's great Google, but you are not making this data easily analyzable for us. I see a few things. Let's do the column names first. A column named "Kim Kardashian: (Worldwide)" is not the most usable for coding purposes. Let's shorten those so we can access their values better. Might as well standardize all column formats, too. I like lowercase, short column names.
# Make column names easier to work with
trends.columns = ['month', 'kim', 'khloe', 'kourtney', 'kendall', 'kylie']
# Inspect data
trends.head()
month | kim | khloe | kourtney | kendall | kylie | |
---|---|---|---|---|---|---|
0 | 2007-01 | 2 | <1 | <1 | <1 | 2 |
1 | 2007-02 | 12 | <1 | <1 | <1 | 2 |
2 | 2007-03 | 9 | <1 | <1 | <1 | 1 |
3 | 2007-04 | 6 | <1 | <1 | <1 | 1 |
4 | 2007-05 | 6 | <1 | <1 | <1 | 2 |
That's better. We don't need to scroll our eyes across the table to read the values anymore since it is much less wide. And seeing five columns that all start with the letter "k" … the aesthetics … we should call them "kolumns" now! (Bad joke.)
The next thing I see that is going to be an issue is that "<" sign. If "a score of 0 means there was not enough data for this term," "<1" must mean it is between 0 and 1 and Google does not want to give us the fraction from google.trends.com for whatever reason. That's fine, but this "<" sign means we won't be able to analyze or visualize our data right away because those column values aren't going to be represented as numbers in our data structure. Let's confirm that by inspecting our data types.
# Inspect data types
trends.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 147 entries, 0 to 146 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 147 non-null object 1 kim 147 non-null int64 2 khloe 147 non-null object 3 kourtney 147 non-null object 4 kendall 147 non-null object 5 kylie 147 non-null int64 dtypes: int64(2), object(4) memory usage: 7.0+ KB
Yes, okay, the khloe
, kourtney
, and kendall
columns aren't integers like the kim
and kylie
columns are. Again, because of the "<" sign that indicates a search interest value between zero and one. Is this an early hint at the hierarchy of sister popularity? We'll see shortly. Before that, we'll need to remove that pesky "<" sign. Then we can change the type of those columns to integer.
# Loop through columns
for column in trends.columns:
# Only modify columns that have the "<" sign
if "<" in trends[column].to_string():
# Remove "<" and convert dtype to integer
trends[column] = trends[column].str.replace("<", "")
trends[column] = pd.to_numeric(trends[column])
# Inspect data types and data
trends.info()
trends.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 147 entries, 0 to 146 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 147 non-null object 1 kim 147 non-null int64 2 khloe 147 non-null int64 3 kourtney 147 non-null int64 4 kendall 147 non-null int64 5 kylie 147 non-null int64 dtypes: int64(5), object(1) memory usage: 7.0+ KB
month | kim | khloe | kourtney | kendall | kylie | |
---|---|---|---|---|---|---|
0 | 2007-01 | 2 | 1 | 1 | 1 | 2 |
1 | 2007-02 | 12 | 1 | 1 | 1 | 2 |
2 | 2007-03 | 9 | 1 | 1 | 1 | 1 |
3 | 2007-04 | 6 | 1 | 1 | 1 | 1 |
4 | 2007-05 | 6 | 1 | 1 | 1 | 2 |
Okay, great, no more "<" signs. All the sister columns are of integer type.
Now let's convert our month
column from type object to datetime to make our date data more accessible.
# Convert month to type datetime
trends['month'] = pd.to_datetime(trends['month'])
# Inspect data types and data
trends.info()
trends.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 147 entries, 0 to 146 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 147 non-null datetime64[ns] 1 kim 147 non-null int64 2 khloe 147 non-null int64 3 kourtney 147 non-null int64 4 kendall 147 non-null int64 5 kylie 147 non-null int64 dtypes: datetime64[ns](1), int64(5) memory usage: 7.0 KB
month | kim | khloe | kourtney | kendall | kylie | |
---|---|---|---|---|---|---|
0 | 2007-01-01 | 2 | 1 | 1 | 1 | 2 |
1 | 2007-02-01 | 12 | 1 | 1 | 1 | 2 |
2 | 2007-03-01 | 9 | 1 | 1 | 1 | 1 |
3 | 2007-04-01 | 6 | 1 | 1 | 1 | 1 |
4 | 2007-05-01 | 6 | 1 | 1 | 1 | 2 |
And finally, let's set the month
column as our index to wrap our data cleaning. Having month
as index rather than the zero-based row numbers will allow us to write shorter lines of code to create plots, where month
will represent our x-axis.
# Set month as DataFrame index
trends = trends.set_index('month')
# Inspect the data
trends.head()
kim | khloe | kourtney | kendall | kylie | |
---|---|---|---|---|---|
month | |||||
2007-01-01 | 2 | 1 | 1 | 1 | 2 |
2007-02-01 | 12 | 1 | 1 | 1 | 2 |
2007-03-01 | 9 | 1 | 1 | 1 | 1 |
2007-04-01 | 6 | 1 | 1 | 1 | 1 |
2007-05-01 | 6 | 1 | 1 | 1 | 2 |
Okay! So our data is ready to plot. Because we cleaned our data, we only need one line of code (and just thirteen characters!) to remake the Google Trends chart, plus another line to make the plot show up in our notebook.
# Plot search interest vs. month
%matplotlib inline
fig, ax = plt.subplots(figsize=(10, 8))
trends.plot(ax=ax);
Oh my! There is so much to make sense of here. Kim's sharp rise in 2007, with the beginning of Keeping Up with the Kardashians, among other things. There was no significant search interest for the other four sisters until mid-2009 when Kourtney and Khloé launched the reality television series, Kourtney and Khloé Take Miami. Then there was Kim's rise from famous to literally more famous than God in 2011. This Cosmopolitan article covers the timeline that includes the launch of music videos, fragrances, iPhone and Android games, another television series, joining Instagram, and more. Then there was Kim's ridiculous spike in December 2014: posing naked on the cover of Paper Magazine in a bid to break the internet will do that for you.
A curious thing starts to happen after that bid as well. Let's zoom in…
# Zoom in from January 2014
fig, ax = plt.subplots(figsize=(10, 8))
trends.loc['2014-01':'2019-03'].plot(ax=ax);
It looks like my suspicion may be true: Kim is not always the most searched Kardashian or Jenner sister. Since late-2016, at various months, Kylie overtakes Kim. Two big spikes where she smashed Kim's search interest: in September 2017 when it was reported that Kylie was expecting her first child with rapper Travis Scott and in February 2018 when she gave birth to her daughter, Stormi Webster. The continued success of Kylie Cosmetics has kept her in the news, not to mention making her the "The Youngest Self-Made Billionaire Ever" according to Forbes.
These fluctuations are descriptive but do not really help us answer our question: is Kim even the most famous sister anymore? We can use rolling means to smooth out short-term fluctuations in time series data and highlight long-term trends. Let's make the window twelve months a.k.a. one year.
# Smooth the data with rolling means
fig, ax = plt.subplots(figsize=(10, 8))
trends.rolling(window=12).mean().plot(ax=ax);
Whoa, okay! So by this metric, Kim is still the most famous sister despite Kylie being close and nearly taking her crown. Honestly, the biggest takeaway from this whole exercise might be Kendall not showing up that much. It makes sense, though, despite her wildly successful modeling career. Some have called her "the only normal one in her family" as she tends to shy away from the more dramatic and controversial parts of the media limelight that generate oh so many clicks.
Let's end this analysis with one last plot. In it, we will plot (pun!) the Kardashian sisters against the Jenner sisters to see which family line is more popular now. We will use average search interest to make things fair, i.e., total search interest divided by the number of sisters in the family line.
The answer? Since 2015, it has been a toss-up. And in the future? With this family and their penchant for big events, who knows?
# Average search interest for each family line
trends['kardashian'] = trends[['kim', 'khloe', 'kourtney']].sum(axis=1) / 3
trends['jenner'] = trends[['kendall', 'kylie']].sum(axis=1) / 2
# Plot average family line search interest vs. month
fig, ax = plt.subplots(figsize=(10, 8))
trends[['kardashian', 'jenner']].plot(ax=ax);