Exploring data for the attention index

The idea of the attention index is to provide a score that indicates the impact of an article, and can easily be aggregated by subject, publisher or other axis.

The index comprises of two parts:

  • promotion how important the article was to the publisher, based on the extent to which they chose to editorially promote it
  • response how readers reacted to the article, based on social engagements

The index will be a number between 0 and 100. 50% is driven by the promotion, and 50% by response:

Attention Index

Promotion Score

The promotion score should take into account:

  • whether the publisher chose make the article a lead article on their primary front (30%)
  • how long the publisher chose to retain the article on their front (40%)
  • whether they chose to push the article on their facebook brand page (30%)

It should be scaled based on the value of that promotion, so a popular, well-visited site should score higher than one on the fringes. And similarly a powerful, well-followed brand page should score higher than one less followed.

Response Score

The response score takes into account the number of engagements on Facebook.

The rest of this notebook explores how those numbers could work, starting with the response score because that is easier, I think.

Setup

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
In [2]:
data = pd.read_csv("articles_2017-05-01_2017-05-31.csv", index_col="id", \
                   parse_dates=["published", "discovered"])
data.head()
Out[2]:
url headline discovered published fb_engagements fb_max_engagements_per_min fb_comments fb_reactions fb_shares publisher_name publisher_id mins_as_lead mins_on_front num_articles_on_front fb_brand_page fb_brand_page_likes alexa_rank
id
0d2c10289ac562afa42b458bf238defd5b7d86d8 https://www.nytimes.com/2017/04/30/nyregion/me... April Fools’ With the Epitome of Cool 2017-05-01 00:05:00.022 2017-05-01 00:00:00 5 0.083449 NaN NaN NaN New York Times nytimes_com 0 0 NaN False NaN 120
0d35eb2f3c53ad8218866a0927082b0a2a53576d http://www.dailymail.co.uk/news/article-446135... Young mother loses three teeth and is left nee... 2017-05-01 00:05:23.416 2017-05-01 00:00:21 8 0.083449 NaN NaN NaN Daily Mail dailymail_co_uk 0 584 603.0 False NaN 158
20a659b9ee677b718b5834566795b6e231541a9b http://www.dailymail.co.uk/tvshowbiz/article-4... Holly Hagan slams 'jealous' trolls in angry Tw... 2017-05-01 00:05:19.176 2017-05-01 00:00:32 1 0.000000 NaN NaN NaN Daily Mail dailymail_co_uk 0 2352 603.0 False NaN 158
7572bc3039f8f0ad83fea23d2692d5bd8a4cadd9 http://www.dailymail.co.uk/news/article-446135... Primary school headteachers could strike over ... 2017-05-01 00:05:19.083 2017-05-01 00:00:32 10 0.083449 NaN NaN NaN Daily Mail dailymail_co_uk 0 1012 603.0 False NaN 158
a8e2c63260f908b2f091862fff1b884da0eb5fe9 http://www.dailymail.co.uk/news/article-446138... Man films student driver who 'fled from hit an... 2017-05-01 00:05:22.695 2017-05-01 00:00:48 12 0.083449 NaN NaN NaN Daily Mail dailymail_co_uk 0 0 NaN False NaN 158

Response Score

The response score is a number between 0 and 50 that indicates the level of response to an article.

Perhaps in the future we may choose to include other factors, but for now we just include engagements on Facebook. The maximum score of 50 should be achieved by an article that does really well compared with others.

In [3]:
pd.options.display.float_format = '{:.2f}'.format
data.fb_engagements.describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.995, 0.999])
Out[3]:
count    128531.00
mean       1219.29
std        6211.63
min           0.00
50%          36.00
75%         362.00
90%        2256.00
95%        5690.00
99%       22447.10
99.5%     33332.15
99.9%     65258.53
max     1114468.00
Name: fb_engagements, dtype: float64

There's one article there with 1 million plus engagements, let's just double check that.

In [4]:
data[data.fb_engagements > 1000000]
Out[4]:
url headline discovered published fb_engagements fb_max_engagements_per_min fb_comments fb_reactions fb_shares publisher_name publisher_id mins_as_lead mins_on_front num_articles_on_front fb_brand_page fb_brand_page_likes alexa_rank
id
c55509930e92f71ee534edf7d570b70ed81a2267 https://www.indy100.com/article/divorced-fathe... This dad's brilliant post about his ex-wife is... 2017-05-15 11:09:18.339 2017-05-14 08:36:41 1114468 8.34 58501.00 914126.00 141841.00 indy100 indy100_com 0 9937 100.00 False nan 5014

Yup. Facebook open graph debugger agrees with that, although it looks to me like facebook has got confused. But Kaleida is reflecting correctly what facebook is saying.

In [5]:
data.fb_engagements.mode()
Out[5]:
0    0
dtype: int64

Going back to the enagement counts, we see the mean is 1,129, mode is zero, median is 36, 90th percentile is 2,256, 99th percentile is 22,447, 99.5th percentile is 33,332. The standard deviation is 6,211, significantly higher than the mean, so this is not a normal distribution.

We want to provide a sensible way of allocating this to the 50 buckets we have available. Let's just bucket geometrically first:

In [6]:
mean = data.fb_engagements.mean()
median = data.fb_engagements.median()

plt.figure(figsize=(12,4.5))
plt.hist(data.fb_engagements, bins=50)
plt.axvline(mean, linestyle=':', label=f'Mean ({mean:,.0f})', color='green')
plt.axvline(median, label=f'Median ({median:,.0f})', color='red')
leg = plt.legend()

Well that's not very useful. Almost everything will score less than 0 if we just do that, which isn't a useful metric.

Let's start by excluding zeros.

In [7]:
non_zero_fb_enagagements = data.fb_engagements[data.fb_engagements > 0]

plt.figure(figsize=(12,4.5))
plt.hist(non_zero_fb_enagagements, bins=50)
plt.axvline(mean, linestyle=':', label=f'Mean ({mean:,.0f})', color='green')
plt.axvline(median, label=f'Median ({median:,.0f})', color='red')
leg = plt.legend()

That's still a big number at the bottom, and so not a useful score.

Next, we exclude the outliers: cap at the 99.9th percentile (i.e. 65,258.53), so that 0.1% of articles should receive the maximum score.

In [8]:
non_zero_fb_enagagements_without_outliers = non_zero_fb_enagagements.clip_upper(65258.53)

plt.figure(figsize=(12,4.5))
plt.hist(non_zero_fb_enagagements_without_outliers, bins=50)
plt.axvline(mean, linestyle=':', label=f'Mean ({mean:,.0f})', color='green')
plt.axvline(median, label=f'Median ({median:,.0f})', color='red')
leg = plt.legend()

That's a bit better, but still way too clustered at the low end. Let's look at a log normal distribution.

In [9]:
mean = data.fb_engagements.mean()
median = data.fb_engagements.median()
ninety = data.fb_engagements.quantile(.90)
ninetyfive = data.fb_engagements.quantile(.95)
ninetynine = data.fb_engagements.quantile(.99)

plt.figure(figsize=(12,4.5))
plt.hist(np.log(non_zero_fb_enagagements + median), bins=50)
plt.axvline(np.log(mean), linestyle=':', label=f'Mean ({mean:,.0f})', color='green')
plt.axvline(np.log(median), label=f'Median ({median:,.0f})', color='green')
plt.axvline(np.log(ninety), linestyle='--', label=f'90% percentile ({ninety:,.0f})', color='red')
plt.axvline(np.log(ninetyfive), linestyle='-.', label=f'95% percentile ({ninetyfive:,.0f})', color='red')
plt.axvline(np.log(ninetynine), linestyle=':', label=f'99% percentile ({ninetynine:,.0f})', color='red')
leg = plt.legend()

That's looking a bit more interesting.

After some exploration, to avoid too much emphasis on the lower end of the scale, we move the numbers to the right a bit by adding on the median.

In [10]:
log_engagements = (non_zero_fb_enagagements
                   .clip_upper(data.fb_engagements.quantile(.999))
                   .apply(lambda x: np.log(x + median))
                  )
log_engagements.describe()
Out[10]:
count   109192.00
mean         5.29
std          1.73
min          3.61
25%          3.83
50%          4.66
75%          6.35
max         11.09
Name: fb_engagements, dtype: float64

Use standard feature scaling to bring that to a 1 to 50 range

In [11]:
def scale_log_engagements(engagements_logged):
    return np.ceil(
        50 * (engagements_logged - log_engagements.min()) / (log_engagements.max() - log_engagements.min())
    )

def scale_engagements(engagements):
    return scale_log_engagements(np.log(engagements + median))

scaled_non_zero_engagements = scale_log_engagements(log_engagements)
scaled_non_zero_engagements.describe()
Out[11]:
count   109192.00
mean        11.71
std         11.56
min          0.00
25%          2.00
50%          8.00
75%         19.00
max         50.00
Name: fb_engagements, dtype: float64
In [12]:
# add in the zeros, as zero
scaled_engagements = pd.concat([scaled_non_zero_engagements, data.fb_engagements[data.fb_engagements == 0]])
In [13]:
proposed = pd.DataFrame({"fb_engagements": data.fb_engagements, "response_score": scaled_engagements})
proposed.response_score.plot.hist(bins=50)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d1cd3c8>

Now look at how the shares distribute to score:

In [14]:
plt.figure(figsize=(15,8))

shares = np.arange(1, 60000)
plt.plot(shares, scale_engagements(shares))
plt.xlabel("shares")
plt.ylabel("score")
plt.axhline(scale_engagements(mean), linestyle=':', label=f'Mean ({mean:,.0f})', color='green')
plt.axhline(scale_engagements(median), label=f'Median ({median:,.0f})', color='green')
plt.axhline(scale_engagements(ninety), linestyle='--', label=f'90% percentile ({ninety:,.0f})', color='red')
plt.axhline(scale_engagements(ninetyfive), linestyle='-.', label=f'95% percentile ({ninetyfive:,.0f})', color='red')
plt.axhline(scale_engagements(ninetynine), linestyle=':', label=f'99% percentile ({ninetynine:,.0f})', color='red')

plt.legend(frameon=True, shadow=True)
Out[14]:
<matplotlib.legend.Legend at 0x10a648160>
In [15]:
proposed.groupby("response_score").fb_engagements.agg([np.size, np.min, np.max])
Out[15]:
size amin amax
response_score
0.00 26315 0 1
1.00 14791 2 6
2.00 9188 7 13
3.00 6501 14 21
4.00 5544 22 31
5.00 4380 32 42
6.00 3603 43 54
7.00 3591 55 69
8.00 3145 70 86
9.00 2944 87 106
10.00 2740 107 129
11.00 2533 130 155
12.00 2364 156 186
13.00 2383 187 222
14.00 2338 223 264
15.00 2108 265 312
16.00 2145 313 368
17.00 2010 369 433
18.00 1904 434 509
19.00 1966 510 597
20.00 1794 598 699
21.00 1751 700 818
22.00 1664 819 956
23.00 1549 957 1116
24.00 1554 1117 1302
25.00 1523 1303 1518
26.00 1325 1519 1768
27.00 1276 1769 2060
28.00 1265 2061 2398
29.00 1168 2399 2790
30.00 1148 2791 3246
31.00 1057 3247 3775
32.00 972 3777 4390
33.00 934 4394 5101
34.00 873 5108 5932
35.00 815 5934 6896
36.00 761 6897 8014
37.00 609 8016 9308
38.00 618 9314 10818
39.00 540 10823 12568
40.00 498 12571 14604
41.00 401 14609 16943
42.00 385 16989 19704
43.00 311 19707 22862
44.00 276 22916 26583
45.00 224 26596 30879
46.00 188 31006 35813
47.00 149 35889 41618
48.00 115 41659 48277
49.00 93 48524 55987
50.00 202 56212 1114468

Looks good to me, lets save that.

In [16]:
data["response_score"] = proposed.response_score

Proposal

The maximum of 50 points is awarded when the engagements are greater than the 99.9th percentile, rolling over the last month.

i.e. where $limit$ is the 99.5th percentile of engagements calculated over the previous month, the response score for article $a$ is:

\begin{align} basicScore_a & = \begin{cases} 0 & \text{if } engagements_a = 0 \\ \log(\min(engagements_a,limit) + median(engagements)) & \text{if } engagements_a > 0 \end{cases} \\ responseScore_a & = \begin{cases} 0 & \text{if } engagements_a = 0 \\ 50 \cdot \frac{basicScore_a - \min(basicScore)}{\max(basicScore) - \min(basicScore)} & \text{if } engagements_a > 0 \end{cases} \\ \\ \text{The latter equation can be expanded to:} \\ responseScore_a & = \begin{cases} 0 & \text{if } engagements_a = 0 \\ 50 \cdot \frac{\log(\min(engagements_a,limit) + median(engagements)) - \log(1 + median(engagements))} {\log(limit + median(engagements)) - \log(1 + median(engagements))} & \text{if } engagements_a > 0 \end{cases} \\ \end{align}

Promotion Score

The aim of the promotion score is to indicate how important the article was to the publisher, by tracking where they chose to promote it. This is a number between 0 and 50 comprised of:

  • 20 points based on whether the article was promoted as the "lead" story on the publisher's home page
  • 15 points based on how long the article was promoted anywhere on the publisher's home page
  • 15 points based on whether the article was promoted on the publisher's main facebook brand page

The first two should be scaled by the popularity/reach of the home page, for which we use the alexa page rank as a proxy.

The last should be scaled by the popularity/reach of the brand page, for which we use the number of likes the brand page has.

Lead story (20 points)

In [17]:
data.mins_as_lead.describe([0.5, 0.75, 0.9, 0.95, 0.99, 0.995, 0.999])
Out[17]:
count   128531.00
mean         8.31
std         95.44
min          0.00
50%          0.00
75%          0.00
90%          0.00
95%          0.00
99%        251.00
99.5%      520.70
99.9%     1143.70
max      14584.00
Name: mins_as_lead, dtype: float64

As expected, the vast majority of articles don't make it as lead. Let's explore how long typically publishers put something as lead for.

In [18]:
lead_articles = data[data.mins_as_lead > 0]
In [19]:
lead_articles.mins_as_lead.describe([0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.995, 0.999])
Out[19]:
count    3121.00
mean      342.34
std       510.76
min         3.00
25%        89.00
50%       192.00
75%       440.00
90%       796.00
95%      1070.00
99%      1703.40
99.5%    2518.80
99.9%    5880.00
max     14584.00
Name: mins_as_lead, dtype: float64
In [20]:
lead_articles.mins_as_lead.plot.hist(bins=50)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a351208>

For lead, it's a significant thing for an article to be lead at all, so although we want to penalise articles that were lead for a very short time, mostly we want to score the maximum even if it wasn't lead for ages. So we'll give maximum points when something has been lead for an hour.

In [21]:
lead_articles.mins_as_lead.clip_upper(60).plot.hist(bins=50)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d1cdc88>

We also want to scale this by the alexa page rank, such that the maximum score of 20 points is for an article that was on the front for 4 hours for the most popular site.

So lets explore the alexa nunbers.

In [22]:
alexa_ranks = data.groupby(by="publisher_id").alexa_rank.mean().sort_values()
alexa_ranks
Out[22]:
publisher_id
bbc_co_uk                               96
cnn_com                                105
nytimes_com                            120
theguardian_com                        142
buzzfeed_com                           147
dailymail_co_uk                        158
washingtonpost_com                     191
huffingtonpost_com                     215
foxnews_com                            285
rt_com                                 365
telegraph_co_uk                        370
independent_co_uk                      386
reuters_com                            497
npr_org                                594
nbcnews_com                            826
breitbart_com                          994
ft_com                                1596
economist_com                         1825
indy100_com                           5014
thetimes_co_uk                        6435
thecanary_co                         15686
propublica_org                       16066
yournewswire_com                     22568
order-order_com                      32515
anotherangryvoice_blogspot_co_uk     77827
westmonster_com                      97775
evolvepolitics_com                  119412
skwawkbox_org                       152475
libdemvoice_org                     344992
brexitcentral_com                   469149
Name: alexa_rank, dtype: int64
In [23]:
alexa_ranks.plot.bar(figsize=[10,5])
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x108987908>

Let's try the simple option first: just divide the number of minutes as lead by the alexa rank. What's the scale of numbers we get then.

In [24]:
lead_proposal_1 = lead_articles.mins_as_lead.clip_upper(60) / lead_articles.alexa_rank
lead_proposal_1.plot.hist()
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a167828>

Looks like there's too much of a cluster around 0. Have we massively over penalised the publishers with a high alexa rank?

In [25]:
lead_proposal_1.groupby(data.publisher_id).mean().plot.bar(figsize=[10,5])
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b9e4a58>

Yes. Let's try taking the log of the alexa rank and see if that looks better.

In [26]:
lead_proposal_2 = (lead_articles.mins_as_lead.clip_upper(60) / np.log(lead_articles.alexa_rank))
lead_proposal_2.plot.hist()
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x10cf420b8>
In [27]:
lead_proposal_2.groupby(data.publisher_id).describe()
Out[27]:
count mean std min 25% 50% 75% max
publisher_id
anotherangryvoice_blogspot_co_uk 33.00 5.19 0.57 2.57 5.33 5.33 5.33 5.33
bbc_co_uk 96.00 12.68 1.96 0.88 13.15 13.15 13.15 13.15
breitbart_com 206.00 8.55 0.86 2.03 8.69 8.69 8.69 8.69
brexitcentral_com 27.00 4.43 0.64 1.45 4.59 4.59 4.59 4.59
buzzfeed_com 105.00 11.86 1.12 1.80 12.02 12.02 12.02 12.02
cnn_com 181.00 12.53 1.44 4.08 12.89 12.89 12.89 12.89
dailymail_co_uk 159.00 11.46 1.42 3.75 11.85 11.85 11.85 11.85
economist_com 44.00 7.55 1.68 0.67 7.99 7.99 7.99 7.99
evolvepolitics_com 26.00 5.02 0.59 2.14 5.13 5.13 5.13 5.13
foxnews_com 214.00 10.54 0.50 5.66 10.61 10.61 10.61 10.61
ft_com 76.00 7.75 1.54 0.54 8.14 8.14 8.14 8.14
huffingtonpost_com 166.00 10.72 1.81 0.56 11.17 11.17 11.17 11.17
independent_co_uk 160.00 9.40 1.93 0.50 10.07 10.07 10.07 10.07
indy100_com 109.00 5.65 1.96 1.06 4.11 7.04 7.04 7.04
libdemvoice_org 10.00 4.58 0.37 3.53 4.71 4.71 4.71 4.71
nbcnews_com 111.00 8.79 1.07 0.60 8.93 8.93 8.93 8.93
npr_org 164.00 8.84 1.82 0.47 9.39 9.39 9.39 9.39
nytimes_com 67.00 12.10 2.02 1.46 12.53 12.53 12.53 12.53
order-order_com 178.00 4.13 1.73 0.39 2.41 4.72 5.78 5.78
propublica_org 24.00 6.20 0.00 6.20 6.20 6.20 6.20 6.20
reuters_com 78.00 9.23 1.52 0.48 9.66 9.66 9.66 9.66
rt_com 78.00 9.71 1.61 0.85 10.17 10.17 10.17 10.17
skwawkbox_org 5.00 3.77 1.78 1.26 2.51 5.03 5.03 5.03
telegraph_co_uk 98.00 9.55 1.94 0.85 10.15 10.15 10.15 10.15
thecanary_co 205.00 4.92 1.76 0.93 3.52 6.11 6.21 6.21
theguardian_com 132.00 11.35 2.33 0.61 12.11 12.11 12.11 12.11
thetimes_co_uk 74.00 6.60 0.90 1.71 6.84 6.84 6.84 6.84
washingtonpost_com 83.00 11.01 2.02 0.57 11.42 11.42 11.42 11.42
westmonster_com 43.00 4.96 1.01 0.78 5.22 5.22 5.22 5.22
yournewswire_com 169.00 3.45 2.29 0.40 1.00 3.39 5.99 5.99
In [28]:
lead_proposal_2.groupby(data.publisher_id).min().plot.bar(figsize=[10,5])
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a67fdd8>

That looks about right, as long as the smaller publishers were closer to zero. So let's apply feature scaling to this, to give a number between 1 and 20. (Anything not as lead will pass though as zero.)

In [29]:
def rescale(series):
    return (series - series.min()) / (series.max() - series.min())

lead_proposal_3 = np.ceil(20 * rescale(lead_proposal_2))
In [30]:
lead_proposal_2.min(), lead_proposal_2.max()
Out[30]:
(0.38500569152790032, 13.145359968846892)
In [31]:
lead_proposal_3.plot.hist()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f0ae550>
In [32]:
lead_proposal_3.groupby(data.publisher_id).median().plot.bar(figsize=[10,5])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x110306b38>
In [33]:
data["lead_score"] = pd.concat([lead_proposal_3, data.mins_as_lead[data.mins_as_lead==0]])
In [34]:
data.lead_score.value_counts().sort_index()
Out[34]:
0.00     125411
1.00         81
2.00         57
3.00         44
4.00         66
5.00         45
6.00         43
7.00         68
8.00        145
9.00        185
10.00       139
11.00       140
12.00        52
13.00        88
14.00       313
15.00       228
16.00       298
17.00       363
18.00       226
19.00       217
20.00       322
Name: lead_score, dtype: int64
In [35]:
data.lead_score.groupby(data.publisher_id).max()
Out[35]:
publisher_id
anotherangryvoice_blogspot_co_uk    8.00
bbc_co_uk                          20.00
breitbart_com                      14.00
brexitcentral_com                   7.00
buzzfeed_com                       19.00
cnn_com                            20.00
dailymail_co_uk                    18.00
economist_com                      12.00
evolvepolitics_com                  8.00
foxnews_com                        17.00
ft_com                             13.00
huffingtonpost_com                 17.00
independent_co_uk                  16.00
indy100_com                        11.00
libdemvoice_org                     7.00
nbcnews_com                        14.00
npr_org                            15.00
nytimes_com                        20.00
order-order_com                     9.00
propublica_org                     10.00
reuters_com                        15.00
rt_com                             16.00
skwawkbox_org                       8.00
telegraph_co_uk                    16.00
thecanary_co                       10.00
theguardian_com                    19.00
thetimes_co_uk                     11.00
washingtonpost_com                 18.00
westmonster_com                     8.00
yournewswire_com                    9.00
Name: lead_score, dtype: float64

In summary then, score for article $a$ is:

$$ unscaledLeadScore_a = \frac{\min(minsAsLead_a, 60)}{\log(alexaRank_a)}\\ leadScore_a = 19 \cdot \frac{unscaledLeadScore_a - \min(unscaledLeadScore)} {\max(unscaledLeadScore) - \min(unscaledLeadScore)} + 1 $$

Since the minium value of $minsAsLead$ is 1, $\min(unscaledLeadScore)$ is pretty insignificant. So we can simplify this to:

$$ leadScore_a = 20 \cdot \frac{unscaledLeadScore_a } {\max(unscaledLeadScore)} $$

or:

$$ leadScore_a = 20 \cdot \frac{\frac{\min(minsAsLead_a, 60)}{\log(alexaRank_a)} } {\frac{60}{\log(\max(alexaRank))}} $$$$ leadScore_a = \left( 20 \cdot \frac{\min(minsAsLead_a, 60)}{\log(alexaRank_a)} \cdot {\frac{\log(\max(alexaRank))}{60}} \right) $$

Time on front score (15 points)

This is similar to time as lead, so lets try doing the same calculation, except we also want to factor in the number of slots on the front:

$$frontScore_a = 15 \left(\frac{\min(minsOnFront_a, 1440)}{alexaRank_a \cdot numArticlesOnFront_a}\right) \left( \frac{\min(alexaRank \cdot numArticlesOnFront)}{1440} \right)$$
In [36]:
(data.alexa_rank * data.num_articles_on_front).min() / 1440
Out[36]:
2.4500000000000002
In [37]:
time_on_front_proposal_1 = np.ceil(data.mins_on_front.clip_upper(1440) / (data.alexa_rank * data.num_articles_on_front) * (2.45) * 15)
In [38]:
time_on_front_proposal_1.plot.hist(figsize=(15, 7), bins=15)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1103c19e8>
In [39]:
time_on_front_proposal_1.value_counts().sort_index()
Out[39]:
1.00     57630
2.00      7582
3.00      5342
4.00      4237
5.00      1161
6.00       680
7.00       578
8.00       498
9.00       731
10.00      271
11.00      123
12.00      321
13.00       66
14.00       49
15.00      317
dtype: int64
In [40]:
time_on_front_proposal_1.groupby(data.publisher_id).sum()
Out[40]:
publisher_id
anotherangryvoice_blogspot_co_uk      60.00
bbc_co_uk                          14530.00
breitbart_com                       2592.00
brexitcentral_com                     40.00
buzzfeed_com                       11209.00
cnn_com                            12716.00
dailymail_co_uk                    14314.00
economist_com                        328.00
evolvepolitics_com                    47.00
foxnews_com                         9078.00
ft_com                              3300.00
huffingtonpost_com                  8148.00
independent_co_uk                   4942.00
indy100_com                          415.00
libdemvoice_org                      104.00
nbcnews_com                         2022.00
npr_org                             2733.00
nytimes_com                         9710.00
order-order_com                      226.00
propublica_org                        40.00
reuters_com                         7027.00
rt_com                              2200.00
skwawkbox_org                         99.00
telegraph_co_uk                     6292.00
thecanary_co                         314.00
theguardian_com                    13124.00
thetimes_co_uk                      8661.00
washingtonpost_com                  9822.00
westmonster_com                      137.00
yournewswire_com                     246.00
dtype: float64

That looks good to me.

In [41]:
data["front_score"] = np.ceil(data.mins_on_front.clip_upper(1440) / (data.alexa_rank * data.num_articles_on_front) * (2.45) * 15).fillna(0)
In [42]:
data.front_score 
Out[42]:
id
0d2c10289ac562afa42b458bf238defd5b7d86d8    0.00
0d35eb2f3c53ad8218866a0927082b0a2a53576d    1.00
20a659b9ee677b718b5834566795b6e231541a9b    1.00
7572bc3039f8f0ad83fea23d2692d5bd8a4cadd9    1.00
a8e2c63260f908b2f091862fff1b884da0eb5fe9    0.00
73a0112050c4cc9ed84283e554320c252503624c    1.00
9599a718d0004a66b97fbf3679e7fc3efc2172ef    0.00
1d3c90bfd083053501ee225c3a744d8b29cd51a8    1.00
a5235e1a73a409d927fd3d2ee1b82f9684b131c9    1.00
fc2a46c86e4f57b3e6f6ec9f3cb67a0f4bf1687f    3.00
71677691696f159fe2c72c8a518d786ac058c29c    1.00
165bc8d9292928cd1c95517f368e6e01912fa93c    8.00
c07415e598aa79d92d02cf453da42b04ed6346af    0.00
9ac7ebf2b4ea4db4b588dd917c36020ab66a5c94   12.00
aa3d9df1b0ed9dd9a6af0e7c320961b955694313    0.00
a6e840df7019548c0f027b3bb25b76d232fda5c0    0.00
49b056d35946eac75f3487f48bd9d98d2f01cf6d    8.00
92582d085e7a346254e10df2dc5e1a107fd41429    1.00
3ba688954a3a02b733b9008ea53ba53d09bac620    1.00
290c5016d53b641a16b46442acc17c82c096e2f7    1.00
77f30ff0be3bc69e52f34d33c4894e42f0dd79dc    1.00
163643dc4945102021dcae63b0ed3f3f2a1f5243    0.00
0515e198f8b1efcc948c2345268864ea6189c79a    3.00
07db392fcd8a65d386cede6b1327c818851092dc    3.00
5604f485243c99ff0c1229c80a1e2b5b6e38b8bd    1.00
af4a800d6274643d1fcf9aa235af296bb52e3799    3.00
52b2fe8f29ad72d3f91013ecfc6fc17e35f2191a    3.00
4394b6d6f7e859071bab0308ed4102bb278e34f9    1.00
5c297fb8a152b865dcc61e86b47fff1ac89ac288    3.00
f2a81eb2eb304cf22ff4e38da4b80d37766ddd75    0.00
                                            ... 
94f6090fd02f980f585c9f1ffa076dc2458b6b6b    1.00
4c9a933fbb59607f3fe5930308fb93a1d9ca096d    1.00
99b09fefa1efeae0f99ee25f5476aaf546bca2a0    0.00
ccaab4f30a464e7fa7bc7631cced4ebc4a8dc7aa    0.00
f18c588fcc1f36ed720a3b5d4ed326b507729970    0.00
c65d609f465035dffaa381716c39637d95c753de    2.00
612574bad40af6199ca7845191ccedbccb18ac27    0.00
52ef152f2e267c1afddda5703930d7ea80071236    7.00
bf43aa49a95f3e9fe6c3d0b8b724ae8922dd8fcd    1.00
8203c207beeb177e1fb29a69238f89e84dac95c0    0.00
0f30506e8edf349a8f991ce993f2317be6d0baf4    2.00
4a9c9cc092eaa49b9c303f49edadc534a6c3624b    1.00
038534d000e19644d30a296b9cc9439beb32ecb7    0.00
459dd178f9e79278695823f8b53018bf3aa69899    0.00
e3123b4c8a6167f72bc79b33471a67b9f470009a    1.00
7fd85e5f7dbb505c35a8500956e9eabced4aef2b    1.00
628c0c659b59f4de8b63b31f11030d7549fe42fa    0.00
155ceec1f40a2f27b47102667bd784f815e0168c    0.00
6731dd99a08ee5f15d8449e18ee8c5864743ef6d    1.00
50b596ee52f94ddcc2f40714bb2d414e121c4944    3.00
7303371ea0cc73e8412cd85f687def3d01746bd5    7.00
b5d2b89adef17fb4937fc7eda8c0140ab219d602    5.00
61bc48f75a81fffd62ea9fde47d69a9ecfb11f0a    0.00
e528905ca29f637e921e4a04fef409667b2a40af    2.00
37c0721db62a3b783627d8deb54f8143ca510990    1.00
ec8b14060bf1cffa51678dc6dd13321563c266c7    1.00
cc59fd3a31ae413de8a8eea951e6bda63cc36b66    0.00
34cb49caea1c0008f83f2f383466295c49cd1511    0.00
ab3787dcb90e62a8ac6a4fb6177e33cff5870251    1.00
accfcbdbbd052320c78ad65eafc403d6109657f3    1.00
Name: front_score, Length: 128531, dtype: float64

Facebook brand page promotion (15 points)

One way a publisher has of promoting content is to post to their brand page. The significance of doing so is stronger when the brand page has more followers (likes).

$$ facebookPromotionProposed1_a = 15 \left( \frac {brandPageLikes_a} {\max(brandPageLikes)} \right) $$

Now lets explore the data to see if that makes sense. tr;dr the formula above is incorrect

In [43]:
data.fb_brand_page_likes.max()
Out[43]:
41834404.0
In [44]:
facebook_promotion_proposed_1 = np.ceil((15 * (data.fb_brand_page_likes / data.fb_brand_page_likes.max())).fillna(0))
In [45]:
facebook_promotion_proposed_1.value_counts().sort_index().plot.bar()
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1103069b0>
In [46]:
facebook_promotion_proposed_1.groupby(data.publisher_id).describe()
Out[46]:
count mean std min 25% 50% 75% max
publisher_id
anotherangryvoice_blogspot_co_uk 60.00 0.45 0.50 0.00 0.00 0.00 1.00 1.00
bbc_co_uk 3682.00 2.16 5.27 0.00 0.00 0.00 0.00 15.00
breitbart_com 2722.00 0.62 0.93 0.00 0.00 0.00 2.00 2.00
brexitcentral_com 47.00 0.60 0.50 0.00 0.00 1.00 1.00 1.00
buzzfeed_com 1962.00 0.28 0.45 0.00 0.00 0.00 1.00 1.00
cnn_com 3923.00 2.45 4.30 0.00 0.00 0.00 0.00 10.00
dailymail_co_uk 25993.00 0.46 1.33 0.00 0.00 0.00 0.00 5.00
economist_com 576.00 2.02 1.41 0.00 0.00 3.00 3.00 3.00
evolvepolitics_com 54.00 0.72 0.45 0.00 0.00 1.00 1.00 1.00
foxnews_com 6776.00 0.58 1.77 0.00 0.00 0.00 0.00 6.00
ft_com 3526.00 0.59 0.91 0.00 0.00 0.00 2.00 2.00
huffingtonpost_com 4758.00 1.10 1.78 0.00 0.00 0.00 4.00 4.00
independent_co_uk 6672.00 0.59 1.19 0.00 0.00 0.00 0.00 3.00
indy100_com 424.00 0.14 0.35 0.00 0.00 0.00 0.00 1.00
libdemvoice_org 181.00 0.56 0.50 0.00 0.00 1.00 1.00 1.00
nbcnews_com 2253.00 2.01 2.00 0.00 0.00 4.00 4.00 4.00
npr_org 2086.00 1.42 1.50 0.00 0.00 0.00 3.00 3.00
nytimes_com 5434.00 1.41 2.39 0.00 0.00 0.00 5.00 6.00
order-order_com 339.00 0.47 0.50 0.00 0.00 0.00 1.00 1.00
propublica_org 42.00 0.90 0.30 0.00 1.00 1.00 1.00 1.00
reuters_com 5760.00 0.34 0.75 0.00 0.00 0.00 0.00 2.00
rt_com 2407.00 0.51 0.87 0.00 0.00 0.00 2.00 2.00
skwawkbox_org 158.00 0.70 0.46 0.00 0.00 1.00 1.00 1.00
telegraph_co_uk 7886.00 0.50 0.86 0.00 0.00 0.00 0.00 2.00
thecanary_co 328.00 0.87 0.34 0.00 1.00 1.00 1.00 1.00
theguardian_com 9008.00 0.52 1.14 0.00 0.00 0.00 0.00 3.00
thetimes_co_uk 8681.00 0.06 0.24 0.00 0.00 0.00 0.00 1.00
washingtonpost_com 22174.00 0.19 0.73 0.00 0.00 0.00 0.00 3.00
westmonster_com 249.00 0.24 0.43 0.00 0.00 0.00 0.00 1.00
yournewswire_com 370.00 0.04 0.20 0.00 0.00 0.00 0.00 1.00

That's too much variation: sites like the Guardian, which have a respectable 7.5m likes, should not be scoring a 3. Lets try applying a log to it, and then standard feature scaling again.

In [47]:
data.fb_brand_page_likes.groupby(data.publisher_id).max()
Out[47]:
publisher_id
anotherangryvoice_blogspot_co_uk     300227.00
bbc_co_uk                          41834404.00
breitbart_com                       3422611.00
brexitcentral_com                      6484.00
buzzfeed_com                        2526061.00
cnn_com                            27506850.00
dailymail_co_uk                    11298065.00
economist_com                       8088734.00
evolvepolitics_com                    79072.00
foxnews_com                        15383159.00
ft_com                              3547359.00
huffingtonpost_com                  9494261.00
independent_co_uk                   6908633.00
indy100_com                          183154.00
libdemvoice_org                        8169.00
nbcnews_com                         8979704.00
npr_org                             6008762.00
nytimes_com                        14025241.00
order-order_com                       41270.00
propublica_org                       331091.00
reuters_com                         3733774.00
rt_com                              4455067.00
skwawkbox_org                          2149.00
telegraph_co_uk                     4161563.00
thecanary_co                         134798.00
theguardian_com                     7466172.00
thetimes_co_uk                       645611.00
washingtonpost_com                  5847964.00
westmonster_com                       10794.00
yournewswire_com                      24040.00
Name: fb_brand_page_likes, dtype: float64
In [48]:
np.log(2149)
Out[48]:
7.6727578966425103
In [49]:
np.log(data.fb_brand_page_likes.groupby(data.publisher_id).max())
Out[49]:
publisher_id
anotherangryvoice_blogspot_co_uk   12.61
bbc_co_uk                          17.55
breitbart_com                      15.05
brexitcentral_com                   8.78
buzzfeed_com                       14.74
cnn_com                            17.13
dailymail_co_uk                    16.24
economist_com                      15.91
evolvepolitics_com                 11.28
foxnews_com                        16.55
ft_com                             15.08
huffingtonpost_com                 16.07
independent_co_uk                  15.75
indy100_com                        12.12
libdemvoice_org                     9.01
nbcnews_com                        16.01
npr_org                            15.61
nytimes_com                        16.46
order-order_com                    10.63
propublica_org                     12.71
reuters_com                        15.13
rt_com                             15.31
skwawkbox_org                       7.67
telegraph_co_uk                    15.24
thecanary_co                       11.81
theguardian_com                    15.83
thetimes_co_uk                     13.38
washingtonpost_com                 15.58
westmonster_com                     9.29
yournewswire_com                   10.09
Name: fb_brand_page_likes, dtype: float64

That's more like it, but the lower numbers should be smaller.

In [50]:
np.log(data.fb_brand_page_likes.groupby(data.publisher_id).max() / 1000)
Out[50]:
publisher_id
anotherangryvoice_blogspot_co_uk    5.70
bbc_co_uk                          10.64
breitbart_com                       8.14
brexitcentral_com                   1.87
buzzfeed_com                        7.83
cnn_com                            10.22
dailymail_co_uk                     9.33
economist_com                       9.00
evolvepolitics_com                  4.37
foxnews_com                         9.64
ft_com                              8.17
huffingtonpost_com                  9.16
independent_co_uk                   8.84
indy100_com                         5.21
libdemvoice_org                     2.10
nbcnews_com                         9.10
npr_org                             8.70
nytimes_com                         9.55
order-order_com                     3.72
propublica_org                      5.80
reuters_com                         8.23
rt_com                              8.40
skwawkbox_org                       0.77
telegraph_co_uk                     8.33
thecanary_co                        4.90
theguardian_com                     8.92
thetimes_co_uk                      6.47
washingtonpost_com                  8.67
westmonster_com                     2.38
yournewswire_com                    3.18
Name: fb_brand_page_likes, dtype: float64
In [51]:
scaled_fb_brand_page_likes = (data.fb_brand_page_likes / 1000)
facebook_promotion_proposed_2 = np.ceil(\
    (15 * \
     (np.log(scaled_fb_brand_page_likes) / np.log(scaled_fb_brand_page_likes.max()))\
    )\
                                       ).fillna(0)
In [52]:
facebook_promotion_proposed_2.groupby(data.publisher_id).max()
Out[52]:
publisher_id
anotherangryvoice_blogspot_co_uk    9.00
bbc_co_uk                          15.00
breitbart_com                      12.00
brexitcentral_com                   3.00
buzzfeed_com                       12.00
cnn_com                            15.00
dailymail_co_uk                    14.00
economist_com                      13.00
evolvepolitics_com                  7.00
foxnews_com                        14.00
ft_com                             12.00
huffingtonpost_com                 13.00
independent_co_uk                  13.00
indy100_com                         8.00
libdemvoice_org                     3.00
nbcnews_com                        13.00
npr_org                            13.00
nytimes_com                        14.00
order-order_com                     6.00
propublica_org                      9.00
reuters_com                        12.00
rt_com                             12.00
skwawkbox_org                       2.00
telegraph_co_uk                    12.00
thecanary_co                        7.00
theguardian_com                    13.00
thetimes_co_uk                     10.00
washingtonpost_com                 13.00
westmonster_com                     4.00
yournewswire_com                    5.00
Name: fb_brand_page_likes, dtype: float64

LGTM. So the equation is

$$ facebookPromotion_a = 15 \left( \frac {\log(\frac {brandPageLikes_a}{1000})} {\log(\frac {\max(brandPageLikes)}{1000}))} \right) $$

Now, let's try applying standard feature scaling approch to this, rather than using a magic number of 1,000. That equation would be:

\begin{align} unscaledFacebookPromotion_a &= \log(brandPageLikes_a) \\ facebookPromotion_a &= 15 \cdot \frac{unscaledFacebookPromotion_a - \min(unscaledFacebookPromotion)}{\max(unscaledFacebookPromotion) - \min(unscaledFacebookPromotion)} \\ \\ \text{The scaling can be simplified to:} \\ facebookPromotion_a &= 15 \cdot \frac{unscaledFacebookPromotion_a - \log(\min(brandPageLikes))}{\log(\max(brandPageLikes)) - \log(\min(brandPageLikes))} \\ \\ \text{Meaning the overall equation becomes:} \\ facebookPromotion_a &= 15 \cdot \frac{\log(brandPageLikes_a) - \log(\min(brandPageLikes))}{\log(\max(brandPageLikes)) - \log(\min(brandPageLikes))} \end{align}
In [53]:
facebook_promotion_proposed_3 = np.ceil(
    (14 * 
     ( 
         (np.log(data.fb_brand_page_likes) - np.log(data.fb_brand_page_likes.min()) ) /
         (np.log(data.fb_brand_page_likes.max()) - np.log(data.fb_brand_page_likes.min()))
     )
    ) + 1
                                       )
In [54]:
facebook_promotion_proposed_3.groupby(data.publisher_id).max()
Out[54]:
publisher_id
anotherangryvoice_blogspot_co_uk    9.00
bbc_co_uk                          15.00
breitbart_com                      12.00
brexitcentral_com                   3.00
buzzfeed_com                       12.00
cnn_com                            15.00
dailymail_co_uk                    14.00
economist_com                      13.00
evolvepolitics_com                  7.00
foxnews_com                        14.00
ft_com                             12.00
huffingtonpost_com                 13.00
independent_co_uk                  13.00
indy100_com                         8.00
libdemvoice_org                     4.00
nbcnews_com                        13.00
npr_org                            13.00
nytimes_com                        14.00
order-order_com                     6.00
propublica_org                      9.00
reuters_com                        12.00
rt_com                             12.00
skwawkbox_org                       2.00
telegraph_co_uk                    12.00
thecanary_co                        8.00
theguardian_com                    13.00
thetimes_co_uk                     10.00
washingtonpost_com                 13.00
westmonster_com                     4.00
yournewswire_com                    5.00
Name: fb_brand_page_likes, dtype: float64
In [55]:
data["facebook_promotion_score"] = facebook_promotion_proposed_3.fillna(0.0)

Review

In [56]:
data["promotion_score"] = (data.lead_score + data.front_score + data.facebook_promotion_score)
data["attention_index"] = (data.promotion_score + data.response_score)
In [57]:
data.promotion_score.plot.hist(bins=np.arange(50), figsize=(15,6))
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x1105b1e48>
In [58]:
data.attention_index.plot.hist(bins=np.arange(100), figsize=(15,6))
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x111b3d2b0>
In [59]:
data.attention_index.value_counts().sort_index()
Out[59]:
0.00     16623
1.00     16081
2.00     11230
3.00      7516
4.00      5775
5.00      4667
6.00      3814
7.00      3321
8.00      3030
9.00      2741
10.00     2492
11.00     2345
12.00     2071
13.00     2061
14.00     2038
15.00     1861
16.00     1869
17.00     1747
18.00     1626
19.00     1562
20.00     1516
21.00     1406
22.00     1375
23.00     1229
24.00     1228
25.00     1230
26.00     1152
27.00     1051
28.00     1053
29.00     1017
         ...  
65.00      142
66.00      138
67.00       99
68.00       95
69.00       87
70.00       67
71.00       64
72.00       53
73.00       70
74.00       48
75.00       35
76.00       41
77.00       40
78.00       24
79.00       29
80.00       22
81.00       25
82.00       18
83.00       19
84.00       21
85.00       20
86.00       10
87.00       14
88.00       13
89.00        8
90.00        5
91.00        2
92.00        4
94.00        1
96.00        1
Name: attention_index, Length: 95, dtype: int64
In [60]:
# and lets see the articles with the biggest attention index
data.sort_values("attention_index", ascending=False)
Out[60]:
url headline discovered published fb_engagements fb_max_engagements_per_min fb_comments fb_reactions fb_shares publisher_name ... num_articles_on_front fb_brand_page fb_brand_page_likes alexa_rank response_score lead_score front_score facebook_promotion_score promotion_score attention_index
id
2b068c2fa3bdb2732465fe524b599b97ba981fa2 https://www.buzzfeed.com/juliareinstein/these-... These Are The Victims Of The Portland Train St... 2017-05-27 21:48:12.860 2017-05-27 21:31:28.000 67262 47.40 7535.00 52484.00 7243.00 Buzzfeed News ... 25.00 True 2522745.00 147 50.00 19.00 15.00 12.00 46.00 96.00
14b0e444e8db057a9aa6df60b8172cf700f280c5 http://money.cnn.com/2017/05/02/news/economy/o... The House just passed a bill that affects over... 2017-05-03 01:28:04.624 2017-05-03 00:11:11.000 85276 258.69 nan nan nan CNN ... 62.00 True 27386540.00 105 50.00 20.00 9.00 15.00 44.00 94.00
fab8d9250617795d93b7383e0e1cf86352338181 http://www.bbc.co.uk/news/live/world-europe-39... French election: Macron and Le Pen in crucial ... 2017-05-07 14:24:01.542 2017-05-07 14:17:24.671 28022 43.08 nan nan nan BBC ... 47.00 True 41287024.00 96 45.00 20.00 12.00 15.00 47.00 92.00
42294296116b94aff50a439a4fca04dd396aac0c https://www.buzzfeed.com/salvadorhernandez/hey... Hey Guys, What Do You Think About A Dwayne Joh... 2017-05-21 04:48:18.390 2017-05-21 04:03:49.000 32996 77.86 2486.00 28269.00 2241.00 Buzzfeed News ... 25.00 True 2505771.00 147 46.00 19.00 15.00 12.00 46.00 92.00
f1ba80ccf9b7979418c5326503bc897b98eafd92 http://www.bbc.co.uk/news/world-europe-39839349 French election: Macron 'defeats Le Pen to bec... 2017-05-07 18:12:02.316 2017-05-07 18:07:12.000 64768 68.76 nan nan nan BBC ... 47.00 True 41277414.00 96 50.00 20.00 7.00 15.00 42.00 92.00
8cb28169a5402b65b3cd89f1a0b0ce3f4ee70156 https://www.buzzfeed.com/emaoconnor/students-w... Students Walked Out Of Notre Dame's Graduation... 2017-05-21 18:27:14.894 2017-05-21 18:26:18.000 31643 61.50 2826.00 27299.00 1518.00 Buzzfeed News ... 25.00 True 2506360.00 147 46.00 19.00 15.00 12.00 46.00 92.00
7f10cd0e6c28956ab1fdc0ca572b45fcfeaa8788 https://www.buzzfeed.com/laurasilver/these-are... These Are The Victims Of The Manchester Terror... 2017-05-23 13:00:16.480 2017-05-23 12:57:53.000 27291 23.27 1368.00 21764.00 4159.00 Buzzfeed News ... 25.00 True 2511571.00 147 45.00 19.00 15.00 12.00 46.00 91.00
5d0e52fea29802318a75911f3371624df4595ae4 http://www.cnn.com/2017/05/24/politics/jeff-se... First on CNN: AG Sessions did not disclose mee... 2017-05-24 22:05:16.913 2017-05-24 22:00:15.000 49692 51.92 9803.00 34058.00 5831.00 CNN ... 62.00 True 27444874.00 105 49.00 20.00 7.00 15.00 42.00 91.00
b09f91cf0b955b39e0e58b4ce530cd0942e4a92f http://www.bbc.co.uk/news/entertainment-arts-4... Blue Peter presenter John Noakes dies 2017-05-29 09:55:02.826 2017-05-29 09:49:34.000 45987 49.74 6860.00 35256.00 3871.00 BBC ... 47.00 True 41787920.00 96 48.00 20.00 7.00 15.00 42.00 90.00
2f34e56453c871882bb30b655e234d8aba32dfdc http://www.huffingtonpost.com/entry/portland-s... The Portland Heroes Who Stood Up To Hate 2017-05-28 22:27:10.745 2017-05-28 08:11:53.000 57925 43.39 2182.00 50049.00 5694.00 HuffPost ... 26.00 True 9485928.00 215 50.00 17.00 10.00 13.00 40.00 90.00
4accbfb7fefcf1eb39df7fac49bd18bc4527f539 http://www.cnn.com/2017/05/01/politics/investi... The FBI translator who went rogue and married ... 2017-05-01 19:47:57.988 2017-05-01 19:39:14.000 32070 36.45 nan nan nan CNN ... 62.00 True 27377584.00 105 46.00 20.00 9.00 15.00 44.00 90.00
afeaaba2df87ace5da55c7e5aad38d7aca9546f8 http://www.cnn.com/interactive/2017/05/politic... Sheriff David Clarke plagiarized portions of h... 2017-05-20 23:45:04.481 2017-05-20 23:36:34.000 36361 56.08 13115.00 18114.00 5132.00 CNN ... 62.00 True 27393308.00 105 47.00 20.00 8.00 15.00 43.00 90.00
09a3a8400c4395f2e18ba818b42912357e8e58ae http://www.bbc.co.uk/news/election-2017-40053427 General election 2017: Corbyn links terror thr... 2017-05-25 21:10:03.897 2017-05-25 21:03:46.000 28189 34.00 11358.00 14371.00 2460.00 BBC ... 47.00 True 41702575.00 96 45.00 20.00 10.00 15.00 45.00 90.00
7638bc116fc4499d08b4070e417ad266fb3e23e9 http://www.cnn.com/2017/05/09/politics/james-c... FBI director James Comey fired 2017-05-09 21:55:19.328 2017-05-09 21:53:47.000 39822 48.57 nan nan nan CNN ... 62.00 True 27317096.00 105 47.00 20.00 7.00 15.00 42.00 89.00
c8c4364097e3139baf5b82eda04df00d350e8343 http://www.cnn.com/2017/05/26/us/portland-trai... Man shouting anti-Muslim slurs kills 2 on Port... 2017-05-27 03:27:15.372 2017-05-27 03:23:54.000 27952 24.50 7626.00 16105.00 4221.00 CNN ... 62.00 True 27475760.00 105 45.00 20.00 9.00 15.00 44.00 89.00
ddcf39b5f71753ab726057de7647b1922cb10d32 http://www.bbc.co.uk/news/entertainment-arts-4... Manchester attacks: Stars to join Ariana Grand... 2017-05-30 15:20:02.705 2017-05-30 15:16:43.000 41965 37.72 3029.00 35012.00 3924.00 BBC ... 47.00 True 41807079.00 96 48.00 20.00 6.00 15.00 41.00 89.00
8d86ba1ca7e4dda9f0c9f4eafbf7108c6a332c81 http://www.bbc.co.uk/news/uk-39802636 Duke of Edinburgh to retire from royal duties 2017-05-04 09:08:01.799 2017-05-04 09:06:23.000 23879 31.17 nan nan nan BBC ... 47.00 True 41300131.00 96 44.00 20.00 10.00 15.00 45.00 89.00
f52875ade537d88acc4d0070faa6f66107b28f6f http://www.bbc.co.uk/news/election-2017-40105324 Jeremy Corbyn to take part in seven-way TV debate 2017-05-31 11:30:11.567 2017-05-31 11:24:24.000 16409 34.05 4954.00 9252.00 2203.00 BBC ... 45.00 True 41822932.00 96 41.00 20.00 13.00 15.00 48.00 89.00
2b382725ca5dc2e921e92f060dbb8920b9dc1fb1 https://www.buzzfeed.com/claudiakoerner/jeremy... The White Supremacist Accused Of The Portland ... 2017-05-30 22:30:14.879 2017-05-30 22:29:38.000 19853 59.50 4263.00 13851.00 1739.00 Buzzfeed News ... 25.00 True 2525437.00 147 43.00 19.00 15.00 12.00 46.00 89.00
2c19bf826aa073d0675b97f04b199cdf8de521f2 http://www.cnn.com/2017/05/15/politics/trump-r... Washington Post: Trump leaked classified info ... 2017-05-15 22:15:13.349 2017-05-15 22:10:37.000 39566 47.98 19346.00 15974.00 4246.00 CNN ... 62.00 True 27352796.00 105 47.00 20.00 7.00 15.00 42.00 89.00
332d434b07b350510549833225fe392117243c0c http://www.cnn.com/2017/05/04/politics/health-... Countdown is on to nail-biter Obamacare repeal... 2017-05-04 10:05:24.746 2017-05-04 10:00:18.000 27425 39.17 nan nan nan CNN ... 62.00 True 27396532.00 105 45.00 20.00 9.00 15.00 44.00 89.00
15ca9ed39bb09eba36bf9b04989d120fdf2bf5c1 https://www.nytimes.com/2017/05/12/us/politics... Trump Threatens Retaliation Against Comey, War... 2017-05-12 13:05:04.079 2017-05-12 13:03:08.000 63535 62.00 16657.00 39053.00 7825.00 New York Times ... 121.00 True 13902139.00 120 50.00 20.00 4.00 14.00 38.00 88.00
a6512f6b09240bd0ad24afe567bc301979abceca http://www.bbc.co.uk/news/election-2017-39956541 Conservative manifesto: Firms to pay more to h... 2017-05-17 21:15:08.119 2017-05-17 21:14:28.000 15755 14.98 6394.00 7365.00 1996.00 BBC ... 47.00 True 41465378.00 96 41.00 20.00 12.00 15.00 47.00 88.00
e96d445a030fc2b1cbb0b318c131d96430a1c8e9 https://www.nytimes.com/2017/05/17/us/politics... Robert Mueller, Former F.B.I. Director, Named ... 2017-05-17 22:03:00.403 2017-05-17 22:01:08.000 71918 76.27 13759.00 54141.00 4018.00 New York Times ... 121.00 True 13931550.00 120 50.00 20.00 4.00 14.00 38.00 88.00
ee25e1faebe21977e7e24f0ac56354e41e9f1246 http://www.huffingtonpost.com/entry/portland-a... 2 Men Killed On Portland Train After Trying To... 2017-05-27 08:35:05.764 2017-05-27 08:27:31.000 42112 54.08 9485.00 28201.00 4426.00 HuffPost ... 26.00 True 9478319.00 215 48.00 17.00 10.00 13.00 40.00 88.00
f8830bbc4e15dab7593e0751b9851c048c575257 https://www.nytimes.com/2017/05/16/us/politics... Comey Memo Says Trump Asked Him to End Flynn I... 2017-05-16 21:24:00.845 2017-05-16 21:22:12.000 83951 49.67 30914.00 46208.00 6829.00 New York Times ... 121.00 True 13926019.00 120 50.00 20.00 4.00 14.00 38.00 88.00
83fe5a6430080e734127806ddea0514ed9975292 http://www.cnn.com/2017/05/01/politics/trump-m... Trump ending Michelle Obama's girls education ... 2017-05-01 17:15:20.675 2017-05-01 17:11:32.000 73117 76.36 nan nan nan CNN ... 62.00 True 27376924.00 105 50.00 20.00 3.00 15.00 38.00 88.00
deb51f871b757906c47f036142fb61cf6fe10a58 http://www.cnn.com/2017/05/12/politics/michell... Michelle Obama slams Trump on school meals 2017-05-12 22:25:18.698 2017-05-12 22:19:54.000 26300 46.17 4436.00 18902.00 2962.00 CNN ... 62.00 True 27341319.00 105 44.00 20.00 9.00 15.00 44.00 88.00
7cb9f49a9bd8be5a0c54a8fac7c5c829eaaa3255 https://www.nytimes.com/2017/05/19/us/politics... Trump Told Russians That Firing ‘Nut Job’ Come... 2017-05-19 18:57:03.616 2017-05-19 18:54:34.000 69396 54.49 18795.00 43237.00 7364.00 New York Times ... 121.00 True 13944205.00 120 50.00 20.00 4.00 14.00 38.00 88.00
102ab96438ef79f8b34f6b10e2aa09098af012e4 http://www.cnn.com/2017/05/12/politics/donald-... Trump threatens Comey in new tweet 2017-05-12 12:55:16.670 2017-05-12 12:47:08.000 23738 48.40 8114.00 12745.00 2879.00 CNN ... 62.00 True 27338171.00 105 44.00 20.00 9.00 15.00 44.00 88.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
80449f003315234c3f6f30cdec4278321b4c1fc1 http://www.dailymail.co.uk/sport/cricket/artic... Bairstow hasn't played into Champions Trophy t... 2017-05-08 17:05:26.615 2017-05-08 16:59:03.000 1 0.08 nan nan nan Daily Mail ... nan False nan 158 0.00 0.00 0.00 0.00 0.00 0.00
98c815a3be205bf97a20740d6a1435f8974b5650 https://www.washingtonpost.com/sports/colleges... Iowa State losing front court player Ray Kason... 2017-05-16 17:33:12.965 2017-05-16 17:30:31.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
7c439ca28aa3385604265f55cb44ddcce0f8b814 http://www.dailymail.co.uk/sport/rugbyunion/ar... Joe Launchbury and Jimmy Gopperth short-listed... 2017-05-16 17:33:25.592 2017-05-16 17:30:28.000 1 0.08 0.00 0.00 1.00 Daily Mail ... nan False nan 158 0.00 0.00 0.00 0.00 0.00 0.00
acf1e0cf4830f40fe6ec36b8545c667d012a65d0 https://www.washingtonpost.com/entertainment/b... Tait tackles millennial friendship in ‘Fake Pl... 2017-05-08 17:10:14.074 2017-05-08 17:02:15.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
95703cf06c6e7297ff5b5b02467066c2cc57a53b http://www.telegraph.co.uk/racing/2017/05/16/m... Marlborough racing tips for Wednesday, May 17 2017-05-16 17:33:01.179 2017-05-16 17:30:00.000 1 0.00 0.00 0.00 1.00 The Telegraph ... nan False nan 370 0.00 0.00 0.00 0.00 0.00 0.00
f98961b8ad120aba0f960d9a5b312f716a941c95 https://www.washingtonpost.com/national/christ... Christie signs bill inspired by ‘Snooki’ to ca... 2017-05-08 17:10:09.918 2017-05-08 17:06:13.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
d0c0564a2aee55110d0a2ffcff033fd07801b8ba https://www.washingtonpost.com/sports/wizards/... Raptors PG Lowry will opt out of final year of... 2017-05-08 17:10:09.224 2017-05-08 17:06:18.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
1433808a624391c2e49054b14162b0de6b1436d4 http://www.dailymail.co.uk/sport/othersports/a... Tom Dumoulin takes Giro d'Italia lead after ti... 2017-05-16 17:30:31.801 2017-05-16 17:28:29.000 1 0.08 0.00 0.00 1.00 Daily Mail ... nan False nan 158 0.00 0.00 0.00 0.00 0.00 0.00
b112785a000a557abf5fdaa99410d03753e0879c https://www.washingtonpost.com/sports/saratoga... Saratoga to rename major race for late trainer... 2017-05-16 17:33:13.191 2017-05-16 17:27:15.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
61ff94a3f768868adc7aa624bb04bbe15a540981 https://www.washingtonpost.com/national/south-... South Dakota jury to get case of man charged i... 2017-05-24 14:00:31.445 2017-05-24 13:58:13.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
09d731b52e6fb05732e8c519eef830a01363a29c http://www.washingtonpost.com/video/entertainm... A who's-who guide to famous Chrises 2017-05-08 17:25:13.871 2017-05-08 16:53:46.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
112e8c790964fdb1bc39f4f2b92b203c1e67b2a0 https://www.washingtonpost.com/sports/ferrari-... Ferrari driver Sebastian Vettel wins Formula O... 2017-05-28 14:00:08.380 2017-05-28 13:50:44.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
7a943c1d63edd5013f4582981831527ba812d67d https://www.washingtonpost.com/national/texas-... Texas woman dies while snorkeling in the Flori... 2017-05-24 14:00:14.929 2017-05-24 13:50:13.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
5ced929e6f8432fca0c19b484cde443533a37589 https://www.washingtonpost.com/national/nc-gov... NC governor vows executive order to expand LGB... 2017-05-16 17:45:32.854 2017-05-16 17:34:17.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
d6f997fab8de17bf6875e53d2151da58aaa8a833 http://www.dailymail.co.uk/news/article-453741... Lisa Wilkinson denies co-star is paid more tha... 2017-05-24 13:54:20.898 2017-05-24 13:50:26.000 0 0.00 0.00 0.00 0.00 Daily Mail ... nan False nan 158 0.00 0.00 0.00 0.00 0.00 0.00
cb74124b7a7bb495948313cf30cd25666cec8827 https://www.washingtonpost.com/national/new-je... New Jersey nightclub shooting kills 1, injures 5 2017-05-28 14:12:08.244 2017-05-28 14:02:16.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
92641649fb260f8c72db0078f7d284f9a93110bd https://www.washingtonpost.com/sports/redskins... Upstaged: Predators anthem singer replaced by ... 2017-05-16 17:45:16.358 2017-05-16 17:39:27.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
01270e09e56dae2f9d089f2d231f1672907d2403 https://www.washingtonpost.com/local/winning-n... Winning numbers drawn in ‘Pick 4 Midday’ game 2017-05-08 16:55:14.913 2017-05-08 16:43:11.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
112d5df1c5e46348b7ae146dabdb237270a7135d https://www.washingtonpost.com/local/winning-n... Winning numbers drawn in ‘Pick 3 Midday’ game 2017-05-08 16:55:11.897 2017-05-08 16:43:18.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
9ab54f9d81d8cfe6c9df4880b95ff895a500f56b https://www.washingtonpost.com/world/middle_ea... Father of alleged Manchester bomber says son i... 2017-05-24 14:03:08.383 2017-05-24 13:51:16.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
ca8dd321adbaa5a8348cde264ef99ba432af3f49 https://www.washingtonpost.com/local/md-lotter... MD Lottery 2017-05-08 16:55:08.122 2017-05-08 16:43:25.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
be716a25981bc5b2cb462aadd7221f9039670851 https://www.washingtonpost.com/national/slain-... Slain Arkansas sheriff’s lieutenant is laid to... 2017-05-16 17:45:31.576 2017-05-16 17:39:20.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
2cccfa0a1fd9def6c2a3131aae3a395b40a0ed62 https://www.washingtonpost.com/national/higher... Patients getting checks soon in gynecologist r... 2017-05-24 14:00:08.192 2017-05-24 13:52:13.000 0 0.00 0.00 0.00 0.00 The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
eaad80cc454df50c3226604e74cbd510d796a8f6 http://www.nbcnews.com/card/forget-appropriate... nbcnews:card_text 2017-05-16 17:39:06.609 2017-05-16 17:36:51.000 0 0.00 0.00 0.00 0.00 NBC News ... nan False nan 826 0.00 0.00 0.00 0.00 0.00 0.00
8b42a5d98a340d82b0893fa0213e211bcb0bb63d http://www.telegraph.co.uk/travel/destinations... The Wordsworth Hotel & Spa 2017-05-08 16:55:03.705 2017-05-08 16:45:13.000 0 0.00 nan nan nan The Telegraph ... nan False nan 370 0.00 0.00 0.00 0.00 0.00 0.00
d170b61ae117b326f993447d6a2278d84bf895c7 https://www.washingtonpost.com/national/man-ar... Man arrested in alleged racing crash that kill... 2017-05-08 16:50:10.899 2017-05-08 16:46:18.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
fb86e77e8927140141f607b1f2eaa8e35e1f4091 https://www.washingtonpost.com/national/us-cal... US calls on Kosovo to ratify border deal with ... 2017-05-08 16:50:10.031 2017-05-08 16:46:25.000 0 0.00 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
d976785cad88db062e345cb7a46b1602295b383b https://www.washingtonpost.com/world/middle_ea... Egypt says kills 8 Brotherhood members before ... 2017-05-08 16:50:11.888 2017-05-08 16:46:32.000 1 0.08 nan nan nan The Washington Post ... nan False nan 191 0.00 0.00 0.00 0.00 0.00 0.00
f3e21b3975b23fced3a0044a153dc2e01a5dbd74 http://www.economist.com/news/21722600-wake-ma... Babbage 2017-05-24 20:21:24.511 2017-05-24 13:52:59.000 0 0.00 0.00 0.00 0.00 The Economist ... nan False nan 1825 0.00 0.00 0.00 0.00 0.00 0.00
417329307c5f0c208d027e026b8c0fdcea64c446 http://www.independent.co.uk/news/obituaries/g... Glyn Tegai Hughes, obituary: esteemed literary... 2017-05-22 12:12:19.316 2017-05-22 09:53:35.000 1 0.00 0.00 0.00 1.00 The Independent ... nan False nan 386 0.00 0.00 0.00 0.00 0.00 0.00

128531 rows × 23 columns

In [61]:
data["score_diff"] = data.promotion_score - data.response_score
In [62]:
# promoted but low response
data.sort_values("score_diff", ascending=False).head(25)
Out[62]:
url headline discovered published fb_engagements fb_max_engagements_per_min fb_comments fb_reactions fb_shares publisher_name ... fb_brand_page fb_brand_page_likes alexa_rank response_score lead_score front_score facebook_promotion_score promotion_score attention_index score_diff
id
276f94591dbcc1637e7b17ef8c1f756f55875256 https://www.buzzfeed.com/josephbernstein/the-p... The Public Square Belongs to 4Chan 2017-05-18 19:45:20.292 2017-05-17 16:56:19 0 0.00 0.00 0.00 0.00 Buzzfeed News ... True 2498318.00 147 0.00 19.00 15.00 12.00 46.00 46.00 46.00
2e86caccaa0f4d2487fbfe7e66c7d9a1bd7d9700 https://www.buzzfeed.com/jtes/california-non-b... This 17-Year-Old Wanted A Name Change. They Ma... 2017-05-25 18:35:06.550 2017-05-24 21:17:42 1 0.02 0.00 0.00 1.00 Buzzfeed News ... True 2522329.00 147 0.00 19.00 15.00 12.00 46.00 46.00 46.00
b719c5a9926a584c03998c5cb3b966189d9f2355 https://www.buzzfeed.com/jimdalrympleii/the-fa... An Epic Legal Battle Is Brewing As Trump Sizes... 2017-05-26 22:15:06.656 2017-05-25 23:53:39 35 0.27 4.00 7.00 24.00 Buzzfeed News ... True 2521837.00 147 5.00 19.00 15.00 12.00 46.00 51.00 41.00
86e29ec1cdb2bfbaca4c7d5c7dced916b514bba8 https://www.buzzfeed.com/bimadewunmi/flint-isn... Flint Isn’t Ready To Trust Anyone Yet 2017-05-24 22:15:04.778 2017-05-22 22:45:04 43 0.08 2.00 18.00 23.00 Buzzfeed News ... True 2523526.00 147 6.00 19.00 15.00 12.00 46.00 52.00 40.00
08c2282049e8764a1cd56b1d7b548fa4bf4989c2 https://www.buzzfeed.com/priya/uber-pool-burn-... Leaked Internal Documents Show UberPool Was A ... 2017-05-31 13:42:26.876 2017-05-26 23:56:21 2 0.08 0.00 0.00 2.00 Buzzfeed News ... True 2526007.00 147 1.00 19.00 7.00 12.00 38.00 39.00 37.00
2386c9b079c7fe36ce6d834c70740d4944b87004 http://www.huffingtonpost.com/entry/trump-kill... While You Weren't Looking, Trump Basically Kil... 2017-05-22 11:21:11.792 2017-05-16 10:01:54 1 0.08 0.00 0.00 1.00 HuffPost ... True 9456258.00 215 0.00 17.00 6.00 13.00 36.00 36.00 36.00
d8e02170a5b45a4438d1eb145395cc016eb3899b http://www.huffingtonpost.com/entry/portland-a... One Official Tried To Warn Us About Attacks Li... 2017-05-31 19:21:15.995 2017-05-30 07:41:50 2 0.17 0.00 0.00 2.00 HuffPost ... True 9494109.00 215 1.00 17.00 6.00 13.00 36.00 37.00 35.00
fe77503974cdd9ed96e6269a1c663703f9c1836d https://www.buzzfeed.com/dougbockclark/america... Why Is The US Trying To Remake The World’s Pri... 2017-05-28 15:03:13.191 2017-05-18 11:39:30 0 0.00 0.00 0.00 0.00 Buzzfeed News ... False nan 147 0.00 19.00 15.00 0.00 34.00 34.00 34.00
23856d48f503288938e10d5705d94e06399b2232 https://www.buzzfeed.com/alisonwillmore/the-ne... The New Fall Shows Are Opting For Comfort Over... 2017-05-21 17:10:15.705 2017-05-19 22:37:04 1 0.00 0.00 0.00 1.00 Buzzfeed News ... False nan 147 0.00 19.00 15.00 0.00 34.00 34.00 34.00
55519241e5fcfb028a7378cda01fe05471e12497 https://www.buzzfeed.com/priya/a-federal-judge... A Federal Judge Ordered Uber To Return Documen... 2017-05-15 15:04:24.715 2017-05-15 14:52:24 29 0.58 0.00 8.00 21.00 Buzzfeed News ... False nan 147 4.00 19.00 15.00 0.00 34.00 38.00 30.00
d64c64f6678104353fffca22cbcc35d443008cd6 http://www.bbc.co.uk/news/live/election-2017-4... General election: Latest updates 2017-05-30 09:10:15.842 2017-05-30 09:09:45 421 0.77 146.00 115.00 160.00 BBC ... True 41829626.00 96 17.00 20.00 12.00 15.00 47.00 64.00 30.00
06ca61dd2f4e36da6778a41713a1bea8a2b80489 https://www.buzzfeed.com/emmaloop/some-senator... Some Senators Worry Special Counsel Will Hurt ... 2017-05-18 23:35:07.335 2017-05-18 23:26:28 24 0.17 3.00 6.00 15.00 Buzzfeed News ... False nan 147 4.00 19.00 13.00 0.00 32.00 36.00 28.00
d5d04b8bf3bba2309df65df92e328095aeb41dcd https://www.buzzfeed.com/hannahalothman/these-... These Pictures Show Salman Abedi On The Night ... 2017-05-27 20:48:17.614 2017-05-27 19:48:33 489 2.34 146.00 308.00 35.00 Buzzfeed News ... True 2522625.00 147 18.00 19.00 15.00 12.00 46.00 64.00 28.00
eec485815c62fe8f7d9f92cfc1a9e3b4505f9da1 https://www.buzzfeed.com/johnhudson/turkeys-fo... Turkey’s Foreign Minister Accuses European Int... 2017-05-27 14:54:12.773 2017-05-27 14:41:30 52 0.29 2.00 24.00 26.00 Buzzfeed News ... False nan 147 6.00 19.00 15.00 0.00 34.00 40.00 28.00
cdc9da260c2f69ac44612222743507f59fc1ea85 https://www.buzzfeed.com/aishagani/these-are-t... These Are The Children Left Behind In Calais 2017-05-29 08:12:11.176 2017-05-25 14:23:45 6 0.50 0.00 4.00 2.00 Buzzfeed News ... False nan 147 1.00 19.00 10.00 0.00 29.00 30.00 28.00
ad95623506020d40832341a492a716c089041357 https://www.buzzfeed.com/buzzfeednews/live-upd... Live Updates: President Trump Delivers Remarks... 2017-05-29 15:30:21.840 2017-05-29 15:28:35 47 0.17 21.00 3.00 23.00 Buzzfeed News ... False nan 147 6.00 19.00 15.00 0.00 34.00 40.00 28.00
46dca9508b04bf096559dae6de4c380aa43b5a1a https://www.buzzfeed.com/borzoudaragahi/the-fa... The Father Of The Manchester Suicide Bomber Pr... 2017-05-25 17:39:11.882 2017-05-25 16:38:13 444 1.11 73.00 314.00 57.00 Buzzfeed News ... True 2517728.00 147 18.00 19.00 15.00 12.00 46.00 64.00 28.00
a68cb2203f5f4adc936598ff48f48e88f90df114 https://www.buzzfeed.com/gracewyler/north-kore... North Korea Reportedly Launched Another Projec... 2017-05-13 23:15:15.684 2017-05-13 23:05:54 44 0.17 0.00 18.00 26.00 Buzzfeed News ... False nan 147 6.00 19.00 15.00 0.00 34.00 40.00 28.00
a753c5ecbc0ba12d832bd9a6f2c835c2ed48ba15 https://www.buzzfeed.com/hannahallam/what-happ... No, The Trump Administration's Attacks On Musl... 2017-05-17 18:57:16.139 2017-05-17 18:55:43 26 0.08 1.00 12.00 13.00 Buzzfeed News ... False nan 147 4.00 19.00 13.00 0.00 32.00 36.00 28.00
17a00d45c745161632b8bc4e925a98946ee4495b https://www.buzzfeed.com/adolfoflores/undocume... The Underground Network To Ferry Undocumented ... 2017-05-20 15:33:18.644 2017-05-19 19:44:49 66 0.44 1.00 29.00 36.00 Buzzfeed News ... False nan 147 7.00 19.00 15.00 0.00 34.00 41.00 27.00
a6f209e9944ba0d610b6434c94698fcbc6efc4c3 https://www.buzzfeed.com/albertonardelli/g7-be... Here's What's Going On Behind The Scenes Durin... 2017-05-26 20:12:22.410 2017-05-26 20:10:27 27 0.16 1.00 4.00 22.00 Buzzfeed News ... False nan 147 4.00 19.00 12.00 0.00 31.00 35.00 27.00
7a54ed2dd7aaa6b1570f8f4921d3ee48f5c94d05 http://www.npr.org/2017/05/31/530929894/jury-s... Jury Selection Begins In South Dakota 'Pink Sl... 2017-05-31 21:00:33.572 2017-05-31 20:00:00 14 0.03 0.00 3.00 11.00 NPR ... True 6008762.00 594 3.00 15.00 2.00 13.00 30.00 33.00 27.00
9ca581ae49a1d1c9378930a5692e5b277b39345a https://www.buzzfeed.com/danvergano/how-lidar-... Everything You Need To Know About The Technolo... 2017-05-15 18:12:03.981 2017-05-12 21:14:37 63 0.50 0.00 13.00 50.00 Buzzfeed News ... False nan 147 7.00 19.00 15.00 0.00 34.00 41.00 27.00
4ade00f8a8b024605911ff619d371603c1acad6f https://www.buzzfeed.com/zoetillman/federal-ap... Federal Appeals Court Upholds The Nationwide I... 2017-05-25 18:09:19.095 2017-05-25 18:09:05 518 4.01 57.00 342.00 119.00 Buzzfeed News ... True 2518012.00 147 19.00 19.00 15.00 12.00 46.00 65.00 27.00
35084b054f4015c12db62c965f792f634b09ceac http://www.nbcnews.com/storyline/trumps-first-... On Saudi Trip, Trump Tasked With Mending Relat... 2017-05-19 19:27:06.021 2017-05-19 18:19:00 3 0.02 0.00 0.00 3.00 NBC News ... True 8922889.00 826 1.00 14.00 1.00 13.00 28.00 29.00 27.00

25 rows × 24 columns

In [63]:
# high response but not promoted
data.sort_values("score_diff", ascending=True).head(25)
Out[63]:
url headline discovered published fb_engagements fb_max_engagements_per_min fb_comments fb_reactions fb_shares publisher_name ... fb_brand_page fb_brand_page_likes alexa_rank response_score lead_score front_score facebook_promotion_score promotion_score attention_index score_diff
id
bc6d2a659ec806138ffbff7973f3849c2d5170fd http://yournewswire.com/world-gets-behind-puti... World Gets Behind Putin's Vow To Destroy New W... 2017-05-17 10:51:54.096 2017-05-06 22:00:12.000 166706 0.00 24418.00 114380.00 27908.00 Your News Wire ... False nan 22568 50.00 0.00 0.00 0.00 0.00 50.00 -50.00
f9e0e40d7a2ae8a3ee8cea21e7a4972358b94f15 https://www.washingtonpost.com/news/parenting/... The important role of aunts and uncles in chil... 2017-05-26 10:06:13.945 2017-05-26 10:00:00.000 84117 83.45 8077.00 62427.00 13613.00 The Washington Post ... False nan 191 50.00 0.00 0.00 0.00 0.00 50.00 -50.00
6f582dfe1ce09b7896bc16ff135c62d61eb46c44 http://www.huffingtonpost.com/entry/chris-corn... Chris Cornell: When Suicide Doesn't Make Sense 2017-05-19 13:30:19.144 2017-05-18 14:47:50.000 73548 58.41 11786.00 50465.00 11297.00 HuffPost ... False nan 215 50.00 0.00 0.00 0.00 0.00 50.00 -50.00
a31e662de58eeb12d8dcb5f1d40f6302a7cb42a5 https://www.nytimes.com/2017/05/19/business/li... Lisa Su on the Art of Setting Ambitious Goals 2017-05-19 11:03:27.185 2017-05-19 11:00:01.000 149799 72.18 65.00 149665.00 69.00 New York Times ... False nan 120 50.00 0.00 0.00 0.00 0.00 50.00 -50.00
3fb57d744f1b05e05f6745bd5244130eeb77aa69 http://www.independent.co.uk/jeff-sessions-wom... Woman found guilty and faces year in jail for ... 2017-05-03 17:50:18.534 2017-05-03 17:48:28.000 60406 88.37 nan nan nan The Independent ... False nan 386 50.00 0.00 0.00 0.00 0.00 50.00 -50.00
971f61642849a7db6f16046b241cc16819f87a33 http://www.cnn.com/2013/05/28/travel/100-best-... World's 100 best beaches 2017-05-11 11:15:18.220 2017-05-11 11:06:29.000 82513 0.00 nan nan nan CNN ... False nan 105 50.00 0.00 0.00 0.00 0.00 50.00 -50.00
238e488a22925512652eef984fb16fd547ac9512 https://www.washingtonpost.com/local/obituarie... Representative: Rocker Chris Cornell has died ... 2017-05-18 07:45:22.278 2017-05-18 07:36:12.000 57145 267.37 11497.00 40224.00 5424.00 The Washington Post ... False nan 191 50.00 0.00 1.00 0.00 1.00 51.00 -49.00
c55509930e92f71ee534edf7d570b70ed81a2267 https://www.indy100.com/article/divorced-fathe... This dad's brilliant post about his ex-wife is... 2017-05-15 11:09:18.339 2017-05-14 08:36:41.000 1114468 8.34 58501.00 914126.00 141841.00 indy100 ... False nan 5014 50.00 0.00 1.00 0.00 1.00 51.00 -49.00
d864ee5c2c85f8cb99380b309268422f81d48165 https://www.indy100.com/article/luxembourg-gay... Luxembourg's openly gay 'first husband' joined... 2017-05-26 12:05:12.044 2017-05-26 11:57:01.000 156503 108.48 5846.00 141474.00 9183.00 indy100 ... False nan 5014 50.00 0.00 1.00 0.00 1.00 51.00 -49.00
5811a42e5a65ce6dca6b1b85fb72b585914d495e http://www.independent.co.uk/news/world/americ... US citizens with cancer could pay up to $140,0... 2017-05-04 11:08:25.992 2017-05-04 10:18:47.000 75268 108.33 nan nan nan The Independent ... False nan 386 50.00 0.00 1.00 0.00 1.00 51.00 -49.00
7258bf39dbb64771f0a084202fb7aa15138274f3 http://www.huffingtonpost.com/entry/black-farm... Black Farmer Calls Out Liberal Racism In Power... 2017-05-24 19:21:13.921 2017-05-24 07:00:55.000 52881 18.61 5498.00 38879.00 8504.00 HuffPost ... False nan 215 49.00 0.00 0.00 0.00 0.00 49.00 -49.00
7e4d3e7c555faa571afb985b7283f0a707d0662b https://www.washingtonpost.com/national/repres... Representative: Rocker Chris Cornell has died ... 2017-05-18 07:39:09.658 2017-05-18 07:28:13.000 49937 62.50 9600.00 35506.00 4831.00 The Washington Post ... False nan 191 49.00 0.00 0.00 0.00 0.00 49.00 -49.00
6a9c8aa43c71cb24777862596739902fb4b06ff8 https://www.indy100.com/article/luxembourg-gay... Luxembourg's openly gay 'first husband' joined... 2017-05-26 09:35:12.615 2017-05-26 09:30:24.000 131750 145.08 4788.00 120893.00 6069.00 indy100 ... False nan 5014 50.00 0.00 1.00 0.00 1.00 51.00 -49.00
6c1158c4d5575103f2d3ea9666ea9fd8a94576c8 https://www.indy100.com/article/white-privileg... This student's message about white privilege i... 2017-05-15 11:09:03.877 2017-05-13 15:05:34.000 221132 191.67 27010.00 143505.00 50617.00 indy100 ... False nan 5014 50.00 0.00 1.00 0.00 1.00 51.00 -49.00
e029967f1b95be119e29603ec4b975a8ccedcfbe http://www.huffingtonpost.com/entry/pharrell-n... Pharrell To College Graduates: 'We Need To Lif... 2017-05-18 19:12:14.557 2017-05-18 09:13:21.000 50390 36.58 821.00 45904.00 3665.00 HuffPost ... False nan 215 49.00 0.00 0.00 0.00 0.00 49.00 -49.00
0765a68170be1d1bc8d466cd312b4198b1bed52a https://www.washingtonpost.com/news/morning-mi... Kid genius brothers, 11 and 14, graduate high ... 2017-05-12 08:12:05.453 2017-05-12 08:06:00.000 68072 58.41 2142.00 56282.00 9648.00 The Washington Post ... False nan 191 50.00 0.00 2.00 0.00 2.00 52.00 -48.00
24c4629edddc8db232fe20107e1eeed658d04980 http://evolvepolitics.com/one-britains-influen... One of Britain's most influential rappers just... 2017-05-16 10:35:22.163 2017-05-09 16:52:45.000 55710 1.61 7810.00 36702.00 11198.00 EvolvePolitics.com ... False nan 119412 49.00 0.00 1.00 0.00 1.00 50.00 -48.00
7465f0aa8747b4b86761fe8c64da207a048347e7 https://www.buzzfeed.com/aishamirza/until-whit... Until White Women Ruined It 2017-05-23 20:25:14.205 2017-05-23 17:25:19.000 50599 17.11 17118.00 28783.00 4698.00 Buzzfeed News ... False nan 147 49.00 0.00 1.00 0.00 1.00 50.00 -48.00
8b2f321660974d49d7bbff5eb7bbacff15c034b9 http://www.huffingtonpost.com/entry/mickey-row... Finally, An Actor With Autism Is Starring In '... 2017-05-10 16:20:24.388 2017-05-10 04:43:41.000 43300 31.04 2726.00 36456.00 4118.00 HuffPost ... False nan 215 48.00 0.00 0.00 0.00 0.00 48.00 -48.00
ec37e52e5822f54de86ca03fdccb331e63efce9f http://yournewswire.com/melania-trump-bans-mon... Melania Trump Bans Monsanto Products From The ... 2017-05-30 18:20:12.797 2017-05-30 18:14:49.000 64190 43.00 10865.00 45276.00 8049.00 Your News Wire ... False nan 22568 50.00 2.00 1.00 0.00 3.00 53.00 -47.00
1f5b4b137dd3d3590710c2b94575fe1b54dbda02 https://www.washingtonpost.com/news/early-lead... Torrey Smith just paid the adoption fee for 46... 2017-05-07 20:36:02.478 2017-05-07 20:27:00.000 70570 49.83 nan nan nan The Washington Post ... False nan 191 50.00 0.00 3.00 0.00 3.00 53.00 -47.00
f370c7baac0ce9a5d9a74e3a022790b8536ef400 http://www.washingtonpost.com/politics/2017/li... Democrats are demanding an investigation of Se... 2017-05-17 17:00:09.780 2017-05-17 17:00:09.780 42073 92.58 2922.00 37359.00 1792.00 The Washington Post ... False nan 191 48.00 0.00 1.00 0.00 1.00 49.00 -47.00
2ae7ddbb5ead761510ae0b05f24f3d9cc096c157 https://www.washingtonpost.com/news/energy-env... Scientists just published an entire study refu... 2017-05-24 17:57:12.090 2017-05-24 17:46:23.000 47462 41.72 2589.00 37804.00 7069.00 The Washington Post ... False nan 191 48.00 0.00 1.00 0.00 1.00 49.00 -47.00
dffa1bb577da46da34e2d0753a48ad497fde3e25 https://www.indy100.com/article/justin-trudeau... Two of the most popular world leaders went for... 2017-05-26 11:25:16.769 2017-05-26 11:11:53.000 42286 48.39 4105.00 35844.00 2337.00 indy100 ... False nan 5014 48.00 0.00 1.00 0.00 1.00 49.00 -47.00
0ae249e0a6761c5c59710c905a88de0341bcef4c https://www.washingtonpost.com/national/health... Journalist arrested during US health secretary... 2017-05-10 11:30:11.366 2017-05-10 11:22:13.000 41446 300.40 nan nan nan The Washington Post ... False nan 191 47.00 0.00 0.00 0.00 0.00 47.00 -47.00

25 rows × 24 columns

Write that data to a file. Note that the scores here are provisional for two reasons:

  1. they should be using a rolling-month based on the article publication date to calculate medians/min/max etc, whereas in this workbook we as just using values for the month of May
  2. for analysis, we've rounded the numbers; we don't expect to do that for the actual scores
In [64]:
data.to_csv("articles_with_provisional_scores_2017-05-01_2017-05-31.csv")

Summary

The attention index of an article is comprised of four components:

  • lead score (max 20 points) based on how long an article was the lead story on the publisher's home page, scaled by the traffic to that publisher
  • front score (max 15 points) based on how long an article was present on the publisher's home page, scaled by traffic to that publisher
  • Facebook promotion score (max 15 points) based on whether the article was promoted to the publisher's Facebook brand page, scaled by the reach of that brand page
  • response score (max 50 points) based on the number of Facebook engagements the article received, relative to other articles

Or, in other words:

\begin{align} attentionIndex_a &= leadScore_a + frontScore_a + facebookPromotionScore_a + responseScore_a \\ leadScore_a &= 20 \cdot \left(\frac{\min(minsAsLead_a, 60)}{alexaRank_a}\right) \cdot \left( \frac{\min(alexaRank)}{60} \right) \\ frontScore_a &= 15 \cdot \left(\frac{\min(minsOnFront_a, 1440)}{alexaRank_a \cdot numArticlesOnFront_a}\right) \cdot \left( \frac{\min(alexaRank \cdot numArticlesOnFront)}{1440} \right) \\ facebookPromotion_a &= \begin{cases} 0 \text{ if not shared on brand page }\\ 15 \cdot \frac{\log(brandPageLikes_a) - \log(\min(brandPageLikes))}{\log(\max(brandPageLikes)) - \log(\min(brandPageLikes))} \text{ otherwise } \end{cases} \\ responseScore_a &= \begin{cases} 0 \text{ if } engagements_a = 0 \\ 50 \cdot \frac{\log(\min(engagements_a,limit) + median(engagements)) - \log(1 + median(engagements))} {\log(limit + median(engagements)) - \log(1 + median(engagements))} \text{ if } engagements_a > 0 \end{cases} \\ \end{align}

In [ ]: