Accessing the History of News Headlines

Recently I have spent some time to investigate the options to access the history of news articles via an API. I was mainly interested in APIs which can be accessed free of charge.

Here is the list of the most useful providers:

  • Guardian
    • Easy API
    • Acceptable Rate Limits
    • Access to over 1,900,000 pieces of content
    • Free for non-commercial usage
  • New York Times
    • Provides API to search and separate API to download monthly data
    • Rate Limits are quickly reached in the search API
    • Provides data since 1851
    • Free for non-commercial usage
  • RSS
    • Many Free Sources
    • Very limited History

As a conclusion, I was ending up with an architecture which

  • replicates the data sources into a Local Search engine (Solr)
  • provides some Utility classes to simplify different scenarios

In this document I provide a quick overview of the possibilities to access news headlines using functionality which is available in the JVM. The examples are implemented in Scala using Jupyter with the BeakerX kernel.

Setup

I am installing the necessary dependencies via Maven. The news-digest project is using

  • Jersey (REST client)
  • SolrJ (Search Engine API)
  • Rome (RSS API)
  • Yahoo Finance (to look up the company name by ticker symbol)
  • OpenNLP (to Support Named Entity determination
  • deeplearning4j-nlp-uima (Sentiment Analysis)
In [1]:
%classpath config resolver maven-public1 http://nuc.local:8081/repository/maven-public/
%%classpath add mvn 
ch.pschatzmann:news-digest:LATEST
Added new repo: maven-public1

All relevant functionality is in the package: ch.pschatzmann.news

In [6]:
import ch.pschatzmann.news._
Out[6]:
import ch.pschatzmann.news._

Searching the 'Guardian' with Jersey

The documentation for the API can be found at https://open-platform.theguardian.com/documentation/

The limits are currently set to

  • 5000 requests per day.
  • Up to 12 calls per second

The Guardian search API can be easily accessed with the help of Jersey. The only challange is the paging logic which needs to iterate over all pages. We provide the GuardianPagedQuery class which is returning a Java Stream of JsonObjects.

We can convert the Java stream in Scala with .stream().iterator().asScala:

In [7]:
import scala.collection.JavaConverters._
import javax.ws.rs.client.ClientBuilder;

val apiKey = Utils.property("guardianAPIKey")
var client = ClientBuilder.newClient();
var target = client.target("https://content.guardianapis.com/search")
    .queryParam("q", "BP")
    .queryParam("order-by", "oldest")
    .queryParam("api-key", apiKey)
    .queryParam("page-size", "200")

var result = new GuardianPagedQuery(target)
    .stream().iterator().asScala
    .map(json => (json.getString("webPublicationDate"), json.getString("webTitle", null)))
    .toList

result.size
Out[7]:
10129
In [8]:
result(0)
Out[8]:
(1997-01-12T15:16:22Z,Blair and the Brains)
In [9]:
result.last
Out[9]:
(2018-12-04T14:00:35Z,How to make a carbon tax popular? Give the proceeds to the people)

Searching the "New York Times" with Jersey

The documentation for the API can be found at https://developer.nytimes.com/

The limits are currently set to

  • 1000 requests per day.
  • Up to 1 calls per second

The search request return 10 entries - so if you manage to execute a query which return 10000 entries your limit is already used up!

The NYT search API can be easily accessed with the help of Jersey. The only challange is the paging logic which needs to iterate over all pages. We provide the NYTPagedQuery class which is returning a Java Stream of JsonObjects:

In [10]:
import scala.collection.JavaConverters._
import javax.ws.rs.client.ClientBuilder;

val apiKey = Utils.property("nytAPIKey")
var client = ClientBuilder.newClient();
var target = client.target("http://api.nytimes.com/svc/search/v2/articlesearch.json")
    .queryParam("q", "BP")
    .queryParam("begin_date", "20170101")
    .queryParam("end_date", "20181231")
    .queryParam("api-key", apiKey)

var result = new NYTPagedQuery(target)
    .stream().iterator().asScala
    .map(json => (json.getString("pub_date"), json.getString("snippet", null)))
    .toList

result.size
Out[10]:
152
In [11]:
result(0)
Out[11]:
(2017-04-17T19:28:17+0000,A damaged well, which had been venting methane vapors, was repaired without injuries or harm to wildlife, officials said.)
In [12]:
result.last
Out[12]:
(2018-08-01T04:47:33+0000,We knew everything we needed to know, and nothing stood in our way. Nothing, that is, except ourselves. A tragedy in two acts.)

Saving Guardian and NYT Data to Solr

With our news-digest functionality, we can easiy save the Guardian and New York Times headlines to a local Solr Search Engine instance.

Alternativly we can schedule the saving by calling scheduleSave(periodMs)

In [13]:
val store = new SolrDocumentStore()
new HistoryDataGuardian(store).save()
//new HistoryDataNYT(store).save()
Out[13]:
null

Now we can run our queries against Solr:

Searching in the Local Solr

We provide a search API which returns a Stream of Document objects. With this architecture we are able to efficiently process hundred of thousends of documents.

In [14]:
import scala.collection.JavaConverters._

val store = new SolrDocumentStore()
var result = store.stream("publisher_t:guardian")
    .iterator().asScala
    .map(doc => (doc.date, doc.content))
    .toList

result.size
Out[14]:
125520
In [15]:
result(0)
Out[15]:
(Sat Jan 01 02:53:59 CET 2000,In brief)
In [16]:
result.last
Out[16]:
(Fri Dec 07 09:41:44 CET 2018,Markets rebound after Huawei arrest sparked biggest sell-off since Brexit vote – business live)
In [17]:
var result = store.stream("publisher_t:nytimes")
    .iterator().asScala
    .map(doc => (doc.date, doc.content))
    .toList

result.size
Out[17]:
116014
In [18]:
result(0)
Out[18]:
(Fri Nov 30 01:02:05 CET 2018,Floyd Mayweather and DJ Khaled Are Fined in I.C.O. Crackdown)
In [19]:
result.last
Out[19]:
(Sun Jun 18 00:00:00 CEST 2000,The death not long ago of Hayward Cirker, the founder of Dover Publications, seemed to mark the end of one era in publishing, but it may have as much to tell us about the one we're entering. As much as any other single publisher, Cirker mastered and thrived on the industry's last great technical revolution: the paperback. The shelves of American bookstores, not to mention those of millions of book lovers, would look very different if not for Cirker's influence. Starting in the late 1940's, Dover built specialties in every topic under the sun: physics, dollhouses, military history, classical music, fairy tales, 19th-century novels, cinema, ancient Egypt, mathematics, architecture, boomerangs, tales of exploration and philosophical memoirs, to name but a few. Cirker was by all accounts a quiet, modest man of vast, immodest interests, and the reprint house he founded and personally ran until he died in March at 82 offers one of the best lessons about how a vast technical revolution that everyone thinks will cheapen the book business can in fact elevate and enhance it. When Cirker and his wife, Blanche, got into the trade back in 1941, the so-called paperback revolution, which allowed mass printings of books for relatively little money, was taking off. It was helped by the government-sponsored ''armed services'' editions, reprints of novels, stories, poems and essays to divert and inspire the troops. But publishers themselves thought of the form as a vehicle for ''cheap'' books, and after the war, when they were no longer under the watchful eyes of Army and Navy censors, they returned their paperback lists to the standard fare of lurid detective stories and romances. Paperback meant pulp, sized for the pocket, and not-quite-real books.)
In [20]:
import scala.collection.JavaConverters._

val store = new SolrDocumentStore()
var result = store.stream("publisher_t:(guardian,nytimes)")
    .iterator().asScala
    .filter(doc => doc.content!=null)
    .map(doc => (doc.date, doc.content))
    .toList

result.size
Out[20]:
241159
In [21]:
result(0)
Out[21]:
(Sat Jan 01 02:53:59 CET 2000,In brief)
In [22]:
result.last
Out[22]:
(Sun Jun 18 00:00:00 CEST 2000,The death not long ago of Hayward Cirker, the founder of Dover Publications, seemed to mark the end of one era in publishing, but it may have as much to tell us about the one we're entering. As much as any other single publisher, Cirker mastered and thrived on the industry's last great technical revolution: the paperback. The shelves of American bookstores, not to mention those of millions of book lovers, would look very different if not for Cirker's influence. Starting in the late 1940's, Dover built specialties in every topic under the sun: physics, dollhouses, military history, classical music, fairy tales, 19th-century novels, cinema, ancient Egypt, mathematics, architecture, boomerangs, tales of exploration and philosophical memoirs, to name but a few. Cirker was by all accounts a quiet, modest man of vast, immodest interests, and the reprint house he founded and personally ran until he died in March at 82 offers one of the best lessons about how a vast technical revolution that everyone thinks will cheapen the book business can in fact elevate and enhance it. When Cirker and his wife, Blanche, got into the trade back in 1941, the so-called paperback revolution, which allowed mass printings of books for relatively little money, was taking off. It was helped by the government-sponsored ''armed services'' editions, reprints of novels, stories, poems and essays to divert and inspire the troops. But publishers themselves thought of the form as a vehicle for ''cheap'' books, and after the war, when they were no longer under the watchful eyes of Army and Navy censors, they returned their paperback lists to the standard fare of lurid detective stories and romances. Paperback meant pulp, sized for the pocket, and not-quite-real books.)

Named Entities Analysis

We can determine the named Entities

In [23]:
result.slice(0,200).map(r => Utils.namedEntities(r._2))
Out[23]:
[[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [National], [], [], [], [], [KLM], [], [], [], [], [], [], [DTI], [], [], [], [], [], [], [], [], [], [Bell], [], [Rock], [], [], [], [], [], [Deutsche], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [EU], [], [], [], [], [], [BMW], [NTL], [Media], [Lloyds], [], [Kyte], [], [], [WTO], [ITV], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [Media], [], [], [], [], [], [], [], [], [], [QXL], [], [], [], [], [Express], [], [], [], [], [], [], [], [], [Time, AOL], [], [], [], [Blue], [], [], [], [], [], [], [], [], [], [], [WTO], [], [], [], [AOL], [], [], [OFT], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [Intel], [], [], [US], [], [], [], [], [], [House], [], [Bank], [], [Siemens], [], [], [], [], [], [], [], [], [GUS], [], [], [], [Europe], [], [], [Partners], [], [EU], [], [], [], [], []]]

Sentiment Analysis

We can determine the sentiment using SWN3

In [24]:
result.slice(0,20).map(r => Utils.sentiment(r._2))
Out[24]:
[[neutral, neutral, neutral, weak_negative, weak_positive, neutral, strong_negative, weak_negative, weak_negative, neutral, neutral, weak_negative, negative, neutral, weak_negative, neutral, weak_negative, weak_negative, neutral, neutral]]

Search with Ticker Symbol

We can search by ticker symbol and the related company name

In [25]:
val search = Utils.companyNameByTickerSearch("AAPL")
In [26]:
import scala.collection.JavaConverters._

val store = new SolrDocumentStore()
var result = store.stream(search)
    .iterator().asScala
    .map(doc => (Utils.sentiment(doc.content), doc.dateFormatted("yyyy-MM-dd"), doc.content))
    .toList

result.slice(0, 100).foreach(println(_))
(negative,2000-01-07,Jobs gets the job: Apple founder becomes a stayer)
(weak_negative,2000-05-28,How UK Inc should spend its windfall)
(weak_negative,2000-07-19,Buoyant Apple boosted by iMac)
(strong_negative,2000-09-30,$9bn off shares halves Apple)
(weak_negative,2000-10-03,Pret a Manger takes a bite at the Big Apple)
(positive,2001-01-18,Apple and IBM results show different views of US economy)
(strong_negative,2001-09-23,USA Inc braced for more pain)
(strong_positive,2001-10-11,Big Apple extends big welcome)
(neutral,2001-11-22,EC fines 'Vitamin Inc' cartel)
(strong_positive,2002-02-01,Enron: not the only bad apple)
(neutral,2002-02-03,The enemy within USA Inc)
(weak_positive,2002-03-31,A Mini adventure in the Big Apple)
(neutral,2002-03-31,End of the affair with USA Inc)
(negative,2002-04-06,Five-star steel behind the apple pie image)
(positive,2002-06-01,Throg street: welcome to Black Hole Inc)
(neutral,2002-06-07,Cleaners move in at America Inc)
(neutral,2002-06-30,Monsters Inc brought to account)
(weak_negative,2002-06-30,City assault on USA Inc)
(neutral,2003-04-12,Apple 'in talks' with Universal Music)
(weak_negative,2003-04-16,Setback for USA inc)
(neutral,2003-05-03,Apple tunes in, but music still has a problem)
(weak_negative,2003-06-14,Why apple pie and sauerkraut make poor fare)
(negative,2003-08-09,Big Apple turns sour on Mike Bloomberg)
(neutral,2003-08-14,Microsoft beats Apple to the music)
(neutral,2003-09-11,Big Apple awash with Snapple)
(neutral,2004-06-19,John Naughton: Apple holds keys to music kingdom)
(negative,2004-07-03,Apple error delays iMac launch)
(positive,2004-07-16,Notebook: Apple is failing to deliver - again)
(neutral,2004-07-31,USA Inc pays cash for access)
(weak_negative,2004-09-01,Apple unveils the computer without a computer)
(neutral,2004-10-30,Shareholder vote looms for USA Inc)
(neutral,2004-11-07,Payback time for USA Inc)
(neutral,2005-01-16,Big Apple?)
(neutral,2005-02-13,How Apple saved the music biz)
(negative,2005-04-15,Wall Street losses continue as GM and Apple disappoint)
(positive,2005-06-12,John Naughton: Is Apple right to cosy up to the enemy?)
(neutral,2005-06-25,Frank Kane: China throws down gauntlet to USA Inc)
(neutral,2006-04-01,The appeal of Apple)
(neutral,2006-05-08,Beatles label loses apple logo case to iTunes)
(negative,2006-05-09,Beatles to appeal after losing trademark battle with Apple Computer)
(negative,2006-05-17,Lawsuit to halt US iPod sales hurts Apple shares)
(neutral,2006-06-30,Apple 'may have mishandled options')
(weak_negative,2006-06-30,Apple admits it may have mishandled stock options)
(neutral,2006-07-30,Will Jobs's departure cut Apple to the core?)
(weak_negative,2006-08-04,Apple may restate profits amid accounting scandal)
(weak_negative,2006-08-04,Apple may restate profits amid accounting scandal)
(neutral,2006-08-25,Apple recalls laptop batteries)
(weak_negative,2006-08-25,Apple iPod battle settled with £53m payout to Singapore rival)
(neutral,2006-10-19,Apple Mac sales hit new record)
(negative,2006-10-22,India Inc: new moguls making billions in steel, software and retailing)
(neutral,2006-10-23,Apple attacked over stock scandal)
(positive,2006-12-29,Apple trembles at report that board minutes were falsified)
(positive,2006-12-30,Apple admits board minutes were falsified over Jobs options)
(weak_negative,2007-01-07,Kremlin Inc ready to take on the West)
(weak_negative,2007-01-10,Apple proclaims its revolution: a camera, an iPod ... oh, and a phone)
(positive,2007-01-12,Cisco to sue Apple over ownership of iPhone name)
(weak_negative,2007-01-13,Apple shares slump as options scandal threatens Jobs)
(neutral,2007-01-14,Britain's bosses woo India Inc)
(weak_negative,2007-02-22,Apple and Cisco resolve iPhone trademark battle)
(positive,2007-04-25,Apple chief Steve Jobs 'was warned on stock options')
(weak_negative,2007-04-26,Apple board backs Jobs in stock options furore as shares soar)
(strong_negative,2007-09-02,John Naughton: The sleek Apple iPhone comes with a bad connection)
(neutral,2007-09-07,Apple apologises for iPhone price cut)
(neutral,2007-09-13,What's in the Big Apple?)
(neutral,2007-09-17,AOL swaps the suburbs for the Big Apple)
(neutral,2007-09-18,AOL reckons Big Apple is where the ad action is)
(neutral,2007-09-20,Jobs subpoenaed over Apple stock scandal)
(weak_negative,2007-12-30,Apple and Google ruled a year to note in your Facebook)
(neutral,2008-01-09,Apple to cut iTunes charges)
(weak_negative,2008-03-27,Wolfson suffers as Apple drops its technology)
(weak_positive,2008-04-24,IPhone helps Apple ring up 36% profit rise)
(neutral,2008-07-21,US economy: Apple shares drop in unofficial trading)
(negative,2008-07-31,Telecoms: 02's exclusive deal for Apple iPhone hangs in the balance)
(weak_positive,2008-09-10,Apple chief executive Steve Jobs bids to quell rumours about his health)
(weak_negative,2008-10-07,Vodafone challenges Apple with offer of fresh BlackBerrys for Christmas)
(weak_negative,2008-12-18,Apple buys shareholding in Imagination Technologies)
(neutral,2009-02-26,O2 reports 10% rise after selling 1m Apple iPhones)
(strong_positive,2009-05-02,Sound familiar? Apple launches a revolution - and then gets overtaken)
(negative,2009-06-26,Imagination Technologies up as Apple stake hits 9.5%)
(neutral,2009-07-14,T-Mobile believed to be in talks with Apple to snatch iPhone from 02)
(weak_positive,2009-07-22,Apple profit turns focus on UK suppliers)
(negative,2009-08-19,Messaging International soars but denies imminent Apple deal)
(positive,2009-08-19,Market forces: Hints of deal with Apple lift software firm)
(weak_negative,2009-08-29,Apple iPhone faces Android threat)
(weak_positive,2009-10-03,Why the iPhone has become the apple of Orange chief's eye)
(weak_negative,2009-10-05,Wolfson Microelectronics slumps after Apple iPhone loss)
(neutral,2009-10-20,Apple rings up record results for iPhone)
(strong_negative,2009-10-22,More sour grapes than bad Apple)
(weak_positive,2009-10-22,Nokia sues Apple over alleged breach of patent)
(negative,2009-11-11,Apple 'plans iPhone that will work anywhere in world')
(neutral,2010-01-14,Vodafone joins Apple iPhone outlets with 50,000 starter sales)
(neutral,2010-01-19,Apple looks for iSlate mobile partner)
(strong_positive,2010-04-21,Wall Street upbeat after Apple and Morgan Stanley results)
(positive,2010-04-22,ARM boss pours cold water on Apple bid rumours after shares soar)
(weak_negative,2010-06-10,Apple takeover talk lifts chip designer Arm Holdings)
(positive,2010-07-21,FTSE makes bright start on SSL takeover and Apple results)
(neutral,2010-07-21,Apple results and bid hopes lift FTSE for first time in six days)
(neutral,2010-08-07,Apple opens biggest store to date in Covent Garden)
(neutral,2010-10-19,FTSE falters ahead of spending review as China rate rise hits miners and Apple disappoints)
(weak_negative,2010-10-19,Technology shares hit by Apple figures but Autonomy jumps on deal hopes)
Out[26]:
null
In [ ]: