#!/usr/bin/env python # coding: utf-8 # # API Calls with Python # APIs (application programming interfaces) are hosted on web servers. When you type www.google.com in your browser's address bar, your computer is actually asking the www.google.com server for a webpage, which it then returns to your browser. APIs work much the same way, except instead of your web browser asking for a webpage, your program asks for data. This data is usually returned in JSON format. To retrieve data, we make a request to a webserver. The server then replies with our data. In Python, we'll use the `requests` library to do this. # ### Python Setup # In[1]: # interacting with websites and web-APIs import requests # easy way to interact with web sites and services # data manipulation import pandas as pd # easy data manipulation # ## How does the request package work? # We first need to understand what information can be accessed from the API. We use an example of the **PatentsView API** (www.patentsview.org) to make the API call and check the information we get. # ### About PatentsView API # The PatentsView platform is built on data derived from the US Patent and Trademark Office (USPTO) bulk data to link inventors, their organizations, locations, and overall patenting activity. The PatentsView API provides programmatic access to longitudinal data and metadata on patents, inventors, companies, and geographic locations since 1976. # # To access the API, we use the `request` function. In oder to tell Python what to access we need to specify the url of the API endpoint. # # PatentsView has several API endpoints. An endpoint is a server route that is used to retrieve different data from the API. Examples of PatentsView API endpoints: http://www.patentsview.org/api/doc.html # # Currently no key is necessary to access the PatentsView API. # ### Making a Request # When you ping a website or portal for information this is called making a request. That is exactly what the `requests` library has been designed to do. # Let's build our first query URL. # # **Query String Format** # # The query string is always a single JSON object: **{``:``}**, where `` is the name of a database field and `` is the value the field will be compared to for equality (Each API Endpoint section contains a list of the data fields that can be selected for inclusion in output datasets). # # We use the following base URL for the Patents Endpoint: # # **Base URL**: `http://www.patentsview.org/api/patents/query?q={criteria}` # # # ## Task example: Pull patents for Stanford University # # Let's go to the Patents Endpoint (http://www.patentsview.org/api/patent.html) and find the appropriate field for the organization's name. # # The variable that we need is called `"assignee_organization"` (organization name, if assignee is organization) # # > _Note_: **Assignee**: the name of the entity - company, foundation, partnership, holding company or individual - that owns the patent. In this example we are looking at universities (organization-level). # ### Step 1. Build the URL query # # Let's build our first URL query by combining the base url with one criterion (name of the `assignee_organization`) # # base url: `http://www.patentsview.org/api/patents/query?q=` + criterion: `{"assignee_organization":stanford university"}` # In[2]: """Save the URL as a variable.""" url = 'https://api.patentsview.org/patents/query?q={"assignee_organization":"stanford university"}' # ### Step 2. Get the response # Now let's get the response using the URL defined above, using the `requests` library. # In[3]: """Get response from the URL.""" r = requests.get(url) # ### Step 3. Check the Response Code # # Before you can do anything with a website or URL in Python, it’s a good idea to check the current status code of said portal. # # The following are the response codes for the PatentsView API: # # `200` - the query parameters are all valid; the results will be in the body of the response # # `400` - the query parameters are not valid, typically either because they are not in valid JSON format, or a specified field or value is not valid; the “status reason” in the header will contain the error message # # `500` - there is an internal error with the processing of the query; the “status reason” in the header will contain the error message # Let's check the status of our response # In[4]: r.status_code # Check the status code # We are good to go. Now let's get the content. # ### Step 4. Get the Content # After a web server returns a response, you can collect the content you need by converting it into a JSON format. # JSON is a way to encode data structures like lists and dictionaries to strings that ensures that they are easily readable by machines. JSON is the primary format in which data is passed back and forth to APIs, and most API servers will send their responses in JSON format. # In[5]: json = r.json() # Convert response to JSON format # By default, we get information on `patent_id`, `patent_number`, and `patent_title`. At the end of the JSON you will see how many results are returned (variable `count`) and the total number of patents found (variable `total_patent_count`). # In[6]: json # View JSON # There are 143 patents for Stanford University, with 25 out of 143 results returned (we will discuss how to change the number of returned results later in the notebook). # ### Step 5. Convert JSON to a pandas dataframe # Now let's convert the JSON into a pandas dataframe. # In[72]: df = pd.DataFrame(json['patents']) # Convert to pandas dataframe df # ### Checkpoint 1: Pull patent data for another university # Now try pulling patent data for Georgetown University: # - build a query URL; # - make a request; # - get the response in JSON format; # - note the total number of patents; # - convert the JSON to a pandas dataframe. # In[ ]: # ## Adding to the query other fields of interest # Above we were able to pull data with the default information on the patents (`patent_id`, `patent_number`, `patent_title`). # # What if we want to know about the patent title and patent year? # # Let's look for those variables in the API Endpoint (http://www.patentsview.org/api/patent.html), and add those fields to our query. # To the URL created above, we will add the fields parameter: `&f=["patent_title","patent_year"]` # In[ ]: url = 'https://api.patentsview.org/patents/query?q={"assignee_organization":"stanford university"}&f=["patent_title","patent_year"]' # In[187]: r = requests.get(url) # Get response from the URL r.status_code # Check the status code # In[188]: json = r.json() # Convert response to JSON format # In[189]: json # View JSON # ### Checkpoint 2: Add other fields # Try adding other fields of interest. Go to the Patents Endpoint (http://www.patentsview.org/api/patent.html) and pick other 2 fields of interest to add to the query and get the results. # In[ ]: # ## Customize the number of results # As you have noticed, by default, only 25 results are returned. To change the number of results returned (for example, 50 results), add the option parameter to the query URL: `&o={"per_page":50}` # # In[100]: url = 'https://api.patentsview.org/patents/query?q={"assignee_organization":"stanford university"}&f=["patent_title","patent_year"]&o={"per_page":50}' # In[102]: json = r.json() # Convert response to JSON format # Now the JSON shows 50 results (as noted in the variable `count` at the bottom of the JSON) # In[103]: json # ### Checkpoint 3: Customize the number of results # Try customizing the number of returned results using the options parameter. # # **Note**: limit the number of results to no more than 100 during the in-class session, to avoid a heavy simultaneous use of the API (so the queries can run faster). # In[ ]: # ## Optional # # Please feel free to explore and practice all available options in the API Query Language section of the PatentsView website (http://www.patentsview.org/api/query-language.html).