Notebook

Capability Correlations¶

1. Database Matrix of Occurences¶

In order to establish a base for understading the basis of the work, we wish to understand the occurence of severall feedstocks, pricessing technologies and outputs in our database.

For example, how many assets (patents, papers, etc) contain the mix of processing technology X for output Y?

To understand this in a more general way, the AMICA database will be transformed in an co-occurence matrix. This means, in the above described example that the number of assets that contain that mix will be an entry in a matrix, whyle the corresponfing technology and output will be columns/lines of the matrix.

1.1. Getting the data from Neo4j¶

The first step in the analysis is to get the data from the AMICA database. This data is stored in a graph database of the type Neo4j. To get it, cypher queries must be designed.

We start by importing necessary libraries.

In [130]:

from py2neo import Graph
import numpy as np 
from pandas import DataFrame
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
import json
import math
import pandas as pd

After that, we use py2neo to establish a connection to our notebook.

In [2]:

local_connection_url = "http://localhost:7474/db/data"
connection_to_graph = Graph(local_connection_url)

1.2. Designing the queries¶

Two queries to the database were designed, the first, only related to non interescting data, this means, looking for co-occurences of outputs-technologies, technologies-feedstocks and feedstocks-outputs.

In [3]:

query_no_interestions = """     MATCH (a:Asset)-[:CONTAINS]->(fs:Feedstock)
                                MATCH (a:Asset)-[:CONTAINS]->(out:Output)
                                MATCH (a:Asset)-[:CONTAINS]->(pt:ProcessingTech)
                                RETURN fs.term, pt.term, out.term, count(a)
                        """

The second query is designed to also fetch the intersections between different outputs, different processing technologies and different feedstocks.

In [4]:

# issue: this query needs to be divided by two when building the matrix -> NON OPTIMIZED
process_variables = ['Feedstock', 'Output', 'ProcessingTech']
query_intersections = """       MATCH (a:Asset)-[:CONTAINS]->(fs:{})
                                MATCH (a:Asset)-[:CONTAINS]->(t:{})
                                WHERE fs<>t
                                RETURN fs.term, t.term, count(a)
                      """

1.3. Building the Matrix: No Intersections¶

In order to get the axis of our co-occurence matrix we make all the necessary queries and gather all the names that appear in a matrix_axis_names list.

In [5]:

# Return dataframe with no interesections
data_no_intersections = DataFrame(connection_to_graph.data(query_no_interestions)).as_matrix()

# Get axis names from columns and add them in a list
feedstock_names = set(list(data_no_intersections[:, 1]))
processing_technology_names = set(list(data_no_intersections[:, 2]))
output_names = set(list(data_no_intersections[:, 3]))
matrix_axis_names = list(feedstock_names) + list(processing_technology_names) + list(output_names)

# Extra names that only appear in non-intersection query
for category in process_variables:
    data_no_intersections = DataFrame(connection_to_graph.data(query_intersections.format(category, category))).as_matrix()
    for column_number in range(1,3):
        column = data_no_intersections[:, column_number]
        for name in column:
            if name not in matrix_axis_names:
                matrix_axis_names.append(name)

The co-occurence matrix will be called matrix. In order to get things started we need two things:

An empty matrix with the shape of the previously built axis:

In [6]:

matrix = np.zeros([len(matrix_axis_names), len(matrix_axis_names)])

A function that given a certain name, returns the position of that name in a list:

In [7]:

def find_index(something, in_list):
    return in_list.index(something)

Finally, we can start building the first part of the matrix. To do so, we iterated over all of the rows of the data retreived in the data_no_intersections table.

In [8]:

# for every row in original response
for row in data_no_intersections:
    # the last column is the frequency (count)
    frequency = row[0]
    indexes = [find_index(element, matrix_axis_names) for element in row[1::]]
    # add frequency value to matrix position
    for pair in itertools.combinations(indexes, 2):
        matrix[pair[0], pair[1]] += frequency
        matrix[pair[1], pair[0]] += frequency

1.4. Building the Matrix: With Intersections¶

The second part of the matrix that should be built related to the second query. Intersecting categories.

In [9]:

for category in process_variables:
    print 'Processing ', category
    process_data = DataFrame(connection_to_graph.data(query_intersections.format(category, category))).as_matrix()
    for row in process_data:
        frequency = row[0]
        indexes = [find_index(element, matrix_axis_names) for element in row[1::]]
        #     add frequency value to matrix position
        for pair in itertools.combinations(indexes, 2):
            matrix[pair[0], pair[1]] += frequency / 2 # Divided by two because query not optimized
            matrix[pair[1], pair[0]] += frequency / 2 # Divided by two because query not optimized

Processing  Feedstock
Processing  Output
Processing  ProcessingTech

1.5. Testing the co-occurence matrix¶

Let's start by getting some basic stats:

In [10]:

print 'Rows:', matrix.shape[0]
print 'Columns:', matrix.shape[1]
print 'Mean: ', np.mean(matrix)
print 'Standart Deviation', np.std(matrix)
print 'Max: ', np.amax(matrix)
print 'Min: ', np.amin(matrix)

Rows: 342
Columns: 342
Mean:  1.3187989466844499
Standart Deviation 15.649910770705663
Max:  1626.0
Min:  0.0

Normalize it:

In [11]:

normalized_matrix = (matrix - np.mean(matrix)) / np.std(matrix)

Let's check if the normalized and non-normalized matrixes are symetrical (they should).

In [12]:

def check_symmetric(a, tol):
    return np.allclose(a, a.T, atol=tol)

print 'The non normalized matrix is symmetrical: {}'.format(check_symmetric(matrix, 1e-8))
print 'The normalized matrix is symmetrical: {}'.format(check_symmetric(normalized_matrix, 1e-8))

The non normalized matrix is symmetrical: True
The normalized matrix is symmetrical: True

The diagonal of the non-normalized matrix is all zeros:

In [13]:

for column_number in range(matrix.shape[0]):
    if matrix[column_number, column_number] != 0:
        print 'Non-zero entry found in entry {}'.format(column_number)

1.6. Printing the matrix¶

In [14]:

plt.subplots(2,1,figsize=(15,15))

plt.subplot(121)
sns.heatmap(matrix, cmap='BuPu', cbar=None, square=True, xticklabels=False, yticklabels=False)
plt.title('Co-occurence matrix heatmap: Non-normalized')

plt.subplot(122)
sns.heatmap(normalized_matrix, cmap='BuPu', cbar=None, square=True, xticklabels=False, yticklabels=False)
plt.title('Co-occurence matrix heatmap: Normalized')
plt.show()

2. Co-occurence by country¶

In this part of the analysis, severall co-occurence matrixes will be produced. Ideally, one for every country in the database.

2.1. A co-occurence matrix given a country¶

We start by getting a list of all of the countries in the neo4j database.

In [15]:

country_query = """ MATCH (n:Country) 
                    WITH n.name AS Country 
                    RETURN Country;
                    """

country_names = list(set(DataFrame(connection_to_graph.data(country_query)).as_matrix()[:, 0]))
country_names.sort()

print 'The country list has {} countries.'.format(len(country_names))
print country_names

The country list has 145 countries.
[u'Albania', u'Algeria', u'Argentina', u'Armenia', u'Asia', u'Australia', u'Austria', u'Azerbaijan', u'Bangladesh', u'Belarus', u'Belgium', u'Benin', u'Bolivia', u'Bosnia and Herzegovina', u'Botswana', u'Brazil', u'Brunei', u'Bulgaria', u'Burkina Faso', u'Burundi', u'Cambodia', u'Cameroon', u'Canada', u'Cape Verde', u'Central African Republic', u'Chile', u'Colombia', u'Costa Rica', u'Croatia', u'Cuba', u'Cyprus', u'Czech Republic', u'Denmark', u'Dominican Republic', u'Dominion of Pakistan', u'Ecuador', u'Egypt', u'El Salvador', u'Estonia', u'Ethiopia', u'European Patent Office', u'Falkland Islands', u'Faroe Islands', u'Fiji', u'Finland', u'France', u'Georgia', u'Germany', u'Ghana', u'Greece', u'Guatemala', u'Guinea', u'Honduras', u'Hong Kong', u'Hungary', u'Iceland', u'India', u'Indonesia', u'Iran', u'Iraq', u'Ireland', u'Israel', u'Italy', u'Jamaica', u'Japan', u'Jordan', u'Kazakhstan', u'Kenya', u'Korea', u'Kuwait', u'Latvia', u'Lebanon', u'Libya', u'Liechtenstein', u'Lithuania', u'Luxembourg', u'Macedonia', u'Madagascar', u'Malawi', u'Malaysia', u'Mali', u'Malta', u'Mauritius', u'Mexico', u'Moldova', u'Montenegro', u'Morocco', u'Mozambique', u'Namibia', u'Nepal', u'Netherlands', u'New Zealand', u'Nicaragua', u'Nigeria', u'North Korea', u'Norway', u'Null', u'Oman', u'Pakistan', u'Panama', u'Paraguay', u"People's Republic of China", u'Peru', u'Philippines', u'Poland', u'Portugal', u'Qatar', u'Republic of Macedonia', u'Romania', u'Russia', u'Saint Vincent and the Grenadines', u'Saudi Arabia', u'Scotland', u'Senegal', u'Serbia', u'Singapore', u'Slovakia', u'Slovenia', u'South Africa', u'South Korea', u'Spain', u'Sri Lanka', u'Sudan', u'Swaziland', u'Sweden', u'Switzerland', u'Taiwan', u'Tanzania', u'Thailand', u'Togo', u'Trinidad and Tobago', u'Tunisia', u'Turkey', u'Uganda', u'Ukraine', u'United Arab Emirates', u'United Kingdom', u'United States of America', u'Uruguay', u'Venezuela', u'Vietnam', u'WO', u'Yemen', u'Zambia', u'Zimbabwe']

After doing this, we prepare a function that given a certain country, will retrieve the co-occurence matrix.

In [16]:

def get_country_matrix(country, normalization=True):
    
    # define queries
    country_no_interestions = """   MATCH (a:Asset)-[:CONTAINS]->(fs:Feedstock)
                                    MATCH (a:Asset)-[:CONTAINS]->(out:Output)
                                    MATCH (a:Asset)-[:CONTAINS]->(pt:ProcessingTech)
                                    WHERE a.country = "{}"
                                    RETURN fs.term, pt.term, out.term, count(a)
                                    """.format(country)
    
    process_variables = ['Feedstock', 'Output', 'ProcessingTech']
    
    country_intersections = """     MATCH (a:Asset)-[:CONTAINS]->(fs:{})
                                    MATCH (a:Asset)-[:CONTAINS]->(t:{})
                                    WHERE fs<>t AND a.country = "{}"
                                    RETURN fs.term, t.term, count(a)
                                    """
    # get data
    data_no_intersections = DataFrame(connection_to_graph.data(country_no_interestions)).as_matrix()
    
    # create matrix
    country_matrix = np.zeros([len(matrix_axis_names), len(matrix_axis_names)])
    
    # for no intersections data
    for row in data_no_intersections:
        # the last column is the frequency (count)
        frequency = row[0]
        indexes = [find_index(element, matrix_axis_names) for element in row[1::]]
        # add frequency value to matrix position
        for pair in itertools.combinations(indexes, 2):
            country_matrix[pair[0], pair[1]] += frequency
            country_matrix[pair[1], pair[0]] += frequency
    
    # for intersecting data
    for category in process_variables:
        process_data = DataFrame(connection_to_graph.data(country_intersections.format(category, category, country))).as_matrix()
        for row in process_data:
            frequency = row[0]
            indexes = [find_index(element, matrix_axis_names) for element in row[1::]]
            # add frequency value to matrix position
            for pair in itertools.combinations(indexes, 2):
                country_matrix[pair[0], pair[1]] += frequency / 2 # Divided by two because query not optimized
                country_matrix[pair[1], pair[0]] += frequency / 2 # Divided by two because query not optimized
    
    # normalize
    normalized_country_matrix = (country_matrix - np.mean(country_matrix)) / np.std(country_matrix)
    
    # dynamic return 
    if normalization == True:
        return normalized_country_matrix
    else: 
        return country_matrix

Let's create a function that returns basic stats given a matrix.

In [17]:

def basic_stats(a_matrix):
    print 'Rows:', a_matrix.shape[0]
    print 'Columns:', a_matrix.shape[1]
    print 'Mean: ', np.mean(a_matrix)
    print 'Standart Deviation', np.std(a_matrix)
    print 'Max: ', np.amax(a_matrix)
    print 'Min: ', np.amin(a_matrix)
    print 'Symmetry: ', check_symmetric(matrix, 1e-8)
    print ''

Let's test a couple of countries.

In [18]:

print 'Denmark co-occurence matrix stats:'
basic_stats(get_country_matrix('Denmark', normalization=True))
print 'Sweden co-occurence matrix stats:'
basic_stats(get_country_matrix('Sweden', normalization=True))

Denmark co-occurence matrix stats:
Rows: 342
Columns: 342
Mean:  2.2477070913378225e-17
Standart Deviation 1.0
Max:  98.71619845088439
Min:  -0.043307401534452276
Symmetry:  True

Sweden co-occurence matrix stats:
Rows: 342
Columns: 342
Mean:  7.046865475545606e-18
Standart Deviation 1.0
Max:  111.13602822340711
Min:  -0.06205868957115324
Symmetry:  True

2.1. Transforming the co-occurence matrix into a list¶

One of the goals of the analysis is to understand how each relates to another. To do this, we will need to transform the matrix of a given country into an array. After doing this we will be able to compare the array of each one of the countries, by computing their difference for example.

Let's start by creating a function that given a symetric matrix, as the ones shown above, returns a list. This list will have an entry by position in the matrix.

This means that for a matrix of dimensions 342x342 the list will have a total of 58 482 entries.

In [19]:

def get_list_from(matrix):
    only_valuable = []
    extension = 1
    for row_number in range(matrix.shape[0]):
        only_valuable.append(matrix[row_number, extension:matrix.shape[0]].tolist()) # numpy functions keep 0s so I hard coded it. 
        extension += 1 
    return [element for column in only_valuable for element in column ]

2.2. Visualizing correlations¶

In [20]:

country_1_ = 'United States of America'
country_2_ = '''People's Republic of China'''

country_1_list = get_list_from(get_country_matrix(country_1_, normalization=True))
country_2_list = get_list_from(get_country_matrix(country_2_, normalization=True))

corelation = np.vstack((country_1_list, country_2_list))

plt.subplots(1,1,figsize=(20, 5))
plt.subplot(111)
sns.heatmap(corelation, cmap='flag', cbar=None, square=False, yticklabels=[country_1_, country_2_], xticklabels=False)
plt.title('Country Correlation Spectrum'.format(country_1_, country_2_))
plt.show()

We can see that the spectrum differs significantly in some areas. Please note that all of the measure were normalized prior to the plotting

2.3. Creating a correlation matrix¶

In this part of the analysis we will start correlating countries in relation to their capabilities.

The correlation matrix follows the following principle:

Entry (i, j) is the correlation index between country i and country j
The correlation index between country i and j is the absolute average difference between the capability list of country i and j.

Now, taking the list of countries previously established, we can iterate through it and fill the matrix.

To improve efficieency, we first create a dictionnary where each key is a country, and each value, the capability list. Taking into consideration that if a country has no records, we will discard it.

In [21]:

# create dictionnary
country_capability_dict = {}
counter = 0

# iterate through countries
for country in country_names:
    counter += 1
    country_matrix = get_country_matrix(country, normalization=True)
    
    if np.all(np.isnan(country_matrix)):
        continue
    else: 
        country_capability_dict[country] = get_list_from(country_matrix)

/Users/duarteocarmo/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:46: RuntimeWarning: invalid value encountered in divide

We can now fill the matrix

In [22]:

country_names = country_capability_dict.keys()
country_names.sort()
number_of_countries = len(country_names)
country_correlation = np.zeros([number_of_countries, number_of_countries])

for row in range(number_of_countries):
    print 'Processing country {} / {} \r'.format(row, number_of_countries),
    country_1 = country_names[row]
    country_1_list = np.asarray(country_capability_dict[country_1])
    for column in range(number_of_countries):
        country_2 = country_names[column]
        country_2_list = np.asarray(country_capability_dict[country_2])
        
        avg_dif = np.mean(country_1_list - country_2_list)
        abs_avg_dif = abs(avg_dif)

        country_correlation[row, column] = abs_avg_dif

Processing country 75 / 76

In [28]:

colors = ['RdPu'] # , 'BuPu' for other viz

for color in colors: 
    plt.subplots(1,1,figsize=(15, 15))
    plt.subplot(111)
    sns.heatmap(country_correlation, cmap='{}_r'.format(color), cbar=None, square=True, yticklabels=country_names, xticklabels=country_names)
    plt.title('Country Correlation Matrix')
    plt.show()

In this result, the darker the color in the heatmap, the more correlated the countries are.

2.4. Zooming into a particular country¶

Let's use a histogram to visualize the correlation of a country with the other countries. This can be thought of as 'Slicing' the matrix above.

In [124]:

countries = ['Denmark', 'Sweden']

for country in countries:
    first_country_in_focus = country
    first_country_in_focus_index = find_index(first_country_in_focus, country_names)
    first_histogram_data = country_correlation[:, first_country_in_focus_index]
    
    plt.subplots(1,1,figsize=(15,7))
    sns.barplot(np.arange(len(first_histogram_data)), first_histogram_data)
    plt.xticks(np.arange(len(first_histogram_data)), country_names, rotation=90, fontsize=11)
    plt.title('Country Profile: {}'.format(first_country_in_focus))
    plt.ylabel('Difference Quantity')
    plt.show()

In this result, the taller the bar, the more different the capabilities are.

3. Contextual Relations¶

We will now study the relationship between the previously found correlations and other characteristics.

Let's start with the GDP, in a database from the IMF

We start by finding the countries in our databse that do not have GDP data available.

In [172]:

data = pd.read_csv('gdp.csv', delimiter=';', header=None).as_matrix()
for country in country_names:
    if country not in data[:, 0]:
        print country

Asia
European Patent Office
Null
Scotland
South Korea
WO

/Users/duarteocarmo/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  app.launch_new_instance()

After that, we delete those entries from the country_correlation matrix and create an adapted_country_correlation.

In [173]:

countries_not_available = ['Asia', 'European Patent Office', 'Korea', 'Null', 'Scotland', 'South Korea', 'WO']
index_countries_not_available = [find_index(country, country_names) for country in countries_not_available]

adapted_country_correlation = np.delete(country_correlation, index_countries_not_available, 0)
adapted_country_correlation = np.delete(adapted_country_correlation, index_countries_not_available, 1)

We create a matrix that correlates gdps

In [175]:

gdps = np.zeros([adapted_country_correlation.shape[0], adapted_country_correlation.shape[0]])
countries_available = [country for country in country_names if country not in countries_not_available]
countries_available.sort()

for row in range(len(countries_available)):
    country_1 = countries_available[row]
    country_1_gdp = float(data[find_index(country_1, data[:, 0].tolist()), 1])
    for column in range(len(countries_available)):
        country_2 = countries_available[column]
        country_2_gdp = float(data[find_index(country_2, data[:, 0].tolist()), 1])
    
        gdps[row, column] = abs(country_1_gdp - country_2_gdp)

gdps_norm = (gdps - np.mean(gdps)) / np.std(gdps)

/Users/duarteocarmo/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  from ipykernel import kernelapp as app

We plot the GDP correlation matrix

In [191]:

plt.subplots(2,1,figsize=(30,30))
plt.subplot(121)
sns.heatmap(gdps, cmap='RdPu_r', cbar=None, square=True, yticklabels=countries_available, xticklabels=countries_available)
plt.title('Country GDP Matrix')

plt.subplot(122)
sns.heatmap(adapted_country_correlation, cmap='RdPu_r', cbar=None, square=True, xticklabels=countries_available, yticklabels=False)
plt.title('Capability Correlation')
plt.show()

In [ ]: