In order to establish a base for understading the basis of the work, we wish to understand the occurence of severall feedstocks, pricessing technologies and outputs in our database.
For example, how many assets (patents, papers, etc) contain the mix of processing technology X for output Y?
To understand this in a more general way, the AMICA database will be transformed in an co-occurence matrix. This means, in the above described example that the number of assets that contain that mix will be an entry in a matrix, whyle the corresponfing technology and output will be columns/lines of the matrix.
from py2neo import Graph
import numpy as np
from pandas import DataFrame
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
import json
import math
import pandas as pd
After that, we use py2neo to establish a connection to our notebook.
local_connection_url = "http://localhost:7474/db/data"
connection_to_graph = Graph(local_connection_url)
Two queries to the database were designed, the first, only related to non interescting data, this means, looking for co-occurences of outputs-technologies, technologies-feedstocks and feedstocks-outputs.
query_no_interestions = """ MATCH (a:Asset)-[:CONTAINS]->(fs:Feedstock)
MATCH (a:Asset)-[:CONTAINS]->(out:Output)
MATCH (a:Asset)-[:CONTAINS]->(pt:ProcessingTech)
RETURN fs.term, pt.term, out.term, count(a)
"""
The second query is designed to also fetch the intersections between different outputs, different processing technologies and different feedstocks.
# issue: this query needs to be divided by two when building the matrix -> NON OPTIMIZED
process_variables = ['Feedstock', 'Output', 'ProcessingTech']
query_intersections = """ MATCH (a:Asset)-[:CONTAINS]->(fs:{})
MATCH (a:Asset)-[:CONTAINS]->(t:{})
WHERE fs<>t
RETURN fs.term, t.term, count(a)
"""
In order to get the axis of our co-occurence matrix we make all the necessary queries and gather all the names that appear in a matrix_axis_names
list.
# Return dataframe with no interesections
data_no_intersections = DataFrame(connection_to_graph.data(query_no_interestions)).as_matrix()
# Get axis names from columns and add them in a list
feedstock_names = set(list(data_no_intersections[:, 1]))
processing_technology_names = set(list(data_no_intersections[:, 2]))
output_names = set(list(data_no_intersections[:, 3]))
matrix_axis_names = list(feedstock_names) + list(processing_technology_names) + list(output_names)
# Extra names that only appear in non-intersection query
for category in process_variables:
data_no_intersections = DataFrame(connection_to_graph.data(query_intersections.format(category, category))).as_matrix()
for column_number in range(1,3):
column = data_no_intersections[:, column_number]
for name in column:
if name not in matrix_axis_names:
matrix_axis_names.append(name)
The co-occurence matrix will be called matrix
. In order to get things started we need two things:
matrix = np.zeros([len(matrix_axis_names), len(matrix_axis_names)])
def find_index(something, in_list):
return in_list.index(something)
Finally, we can start building the first part of the matrix. To do so, we iterated over all of the rows of the data retreived in the data_no_intersections
table.
# for every row in original response
for row in data_no_intersections:
# the last column is the frequency (count)
frequency = row[0]
indexes = [find_index(element, matrix_axis_names) for element in row[1::]]
# add frequency value to matrix position
for pair in itertools.combinations(indexes, 2):
matrix[pair[0], pair[1]] += frequency
matrix[pair[1], pair[0]] += frequency
The second part of the matrix that should be built related to the second query. Intersecting categories.
for category in process_variables:
print 'Processing ', category
process_data = DataFrame(connection_to_graph.data(query_intersections.format(category, category))).as_matrix()
for row in process_data:
frequency = row[0]
indexes = [find_index(element, matrix_axis_names) for element in row[1::]]
# add frequency value to matrix position
for pair in itertools.combinations(indexes, 2):
matrix[pair[0], pair[1]] += frequency / 2 # Divided by two because query not optimized
matrix[pair[1], pair[0]] += frequency / 2 # Divided by two because query not optimized
Processing Feedstock Processing Output Processing ProcessingTech
Let's start by getting some basic stats:
print 'Rows:', matrix.shape[0]
print 'Columns:', matrix.shape[1]
print 'Mean: ', np.mean(matrix)
print 'Standart Deviation', np.std(matrix)
print 'Max: ', np.amax(matrix)
print 'Min: ', np.amin(matrix)
Rows: 342 Columns: 342 Mean: 1.3187989466844499 Standart Deviation 15.649910770705663 Max: 1626.0 Min: 0.0
Normalize it:
normalized_matrix = (matrix - np.mean(matrix)) / np.std(matrix)
Let's check if the normalized and non-normalized matrixes are symetrical (they should).
def check_symmetric(a, tol):
return np.allclose(a, a.T, atol=tol)
print 'The non normalized matrix is symmetrical: {}'.format(check_symmetric(matrix, 1e-8))
print 'The normalized matrix is symmetrical: {}'.format(check_symmetric(normalized_matrix, 1e-8))
The non normalized matrix is symmetrical: True The normalized matrix is symmetrical: True
The diagonal of the non-normalized matrix is all zeros:
for column_number in range(matrix.shape[0]):
if matrix[column_number, column_number] != 0:
print 'Non-zero entry found in entry {}'.format(column_number)
plt.subplots(2,1,figsize=(15,15))
plt.subplot(121)
sns.heatmap(matrix, cmap='BuPu', cbar=None, square=True, xticklabels=False, yticklabels=False)
plt.title('Co-occurence matrix heatmap: Non-normalized')
plt.subplot(122)
sns.heatmap(normalized_matrix, cmap='BuPu', cbar=None, square=True, xticklabels=False, yticklabels=False)
plt.title('Co-occurence matrix heatmap: Normalized')
plt.show()
In this part of the analysis, severall co-occurence matrixes will be produced. Ideally, one for every country in the database.
We start by getting a list of all of the countries in the neo4j database.
country_query = """ MATCH (n:Country)
WITH n.name AS Country
RETURN Country;
"""
country_names = list(set(DataFrame(connection_to_graph.data(country_query)).as_matrix()[:, 0]))
country_names.sort()
print 'The country list has {} countries.'.format(len(country_names))
print country_names
The country list has 145 countries. [u'Albania', u'Algeria', u'Argentina', u'Armenia', u'Asia', u'Australia', u'Austria', u'Azerbaijan', u'Bangladesh', u'Belarus', u'Belgium', u'Benin', u'Bolivia', u'Bosnia and Herzegovina', u'Botswana', u'Brazil', u'Brunei', u'Bulgaria', u'Burkina Faso', u'Burundi', u'Cambodia', u'Cameroon', u'Canada', u'Cape Verde', u'Central African Republic', u'Chile', u'Colombia', u'Costa Rica', u'Croatia', u'Cuba', u'Cyprus', u'Czech Republic', u'Denmark', u'Dominican Republic', u'Dominion of Pakistan', u'Ecuador', u'Egypt', u'El Salvador', u'Estonia', u'Ethiopia', u'European Patent Office', u'Falkland Islands', u'Faroe Islands', u'Fiji', u'Finland', u'France', u'Georgia', u'Germany', u'Ghana', u'Greece', u'Guatemala', u'Guinea', u'Honduras', u'Hong Kong', u'Hungary', u'Iceland', u'India', u'Indonesia', u'Iran', u'Iraq', u'Ireland', u'Israel', u'Italy', u'Jamaica', u'Japan', u'Jordan', u'Kazakhstan', u'Kenya', u'Korea', u'Kuwait', u'Latvia', u'Lebanon', u'Libya', u'Liechtenstein', u'Lithuania', u'Luxembourg', u'Macedonia', u'Madagascar', u'Malawi', u'Malaysia', u'Mali', u'Malta', u'Mauritius', u'Mexico', u'Moldova', u'Montenegro', u'Morocco', u'Mozambique', u'Namibia', u'Nepal', u'Netherlands', u'New Zealand', u'Nicaragua', u'Nigeria', u'North Korea', u'Norway', u'Null', u'Oman', u'Pakistan', u'Panama', u'Paraguay', u"People's Republic of China", u'Peru', u'Philippines', u'Poland', u'Portugal', u'Qatar', u'Republic of Macedonia', u'Romania', u'Russia', u'Saint Vincent and the Grenadines', u'Saudi Arabia', u'Scotland', u'Senegal', u'Serbia', u'Singapore', u'Slovakia', u'Slovenia', u'South Africa', u'South Korea', u'Spain', u'Sri Lanka', u'Sudan', u'Swaziland', u'Sweden', u'Switzerland', u'Taiwan', u'Tanzania', u'Thailand', u'Togo', u'Trinidad and Tobago', u'Tunisia', u'Turkey', u'Uganda', u'Ukraine', u'United Arab Emirates', u'United Kingdom', u'United States of America', u'Uruguay', u'Venezuela', u'Vietnam', u'WO', u'Yemen', u'Zambia', u'Zimbabwe']
After doing this, we prepare a function that given a certain country, will retrieve the co-occurence matrix.
def get_country_matrix(country, normalization=True):
# define queries
country_no_interestions = """ MATCH (a:Asset)-[:CONTAINS]->(fs:Feedstock)
MATCH (a:Asset)-[:CONTAINS]->(out:Output)
MATCH (a:Asset)-[:CONTAINS]->(pt:ProcessingTech)
WHERE a.country = "{}"
RETURN fs.term, pt.term, out.term, count(a)
""".format(country)
process_variables = ['Feedstock', 'Output', 'ProcessingTech']
country_intersections = """ MATCH (a:Asset)-[:CONTAINS]->(fs:{})
MATCH (a:Asset)-[:CONTAINS]->(t:{})
WHERE fs<>t AND a.country = "{}"
RETURN fs.term, t.term, count(a)
"""
# get data
data_no_intersections = DataFrame(connection_to_graph.data(country_no_interestions)).as_matrix()
# create matrix
country_matrix = np.zeros([len(matrix_axis_names), len(matrix_axis_names)])
# for no intersections data
for row in data_no_intersections:
# the last column is the frequency (count)
frequency = row[0]
indexes = [find_index(element, matrix_axis_names) for element in row[1::]]
# add frequency value to matrix position
for pair in itertools.combinations(indexes, 2):
country_matrix[pair[0], pair[1]] += frequency
country_matrix[pair[1], pair[0]] += frequency
# for intersecting data
for category in process_variables:
process_data = DataFrame(connection_to_graph.data(country_intersections.format(category, category, country))).as_matrix()
for row in process_data:
frequency = row[0]
indexes = [find_index(element, matrix_axis_names) for element in row[1::]]
# add frequency value to matrix position
for pair in itertools.combinations(indexes, 2):
country_matrix[pair[0], pair[1]] += frequency / 2 # Divided by two because query not optimized
country_matrix[pair[1], pair[0]] += frequency / 2 # Divided by two because query not optimized
# normalize
normalized_country_matrix = (country_matrix - np.mean(country_matrix)) / np.std(country_matrix)
# dynamic return
if normalization == True:
return normalized_country_matrix
else:
return country_matrix
Let's create a function that returns basic stats given a matrix.
def basic_stats(a_matrix):
print 'Rows:', a_matrix.shape[0]
print 'Columns:', a_matrix.shape[1]
print 'Mean: ', np.mean(a_matrix)
print 'Standart Deviation', np.std(a_matrix)
print 'Max: ', np.amax(a_matrix)
print 'Min: ', np.amin(a_matrix)
print 'Symmetry: ', check_symmetric(matrix, 1e-8)
print ''
Let's test a couple of countries.
print 'Denmark co-occurence matrix stats:'
basic_stats(get_country_matrix('Denmark', normalization=True))
print 'Sweden co-occurence matrix stats:'
basic_stats(get_country_matrix('Sweden', normalization=True))
Denmark co-occurence matrix stats: Rows: 342 Columns: 342 Mean: 2.2477070913378225e-17 Standart Deviation 1.0 Max: 98.71619845088439 Min: -0.043307401534452276 Symmetry: True Sweden co-occurence matrix stats: Rows: 342 Columns: 342 Mean: 7.046865475545606e-18 Standart Deviation 1.0 Max: 111.13602822340711 Min: -0.06205868957115324 Symmetry: True
One of the goals of the analysis is to understand how each relates to another. To do this, we will need to transform the matrix of a given country into an array. After doing this we will be able to compare the array of each one of the countries, by computing their difference for example.
Let's start by creating a function that given a symetric matrix, as the ones shown above, returns a list. This list will have an entry by position in the matrix.
This means that for a matrix of dimensions 342x342 the list will have a total of 58 482 entries.
def get_list_from(matrix):
only_valuable = []
extension = 1
for row_number in range(matrix.shape[0]):
only_valuable.append(matrix[row_number, extension:matrix.shape[0]].tolist()) # numpy functions keep 0s so I hard coded it.
extension += 1
return [element for column in only_valuable for element in column ]
country_1_ = 'United States of America'
country_2_ = '''People's Republic of China'''
country_1_list = get_list_from(get_country_matrix(country_1_, normalization=True))
country_2_list = get_list_from(get_country_matrix(country_2_, normalization=True))
corelation = np.vstack((country_1_list, country_2_list))
plt.subplots(1,1,figsize=(20, 5))
plt.subplot(111)
sns.heatmap(corelation, cmap='flag', cbar=None, square=False, yticklabels=[country_1_, country_2_], xticklabels=False)
plt.title('Country Correlation Spectrum'.format(country_1_, country_2_))
plt.show()
We can see that the spectrum differs significantly in some areas. Please note that all of the measure were normalized prior to the plotting
In this part of the analysis we will start correlating countries in relation to their capabilities.
The correlation matrix follows the following principle:
Now, taking the list of countries previously established, we can iterate through it and fill the matrix.
To improve efficieency, we first create a dictionnary where each key is a country, and each value, the capability list. Taking into consideration that if a country has no records, we will discard it.
# create dictionnary
country_capability_dict = {}
counter = 0
# iterate through countries
for country in country_names:
counter += 1
country_matrix = get_country_matrix(country, normalization=True)
if np.all(np.isnan(country_matrix)):
continue
else:
country_capability_dict[country] = get_list_from(country_matrix)
/Users/duarteocarmo/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:46: RuntimeWarning: invalid value encountered in divide
We can now fill the matrix
country_names = country_capability_dict.keys()
country_names.sort()
number_of_countries = len(country_names)
country_correlation = np.zeros([number_of_countries, number_of_countries])
for row in range(number_of_countries):
print 'Processing country {} / {} \r'.format(row, number_of_countries),
country_1 = country_names[row]
country_1_list = np.asarray(country_capability_dict[country_1])
for column in range(number_of_countries):
country_2 = country_names[column]
country_2_list = np.asarray(country_capability_dict[country_2])
avg_dif = np.mean(country_1_list - country_2_list)
abs_avg_dif = abs(avg_dif)
country_correlation[row, column] = abs_avg_dif
Processing country 75 / 76
colors = ['RdPu'] # , 'BuPu' for other viz
for color in colors:
plt.subplots(1,1,figsize=(15, 15))
plt.subplot(111)
sns.heatmap(country_correlation, cmap='{}_r'.format(color), cbar=None, square=True, yticklabels=country_names, xticklabels=country_names)
plt.title('Country Correlation Matrix')
plt.show()
In this result, the darker the color in the heatmap, the more correlated the countries are.
Let's use a histogram to visualize the correlation of a country with the other countries. This can be thought of as 'Slicing' the matrix above.
countries = ['Denmark', 'Sweden']
for country in countries:
first_country_in_focus = country
first_country_in_focus_index = find_index(first_country_in_focus, country_names)
first_histogram_data = country_correlation[:, first_country_in_focus_index]
plt.subplots(1,1,figsize=(15,7))
sns.barplot(np.arange(len(first_histogram_data)), first_histogram_data)
plt.xticks(np.arange(len(first_histogram_data)), country_names, rotation=90, fontsize=11)
plt.title('Country Profile: {}'.format(first_country_in_focus))
plt.ylabel('Difference Quantity')
plt.show()
In this result, the taller the bar, the more different the capabilities are.
We will now study the relationship between the previously found correlations and other characteristics.
Let's start with the GDP, in a database from the IMF
We start by finding the countries in our databse that do not have GDP data available.
data = pd.read_csv('gdp.csv', delimiter=';', header=None).as_matrix()
for country in country_names:
if country not in data[:, 0]:
print country
Asia European Patent Office Null Scotland South Korea WO
/Users/duarteocarmo/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal app.launch_new_instance()
After that, we delete those entries from the country_correlation
matrix and create an adapted_country_correlation
.
countries_not_available = ['Asia', 'European Patent Office', 'Korea', 'Null', 'Scotland', 'South Korea', 'WO']
index_countries_not_available = [find_index(country, country_names) for country in countries_not_available]
adapted_country_correlation = np.delete(country_correlation, index_countries_not_available, 0)
adapted_country_correlation = np.delete(adapted_country_correlation, index_countries_not_available, 1)
We create a matrix that correlates gdps
gdps = np.zeros([adapted_country_correlation.shape[0], adapted_country_correlation.shape[0]])
countries_available = [country for country in country_names if country not in countries_not_available]
countries_available.sort()
for row in range(len(countries_available)):
country_1 = countries_available[row]
country_1_gdp = float(data[find_index(country_1, data[:, 0].tolist()), 1])
for column in range(len(countries_available)):
country_2 = countries_available[column]
country_2_gdp = float(data[find_index(country_2, data[:, 0].tolist()), 1])
gdps[row, column] = abs(country_1_gdp - country_2_gdp)
gdps_norm = (gdps - np.mean(gdps)) / np.std(gdps)
/Users/duarteocarmo/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal from ipykernel import kernelapp as app
We plot the GDP correlation matrix
plt.subplots(2,1,figsize=(30,30))
plt.subplot(121)
sns.heatmap(gdps, cmap='RdPu_r', cbar=None, square=True, yticklabels=countries_available, xticklabels=countries_available)
plt.title('Country GDP Matrix')
plt.subplot(122)
sns.heatmap(adapted_country_correlation, cmap='RdPu_r', cbar=None, square=True, xticklabels=countries_available, yticklabels=False)
plt.title('Capability Correlation')
plt.show()