Getting Started¶

The base containers include:

a PostgreSQL database
a Jupyter notebook server containing a range of scientific python packages;

The notebook container is lacking some libraries and packages we need to make life easier... The -qq and --quiet flags make the install a quiet one...

In [ ]:

!sudo apt-get update -qq  && sudo apt-get install -qq -y libpq-dev python-dev
!pip install --quiet ipython-sql psycopg2

Jupyter notebooks support SQLMagic. This lets us run SQL commands in a code cell.

In [1]:

%load_ext sql
#This is how we connect to a sql database
#Monolithic VM addressing style
%sql postgresql://postgres:PGPass@postgres:5432/postgres

Out[1]:

'Connected: postgres@postgres'

We can also run SQL commands on the database using the pandas python package.

In [2]:

from sqlalchemy import create_engine
engine = create_engine("postgresql://postgres:PGPass@postgres:5432/postgres")

Importing the base orientation data¶

Loading CSV data directly into a PostgreSQL database often required the CSV to be well behaved; pandas is a bit more forgiving on the import and can be used to load the data from the original CSV file and then insert it into the PostgreSQL database. As pandas dataframes are stored in memory we need to load the data in from the source file in chunks of several thouasnds of lines at a time.

In [ ]:

fn="companies-with-controlling-entities v6.csv"

In [ ]:

!head "$fn"

The PostgreSQL database table will be defined based on the dataframe column names properties - we may need to force column datatypes when we read in the datafile.

In [3]:

#Import the data from the CSV file
import pandas as pd

In [ ]:

#We'll import the base data into a table called: sigcontrol
#Drop the table if it already exists so we start from a blank state
%sql DROP TABLE IF EXISTS sigcontrolchunks=pd.read_csv(fn,chunksize=10000,dtype=str)

#Note that we can also read in from a zip file - as long as it isn't password protected:
#pd.read_csv(fn,chunksize=10000,dtype=str, compression='gzip')
for chunk in chunks:
    #Pop the data into the 
    chunk.to_sql('sigcontrol',engine,index=False, if_exists='append')

In [ ]:

%sql SELECT * FROM sigcontrol LIMIT 3

Orientation Data Exercises¶

To get you started with the data, here are some opening questions, with queries to solve them:

In [29]:

#Who controls G4S?
%sql SELECT * FROM sigcontrol WHERE company_name LIKE '%G4S%' LIMIT 3

3 rows affected.

Out[29]:

company_number	company_name	jurisdiction_code	Controlling Entity Name	opencorporates_url	address_care_of	address_country	address_locality	address_postal_code	address_region	address_street	controlling_company_type	country_of_residence	dob_month	dob_year	family_name	given_name	middle_name	nationality	po_box	title	uid	ultimate_entity_company_number	controlling_entity_company_number
00928471	G4S 084 (UK) LIMITED	gb	G4S PLC	https://opencorporates.com/companies/gb/00928471	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	04992207	04992207
01956974	G4S HOLDINGS 38 (UK) LIMITED	gb	G4S HOLDINGS 3 (UK) LIMITED	https://opencorporates.com/companies/gb/01956974	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	02380914	02380914
02626613	G4S MONITORING TECHNOLOGIES LIMITED	gb	G4S CARE AND JUSTICE SERVICES (UK) LIMITED	https://opencorporates.com/companies/gb/02626613	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	None	00390328	00390328

In [34]:

# Who controls Whitbread?
df=%sql SELECT * FROM sigcontrol WHERE company_name LIKE '%WHITBREAD%'

38 rows affected.

We can create a graph that allows us to see how these companies are related to each other. The networkx library is a handy tool for this.

In [41]:

import networkx as nx

def companyInterestsGrapher(row,DG):
    DG.add_node(row['company_number'], Label=row['company_name'])
    DG.add_node(row['controlling_entity_company_number'], Label=row['Controlling Entity Name'])
    DG.add_edge(row['company_number'],row['controlling_entity_company_number'])
    return

In [42]:

DG=nx.DiGraph()

In [44]:

tmp=df.DataFrame().apply(lambda x: companyInterestsGrapher(x,DG), axis=1)

In [46]:

DG.number_of_nodes()

Out[46]:

In [65]:

nx.draw(DG,with_labels=True,labels=node_labels)

In [63]:

Out[63]:

{'00224163': <matplotlib.text.Text at 0x7f5287f29080>,
 '01044129': <matplotlib.text.Text at 0x7f5287f295c0>,
 None: <matplotlib.text.Text at 0x7f5287f29b00>,
 '01508577': <matplotlib.text.Text at 0x7f5287f30080>,
 '04120344': <matplotlib.text.Text at 0x7f5287f305c0>,
 '05386807': <matplotlib.text.Text at 0x7f5287f30b00>,
 '00504918': <matplotlib.text.Text at 0x7f5287f37080>,
 '00029423': <matplotlib.text.Text at 0x7f5287f375c0>,
 '03082773': <matplotlib.text.Text at 0x7f5287f37b00>,
 '29423': <matplotlib.text.Text at 0x7f5287f3d080>,
 '06319724': <matplotlib.text.Text at 0x7f5287f3d5c0>,
 '00629992': <matplotlib.text.Text at 0x7f5287f3db00>,
 '03416224': <matplotlib.text.Text at 0x7f5287f41080>,
 '06682973': <matplotlib.text.Text at 0x7f5287f415c0>,
 '00562422': <matplotlib.text.Text at 0x7f5287f41b00>,
 '02370894': <matplotlib.text.Text at 0x7f52876b0080>,
 '00017030': <matplotlib.text.Text at 0x7f52876b05c0>,
 '10317272': <matplotlib.text.Text at 0x7f52876b0b00>,
 '00370005': <matplotlib.text.Text at 0x7f5287688080>,
 '02289106': <matplotlib.text.Text at 0x7f52876885c0>,
 '06354009': <matplotlib.text.Text at 0x7f5287688b00>,
 '07707656': <matplotlib.text.Text at 0x7f528768f080>}

In [60]:

node_labels = nx.get_node_attributes(DG,'Label')
node_labels

Out[60]:

{'00224163': 'WHITBREAD HOTEL COMPANY LIMITED',
 '01044129': 'WHITBREAD DEVELOPMENTS LIMITED',
 None: 'Mr Gary Whitbread',
 '01508577': 'WHITBREAD SHARE OWNERSHIP TRUSTEES LIMITED',
 '04120344': 'WHITBREAD PLC',
 '05386807': 'WHITBREAD GUARANTEE COMPANY TWO LIMITED',
 '00504918': 'WHITBREAD GOLF AND COUNTRY CLUB LIMITED',
 '00029423': 'WHITBREAD GROUP PLC',
 '03082773': 'WHITBREAD SUNDERLAND (1995) LIMITED',
 '29423': 'Whitbread Group Plc',
 '06319724': 'MILTON 2 LIMITED',
 '00629992': 'WHITBREAD PENSION TRUSTEES',
 '03416224': 'WHITBREAD QUEST TRUSTEE LIMITED',
 '06682973': 'WHITBREAD MEDICAL LIMITED',
 '00562422': 'WHITBREAD INVESTMENT COMPANY LIMITED',
 '02370894': 'WHITBREAD RESTAURANTS (AUSTRALIA) LIMITED',
 '00017030': 'WHITBREAD EAST PENNINES LIMITED',
 '10317272': 'WHITBREAD CONSULTING LIMITED',
 '00370005': 'WHITBREAD INTERNATIONAL TRADING LIMITED',
 '02289106': 'WHITBREAD NOMINEES LIMITED',
 '06354009': 'WHITBREAD FINANCE PUBLIC LIMITED COMPANY',
 '07707656': 'WHITBREADS TRADING LTD'}

In [ ]:

nx.write_gexf(DG, "companytree.gexf")

In [ ]:

#Show the most popular named entity nodes
for node in sorted(DG.in_degree_iter(),key=itemgetter(1),reverse=True)[:20]:
    print(node)

In [ ]:

How many different nationalities are beneficial owners?
A simple count of nationalities reveals 1,725 but with a lot of duplicates (i.e. there are 23 different ways of spelling/representing the nationality "Zimbabwean". We might need to remove text after a "/" to remove duel nationalities in some cleaning. We then might need to match to a complete list of nationalities to understand.
How many royalty control companies?
What is the age of the youngest person who controls a company?
df = pd.read_csv("companies-with-controlling-entities-v5.csv")
df.dob_year.min()
df[df.dob_year == 2016]
There are 1097 Registered in this year.
What is the age of the oldest person who controls a company?
df = pd.read_csv("companies-with-controlling-entities-v5.csv")
df.dob_year.min()
df[df.dob_year == 1916]
There are three persons born in 1916
There is someone born in the future; in the year 9998!
Which individual controls the most number of companies?
df.apply(lambda x: ' '.join([str(x.given_name),str(x.family_name)]), axis=1).value_counts()
The individual Gabrielle Southern  owns 761 companies ( ! )

In [ ]:

Spend Network Data¶

If you try to unzip the data, you may get an error depending on the format of the zip fule used:

unzip message need PK compat. v5.1 (can do v4.6) : install the p7zip-full package and use p7zip to unzip.

Install p7zip:

In [ ]:

!sudo apt-get install p7zip-full

Launch a terminal from the notebooks homepage (New -> Terminal) and run the following command (when prompted, the password is *****):

7z e PATH_TO_FILE/FILENAME.csv.zip

For example: 7z e Spend\ Network/central_gov_slim.csv.zip

Load the data in (this may take some time).

In [22]:

fn='central_gov_slim.csv'
%sql DROP TABLE IF EXISTS central_gov_slim

chunks=pd.read_csv(fn,chunksize=1000,dtype=str,encoding='utf-8-sig')

err=0

#Note that we can also read in from a zip file - as long as it isn't password protected:
#pd.read_csv(fn,chunksize=10000,dtype=str, compression='gzip')
for chunk in chunks:
    #Pop the data into the 
    try:
        chunk.to_sql('central_gov_slim',engine,index=False, if_exists='append')
    except:
        err=err+1
print(err)

Done.

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/opt/conda/lib/python3.5/encodings/utf_8_sig.py in decode(input, errors)
     22         prefix = 3
---> 23     (output, consumed) = codecs.utf_8_decode(input, errors, True)
     24     return (output, consumed+prefix)

KeyboardInterrupt: 

The above exception was the direct cause of the following exception:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-22-42926f452e42> in <module>()
      7 #Note that we can also read in from a zip file - as long as it isn't password protected:
      8 #pd.read_csv(fn,chunksize=10000,dtype=str, compression='gzip')
----> 9 for chunk in chunks:
     10     #Pop the data into the
     11     try:

/opt/conda/lib/python3.5/site-packages/pandas/io/parsers.py in __next__(self)
    913     def __next__(self):
    914         try:
--> 915             return self.get_chunk()
    916         except StopIteration:
    917             self.close()

/opt/conda/lib/python3.5/site-packages/pandas/io/parsers.py in get_chunk(self, size)
    969         if size is None:
    970             size = self.chunksize
--> 971         return self.read(nrows=size)
    972 
    973 

/opt/conda/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
    936                 raise ValueError('skipfooter not supported for iteration')
    937 
--> 938         ret = self._engine.read(nrows)
    939 
    940         if self.options.get('as_recarray'):

/opt/conda/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows)
   1503     def read(self, nrows=None):
   1504         try:
-> 1505             data = self._reader.read(nrows)
   1506         except StopIteration:
   1507             if self._first_chunk:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:9884)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10347)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:11161)()

pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:12536)()

pandas/parser.pyx in pandas.parser.TextReader._convert_tokens (pandas/parser.c:13203)()

pandas/parser.pyx in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:14853)()

pandas/parser.pyx in pandas.parser.TextReader._string_convert (pandas/parser.c:15538)()

pandas/parser.pyx in pandas.parser._string_box_decode (pandas/parser.c:21480)()

KeyboardInterrupt: decoding with 'utf-8-sig' codec failed (KeyboardInterrupt: )

In [24]:

%sql SELECT COUNT(*) FROM central_gov_slim

1 rows affected.

Out[24]:

count
1170000

In [ ]: