The base containers include:
The notebook container is lacking some libraries and packages we need to make life easier... The -qq
and --quiet
flags make the install a quiet one...
!sudo apt-get update -qq && sudo apt-get install -qq -y libpq-dev python-dev
!pip install --quiet ipython-sql psycopg2
Jupyter notebooks support SQLMagic. This lets us run SQL commands in a code cell.
%load_ext sql
#This is how we connect to a sql database
#Monolithic VM addressing style
%sql postgresql://postgres:PGPass@postgres:5432/postgres
'Connected: postgres@postgres'
We can also run SQL commands on the database using the pandas
python package.
from sqlalchemy import create_engine
engine = create_engine("postgresql://postgres:PGPass@postgres:5432/postgres")
Loading CSV data directly into a PostgreSQL database often required the CSV to be well behaved; pandas is a bit more forgiving on the import and can be used to load the data from the original CSV file and then insert it into the PostgreSQL database. As pandas dataframes are stored in memory we need to load the data in from the source file in chunks of several thouasnds of lines at a time.
fn="companies-with-controlling-entities v6.csv"
!head "$fn"
The PostgreSQL database table will be defined based on the dataframe column names properties - we may need to force column datatypes when we read in the datafile.
#Import the data from the CSV file
import pandas as pd
#We'll import the base data into a table called: sigcontrol
#Drop the table if it already exists so we start from a blank state
%sql DROP TABLE IF EXISTS sigcontrolchunks=pd.read_csv(fn,chunksize=10000,dtype=str)
#Note that we can also read in from a zip file - as long as it isn't password protected:
#pd.read_csv(fn,chunksize=10000,dtype=str, compression='gzip')
for chunk in chunks:
#Pop the data into the
chunk.to_sql('sigcontrol',engine,index=False, if_exists='append')
%sql SELECT * FROM sigcontrol LIMIT 3
To get you started with the data, here are some opening questions, with queries to solve them:
#Who controls G4S?
%sql SELECT * FROM sigcontrol WHERE company_name LIKE '%G4S%' LIMIT 3
3 rows affected.
company_number | company_name | jurisdiction_code | Controlling Entity SKey | Controlling Entity Name | opencorporates_url | address_care_of | address_country | address_locality | address_postal_code | address_region | address_street | controlling_company_type | country_of_residence | dob_month | dob_year | family_name | given_name | middle_name | nationality | po_box | title | uid | ultimate_entity_company_number | controlling_entity_company_number |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
00928471 | G4S 084 (UK) LIMITED | gb | 0 | G4S PLC | https://opencorporates.com/companies/gb/00928471 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | 04992207 | 04992207 |
01956974 | G4S HOLDINGS 38 (UK) LIMITED | gb | 0 | G4S HOLDINGS 3 (UK) LIMITED | https://opencorporates.com/companies/gb/01956974 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | 02380914 | 02380914 |
02626613 | G4S MONITORING TECHNOLOGIES LIMITED | gb | 0 | G4S CARE AND JUSTICE SERVICES (UK) LIMITED | https://opencorporates.com/companies/gb/02626613 | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | None | 00390328 | 00390328 |
# Who controls Whitbread?
df=%sql SELECT * FROM sigcontrol WHERE company_name LIKE '%WHITBREAD%'
38 rows affected.
We can create a graph that allows us to see how these companies are related to each other. The networkx
library is a handy tool for this.
import networkx as nx
def companyInterestsGrapher(row,DG):
DG.add_node(row['company_number'], Label=row['company_name'])
DG.add_node(row['controlling_entity_company_number'], Label=row['Controlling Entity Name'])
DG.add_edge(row['company_number'],row['controlling_entity_company_number'])
return
DG=nx.DiGraph()
tmp=df.DataFrame().apply(lambda x: companyInterestsGrapher(x,DG), axis=1)
DG.number_of_nodes()
22
nx.draw(DG,with_labels=True,labels=node_labels)
{'00224163': <matplotlib.text.Text at 0x7f5287f29080>, '01044129': <matplotlib.text.Text at 0x7f5287f295c0>, None: <matplotlib.text.Text at 0x7f5287f29b00>, '01508577': <matplotlib.text.Text at 0x7f5287f30080>, '04120344': <matplotlib.text.Text at 0x7f5287f305c0>, '05386807': <matplotlib.text.Text at 0x7f5287f30b00>, '00504918': <matplotlib.text.Text at 0x7f5287f37080>, '00029423': <matplotlib.text.Text at 0x7f5287f375c0>, '03082773': <matplotlib.text.Text at 0x7f5287f37b00>, '29423': <matplotlib.text.Text at 0x7f5287f3d080>, '06319724': <matplotlib.text.Text at 0x7f5287f3d5c0>, '00629992': <matplotlib.text.Text at 0x7f5287f3db00>, '03416224': <matplotlib.text.Text at 0x7f5287f41080>, '06682973': <matplotlib.text.Text at 0x7f5287f415c0>, '00562422': <matplotlib.text.Text at 0x7f5287f41b00>, '02370894': <matplotlib.text.Text at 0x7f52876b0080>, '00017030': <matplotlib.text.Text at 0x7f52876b05c0>, '10317272': <matplotlib.text.Text at 0x7f52876b0b00>, '00370005': <matplotlib.text.Text at 0x7f5287688080>, '02289106': <matplotlib.text.Text at 0x7f52876885c0>, '06354009': <matplotlib.text.Text at 0x7f5287688b00>, '07707656': <matplotlib.text.Text at 0x7f528768f080>}
node_labels = nx.get_node_attributes(DG,'Label')
node_labels
{'00224163': 'WHITBREAD HOTEL COMPANY LIMITED', '01044129': 'WHITBREAD DEVELOPMENTS LIMITED', None: 'Mr Gary Whitbread', '01508577': 'WHITBREAD SHARE OWNERSHIP TRUSTEES LIMITED', '04120344': 'WHITBREAD PLC', '05386807': 'WHITBREAD GUARANTEE COMPANY TWO LIMITED', '00504918': 'WHITBREAD GOLF AND COUNTRY CLUB LIMITED', '00029423': 'WHITBREAD GROUP PLC', '03082773': 'WHITBREAD SUNDERLAND (1995) LIMITED', '29423': 'Whitbread Group Plc', '06319724': 'MILTON 2 LIMITED', '00629992': 'WHITBREAD PENSION TRUSTEES', '03416224': 'WHITBREAD QUEST TRUSTEE LIMITED', '06682973': 'WHITBREAD MEDICAL LIMITED', '00562422': 'WHITBREAD INVESTMENT COMPANY LIMITED', '02370894': 'WHITBREAD RESTAURANTS (AUSTRALIA) LIMITED', '00017030': 'WHITBREAD EAST PENNINES LIMITED', '10317272': 'WHITBREAD CONSULTING LIMITED', '00370005': 'WHITBREAD INTERNATIONAL TRADING LIMITED', '02289106': 'WHITBREAD NOMINEES LIMITED', '06354009': 'WHITBREAD FINANCE PUBLIC LIMITED COMPANY', '07707656': 'WHITBREADS TRADING LTD'}
nx.write_gexf(DG, "companytree.gexf")
#Show the most popular named entity nodes
for node in sorted(DG.in_degree_iter(),key=itemgetter(1),reverse=True)[:20]:
print(node)
How many different nationalities are beneficial owners?
A simple count of nationalities reveals 1,725 but with a lot of duplicates (i.e. there are 23 different ways of spelling/representing the nationality "Zimbabwean". We might need to remove text after a "/" to remove duel nationalities in some cleaning. We then might need to match to a complete list of nationalities to understand.
How many royalty control companies?
What is the age of the youngest person who controls a company?
df = pd.read_csv("companies-with-controlling-entities-v5.csv")
df.dob_year.min()
df[df.dob_year == 2016]
There are 1097 Registered in this year.
What is the age of the oldest person who controls a company?
df = pd.read_csv("companies-with-controlling-entities-v5.csv")
df.dob_year.min()
df[df.dob_year == 1916]
There are three persons born in 1916
There is someone born in the future; in the year 9998!
Which individual controls the most number of companies?
df.apply(lambda x: ' '.join([str(x.given_name),str(x.family_name)]), axis=1).value_counts()
The individual Gabrielle Southern owns 761 companies ( ! )
If you try to unzip the data, you may get an error depending on the format of the zip fule used:
p7zip-full
package and use p7zip
to unzip.Install p7zip
:
!sudo apt-get install p7zip-full
Launch a terminal from the notebooks homepage (New -> Terminal) and run the following command (when prompted, the password is *****):
7z e PATH_TO_FILE/FILENAME.csv.zip
For example:
7z e Spend\ Network/central_gov_slim.csv.zip
Load the data in (this may take some time).
fn='central_gov_slim.csv'
%sql DROP TABLE IF EXISTS central_gov_slim
chunks=pd.read_csv(fn,chunksize=1000,dtype=str,encoding='utf-8-sig')
err=0
#Note that we can also read in from a zip file - as long as it isn't password protected:
#pd.read_csv(fn,chunksize=10000,dtype=str, compression='gzip')
for chunk in chunks:
#Pop the data into the
try:
chunk.to_sql('central_gov_slim',engine,index=False, if_exists='append')
except:
err=err+1
print(err)
Done.
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) /opt/conda/lib/python3.5/encodings/utf_8_sig.py in decode(input, errors) 22 prefix = 3 ---> 23 (output, consumed) = codecs.utf_8_decode(input, errors, True) 24 return (output, consumed+prefix) KeyboardInterrupt: The above exception was the direct cause of the following exception: KeyboardInterrupt Traceback (most recent call last) <ipython-input-22-42926f452e42> in <module>() 7 #Note that we can also read in from a zip file - as long as it isn't password protected: 8 #pd.read_csv(fn,chunksize=10000,dtype=str, compression='gzip') ----> 9 for chunk in chunks: 10 #Pop the data into the 11 try: /opt/conda/lib/python3.5/site-packages/pandas/io/parsers.py in __next__(self) 913 def __next__(self): 914 try: --> 915 return self.get_chunk() 916 except StopIteration: 917 self.close() /opt/conda/lib/python3.5/site-packages/pandas/io/parsers.py in get_chunk(self, size) 969 if size is None: 970 size = self.chunksize --> 971 return self.read(nrows=size) 972 973 /opt/conda/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows) 936 raise ValueError('skipfooter not supported for iteration') 937 --> 938 ret = self._engine.read(nrows) 939 940 if self.options.get('as_recarray'): /opt/conda/lib/python3.5/site-packages/pandas/io/parsers.py in read(self, nrows) 1503 def read(self, nrows=None): 1504 try: -> 1505 data = self._reader.read(nrows) 1506 except StopIteration: 1507 if self._first_chunk: pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:9884)() pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10347)() pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:11161)() pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:12536)() pandas/parser.pyx in pandas.parser.TextReader._convert_tokens (pandas/parser.c:13203)() pandas/parser.pyx in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:14853)() pandas/parser.pyx in pandas.parser.TextReader._string_convert (pandas/parser.c:15538)() pandas/parser.pyx in pandas.parser._string_box_decode (pandas/parser.c:21480)() KeyboardInterrupt: decoding with 'utf-8-sig' codec failed (KeyboardInterrupt: )
%sql SELECT COUNT(*) FROM central_gov_slim
1 rows affected.
count |
---|
1170000 |