Note that the data ingest stage of this activity may take your computer an hour or more to process, so make sure you take this into account as you work through the activity.
In the UK, Companies House maintains a public register of registered companies. Details of "live" companies can be downloaded as open data from Companies House: http://download.companieshouse.gov.uk/en_output.html.
The file is currently bundled into a set of 5 files, containing something over three and a half million records (the lower numbered zip files have 850,000 rows, the final zip file the balance).
Whilst pandas can cope with quite large dataframes, keeping such a large number of items in memory represents a large overhead. In addition, having to load the whole dataset into memory each time a new python session is run is also time consuming.
Instead, it's much easier to store these records in a persistent database such as PostgreSQL, generate efficient indexes over column we are likely to one to query, and then pull out just the records we need as we need them.
The first step requires downloading the data from the Companies House website. Let's create a directory in which to put the downloaded zip files and download the files to it.
!mkdir companydata
#You may need to check the URLs
#http://download.companieshouse.gov.uk/en_output.html
#Download the data files and place them in the companydata directory
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part1_5.zip -P companydata
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part2_5.zip -P companydata
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part3_5.zip -P companydata
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part4_5.zip -P companydata
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part5_5.zip -P companydata
#Check the files are there...
!ls companydata
#Unzip the files into the same directory
!unzip "companydata/*.zip" -d companydata
Archive: companydata/BasicCompanyData-2014-06-01-part1_5.zip inflating: companydata/BasicCompanyData-2014-06-01-part1_5.csv Archive: companydata/BasicCompanyData-2014-06-01-part2_5.zip inflating: companydata/BasicCompanyData-2014-06-01-part2_5.csv Archive: companydata/BasicCompanyData-2014-06-01-part3_5.zip inflating: companydata/BasicCompanyData-2014-06-01-part3_5.csv Archive: companydata/BasicCompanyData-2014-06-01-part4_5.zip inflating: companydata/BasicCompanyData-2014-06-01-part4_5.csv Archive: companydata/BasicCompanyData-2014-06-01-part5_5.zip inflating: companydata/BasicCompanyData-2014-06-01-part5_5.csv 5 archives were successfully processed.
#Should we tidy up further and delete the zip files too?
#Or maybe do this after we know everything is has been loaded...
#!rm companydata/*.zip
#How many rows are there in each file?
!wc -l companydata/*.csv
#Let's have a quick peak at the head of one of the files
!head -n 3 companydata/BasicCompanyData-2014-06-01-part1_5.csv
CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE, PreviousName_1.CompanyName, PreviousName_2.CONDATE, PreviousName_2.CompanyName,PreviousName_3.CONDATE, PreviousName_3.CompanyName,PreviousName_4.CONDATE, PreviousName_4.CompanyName,PreviousName_5.CONDATE, PreviousName_5.CompanyName,PreviousName_6.CONDATE, PreviousName_6.CompanyName,PreviousName_7.CONDATE, PreviousName_7.CompanyName,PreviousName_8.CONDATE, PreviousName_8.CompanyName,PreviousName_9.CONDATE, PreviousName_9.CompanyName,PreviousName_10.CONDATE, PreviousName_10.CompanyName "! LTD","08209948","","","METROHOUSE 57 PEPPER ROAD","HUNSLET","LEEDS","YORKSHIRE","UNITED KINGDOM","LS10 2RU","Private Limited Company","Active","United Kingdom","","11/09/2012","30","9","30/06/2015","30/09/2013","DORMANT","09/10/2014","11/09/2013","0","0","0","0","99999 - Dormant Company","","","","0","0","http://business.data.gov.uk/id/company/08209948","","","","","","","","","","","","","","","","","","","","" "!BIG IMPACT GRAPHICS LIMITED","07382019","","","335 ROSDEN HOUSE","372 OLD STREET","LONDON","","","EC1V 9AV","Private Limited Company","Active","United Kingdom","","21/09/2010","30","9","30/06/2014","30/09/2012","DORMANT","19/10/2014","21/09/2013","0","0","0","0","59112 - Video production activities","59113 - Television programme production activities","74100 - specialised design activities","74202 - Other specialist photography","0","0","http://business.data.gov.uk/id/company/07382019","","","","","","","","","","","","","","","","","","","",""
#Let's see how pandas sees a file
import pandas as pd
tmp=pd.read_csv('companydata/BasicCompanyData-2014-06-01-part1_5.csv',nrows=3)
tmp.head(3)
CompanyName | CompanyNumber | RegAddress.CareOf | RegAddress.POBox | RegAddress.AddressLine1 | RegAddress.AddressLine2 | RegAddress.PostTown | RegAddress.County | RegAddress.Country | RegAddress.PostCode | ... | PreviousName_6.CONDATE | PreviousName_6.CompanyName | PreviousName_7.CONDATE | PreviousName_7.CompanyName | PreviousName_8.CONDATE | PreviousName_8.CompanyName | PreviousName_9.CONDATE | PreviousName_9.CompanyName | PreviousName_10.CONDATE | PreviousName_10.CompanyName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ! LTD | 8209948 | NaN | NaN | METROHOUSE 57 PEPPER ROAD | HUNSLET | LEEDS | YORKSHIRE | UNITED KINGDOM | LS10 2RU | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | !BIG IMPACT GRAPHICS LIMITED | 7382019 | NaN | NaN | 335 ROSDEN HOUSE | 372 OLD STREET | LONDON | NaN | NaN | EC1V 9AV | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | !K7 TOURING LIMITED | 8937297 | NaN | NaN | 5TH FLOOR, DURKAN HOUSE | 155 EAST BARNET ROAD | NEW BARNET | HERTS | UNITED KINGDOM | EN4 8QZ | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 53 columns
!ls companydata
BasicCompanyData-2014-06-01-part1_5.zip BasicCompanyData-2014-06-01-part2_5.zip BasicCompanyData-2014-06-01-part3_5.zip BasicCompanyData-2014-06-01-part4_5.zip BasicCompanyData-2014-06-01-part5_5.zip
#Let's look at the column names
tmp.columns.values
array(['CompanyName', ' CompanyNumber', 'RegAddress.CareOf', 'RegAddress.POBox', 'RegAddress.AddressLine1', ' RegAddress.AddressLine2', 'RegAddress.PostTown', 'RegAddress.County', 'RegAddress.Country', 'RegAddress.PostCode', 'CompanyCategory', 'CompanyStatus', 'CountryOfOrigin', 'DissolutionDate', 'IncorporationDate', 'Accounts.AccountRefDay', 'Accounts.AccountRefMonth', 'Accounts.NextDueDate', 'Accounts.LastMadeUpDate', 'Accounts.AccountCategory', 'Returns.NextDueDate', 'Returns.LastMadeUpDate', 'Mortgages.NumMortCharges', 'Mortgages.NumMortOutstanding', 'Mortgages.NumMortPartSatisfied', 'Mortgages.NumMortSatisfied', 'SICCode.SicText_1', 'SICCode.SicText_2', 'SICCode.SicText_3', 'SICCode.SicText_4', 'LimitedPartnerships.NumGenPartners', 'LimitedPartnerships.NumLimPartners', 'URI', 'PreviousName_1.CONDATE', ' PreviousName_1.CompanyName', ' PreviousName_2.CONDATE', ' PreviousName_2.CompanyName', 'PreviousName_3.CONDATE', ' PreviousName_3.CompanyName', 'PreviousName_4.CONDATE', ' PreviousName_4.CompanyName', 'PreviousName_5.CONDATE', ' PreviousName_5.CompanyName', 'PreviousName_6.CONDATE', ' PreviousName_6.CompanyName', 'PreviousName_7.CONDATE', ' PreviousName_7.CompanyName', 'PreviousName_8.CONDATE', ' PreviousName_8.CompanyName', 'PreviousName_9.CONDATE', ' PreviousName_9.CompanyName', 'PreviousName_10.CONDATE', ' PreviousName_10.CompanyName'], dtype=object)
Looks like there may be extraneous white space in some of the column names, which we'll need to take care of...
We're going to be using PostgreSQL so we need to load in some drivers and handlers.
#deprecate this in favour of psqlx?
#import psycopg2 as pg
#If connecting to the default port, you can omit the port parameter
#conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test' , port=5432)
#cur = conn.cursor()
import pandas as pd
pd.__version__
'0.14.0'
!pip3 install --upgrade
You must give at least one requirement to install (see "pip help install")
import sys
sys.version
'3.4.0 (default, Apr 11 2014, 13:05:11) \n[GCC 4.8.2]'
#psql lets us return tables as dataframes
from pandas import read_sql_query as psql
#psqlx is a convenience fucntion for executing raw queries where we aren't so bothered about the response...
#Errors should still be thrown...
from pandas.io.sql import execute as psqlx
import pandas as pd
from sqlalchemy import create_engine
/usr/local/lib/python3.4/dist-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0. .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))
For some reason, the psql wrapper requires LIKE % wildcard character to be escaped as %%
#Create the database connection for pandas dataframe operations
engine = create_engine('postgresql://test:test@localhost:5432/tm351test')
#cur.execute("DROP TABLE companydata;")
#psqlx("DROP TABLE companydata;",engine)
#cur.execute("DROP TABLE company_postcode_area;")
#psqlx("DROP TABLE company_postcode_area;",engine)
#cur.execute("DROP INDEX company_postcode_area_idx;")
#psqlx("DROP INDEX company_postcode_area_idx;",engine)
#To define the table, let's check the longest strings in column
#Note that this routine may take a few minutes to run
cleanColNames=["CompanyName","RegAddress.CareOf","RegAddress.POBox", \
"RegAddress.AddressLine1","RegAddress.AddressLine2","RegAddress.PostTown","RegAddress.County", \
"RegAddress.Country","RegAddress.PostCode","CompanyCategory","CompanyStatus", \
"CountryOfOrigin"]
def colMaxLen(df,maxlens):
df.rename(columns=lambda x: x.strip(), inplace=True)
for col in cleanColNames:
cml=df[col].astype(str).map(len).max()
if cml > maxlens[col]: maxlens[col]=cml
import os
maxlens={}
for name in cleanColNames: maxlens[name]=0
for fname in os.listdir('companydata/'):
#if a file is a CSV file, process it
if fname.endswith('csv'):
fname="companydata/"+fname
#Read in 10,000 rows at a time
chunks=pd.read_csv(fname,chunksize=10000)
for chunk in chunks:
colMaxLen(chunk,maxlens)
maxlens
#I'm going to create a table over some of the columns - essentially, the company name, number, address and incoporation date
#We can use the maxlens to ensure we create large enough data fields
#I wonder if we should just try to create the table from the dataframe?
#Perhaps make sure we cast the right type of each column in the dataframe first?
#It's easier using psycopg2 to just run the query (will sqlalchemy let us run a "raw" query? Trying psqlx over cur.execute)
#cur.execute
psqlx("CREATE TABLE companydata ( CompanyName VARCHAR (200), CompanyNumber VARCHAR (10) PRIMARY KEY, \
RegAddress_CareOf VARCHAR (100), RegAddress_POBox VARCHAR (15), RegAddress_AddressLine1 VARCHAR (120), \
RegAddress_AddressLine2 VARCHAR (100), RegAddress_PostTown VARCHAR (60), \
RegAddress_County VARCHAR (60), RegAddress_Country VARCHAR (60), RegAddress_PostCode VARCHAR (20), \
CompanyCategory VARCHAR (100), CompanyStatus VARCHAR (60), CountryOfOrigin VARCHAR (40), \
IncorporationDate date);",engine)
#Even though I used mixed case column names, I didn't quote them - so PostgreSQL forces them to lower case
#The lack of quotes also meand I had to replace the . in the column names with something more acceptable... (an underscore)
psql("SELECT * FROM companydata LIMIT 1;", engine)
What other tables could we create?
SIC codes are sometimes interesting - they give a crude estimate of the libne of business a compnay may be involved in. Maybe create a table with companynumber and sic code - note that each companynumber may have several sic codes, so there may be more than one row per companynumber in a two col table of companynumber,sicCode.
Having created a crude database table, we're now going to load in some data - company records from the downloaded files for companies that have a well-specified incorporation date.
Note that it may take some time to load the data into the database and index it, possibly an hour or more depending on the speed of your computer. So if you have some reading to do, now could be a good time to do it. Alternatively, go and catch up with what's happening in the course forums.
companydata_cols=['companyname', 'companynumber', 'regaddress_careof',
'regaddress_pobox', 'regaddress_addressline1',
'regaddress_addressline2', 'regaddress_posttown',
'regaddress_county', 'regaddress_country', 'regaddress_postcode',
'companycategory', 'companystatus', 'countryoforigin',
'incorporationdate']
def addchunk(chunk):
'''Create a function to clean the companydata from the CSV file and enter it into the database table'''
#NOTE IF WE HAVE MORE TABLES BEST TO PROCESS THEM ALL FROM A SINGLE FILE READ?
#To conform to the column names in the table I defined, we need to make some changes to the column names
#Firstly, replace the points with underscores
chunk.rename(columns=lambda x: x.replace('.','_'), inplace=True)
#Make sure we strip off any whitespace left and right
chunk.rename(columns=lambda x: x.strip(), inplace=True)
#Force the column names to lowet case
chunk.rename(columns=lambda x: x.lower(), inplace=True)
#This adds to the load time but it's often worth doing
for col in companydata_cols:
chunk[col] = chunk[col].str.strip()
#We could have parsed the dates on load, but if we do it here we can keep all the cleaning steps together
chunk['incorporationdate']=pd.to_datetime(chunk['incorporationdate'], format='%d/%m/%Y',coerce=True)
#I'm actually only interested in companies with an incorporation date
chunk.dropna(subset=['incorporationdate'],inplace=True)
#NB PostgreSQL may also throw a wobbly if it doesn't see a date when it expects one!
#We're not saving all the column names - just name, number, address, incorporation date as per the table definition
#If the table exists, which it does, append data to it
chunk[companydata_cols].to_sql('companydata', engine, index=False, if_exists='append')
#Here's our ingest loop
#THIS COULD TAKE SOME TIME TO RUN - MAYBE EVEN AN HOUR OR MORE...
#Maybe this is an argument in favour of having several passes for different tables?
#Get the names of the files in the companydata directory
for fname in os.listdir('companydata/'):
#if a file is a CSV file, process it
if fname.endswith('csv'):
fname="companydata/"+fname
#Read in 10,000 rows at a time
chunks=pd.read_csv(fname,chunksize=10000,dtype=str)
for chunk in chunks:
#Call the function that cleans the data and adds it to the database
addchunk(chunk)
#Do some housekeeping - remove the CSV
os.remove(fname)
#How many rows did we load in in the end?
psql("SELECT count(*) FROM companydata;", engine)
psql("SELECT * FROM companydata LIMIT 5;", engine)
A web search for /Milton Keynes postcode areas/ turns up a Wikipedia page - MK Postcode Area - that suggests the postcode areas covering the city are (MK1, MK2, MK3, MK4, MK5,MK6,MK7,MK8,MK9,MK19,MK11,MK12,MK13,MK14,MK15).
If we are going to regularly search by postcode area, it might make sense to create a second table that includes the postcode area as an indexed term and the company number as a foreign key into the original company data table.
#cur.execute("CREATE TABLE company_postcode_area AS \
# SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;")
psqlx("CREATE TABLE company_postcode_area AS \
SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;",engine)
--------------------------------------------------------------------------- ProgrammingError Traceback (most recent call last) <ipython-input-35-f25f4c40174c> in <module>() 1 #cur.execute("CREATE TABLE company_postcode_area AS \ 2 # SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;") ----> 3 psqlx("CREATE TABLE company_postcode_area AS SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;",engine) /usr/local/lib/python3.4/dist-packages/pandas/io/sql.py in execute(sql, con, cur, params) 102 pandas_sql = pandasSQL_builder(cur, is_cursor=True) 103 args = _convert_params(sql, params) --> 104 return pandas_sql.execute(*args) 105 106 /usr/local/lib/python3.4/dist-packages/pandas/io/sql.py in execute(self, *args, **kwargs) 781 def execute(self, *args, **kwargs): 782 """Simple passthrough to SQLAlchemy engine""" --> 783 return self.engine.execute(*args, **kwargs) 784 785 def read_table(self, table_name, index_col=None, coerce_float=True, /usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in execute(self, statement, *multiparams, **params) 1680 1681 connection = self.contextual_connect(close_with_result=True) -> 1682 return connection.execute(statement, *multiparams, **params) 1683 1684 def scalar(self, statement, *multiparams, **params): /usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in execute(self, object, *multiparams, **params) 710 """ 711 if isinstance(object, util.string_types[0]): --> 712 return self._execute_text(object, multiparams, params) 713 try: 714 meth = object._execute_on_connection /usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in _execute_text(self, statement, multiparams, params) 859 statement, 860 parameters, --> 861 statement, parameters 862 ) 863 if self._has_events or self.engine._has_events: /usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args) 945 parameters, 946 cursor, --> 947 context) 948 949 if self._has_events or self.engine._has_events: /usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context) 1106 self.dialect.dbapi.Error, 1107 connection_invalidated=self._is_disconnect), -> 1108 exc_info 1109 ) 1110 /usr/local/lib/python3.4/dist-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info) 172 exc_info = sys.exc_info() 173 exc_type, exc_value, exc_tb = exc_info --> 174 reraise(type(exception), exception, tb=exc_tb, cause=exc_value) 175 else: 176 exec("def reraise(tp, value, tb=None, cause=None):\n" /usr/local/lib/python3.4/dist-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause) 165 value.__cause__ = cause 166 if value.__traceback__ is not tb: --> 167 raise value.with_traceback(tb) 168 raise value 169 /usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args) 938 statement, 939 parameters, --> 940 context) 941 except Exception as e: 942 self._handle_dbapi_exception( /usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context) 433 434 def do_execute(self, cursor, statement, parameters, context=None): --> 435 cursor.execute(statement, parameters) 436 437 def do_execute_no_params(self, cursor, statement, context=None): ProgrammingError: (ProgrammingError) relation "company_postcode_area" already exists "CREATE TABLE company_postcode_area AS SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;" {}
#cur.execute("CREATE INDEX company_postcode_area_idx ON company_postcode_area (postcode_area);")
psqlx("CREATE INDEX company_postcode_area_idx ON company_postcode_area (postcode_area);",engine)
mkCos=psql("SELECT * FROM company_postcode_area \
WHERE postcode_area IN ('MK1','MK2','MK3','MK4','MK5','MK6','MK7','MK8','MK9','MK10','MK11','MK12','MK13','MK14','MK15');",engine)
mkCos.groupby('postcode_area').companynumber.agg(len).order(ascending=False)
psql("SELECT postcode_area, COUNT(*) FROM company_postcode_area \
WHERE postcode_area IN ('MK1','MK2','MK3','MK4','MK5','MK6','MK7','MK8','MK9','MK10','MK11','MK12','MK13','MK14','MK15') \
GROUP BY postcode_area;",engine)
psql("SELECT regaddress_postcode, COUNT(*) FROM companydata \
GROUP BY regaddress_postcode ORDER BY count DESC LIMIT 20;",engine)
regaddress_postcode | count | |
---|---|---|
0 | NaN | 36940 |
1 | EC1V 4PW | 28359 |
2 | N12 0DR | 9458 |
3 | B18 6EW | 9032 |
4 | WA1 1RG | 8932 |
5 | SE1 7HR | 7056 |
6 | SW1Y 5EA | 6961 |
7 | W1B 3HH | 4980 |
8 | BN3 6HA | 4582 |
9 | N14 5BP | 3592 |
10 | SK7 2DH | 3568 |
11 | SW6 3JH | 3509 |
12 | EH7 5JA | 3446 |
13 | BH12 1JY | 3258 |
14 | EC1V 4PY | 2872 |
15 | B2 4BG | 2734 |
16 | W1G 9QR | 2734 |
17 | EH3 9WJ | 2591 |
18 | BS9 3BY | 2470 |
19 | RG7 8NN | 2456 |
psql("SELECT regaddress_postcode FROM companydata WHERE regaddress_postcode LIKE 'BS9%%' LIMIT 20;",engine)
regaddress_postcode | |
---|---|
0 | BS9 2JT |
1 | BS9 3BY |
2 | BS9 2DR |
3 | BS9 3BY |
4 | BS9 3BY |
5 | BS9 3TQ |
6 | BS9 3BH |
7 | BS9 3BY |
8 | BS9 3BY |
9 | BS9 1BE |
10 | BS9 3HQ |
11 | BS9 3BY |
12 | BS9 3BY |
13 | BS9 3DZ |
14 | BS9 1LE |
15 | BS9 3BY |
16 | BS9 3BY |
17 | BS9 3BY |
18 | BS9 3BY |
19 | BS9 2BB |