Working With Companies House Data¶

Note that the data ingest stage of this activity may take your computer an hour or more to process, so make sure you take this into account as you work through the activity.

In the UK, Companies House maintains a public register of registered companies. Details of "live" companies can be downloaded as open data from Companies House: http://download.companieshouse.gov.uk/en_output.html.

The file is currently bundled into a set of 5 files, containing something over three and a half million records (the lower numbered zip files have 850,000 rows, the final zip file the balance).

Whilst pandas can cope with quite large dataframes, keeping such a large number of items in memory represents a large overhead. In addition, having to load the whole dataset into memory each time a new python session is run is also time consuming.

Instead, it's much easier to store these records in a persistent database such as PostgreSQL, generate efficient indexes over column we are likely to one to query, and then pull out just the records we need as we need them.

The first step requires downloading the data from the Companies House website. Let's create a directory in which to put the downloaded zip files and download the files to it.

In [ ]:

!mkdir companydata

In [ ]:

#You may need to check the URLs
#http://download.companieshouse.gov.uk/en_output.html
#Download the data files and place them in the companydata directory
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part1_5.zip -P companydata
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part2_5.zip -P companydata
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part3_5.zip -P companydata
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part4_5.zip -P companydata
!wget http://download.companieshouse.gov.uk/BasicCompanyData-2014-06-01-part5_5.zip -P companydata

#Check the files are there...
!ls companydata

In [22]:

#Unzip the files into the same directory
!unzip "companydata/*.zip" -d companydata

Archive:  companydata/BasicCompanyData-2014-06-01-part1_5.zip
  inflating: companydata/BasicCompanyData-2014-06-01-part1_5.csv  

Archive:  companydata/BasicCompanyData-2014-06-01-part2_5.zip
  inflating: companydata/BasicCompanyData-2014-06-01-part2_5.csv  

Archive:  companydata/BasicCompanyData-2014-06-01-part3_5.zip
  inflating: companydata/BasicCompanyData-2014-06-01-part3_5.csv  

Archive:  companydata/BasicCompanyData-2014-06-01-part4_5.zip
  inflating: companydata/BasicCompanyData-2014-06-01-part4_5.csv  

Archive:  companydata/BasicCompanyData-2014-06-01-part5_5.zip
  inflating: companydata/BasicCompanyData-2014-06-01-part5_5.csv  

5 archives were successfully processed.

In [ ]:

#Should we tidy up further and delete the zip files too?
#Or maybe do this after we know everything is has been loaded...
#!rm companydata/*.zip

In [ ]:

#How many rows are there in each file?
!wc -l companydata/*.csv

In [23]:

#Let's have a quick peak at the head of one of the files
!head -n 3 companydata/BasicCompanyData-2014-06-01-part1_5.csv

CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE, PreviousName_1.CompanyName, PreviousName_2.CONDATE, PreviousName_2.CompanyName,PreviousName_3.CONDATE, PreviousName_3.CompanyName,PreviousName_4.CONDATE, PreviousName_4.CompanyName,PreviousName_5.CONDATE, PreviousName_5.CompanyName,PreviousName_6.CONDATE, PreviousName_6.CompanyName,PreviousName_7.CONDATE, PreviousName_7.CompanyName,PreviousName_8.CONDATE, PreviousName_8.CompanyName,PreviousName_9.CONDATE, PreviousName_9.CompanyName,PreviousName_10.CONDATE, PreviousName_10.CompanyName
"! LTD","08209948","","","METROHOUSE 57 PEPPER ROAD","HUNSLET","LEEDS","YORKSHIRE","UNITED KINGDOM","LS10 2RU","Private Limited Company","Active","United Kingdom","","11/09/2012","30","9","30/06/2015","30/09/2013","DORMANT","09/10/2014","11/09/2013","0","0","0","0","99999 - Dormant Company","","","","0","0","http://business.data.gov.uk/id/company/08209948","","","","","","","","","","","","","","","","","","","",""
"!BIG IMPACT GRAPHICS LIMITED","07382019","","","335 ROSDEN HOUSE","372 OLD STREET","LONDON","","","EC1V 9AV","Private Limited Company","Active","United Kingdom","","21/09/2010","30","9","30/06/2014","30/09/2012","DORMANT","19/10/2014","21/09/2013","0","0","0","0","59112 - Video production activities","59113 - Television programme production activities","74100 - specialised design activities","74202 - Other specialist photography","0","0","http://business.data.gov.uk/id/company/07382019","","","","","","","","","","","","","","","","","","","",""

In [24]:

#Let's see how pandas sees a file
import pandas as pd

tmp=pd.read_csv('companydata/BasicCompanyData-2014-06-01-part1_5.csv',nrows=3)
tmp.head(3)

Out[24]:

	CompanyName	CompanyNumber	RegAddress.CareOf	RegAddress.POBox	RegAddress.AddressLine1	RegAddress.AddressLine2	RegAddress.PostTown	RegAddress.County	RegAddress.Country	RegAddress.PostCode	...	PreviousName_6.CONDATE	PreviousName_6.CompanyName	PreviousName_7.CONDATE	PreviousName_7.CompanyName	PreviousName_8.CONDATE	PreviousName_8.CompanyName	PreviousName_9.CONDATE	PreviousName_9.CompanyName	PreviousName_10.CONDATE	PreviousName_10.CompanyName
0	! LTD	8209948	NaN	NaN	METROHOUSE 57 PEPPER ROAD	HUNSLET	LEEDS	YORKSHIRE	UNITED KINGDOM	LS10 2RU	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	!BIG IMPACT GRAPHICS LIMITED	7382019	NaN	NaN	335 ROSDEN HOUSE	372 OLD STREET	LONDON	NaN	NaN	EC1V 9AV	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	!K7 TOURING LIMITED	8937297	NaN	NaN	5TH FLOOR, DURKAN HOUSE	155 EAST BARNET ROAD	NEW BARNET	HERTS	UNITED KINGDOM	EN4 8QZ	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

3 rows × 53 columns

In [4]:

!ls companydata

BasicCompanyData-2014-06-01-part1_5.zip
BasicCompanyData-2014-06-01-part2_5.zip
BasicCompanyData-2014-06-01-part3_5.zip
BasicCompanyData-2014-06-01-part4_5.zip
BasicCompanyData-2014-06-01-part5_5.zip

In [25]:

#Let's look at the column names
tmp.columns.values

Out[25]:

array(['CompanyName', ' CompanyNumber', 'RegAddress.CareOf',
       'RegAddress.POBox', 'RegAddress.AddressLine1',
       ' RegAddress.AddressLine2', 'RegAddress.PostTown',
       'RegAddress.County', 'RegAddress.Country', 'RegAddress.PostCode',
       'CompanyCategory', 'CompanyStatus', 'CountryOfOrigin',
       'DissolutionDate', 'IncorporationDate', 'Accounts.AccountRefDay',
       'Accounts.AccountRefMonth', 'Accounts.NextDueDate',
       'Accounts.LastMadeUpDate', 'Accounts.AccountCategory',
       'Returns.NextDueDate', 'Returns.LastMadeUpDate',
       'Mortgages.NumMortCharges', 'Mortgages.NumMortOutstanding',
       'Mortgages.NumMortPartSatisfied', 'Mortgages.NumMortSatisfied',
       'SICCode.SicText_1', 'SICCode.SicText_2', 'SICCode.SicText_3',
       'SICCode.SicText_4', 'LimitedPartnerships.NumGenPartners',
       'LimitedPartnerships.NumLimPartners', 'URI',
       'PreviousName_1.CONDATE', ' PreviousName_1.CompanyName',
       ' PreviousName_2.CONDATE', ' PreviousName_2.CompanyName',
       'PreviousName_3.CONDATE', ' PreviousName_3.CompanyName',
       'PreviousName_4.CONDATE', ' PreviousName_4.CompanyName',
       'PreviousName_5.CONDATE', ' PreviousName_5.CompanyName',
       'PreviousName_6.CONDATE', ' PreviousName_6.CompanyName',
       'PreviousName_7.CONDATE', ' PreviousName_7.CompanyName',
       'PreviousName_8.CONDATE', ' PreviousName_8.CompanyName',
       'PreviousName_9.CONDATE', ' PreviousName_9.CompanyName',
       'PreviousName_10.CONDATE', ' PreviousName_10.CompanyName'], dtype=object)

Looks like there may be extraneous white space in some of the column names, which we'll need to take care of...

Setting up the PostgreSQL Database¶

We're going to be using PostgreSQL so we need to load in some drivers and handlers.

In [ ]:

#deprecate this in favour of psqlx?
#import psycopg2 as pg
#If connecting to the default port, you can omit the port parameter
#conn = pg.connect(dbname='tm351test',  host='localhost', user='test', password='test' , port=5432)
#cur = conn.cursor()

In [2]:

import pandas as pd
pd.__version__

Out[2]:

'0.14.0'

In [6]:

!pip3 install --upgrade

You must give at least one requirement to install (see "pip help install")

In [3]:

import sys
sys.version

Out[3]:

'3.4.0 (default, Apr 11 2014, 13:05:11) \n[GCC 4.8.2]'

In [1]:

#psql lets us return tables as dataframes
from pandas import read_sql_query as psql
#psqlx is a convenience fucntion for executing raw queries where we aren't so bothered about the response...
#Errors should still be thrown...
from pandas.io.sql import execute as psqlx

import pandas as pd
from sqlalchemy import create_engine

/usr/local/lib/python3.4/dist-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))

For some reason, the psql wrapper requires LIKE % wildcard character to be escaped as %%

In [ ]:

#Create the database connection for pandas dataframe operations
engine = create_engine('postgresql://test:test@localhost:5432/tm351test')

In [ ]:

#cur.execute("DROP TABLE companydata;")
#psqlx("DROP TABLE companydata;",engine)

In [ ]:

#cur.execute("DROP TABLE company_postcode_area;")
#psqlx("DROP TABLE company_postcode_area;",engine)

In [ ]:

#cur.execute("DROP INDEX company_postcode_area_idx;")
#psqlx("DROP INDEX company_postcode_area_idx;",engine)

In [ ]:

#To define the table, let's check the longest strings in column
#Note that this routine may take a few minutes to run

cleanColNames=["CompanyName","RegAddress.CareOf","RegAddress.POBox", \
                "RegAddress.AddressLine1","RegAddress.AddressLine2","RegAddress.PostTown","RegAddress.County", \
                "RegAddress.Country","RegAddress.PostCode","CompanyCategory","CompanyStatus", \
                "CountryOfOrigin"]

def colMaxLen(df,maxlens):
    df.rename(columns=lambda x: x.strip(), inplace=True)
    for col in cleanColNames:
        cml=df[col].astype(str).map(len).max()
        if cml > maxlens[col]: maxlens[col]=cml

import os
maxlens={}
for name in cleanColNames: maxlens[name]=0
for fname in os.listdir('companydata/'):
    #if a file is a CSV file, process it
    if fname.endswith('csv'):
        fname="companydata/"+fname
        #Read in 10,000 rows at a time
        chunks=pd.read_csv(fname,chunksize=10000)
        for chunk in chunks:
            colMaxLen(chunk,maxlens)
maxlens

In [ ]:

#I'm going to create a table over some of the columns - essentially, the company name, number, address and incoporation date
#We can use the maxlens to ensure we create large enough data fields

#I wonder if we should just try to create the table from the dataframe?
#Perhaps make sure we cast the right type of each column in the dataframe first?
#It's easier using psycopg2 to just run the query (will sqlalchemy let us run a "raw" query? Trying psqlx over cur.execute)
#cur.execute
psqlx("CREATE TABLE companydata ( CompanyName VARCHAR (200), CompanyNumber VARCHAR (10) PRIMARY KEY, \
       RegAddress_CareOf VARCHAR (100), RegAddress_POBox VARCHAR (15), RegAddress_AddressLine1 VARCHAR (120), \
       RegAddress_AddressLine2 VARCHAR (100), RegAddress_PostTown VARCHAR (60), \
       RegAddress_County VARCHAR (60), RegAddress_Country VARCHAR (60), RegAddress_PostCode VARCHAR (20), \
       CompanyCategory VARCHAR (100), CompanyStatus VARCHAR (60), CountryOfOrigin VARCHAR (40), \
       IncorporationDate date);",engine)

In [ ]:

#Even though I used mixed case column names, I didn't quote them - so PostgreSQL forces them to lower case
#The lack of quotes also meand I had to replace the . in the column names with something more acceptable... (an underscore)
psql("SELECT * FROM companydata LIMIT 1;", engine)

What other tables could we create?

SIC codes are sometimes interesting - they give a crude estimate of the libne of business a compnay may be involved in. Maybe create a table with companynumber and sic code - note that each companynumber may have several sic codes, so there may be more than one row per companynumber in a two col table of companynumber,sicCode.

Having created a crude database table, we're now going to load in some data - company records from the downloaded files for companies that have a well-specified incorporation date.

Note that it may take some time to load the data into the database and index it, possibly an hour or more depending on the speed of your computer. So if you have some reading to do, now could be a good time to do it. Alternatively, go and catch up with what's happening in the course forums.

In [ ]:

companydata_cols=['companyname', 'companynumber', 'regaddress_careof', 
           'regaddress_pobox', 'regaddress_addressline1',
           'regaddress_addressline2', 'regaddress_posttown',
           'regaddress_county', 'regaddress_country', 'regaddress_postcode',
           'companycategory', 'companystatus', 'countryoforigin',
           'incorporationdate']

def addchunk(chunk):
    '''Create a function to clean the companydata from the CSV file and enter it into the database table'''
    
    #NOTE IF WE HAVE MORE TABLES BEST TO PROCESS THEM ALL FROM A SINGLE FILE READ?
    
    #To conform to the column names in the table I defined, we need to make some changes to the column names
    #Firstly, replace the points with underscores
    chunk.rename(columns=lambda x: x.replace('.','_'), inplace=True)
    #Make sure we strip off any whitespace left and right
    chunk.rename(columns=lambda x: x.strip(), inplace=True)
    #Force the column names to lowet case
    chunk.rename(columns=lambda x: x.lower(), inplace=True)
    
    #This adds to the load time but it's often worth doing
    for col in companydata_cols:
        chunk[col] = chunk[col].str.strip()
        
    #We could have parsed the dates on load, but if we do it here we can keep all the cleaning steps together
    chunk['incorporationdate']=pd.to_datetime(chunk['incorporationdate'], format='%d/%m/%Y',coerce=True)
    #I'm actually only interested in companies with an incorporation date
    chunk.dropna(subset=['incorporationdate'],inplace=True)
    #NB PostgreSQL may also throw a wobbly if it doesn't see a date when it expects one!
    
    #We're not saving all the column names - just name, number, address, incorporation date as per the table definition
    #If the table exists, which it does, append data to it
    chunk[companydata_cols].to_sql('companydata', engine, index=False, if_exists='append')

In [ ]:

#Here's our ingest loop
#THIS COULD TAKE SOME TIME TO RUN - MAYBE EVEN AN HOUR OR MORE...
#Maybe this is an argument in favour of having several passes for different tables?

#Get the names of the files in the companydata directory
for fname in os.listdir('companydata/'):
    #if a file is a CSV file, process it
    if fname.endswith('csv'):
        fname="companydata/"+fname
        #Read in 10,000 rows at a time
        chunks=pd.read_csv(fname,chunksize=10000,dtype=str)
        for chunk in chunks:
            #Call the function that cleans the data and adds it to the database
            addchunk(chunk)
        #Do some housekeeping - remove the CSV
        os.remove(fname)

In [ ]:

#How many rows did we load in in the end?
psql("SELECT count(*) FROM companydata;", engine)

In [ ]:

psql("SELECT * FROM companydata LIMIT 5;", engine)

Getting Started Querying the Data¶

A web search for /Milton Keynes postcode areas/ turns up a Wikipedia page - MK Postcode Area - that suggests the postcode areas covering the city are (MK1, MK2, MK3, MK4, MK5,MK6,MK7,MK8,MK9,MK19,MK11,MK12,MK13,MK14,MK15).

If we are going to regularly search by postcode area, it might make sense to create a second table that includes the postcode area as an indexed term and the company number as a foreign key into the original company data table.

In [35]:

#cur.execute("CREATE TABLE company_postcode_area AS \
#    SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;")
psqlx("CREATE TABLE company_postcode_area AS \
    SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;",engine)

---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
<ipython-input-35-f25f4c40174c> in <module>()
      1 #cur.execute("CREATE TABLE company_postcode_area AS \
      2 #    SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;")
----> 3 psqlx("CREATE TABLE company_postcode_area AS     SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;",engine)

/usr/local/lib/python3.4/dist-packages/pandas/io/sql.py in execute(sql, con, cur, params)
    102         pandas_sql = pandasSQL_builder(cur, is_cursor=True)
    103     args = _convert_params(sql, params)
--> 104     return pandas_sql.execute(*args)
    105 
    106 

/usr/local/lib/python3.4/dist-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
    781     def execute(self, *args, **kwargs):
    782         """Simple passthrough to SQLAlchemy engine"""
--> 783         return self.engine.execute(*args, **kwargs)
    784 
    785     def read_table(self, table_name, index_col=None, coerce_float=True,

/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in execute(self, statement, *multiparams, **params)
   1680 
   1681         connection = self.contextual_connect(close_with_result=True)
-> 1682         return connection.execute(statement, *multiparams, **params)
   1683 
   1684     def scalar(self, statement, *multiparams, **params):

/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in execute(self, object, *multiparams, **params)
    710         """
    711         if isinstance(object, util.string_types[0]):
--> 712             return self._execute_text(object, multiparams, params)
    713         try:
    714             meth = object._execute_on_connection

/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in _execute_text(self, statement, multiparams, params)
    859             statement,
    860             parameters,
--> 861             statement, parameters
    862         )
    863         if self._has_events or self.engine._has_events:

/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
    945                                 parameters,
    946                                 cursor,
--> 947                                 context)
    948 
    949         if self._has_events or self.engine._has_events:

/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1106                                         self.dialect.dbapi.Error,
   1107                                         connection_invalidated=self._is_disconnect),
-> 1108                                     exc_info
   1109                                 )
   1110 

/usr/local/lib/python3.4/dist-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
    172             exc_info = sys.exc_info()
    173         exc_type, exc_value, exc_tb = exc_info
--> 174         reraise(type(exception), exception, tb=exc_tb, cause=exc_value)
    175 else:
    176     exec("def reraise(tp, value, tb=None, cause=None):\n"

/usr/local/lib/python3.4/dist-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    165             value.__cause__ = cause
    166         if value.__traceback__ is not tb:
--> 167             raise value.with_traceback(tb)
    168         raise value
    169 

/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
    938                                      statement,
    939                                      parameters,
--> 940                                      context)
    941         except Exception as e:
    942             self._handle_dbapi_exception(

/usr/local/lib/python3.4/dist-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    433 
    434     def do_execute(self, cursor, statement, parameters, context=None):
--> 435         cursor.execute(statement, parameters)
    436 
    437     def do_execute_no_params(self, cursor, statement, context=None):

ProgrammingError: (ProgrammingError) relation "company_postcode_area" already exists
 "CREATE TABLE company_postcode_area AS     SELECT companynumber, split_part(regaddress_postcode,' ',1) as postcode_area FROM companydata;" {}

In [ ]:

#cur.execute("CREATE INDEX company_postcode_area_idx ON company_postcode_area (postcode_area);")
psqlx("CREATE INDEX company_postcode_area_idx ON company_postcode_area (postcode_area);",engine)

In [ ]:

mkCos=psql("SELECT * FROM company_postcode_area \
    WHERE postcode_area IN ('MK1','MK2','MK3','MK4','MK5','MK6','MK7','MK8','MK9','MK10','MK11','MK12','MK13','MK14','MK15');",engine)
mkCos.groupby('postcode_area').companynumber.agg(len).order(ascending=False)

In [ ]:

psql("SELECT postcode_area, COUNT(*) FROM company_postcode_area \
    WHERE postcode_area IN ('MK1','MK2','MK3','MK4','MK5','MK6','MK7','MK8','MK9','MK10','MK11','MK12','MK13','MK14','MK15') \
    GROUP BY postcode_area;",engine)

In [36]:

psql("SELECT regaddress_postcode, COUNT(*) FROM companydata \
    GROUP BY regaddress_postcode ORDER BY count DESC LIMIT 20;",engine)

Out[36]:

	regaddress_postcode	count
0	NaN	36940
1	EC1V 4PW	28359
2	N12 0DR	9458
3	B18 6EW	9032
4	WA1 1RG	8932
5	SE1 7HR	7056
6	SW1Y 5EA	6961
7	W1B 3HH	4980
8	BN3 6HA	4582
9	N14 5BP	3592
10	SK7 2DH	3568
11	SW6 3JH	3509
12	EH7 5JA	3446
13	BH12 1JY	3258
14	EC1V 4PY	2872
15	B2 4BG	2734
16	W1G 9QR	2734
17	EH3 9WJ	2591
18	BS9 3BY	2470
19	RG7 8NN	2456

In [48]:

psql("SELECT regaddress_postcode FROM companydata WHERE regaddress_postcode LIKE 'BS9%%' LIMIT 20;",engine)

Out[48]:

	regaddress_postcode
0	BS9 2JT
1	BS9 3BY
2	BS9 2DR
3	BS9 3BY
4	BS9 3BY
5	BS9 3TQ
6	BS9 3BH
7	BS9 3BY
8	BS9 3BY
9	BS9 1BE
10	BS9 3HQ
11	BS9 3BY
12	BS9 3BY
13	BS9 3DZ
14	BS9 1LE
15	BS9 3BY
16	BS9 3BY
17	BS9 3BY
18	BS9 3BY
19	BS9 2BB

In [ ]: