Analyzing Startup Fundraising Deals from Crunchbase¶

To get familiar with the data, let's create a helper funtion that can read from a given csv file, a specified number of rows using the right encoding and print them as a table.

In [2]:

import pandas as pd
import matplotlib.pyplot as plt

csvfile = 'crunchbase-investments.csv'
def pdread_csv(filename, numrows, enc='ISO-8859-1'):
    rows = pd.read_csv( filename, nrows=numrows, encoding=enc)
    return rows

Read a few rows from the csv file

In [19]:

pdread_csv(csvfile, 5)

Out[19]:

	company_permalink	company_name	company_category_code	company_country_code	company_state_code	company_region	company_city	investor_permalink	investor_name	investor_category_code	investor_country_code	investor_state_code	investor_region	investor_city	funding_round_type	funded_at	funded_month	funded_quarter	funded_year	raised_amount_usd
0	/company/advercar	AdverCar	advertising	USA	CA	SF Bay	San Francisco	/company/1-800-flowers-com	1-800-FLOWERS.COM	NaN	USA	NY	New York	New York	series-a	2012-10-30	2012-10	2012-Q4	2012	2000000
1	/company/launchgram	LaunchGram	news	USA	CA	SF Bay	Mountain View	/company/10xelerator	10Xelerator	finance	USA	OH	Columbus	Columbus	other	2012-01-23	2012-01	2012-Q1	2012	20000
2	/company/utap	uTaP	messaging	USA	NaN	United States - Other	NaN	/company/10xelerator	10Xelerator	finance	USA	OH	Columbus	Columbus	other	2012-01-01	2012-01	2012-Q1	2012	20000
3	/company/zoopshop	ZoopShop	software	USA	OH	Columbus	columbus	/company/10xelerator	10Xelerator	finance	USA	OH	Columbus	Columbus	angel	2012-02-15	2012-02	2012-Q1	2012	20000
4	/company/efuneral	eFuneral	web	USA	OH	Cleveland	Cleveland	/company/10xelerator	10Xelerator	finance	USA	OH	Columbus	Columbus	other	2011-09-08	2011-09	2011-Q3	2011	20000

Total memory usage (in MB) for 5000 rows

In [24]:

pdread_csv(csvfile, 5000).memory_usage(deep=True).sum()/1024**2

Out[24]:

5.579195022583008

Total memory footprint of the dataset¶

As memory consumption for 5000 rows is much less than 10 MB, let's read the data into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 MB of memory. We will then calculate total memory footprint of all of the chunks combined.

Let's define a function for this. The function will consume a chunk iterator and return the total memory footprint of the full dataset, in MB

In [48]:

def all_chunks_mem(chunk_iter):
    total_memory_footprint = 0
    for chunk in chunk_iter:
        total_memory_footprint += chunk.memory_usage(deep=True).sum()/1024**2
    return total_memory_footprint

Create an iterator using pandas.read_csv() that will process chunks of 5000 rows from the csv file at a time.
Call all_chunks_mem function with the iterator to give us the total memory footprint of the dataset.

In [79]:

crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
all_chunks_mem(crunch_chunks)

Out[79]:

56.988484382629395

This dataset is believed to contain over 50,000 rows. Let's find out how many rows and columns there are:

For total number of rows in the dataset, add up number of rows across all of the chunks.
For total number of columns, count number of columns in any one chunk.

In [80]:

crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
total_rows = 0
for chunk in crunch_chunks:
    total_rows += chunk.shape[0]
print (total_rows)
print (chunk.shape[1])

52870
20

Memory usage of each column in bytes¶

While calculating this, we will exclude the memory usage of the DataFrame’s index.

We will chunkify the dataset as before, using 5000 rows at a time
We will form a list of column names to iterate over in each chunk
We will use a dictionary to count the memory footprint of each column. The dictionary keys will be column names and values will be the memory usage per column

Sum of them all in MB should match the dataset's total memory footprint( or that of all chunks combined ) calculated earlier.

In [99]:

crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
col_mem = {}
for chunk in crunch_chunks:
    cols = list(chunk.columns)
    for col in cols:
        if col in col_mem:
            col_mem[col] += chunk[col].memory_usage(index=False,deep=True)
        else:
            col_mem[col] = chunk[col].memory_usage(index=False,deep=True)

total_mem = 0
for key in col_mem:
    print (key,col_mem[key])
    total_mem += col_mem[key]
    
print ('\n','Total memory Usage of All Columns(in MB):', round(total_mem/1024**2,3))

investor_category_code 622424
funded_month 3383584
company_state_code 3106051
company_category_code 3421104
funded_year 422960
investor_permalink 4980548
company_name 3591326
company_city 3505926
funding_round_type 3410707
investor_state_code 2476607
funded_at 3542185
investor_region 3396281
company_permalink 4057788
company_region 3411585
company_country_code 3172176
investor_city 2885083
funded_quarter 3383584
investor_name 3915666
investor_country_code 2647292
raised_amount_usd 422960

 Total memory Usage of All Columns(in MB): 56.988

Each column's missing value counts¶

We will calculate the sum of pd.isnull() on each chunk to tell us the number of missing values for each column
We will need to combine them together. Before doing that, we will append each iteration's result to a list.
Once list is ready for all chunks, we will use pd.concat(list) to combine them into a single series object
Then we group all values associated with each index by using groupby() on the above result
Once grouped by their corresponding indexes, we will use the Series.sum() method to return the final, unique value counts(missing values)
We finally display the result by sorted_values (in ascending order by default)

In [103]:

crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
missing = []
for chunk in crunch_chunks:
    missing.append(pd.isnull(chunk).sum())
combined = pd.concat(missing)
final = (combined.groupby(combined.index)).sum()
final.sort_values()

Out[103]:

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

Which column(s) can we drop because they aren't useful for analysis?¶

Just by looking at the per column missing values, the column investor_category_code has way too many missing i.e 50427 out of 52870 row values, to be useful for analysis.
Links to websites are also not useful for analysis - company_permalink and investor_permalink . We will use df.drop to drop this list of columns, resulting in 17 columns in the dataframe.

In [127]:

crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
col_mem = {}
for chunk in crunch_chunks:
    dropped_chunk=chunk.drop(['investor_category_code', 'company_permalink', 'investor_permalink'], axis=1)
    cols = list(dropped_chunk.columns)

print (dropped_chunk.shape[1])
print (cols)
dropped_chunk

17
['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter', 'funded_year', 'raised_amount_usd']

Out[127]:

	company_name	company_category_code	company_country_code	company_state_code	company_region	company_city	investor_name	investor_country_code	investor_state_code	investor_region	investor_city	funding_round_type	funded_at	funded_month	funded_quarter	funded_year	raised_amount_usd
50000	NuORDER	fashion	USA	CA	Los Angeles	West Hollywood	Mortimer Singer	NaN	NaN	unknown	NaN	series-a	2012-10-01	2012-10	2012-Q4	2012	3060000.0
50001	ChaCha	advertising	USA	IN	Indianapolis	Carmel	Morton Meyerson	NaN	NaN	unknown	NaN	series-b	2007-10-01	2007-10	2007-Q4	2007	12000000.0
50002	Binfire	software	USA	FL	Bocat Raton	Bocat Raton	Moshe Ariel	NaN	NaN	unknown	NaN	angel	2008-04-18	2008-04	2008-Q2	2008	500000.0
50003	Binfire	software	USA	FL	Bocat Raton	Bocat Raton	Moshe Ariel	NaN	NaN	unknown	NaN	angel	2010-01-01	2010-01	2010-Q1	2010	750000.0
50004	Unified Color	software	USA	CA	SF Bay	South San Frnacisco	Mr. Andrew Oung	NaN	NaN	unknown	NaN	angel	2010-01-01	2010-01	2010-Q1	2010	NaN
50005	HItviews	advertising	USA	NY	New York	New York City	multiple parties	NaN	NaN	unknown	NaN	angel	2007-11-29	2007-11	2007-Q4	2007	485000.0
50006	LockerDome	social	USA	MO	Saint Louis	St. Louis	multiple parties	NaN	NaN	unknown	NaN	angel	2012-04-17	2012-04	2012-Q2	2012	300000.0
50007	ThirdLove	ecommerce	USA	CA	SF Bay	San Francisco	Munjal Shah	NaN	NaN	unknown	NaN	series-a	2012-12-01	2012-12	2012-Q4	2012	5600000.0
50008	Hakia	search	USA	NaN	TBD	NaN	Murat Vargi	NaN	NaN	unknown	NaN	series-a	2006-11-01	2006-11	2006-Q4	2006	16000000.0
50009	bookacoach	sports	USA	IN	Indianapolis	Indianapolis	Myles Grote	NaN	NaN	unknown	NaN	angel	2012-11-01	2012-11	2012-Q4	2012	NaN
50010	LocalCircles	social	USA	CA	SF Bay	Santa Clara	Nadir Godrej	NaN	NaN	unknown	NaN	angel	2012-09-01	2012-09	2012-Q3	2012	NaN
50011	Graphdive	analytics	USA	CA	SF Bay	Menlo Park	Naguib Sawiris	NaN	NaN	unknown	NaN	angel	2012-10-04	2012-10	2012-Q4	2012	1000000.0
50012	Ribbon	ecommerce	USA	CA	SF Bay	San Francisco	Naguib Sawiris	NaN	NaN	unknown	NaN	series-a	2013-02-05	2013-02	2013-Q1	2013	1630000.0
50013	Dokkankom.com	ecommerce	USA	NY	New York	new york	Namek Zu'bi	NaN	NaN	unknown	NaN	angel	2011-10-10	2011-10	2011-Q4	2011	30000.0
50014	Lookery	web	USA	CA	SF Bay	San Francisco	Nana Shin	NaN	NaN	unknown	NaN	angel	2008-02-07	2008-02	2008-Q1	2008	900000.0
50015	TrustDegrees	web	USA	NY	Kenmore	Kenmore	Nancy Barrett	NaN	NaN	unknown	NaN	angel	2011-06-09	2011-06	2011-Q2	2011	8000.0
50016	Altavoz	games_video	USA	DC	Washington DC	Washington	Nancy Jacobsen	NaN	NaN	unknown	NaN	angel	2012-09-11	2012-09	2012-Q3	2012	150000.0
50017	EdSurge	education	USA	CA	SF Bay	Burlingame	Nancy Peretsman	NaN	NaN	unknown	NaN	angel	2012-08-29	2012-08	2012-Q3	2012	400000.0
50018	FullContact	enterprise	USA	CO	Denver	Denver	Nancy Pierce	NaN	NaN	unknown	NaN	series-b	2012-07-09	2012-07	2012-Q3	2012	7000000.0
50019	Rapt Media	enterprise	USA	CO	Denver	Boulder	Nancy Pierce	NaN	NaN	unknown	NaN	series-a	2013-01-23	2013-01	2013-Q1	2013	2288803.0
50020	Humanoid	software	USA	CA	SF Bay	San Francisco	Nat Friedman	NaN	NaN	unknown	NaN	angel	2010-12-01	2010-12	2010-Q4	2010	1100000.0
50021	Runscope	web	USA	CA	SF Bay	San Francisco	Nat Friedman	NaN	NaN	unknown	NaN	angel	2013-05-22	2013-05	2013-Q2	2013	1100000.0
50022	Adzerk	advertising	USA	NC	Raleigh-Durham	Durham	Nat Turner	NaN	NaN	unknown	NaN	angel	2011-07-12	2011-07	2011-Q3	2011	650000.0
50023	Adaptly	advertising	USA	NY	New York	New York	Nat Turner	NaN	NaN	unknown	NaN	series-a	2011-04-18	2011-04	2011-Q2	2011	2000000.0
50024	Lore	education	USA	NY	New York	New York	Nat Turner	NaN	NaN	unknown	NaN	angel	2011-06-27	2011-06	2011-Q2	2011	1000000.0
50025	Tasted Menu	hospitality	USA	MA	Boston	Boston	Nat Turner	NaN	NaN	unknown	NaN	angel	2011-05-01	2011-05	2011-Q2	2011	NaN
50026	Lua Technologies	mobile	USA	NY	New York	New York	Nat Turner	NaN	NaN	unknown	NaN	series-a	2012-08-01	2012-08	2012-Q3	2012	2500000.0
50027	Blue Apron	hospitality	USA	NY	New York	Brooklyn	Nat Turner	NaN	NaN	unknown	NaN	series-a	2013-02-19	2013-02	2013-Q1	2013	3000000.0
50028	ChatID	mobile	USA	NY	New York	New York	Nat Turner	NaN	NaN	unknown	NaN	angel	2012-01-01	2012-01	2012-Q1	2012	NaN
50029	Breakthrough Behavioral	health	USA	CA	SF Bay	Redwood City	Nat Turner	NaN	NaN	unknown	NaN	angel	2012-08-13	2012-08	2012-Q3	2012	900000.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
52840	Meddik	health	USA	NY	New York	New York	Zach Weinberg	NaN	NaN	unknown	NaN	angel	2012-05-24	2012-05	2012-Q2	2012	750000.0
52841	Blue Apron	hospitality	USA	NY	New York	Brooklyn	Zach Weinberg	NaN	NaN	unknown	NaN	series-a	2013-02-19	2013-02	2013-Q1	2013	3000000.0
52842	ChatID	mobile	USA	NY	New York	New York	Zach Weinberg	NaN	NaN	unknown	NaN	angel	2012-01-01	2012-01	2012-Q1	2012	NaN
52843	Breakthrough Behavioral	health	USA	CA	SF Bay	Redwood City	Zach Weinberg	NaN	NaN	unknown	NaN	angel	2012-08-13	2012-08	2012-Q3	2012	900000.0
52844	Plaid	software	USA	CA	SF Bay	San Francisco	Zach Weinberg	NaN	NaN	unknown	NaN	series-a	2013-09-19	2013-09	2013-Q3	2013	2800000.0
52845	PokitDok	mobile	USA	CA	SF Bay	Menlo Park	Zach Zeitlin	NaN	NaN	unknown	NaN	angel	2012-07-12	2012-07	2012-Q3	2012	1300000.0
52846	Fitocracy	web	USA	NY	New York	New York	Zachary Aarons	NaN	NaN	unknown	NaN	angel	2011-09-01	2011-09	2011-Q3	2011	250000.0
52847	Square	mobile	USA	CA	SF Bay	San Francisco	Zachary Bogue	NaN	NaN	unknown	NaN	series-a	2009-11-01	2009-11	2009-Q4	2009	10000000.0
52848	MixRank	advertising	USA	CA	SF Bay	San Francisco	Zachary Bogue	NaN	NaN	unknown	NaN	series-a	2011-11-18	2011-11	2011-Q4	2011	1500000.0
52849	Socialcam	mobile	USA	CA	Santa Clara County	Santa Clara County	Zachary Bogue	NaN	NaN	unknown	NaN	angel	2012-04-30	2012-04	2012-Q2	2012	NaN
52850	Nuzzel	news	USA	CA	SF Bay	San Francisco	Zachary Bogue	NaN	NaN	unknown	NaN	venture	2012-11-15	2012-11	2012-Q4	2012	1700000.0
52851	ThirdLove	ecommerce	USA	CA	SF Bay	San Francisco	Zachary Bogue	NaN	NaN	unknown	NaN	series-a	2012-12-01	2012-12	2012-Q4	2012	5600000.0
52852	MXD3D	web	USA	CA	SF Bay	San Francisco	Zaid Ayoub	NaN	NaN	unknown	NaN	angel	2012-01-01	2012-01	2012-Q1	2012	300000.0
52853	MXD3D	web	USA	CA	SF Bay	San Francisco	Zaid Ayoub	NaN	NaN	unknown	NaN	angel	2011-01-01	2011-01	2011-Q1	2011	300000.0
52854	Verious	mobile	USA	CA	SF Bay	San Carlos	Zain Khan	NaN	NaN	unknown	NaN	angel	2011-05-30	2011-05	2011-Q2	2011	800000.0
52855	Identified	analytics	USA	CA	SF Bay	San Francisco	Zao Yang	NaN	NaN	unknown	NaN	series-b	2012-06-05	2012-06	2012-Q2	2012	21000000.0
52856	HaulerDeals	fashion	USA	CA	Los Angeles	Los Angeles	Zaw Thet	NaN	NaN	unknown	NaN	angel	2012-10-31	2012-10	2012-Q4	2012	1250000.0
52857	When You Wish	nonprofit	USA	CA	Los Angeles	Marina Del Rey	Zelda Marzec	NaN	NaN	unknown	NaN	series-a	2011-02-01	2011-02	2011-Q1	2011	1500000.0
52858	Farmeron	analytics	USA	CA	SF Bay	Mountain View	Zeljko Mataija	NaN	NaN	unknown	NaN	angel	2010-10-01	2010-10	2010-Q4	2010	15000.0
52859	Theraclone Sciences	biotech	USA	WA	Seattle	Seattle	Zenyaku Kogyo	NaN	NaN	unknown	NaN	series-b	2013-03-25	2013-03	2013-Q1	2013	8000000.0
52860	SimpleGeo	advertising	USA	CA	SF Bay	San Francisco	Ziv Navoth	NaN	NaN	unknown	NaN	other	2009-11-10	2009-11	2009-Q4	2009	195000.0
52861	Open Me	ecommerce	USA	CA	Los Angeles	Los Angeles	Ziver Birg	NaN	NaN	unknown	NaN	angel	2013-08-01	2013-08	2013-Q3	2013	NaN
52862	Comprehend Systems	enterprise	USA	CA	SF Bay	Palo Alto	Zod Nazem	NaN	NaN	unknown	NaN	series-a	2013-07-11	2013-07	2013-Q3	2013	8400000.0
52863	Payoneer	other	USA	NY	New York	New York	Zohar Gilon	NaN	NaN	unknown	NaN	series-a	2005-01-01	2005-01	2005-Q1	2005	2000000.0
52864	Outbrain	web	USA	NY	New York	New York City	Zohar Gilon	NaN	NaN	unknown	NaN	series-b	2009-02-11	2009-02	2009-Q1	2009	12000000.0
52865	Garantia Data	enterprise	USA	CA	SF Bay	Santa Clara	Zohar Gilon	NaN	NaN	unknown	NaN	series-a	2012-08-08	2012-08	2012-Q3	2012	3800000.0
52866	DudaMobile	mobile	USA	CA	SF Bay	Palo Alto	Zohar Gilon	NaN	NaN	unknown	NaN	series-c+	2013-04-08	2013-04	2013-Q2	2013	10300000.0
52867	SiteBrains	software	USA	CA	SF Bay	San Francisco	zohar israel	NaN	NaN	unknown	NaN	angel	2010-08-01	2010-08	2010-Q3	2010	350000.0
52868	Comprehend Systems	enterprise	USA	CA	SF Bay	Palo Alto	Zorba Lieberman	NaN	NaN	unknown	NaN	series-a	2013-07-11	2013-07	2013-Q3	2013	8400000.0
52869	SmartThings	mobile	USA	DC	unknown	Minneapolis	Zorik Gordon	NaN	NaN	unknown	NaN	series-a	2012-12-04	2012-12	2012-Q4	2012	3000000.0

2870 rows × 17 columns

Identify the types for each column¶

We could just simply get column dtypes from df.dtypes as follows. Let's see for information sake

In [128]:

dropped_chunk.dtypes

Out[128]:

company_name              object
company_category_code     object
company_country_code      object
company_state_code        object
company_region            object
company_city              object
investor_name             object
investor_country_code    float64
investor_state_code      float64
investor_region           object
investor_city            float64
funding_round_type        object
funded_at                 object
funded_month              object
funded_quarter            object
funded_year                int64
raised_amount_usd        float64
dtype: object

Now, let's apply a different approach to this:

For each chunk, and for each column within a chunk, form a dictionary of columns as keys and types of each column as list of values. This allows us to find all data types in case some columns may have mixed types.
Remove duplicate/recurring data types from the dictionary values list for each key
From the non-duplicated values of the dictionary, we can see if there's a difference this approach shows compared to the last result

Let's write a function for removing duplicate values from a dictionary, where the key values are in a List. It will return the dictionary where each key value is a list of only unique values

In [4]:

def dedup_dictionary(dict_name):
    for key in dict_name:
        dict_name[key] = list(dict.fromkeys(dict_name[key]))
    return dict_name 

In [9]:

crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
col_types = {}
for chunk in crunch_chunks:
    dropped_chunk=chunk.drop(['investor_category_code', 'company_permalink', 'investor_permalink'], axis=1)
    cols = list(dropped_chunk.columns)
    for col in cols:
        if col in col_types:
            col_types[col].append(str(dropped_chunk.dtypes[col]))
        else:
            col_types[col] =[str(dropped_chunk.dtypes[col])]

dedup_dictionary(col_types)

Out[9]:

{'company_category_code': ['object'],
 'company_city': ['object'],
 'company_country_code': ['object'],
 'company_name': ['object'],
 'company_region': ['object'],
 'company_state_code': ['object'],
 'funded_at': ['object'],
 'funded_month': ['object'],
 'funded_quarter': ['object'],
 'funded_year': ['int64', 'float64'],
 'funding_round_type': ['object'],
 'investor_city': ['object', 'float64'],
 'investor_country_code': ['object', 'float64'],
 'investor_name': ['object'],
 'investor_region': ['object'],
 'investor_state_code': ['object', 'float64'],
 'raised_amount_usd': ['float64']}

Identify the numeric columns we can represent using more space efficient types

In [8]:

import numpy as np
list(dropped_chunk.select_dtypes(include=[np.number]).columns)

Out[8]:

['investor_country_code',
 'investor_state_code',
 'investor_city',
 'funded_year',
 'raised_amount_usd']

For text columns:

Analyze the unique value counts across all of the chunks to see if we can convert them to a numeric type

Get the list of Text columns
For each chunk, get the data frame containing only Text columns, using chunk.select_dtypes(inlude=[datatype])
For each column, get the chunk's value counts
Form a dictionary of columns. Append value counts as values
Once, the dictionary is ready, for each column, combine the values for each key using pd.concat(dictionary[key])
Then we group all values associated with each index by using groupby() on the above result

Once grouped by their corresponding indexes, we will use the Series.sum() method to return the final, unique value counts in the form of another dictionary with columns as keys.

Display the columns and key values from this dictionary

In [21]:

crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
vc_dict = {}
obj_cols = list(dropped_chunk.select_dtypes(include=['object']).columns)
print (obj_cols)
for chunk in crunch_chunks:
    text_chunk = chunk.select_dtypes(include=['object'])
    for col in obj_cols:
        tchunk_vc = text_chunk[col].value_counts()
        if col in vc_dict:
            vc_dict[col].append(tchunk_vc)
        else:
            vc_dict[col] = [tchunk_vc]

final_dict = {}
for col in vc_dict:
    combined_vc = pd.concat(vc_dict[col])
    final_vc = combined_vc.groupby(combined_vc.index).sum()
    final_dict[col] = final_vc 
    print (col, len(final_vc))
    

['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_region', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter']
company_region 546
funded_month 192
company_name 11573
funding_round_type 9
funded_quarter 72
company_city 1229
investor_name 10465
company_category_code 43
investor_region 585
company_country_code 2
company_state_code 50
funded_at 2808

Upload the rows in each dataframe chunk to a SQLite database table¶

Let's create and connect to a new SQLite database file called crunchbase.db
Using dataframe chunk iterator, we will first export each chunk to a new table in the database called ventures
We will then use pandas to read the database table into a dataframe returning the column types for the newly created table.

We will use query PRAGMA table_info(table_name) to return information on the table's column data types.

In [135]:

import sqlite3

conn = sqlite3.connect('crunchbase.db')
crunch_iter = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
for chunk in crunch_iter:
    chunk.to_sql('ventures', conn, if_exists='append', index=False)
results_df = pd.read_sql('PRAGMA table_info("ventures")', conn)
print (results_df)

    cid                    name     type  notnull dflt_value  pk
0     0       company_permalink     TEXT        0       None   0
1     1            company_name     TEXT        0       None   0
2     2   company_category_code     TEXT        0       None   0
3     3    company_country_code     TEXT        0       None   0
4     4      company_state_code     TEXT        0       None   0
5     5          company_region     TEXT        0       None   0
6     6            company_city     TEXT        0       None   0
7     7      investor_permalink     TEXT        0       None   0
8     8           investor_name     TEXT        0       None   0
9     9  investor_category_code     TEXT        0       None   0
10   10   investor_country_code     TEXT        0       None   0
11   11     investor_state_code     TEXT        0       None   0
12   12         investor_region     TEXT        0       None   0
13   13           investor_city     TEXT        0       None   0
14   14      funding_round_type     TEXT        0       None   0
15   15               funded_at     TEXT        0       None   0
16   16            funded_month     TEXT        0       None   0
17   17          funded_quarter     TEXT        0       None   0
18   18             funded_year  INTEGER        0       None   0
19   19       raised_amount_usd     REAL        0       None   0

Column dtypes match with results in the previous step where

funded_year was int64

raised_amount_usd was float64

It looks Text type are either Object or mixed Object&Float

Let's find the size of crunchbase.db database file

Using ls -l command
Using wc command

In [8]:

!ls -l crunchbase.db

-rw-r--r-- 1 dq root 32329728 Jun 24 21:19 crunchbase.db

In [9]:

!wc -c crunchbase.db

32329728 crunchbase.db

The database file is 32MB