To get familiar with the data, let's create a helper funtion that can read from a given csv file, a specified number of rows using the right encoding and print them as a table.
import pandas as pd
import matplotlib.pyplot as plt
csvfile = 'crunchbase-investments.csv'
def pdread_csv(filename, numrows, enc='ISO-8859-1'):
rows = pd.read_csv( filename, nrows=numrows, encoding=enc)
return rows
Read a few rows from the csv file
pdread_csv(csvfile, 5)
company_permalink | company_name | company_category_code | company_country_code | company_state_code | company_region | company_city | investor_permalink | investor_name | investor_category_code | investor_country_code | investor_state_code | investor_region | investor_city | funding_round_type | funded_at | funded_month | funded_quarter | funded_year | raised_amount_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | /company/advercar | AdverCar | advertising | USA | CA | SF Bay | San Francisco | /company/1-800-flowers-com | 1-800-FLOWERS.COM | NaN | USA | NY | New York | New York | series-a | 2012-10-30 | 2012-10 | 2012-Q4 | 2012 | 2000000 |
1 | /company/launchgram | LaunchGram | news | USA | CA | SF Bay | Mountain View | /company/10xelerator | 10Xelerator | finance | USA | OH | Columbus | Columbus | other | 2012-01-23 | 2012-01 | 2012-Q1 | 2012 | 20000 |
2 | /company/utap | uTaP | messaging | USA | NaN | United States - Other | NaN | /company/10xelerator | 10Xelerator | finance | USA | OH | Columbus | Columbus | other | 2012-01-01 | 2012-01 | 2012-Q1 | 2012 | 20000 |
3 | /company/zoopshop | ZoopShop | software | USA | OH | Columbus | columbus | /company/10xelerator | 10Xelerator | finance | USA | OH | Columbus | Columbus | angel | 2012-02-15 | 2012-02 | 2012-Q1 | 2012 | 20000 |
4 | /company/efuneral | eFuneral | web | USA | OH | Cleveland | Cleveland | /company/10xelerator | 10Xelerator | finance | USA | OH | Columbus | Columbus | other | 2011-09-08 | 2011-09 | 2011-Q3 | 2011 | 20000 |
Total memory usage (in MB) for 5000 rows
pdread_csv(csvfile, 5000).memory_usage(deep=True).sum()/1024**2
5.579195022583008
As memory consumption for 5000 rows is much less than 10 MB, let's read the data into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 MB of memory. We will then calculate total memory footprint of all of the chunks combined.
Let's define a function for this. The function will consume a chunk iterator and return the total memory footprint of the full dataset, in MB
def all_chunks_mem(chunk_iter):
total_memory_footprint = 0
for chunk in chunk_iter:
total_memory_footprint += chunk.memory_usage(deep=True).sum()/1024**2
return total_memory_footprint
crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
all_chunks_mem(crunch_chunks)
56.988484382629395
This dataset is believed to contain over 50,000 rows. Let's find out how many rows and columns there are:
crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
total_rows = 0
for chunk in crunch_chunks:
total_rows += chunk.shape[0]
print (total_rows)
print (chunk.shape[1])
52870 20
While calculating this, we will exclude the memory usage of the DataFrame’s index.
We will chunkify the dataset as before, using 5000 rows at a time
We will form a list of column names to iterate over in each chunk
We will use a dictionary to count the memory footprint of each column. The dictionary keys will be column names and values will be the memory usage per column
Sum of them all in MB should match the dataset's total memory footprint( or that of all chunks combined ) calculated earlier.
crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
col_mem = {}
for chunk in crunch_chunks:
cols = list(chunk.columns)
for col in cols:
if col in col_mem:
col_mem[col] += chunk[col].memory_usage(index=False,deep=True)
else:
col_mem[col] = chunk[col].memory_usage(index=False,deep=True)
total_mem = 0
for key in col_mem:
print (key,col_mem[key])
total_mem += col_mem[key]
print ('\n','Total memory Usage of All Columns(in MB):', round(total_mem/1024**2,3))
investor_category_code 622424 funded_month 3383584 company_state_code 3106051 company_category_code 3421104 funded_year 422960 investor_permalink 4980548 company_name 3591326 company_city 3505926 funding_round_type 3410707 investor_state_code 2476607 funded_at 3542185 investor_region 3396281 company_permalink 4057788 company_region 3411585 company_country_code 3172176 investor_city 2885083 funded_quarter 3383584 investor_name 3915666 investor_country_code 2647292 raised_amount_usd 422960 Total memory Usage of All Columns(in MB): 56.988
crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
missing = []
for chunk in crunch_chunks:
missing.append(pd.isnull(chunk).sum())
combined = pd.concat(missing)
final = (combined.groupby(combined.index)).sum()
final.sort_values()
company_country_code 1 company_name 1 company_permalink 1 company_region 1 investor_region 2 investor_permalink 2 investor_name 2 funded_quarter 3 funded_at 3 funded_month 3 funded_year 3 funding_round_type 3 company_state_code 492 company_city 533 company_category_code 643 raised_amount_usd 3599 investor_country_code 12001 investor_city 12480 investor_state_code 16809 investor_category_code 50427 dtype: int64
Just by looking at the per column missing values, the column investor_category_code has way too many missing i.e 50427 out of 52870 row values, to be useful for analysis.
Links to websites are also not useful for analysis - company_permalink and investor_permalink . We will use df.drop to drop this list of columns, resulting in 17 columns in the dataframe.
crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
col_mem = {}
for chunk in crunch_chunks:
dropped_chunk=chunk.drop(['investor_category_code', 'company_permalink', 'investor_permalink'], axis=1)
cols = list(dropped_chunk.columns)
print (dropped_chunk.shape[1])
print (cols)
dropped_chunk
17 ['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code', 'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter', 'funded_year', 'raised_amount_usd']
company_name | company_category_code | company_country_code | company_state_code | company_region | company_city | investor_name | investor_country_code | investor_state_code | investor_region | investor_city | funding_round_type | funded_at | funded_month | funded_quarter | funded_year | raised_amount_usd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50000 | NuORDER | fashion | USA | CA | Los Angeles | West Hollywood | Mortimer Singer | NaN | NaN | unknown | NaN | series-a | 2012-10-01 | 2012-10 | 2012-Q4 | 2012 | 3060000.0 |
50001 | ChaCha | advertising | USA | IN | Indianapolis | Carmel | Morton Meyerson | NaN | NaN | unknown | NaN | series-b | 2007-10-01 | 2007-10 | 2007-Q4 | 2007 | 12000000.0 |
50002 | Binfire | software | USA | FL | Bocat Raton | Bocat Raton | Moshe Ariel | NaN | NaN | unknown | NaN | angel | 2008-04-18 | 2008-04 | 2008-Q2 | 2008 | 500000.0 |
50003 | Binfire | software | USA | FL | Bocat Raton | Bocat Raton | Moshe Ariel | NaN | NaN | unknown | NaN | angel | 2010-01-01 | 2010-01 | 2010-Q1 | 2010 | 750000.0 |
50004 | Unified Color | software | USA | CA | SF Bay | South San Frnacisco | Mr. Andrew Oung | NaN | NaN | unknown | NaN | angel | 2010-01-01 | 2010-01 | 2010-Q1 | 2010 | NaN |
50005 | HItviews | advertising | USA | NY | New York | New York City | multiple parties | NaN | NaN | unknown | NaN | angel | 2007-11-29 | 2007-11 | 2007-Q4 | 2007 | 485000.0 |
50006 | LockerDome | social | USA | MO | Saint Louis | St. Louis | multiple parties | NaN | NaN | unknown | NaN | angel | 2012-04-17 | 2012-04 | 2012-Q2 | 2012 | 300000.0 |
50007 | ThirdLove | ecommerce | USA | CA | SF Bay | San Francisco | Munjal Shah | NaN | NaN | unknown | NaN | series-a | 2012-12-01 | 2012-12 | 2012-Q4 | 2012 | 5600000.0 |
50008 | Hakia | search | USA | NaN | TBD | NaN | Murat Vargi | NaN | NaN | unknown | NaN | series-a | 2006-11-01 | 2006-11 | 2006-Q4 | 2006 | 16000000.0 |
50009 | bookacoach | sports | USA | IN | Indianapolis | Indianapolis | Myles Grote | NaN | NaN | unknown | NaN | angel | 2012-11-01 | 2012-11 | 2012-Q4 | 2012 | NaN |
50010 | LocalCircles | social | USA | CA | SF Bay | Santa Clara | Nadir Godrej | NaN | NaN | unknown | NaN | angel | 2012-09-01 | 2012-09 | 2012-Q3 | 2012 | NaN |
50011 | Graphdive | analytics | USA | CA | SF Bay | Menlo Park | Naguib Sawiris | NaN | NaN | unknown | NaN | angel | 2012-10-04 | 2012-10 | 2012-Q4 | 2012 | 1000000.0 |
50012 | Ribbon | ecommerce | USA | CA | SF Bay | San Francisco | Naguib Sawiris | NaN | NaN | unknown | NaN | series-a | 2013-02-05 | 2013-02 | 2013-Q1 | 2013 | 1630000.0 |
50013 | Dokkankom.com | ecommerce | USA | NY | New York | new york | Namek Zu'bi | NaN | NaN | unknown | NaN | angel | 2011-10-10 | 2011-10 | 2011-Q4 | 2011 | 30000.0 |
50014 | Lookery | web | USA | CA | SF Bay | San Francisco | Nana Shin | NaN | NaN | unknown | NaN | angel | 2008-02-07 | 2008-02 | 2008-Q1 | 2008 | 900000.0 |
50015 | TrustDegrees | web | USA | NY | Kenmore | Kenmore | Nancy Barrett | NaN | NaN | unknown | NaN | angel | 2011-06-09 | 2011-06 | 2011-Q2 | 2011 | 8000.0 |
50016 | Altavoz | games_video | USA | DC | Washington DC | Washington | Nancy Jacobsen | NaN | NaN | unknown | NaN | angel | 2012-09-11 | 2012-09 | 2012-Q3 | 2012 | 150000.0 |
50017 | EdSurge | education | USA | CA | SF Bay | Burlingame | Nancy Peretsman | NaN | NaN | unknown | NaN | angel | 2012-08-29 | 2012-08 | 2012-Q3 | 2012 | 400000.0 |
50018 | FullContact | enterprise | USA | CO | Denver | Denver | Nancy Pierce | NaN | NaN | unknown | NaN | series-b | 2012-07-09 | 2012-07 | 2012-Q3 | 2012 | 7000000.0 |
50019 | Rapt Media | enterprise | USA | CO | Denver | Boulder | Nancy Pierce | NaN | NaN | unknown | NaN | series-a | 2013-01-23 | 2013-01 | 2013-Q1 | 2013 | 2288803.0 |
50020 | Humanoid | software | USA | CA | SF Bay | San Francisco | Nat Friedman | NaN | NaN | unknown | NaN | angel | 2010-12-01 | 2010-12 | 2010-Q4 | 2010 | 1100000.0 |
50021 | Runscope | web | USA | CA | SF Bay | San Francisco | Nat Friedman | NaN | NaN | unknown | NaN | angel | 2013-05-22 | 2013-05 | 2013-Q2 | 2013 | 1100000.0 |
50022 | Adzerk | advertising | USA | NC | Raleigh-Durham | Durham | Nat Turner | NaN | NaN | unknown | NaN | angel | 2011-07-12 | 2011-07 | 2011-Q3 | 2011 | 650000.0 |
50023 | Adaptly | advertising | USA | NY | New York | New York | Nat Turner | NaN | NaN | unknown | NaN | series-a | 2011-04-18 | 2011-04 | 2011-Q2 | 2011 | 2000000.0 |
50024 | Lore | education | USA | NY | New York | New York | Nat Turner | NaN | NaN | unknown | NaN | angel | 2011-06-27 | 2011-06 | 2011-Q2 | 2011 | 1000000.0 |
50025 | Tasted Menu | hospitality | USA | MA | Boston | Boston | Nat Turner | NaN | NaN | unknown | NaN | angel | 2011-05-01 | 2011-05 | 2011-Q2 | 2011 | NaN |
50026 | Lua Technologies | mobile | USA | NY | New York | New York | Nat Turner | NaN | NaN | unknown | NaN | series-a | 2012-08-01 | 2012-08 | 2012-Q3 | 2012 | 2500000.0 |
50027 | Blue Apron | hospitality | USA | NY | New York | Brooklyn | Nat Turner | NaN | NaN | unknown | NaN | series-a | 2013-02-19 | 2013-02 | 2013-Q1 | 2013 | 3000000.0 |
50028 | ChatID | mobile | USA | NY | New York | New York | Nat Turner | NaN | NaN | unknown | NaN | angel | 2012-01-01 | 2012-01 | 2012-Q1 | 2012 | NaN |
50029 | Breakthrough Behavioral | health | USA | CA | SF Bay | Redwood City | Nat Turner | NaN | NaN | unknown | NaN | angel | 2012-08-13 | 2012-08 | 2012-Q3 | 2012 | 900000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
52840 | Meddik | health | USA | NY | New York | New York | Zach Weinberg | NaN | NaN | unknown | NaN | angel | 2012-05-24 | 2012-05 | 2012-Q2 | 2012 | 750000.0 |
52841 | Blue Apron | hospitality | USA | NY | New York | Brooklyn | Zach Weinberg | NaN | NaN | unknown | NaN | series-a | 2013-02-19 | 2013-02 | 2013-Q1 | 2013 | 3000000.0 |
52842 | ChatID | mobile | USA | NY | New York | New York | Zach Weinberg | NaN | NaN | unknown | NaN | angel | 2012-01-01 | 2012-01 | 2012-Q1 | 2012 | NaN |
52843 | Breakthrough Behavioral | health | USA | CA | SF Bay | Redwood City | Zach Weinberg | NaN | NaN | unknown | NaN | angel | 2012-08-13 | 2012-08 | 2012-Q3 | 2012 | 900000.0 |
52844 | Plaid | software | USA | CA | SF Bay | San Francisco | Zach Weinberg | NaN | NaN | unknown | NaN | series-a | 2013-09-19 | 2013-09 | 2013-Q3 | 2013 | 2800000.0 |
52845 | PokitDok | mobile | USA | CA | SF Bay | Menlo Park | Zach Zeitlin | NaN | NaN | unknown | NaN | angel | 2012-07-12 | 2012-07 | 2012-Q3 | 2012 | 1300000.0 |
52846 | Fitocracy | web | USA | NY | New York | New York | Zachary Aarons | NaN | NaN | unknown | NaN | angel | 2011-09-01 | 2011-09 | 2011-Q3 | 2011 | 250000.0 |
52847 | Square | mobile | USA | CA | SF Bay | San Francisco | Zachary Bogue | NaN | NaN | unknown | NaN | series-a | 2009-11-01 | 2009-11 | 2009-Q4 | 2009 | 10000000.0 |
52848 | MixRank | advertising | USA | CA | SF Bay | San Francisco | Zachary Bogue | NaN | NaN | unknown | NaN | series-a | 2011-11-18 | 2011-11 | 2011-Q4 | 2011 | 1500000.0 |
52849 | Socialcam | mobile | USA | CA | Santa Clara County | Santa Clara County | Zachary Bogue | NaN | NaN | unknown | NaN | angel | 2012-04-30 | 2012-04 | 2012-Q2 | 2012 | NaN |
52850 | Nuzzel | news | USA | CA | SF Bay | San Francisco | Zachary Bogue | NaN | NaN | unknown | NaN | venture | 2012-11-15 | 2012-11 | 2012-Q4 | 2012 | 1700000.0 |
52851 | ThirdLove | ecommerce | USA | CA | SF Bay | San Francisco | Zachary Bogue | NaN | NaN | unknown | NaN | series-a | 2012-12-01 | 2012-12 | 2012-Q4 | 2012 | 5600000.0 |
52852 | MXD3D | web | USA | CA | SF Bay | San Francisco | Zaid Ayoub | NaN | NaN | unknown | NaN | angel | 2012-01-01 | 2012-01 | 2012-Q1 | 2012 | 300000.0 |
52853 | MXD3D | web | USA | CA | SF Bay | San Francisco | Zaid Ayoub | NaN | NaN | unknown | NaN | angel | 2011-01-01 | 2011-01 | 2011-Q1 | 2011 | 300000.0 |
52854 | Verious | mobile | USA | CA | SF Bay | San Carlos | Zain Khan | NaN | NaN | unknown | NaN | angel | 2011-05-30 | 2011-05 | 2011-Q2 | 2011 | 800000.0 |
52855 | Identified | analytics | USA | CA | SF Bay | San Francisco | Zao Yang | NaN | NaN | unknown | NaN | series-b | 2012-06-05 | 2012-06 | 2012-Q2 | 2012 | 21000000.0 |
52856 | HaulerDeals | fashion | USA | CA | Los Angeles | Los Angeles | Zaw Thet | NaN | NaN | unknown | NaN | angel | 2012-10-31 | 2012-10 | 2012-Q4 | 2012 | 1250000.0 |
52857 | When You Wish | nonprofit | USA | CA | Los Angeles | Marina Del Rey | Zelda Marzec | NaN | NaN | unknown | NaN | series-a | 2011-02-01 | 2011-02 | 2011-Q1 | 2011 | 1500000.0 |
52858 | Farmeron | analytics | USA | CA | SF Bay | Mountain View | Zeljko Mataija | NaN | NaN | unknown | NaN | angel | 2010-10-01 | 2010-10 | 2010-Q4 | 2010 | 15000.0 |
52859 | Theraclone Sciences | biotech | USA | WA | Seattle | Seattle | Zenyaku Kogyo | NaN | NaN | unknown | NaN | series-b | 2013-03-25 | 2013-03 | 2013-Q1 | 2013 | 8000000.0 |
52860 | SimpleGeo | advertising | USA | CA | SF Bay | San Francisco | Ziv Navoth | NaN | NaN | unknown | NaN | other | 2009-11-10 | 2009-11 | 2009-Q4 | 2009 | 195000.0 |
52861 | Open Me | ecommerce | USA | CA | Los Angeles | Los Angeles | Ziver Birg | NaN | NaN | unknown | NaN | angel | 2013-08-01 | 2013-08 | 2013-Q3 | 2013 | NaN |
52862 | Comprehend Systems | enterprise | USA | CA | SF Bay | Palo Alto | Zod Nazem | NaN | NaN | unknown | NaN | series-a | 2013-07-11 | 2013-07 | 2013-Q3 | 2013 | 8400000.0 |
52863 | Payoneer | other | USA | NY | New York | New York | Zohar Gilon | NaN | NaN | unknown | NaN | series-a | 2005-01-01 | 2005-01 | 2005-Q1 | 2005 | 2000000.0 |
52864 | Outbrain | web | USA | NY | New York | New York City | Zohar Gilon | NaN | NaN | unknown | NaN | series-b | 2009-02-11 | 2009-02 | 2009-Q1 | 2009 | 12000000.0 |
52865 | Garantia Data | enterprise | USA | CA | SF Bay | Santa Clara | Zohar Gilon | NaN | NaN | unknown | NaN | series-a | 2012-08-08 | 2012-08 | 2012-Q3 | 2012 | 3800000.0 |
52866 | DudaMobile | mobile | USA | CA | SF Bay | Palo Alto | Zohar Gilon | NaN | NaN | unknown | NaN | series-c+ | 2013-04-08 | 2013-04 | 2013-Q2 | 2013 | 10300000.0 |
52867 | SiteBrains | software | USA | CA | SF Bay | San Francisco | zohar israel | NaN | NaN | unknown | NaN | angel | 2010-08-01 | 2010-08 | 2010-Q3 | 2010 | 350000.0 |
52868 | Comprehend Systems | enterprise | USA | CA | SF Bay | Palo Alto | Zorba Lieberman | NaN | NaN | unknown | NaN | series-a | 2013-07-11 | 2013-07 | 2013-Q3 | 2013 | 8400000.0 |
52869 | SmartThings | mobile | USA | DC | unknown | Minneapolis | Zorik Gordon | NaN | NaN | unknown | NaN | series-a | 2012-12-04 | 2012-12 | 2012-Q4 | 2012 | 3000000.0 |
2870 rows × 17 columns
dropped_chunk.dtypes
company_name object company_category_code object company_country_code object company_state_code object company_region object company_city object investor_name object investor_country_code float64 investor_state_code float64 investor_region object investor_city float64 funding_round_type object funded_at object funded_month object funded_quarter object funded_year int64 raised_amount_usd float64 dtype: object
Now, let's apply a different approach to this:
For each chunk, and for each column within a chunk, form a dictionary of columns as keys and types of each column as list of values. This allows us to find all data types in case some columns may have mixed types.
Remove duplicate/recurring data types from the dictionary values list for each key
From the non-duplicated values of the dictionary, we can see if there's a difference this approach shows compared to the last result
Let's write a function for removing duplicate values from a dictionary, where the key values are in a List. It will return the dictionary where each key value is a list of only unique values
def dedup_dictionary(dict_name):
for key in dict_name:
dict_name[key] = list(dict.fromkeys(dict_name[key]))
return dict_name
crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
col_types = {}
for chunk in crunch_chunks:
dropped_chunk=chunk.drop(['investor_category_code', 'company_permalink', 'investor_permalink'], axis=1)
cols = list(dropped_chunk.columns)
for col in cols:
if col in col_types:
col_types[col].append(str(dropped_chunk.dtypes[col]))
else:
col_types[col] =[str(dropped_chunk.dtypes[col])]
dedup_dictionary(col_types)
{'company_category_code': ['object'], 'company_city': ['object'], 'company_country_code': ['object'], 'company_name': ['object'], 'company_region': ['object'], 'company_state_code': ['object'], 'funded_at': ['object'], 'funded_month': ['object'], 'funded_quarter': ['object'], 'funded_year': ['int64', 'float64'], 'funding_round_type': ['object'], 'investor_city': ['object', 'float64'], 'investor_country_code': ['object', 'float64'], 'investor_name': ['object'], 'investor_region': ['object'], 'investor_state_code': ['object', 'float64'], 'raised_amount_usd': ['float64']}
Identify the numeric columns we can represent using more space efficient types
import numpy as np
list(dropped_chunk.select_dtypes(include=[np.number]).columns)
['investor_country_code', 'investor_state_code', 'investor_city', 'funded_year', 'raised_amount_usd']
For text columns:
Analyze the unique value counts across all of the chunks to see if we can convert them to a numeric type
Once grouped by their corresponding indexes, we will use the Series.sum() method to return the final, unique value counts in the form of another dictionary with columns as keys.
Display the columns and key values from this dictionary
crunch_chunks = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
vc_dict = {}
obj_cols = list(dropped_chunk.select_dtypes(include=['object']).columns)
print (obj_cols)
for chunk in crunch_chunks:
text_chunk = chunk.select_dtypes(include=['object'])
for col in obj_cols:
tchunk_vc = text_chunk[col].value_counts()
if col in vc_dict:
vc_dict[col].append(tchunk_vc)
else:
vc_dict[col] = [tchunk_vc]
final_dict = {}
for col in vc_dict:
combined_vc = pd.concat(vc_dict[col])
final_vc = combined_vc.groupby(combined_vc.index).sum()
final_dict[col] = final_vc
print (col, len(final_vc))
['company_name', 'company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'company_city', 'investor_name', 'investor_region', 'funding_round_type', 'funded_at', 'funded_month', 'funded_quarter'] company_region 546 funded_month 192 company_name 11573 funding_round_type 9 funded_quarter 72 company_city 1229 investor_name 10465 company_category_code 43 investor_region 585 company_country_code 2 company_state_code 50 funded_at 2808
We will use query PRAGMA table_info(table_name) to return information on the table's column data types.
import sqlite3
conn = sqlite3.connect('crunchbase.db')
crunch_iter = pd.read_csv(csvfile, encoding='ISO-8859-1', chunksize=5000)
for chunk in crunch_iter:
chunk.to_sql('ventures', conn, if_exists='append', index=False)
results_df = pd.read_sql('PRAGMA table_info("ventures")', conn)
print (results_df)
cid name type notnull dflt_value pk 0 0 company_permalink TEXT 0 None 0 1 1 company_name TEXT 0 None 0 2 2 company_category_code TEXT 0 None 0 3 3 company_country_code TEXT 0 None 0 4 4 company_state_code TEXT 0 None 0 5 5 company_region TEXT 0 None 0 6 6 company_city TEXT 0 None 0 7 7 investor_permalink TEXT 0 None 0 8 8 investor_name TEXT 0 None 0 9 9 investor_category_code TEXT 0 None 0 10 10 investor_country_code TEXT 0 None 0 11 11 investor_state_code TEXT 0 None 0 12 12 investor_region TEXT 0 None 0 13 13 investor_city TEXT 0 None 0 14 14 funding_round_type TEXT 0 None 0 15 15 funded_at TEXT 0 None 0 16 16 funded_month TEXT 0 None 0 17 17 funded_quarter TEXT 0 None 0 18 18 funded_year INTEGER 0 None 0 19 19 raised_amount_usd REAL 0 None 0
Column dtypes match with results in the previous step where
funded_year was int64
raised_amount_usd was float64
It looks Text type are either Object or mixed Object&Float
Let's find the size of crunchbase.db database file
!ls -l crunchbase.db
-rw-r--r-- 1 dq root 32329728 Jun 24 21:19 crunchbase.db
!wc -c crunchbase.db
32329728 crunchbase.db
The database file is 32MB