[1]
import numpy as np
import pandas as pd
import nbconvert
import warnings
warnings.filterwarnings("ignore")
Situation: Let say, you have Age
column having minimum value 1 and maximum value 150
, with 10 million
total rows in dataframe
Task: Reduce Memory Usage of Age
column given above constraints
Action: Change of original dtype from int32
to uint8
Result: Drop from 38.1 MB to 9.5 MB
in Memory usage i.e. 75%
reduction
## Initializing minimum and maximum value of age
min_age_value , max_age_value = 1,150
## Number of rows in dataframe
nrows = int(np.power(10,7))
## creation of Age dataframe
df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})
## check memory usage before action
df_age.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000000 entries, 0 to 9999999 Data columns (total 1 columns): Age int32 dtypes: int32(1) memory usage: 38.1 MB
## Range of "uint8"; satisfies range constraint of Age column
np.iinfo('uint8')
iinfo(min=0, max=255, dtype=uint8)
## Action: conversion of dtype from "int32" to "uint8"
converted_df_age = df_age.astype(np.uint8)
## check memory usage after action
converted_df_age.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000000 entries, 0 to 9999999 Data columns (total 1 columns): Age uint8 dtypes: uint8(1) memory usage: 9.5 MB
Situation: Let say, you have 50,000 search queries
and 5,000 documents
and computed cosine similarity
for each search query with all documents i.e. dimension 50,000 X 5,000
. All similarity values are between 0 and 1
and should have atleast 2 decimal precision
Task: Reduce Memory Usage of cosine smilarity dataframe given above constraints
Action: Change of original dtype from float64
to float16
Result: Drop from 1.9 GB to 476.8 MB or 0.46 GB
in Memory usage i.e. 75%
reduction
## no. of documents
ncols = int(5*np.power(10,3))
## no. of search queries
nrows = int(5*np.power(10,4))
## creation of cosine similarity dataframe
df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))
print("No. of search queries: {} and No. of documents: {}".format(df_query_doc.shape[0],df_query_doc.shape[1]))
No. of search queries: 50000 and No. of documents: 5000
## check memory usage before action
df_query_doc.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Columns: 5000 entries, 0 to 4999 dtypes: float64(5000) memory usage: 1.9 GB
## Action: conversion of dtype from "float64" to "float16"
converted_df_query_doc = df_query_doc.astype('float16')
## check memory usage after action
converted_df_query_doc.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Columns: 5000 entries, 0 to 4999 dtypes: float16(5000) memory usage: 476.8 MB
Situation: Let say, you have Day of Week
column having 7 unique
values, with 4.9 million
total rows in dataframe
Task: Reduce Memory Usage of Day of Week
column given only 7 unique value exist
Action: Change of dtype from object
to category
as ratio of unique values to no. of rows is almost zero
Result: Drop from 2.9 GB to 46.7 MB or 0.045 GB
in Memory usage i.e. 98%
reduction
## unique values of "days of week"
day_of_week = ["monday","tuesday","wednesday","thursday","friday","saturday","sunday"]
## Number of times day_of_week repeats
repeat_times = 7*np.power(10,6)
## creation of days of week dataframe
df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})
print("No of rows in days of week dataframe {}".format(df_day_of_week.shape[0]))
No of rows in days of week dataframe 49000000
## check memory usage before action
df_day_of_week.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49000000 entries, 0 to 48999999 Data columns (total 1 columns): day_of_week object dtypes: object(1) memory usage: 2.9 GB
## Action: conversion of dtype from "object" to "category"
converted_df_day_of_week = df_day_of_week.astype('category')
## check memory usage after action
converted_df_day_of_week.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 49000000 entries, 0 to 48999999 Data columns (total 1 columns): day_of_week category dtypes: category(1) memory usage: 46.7 MB
## check first two rows of dataframe
converted_df_day_of_week.head(2)
day_of_week | |
---|---|
0 | monday |
1 | monday |
## check how mapping of day_of_week is created in category dtype
converted_df_day_of_week.head(2)['day_of_week'].cat.codes
0 1 1 1 dtype: int8
Situation: Let say, you have dataframe having large count of zero or missing values (66%)
usually happens in lot of NLP task
like Count/TF-IDF encoding, Recommender Systems [2]
Task: Reduce Memory Usage of dataframe
Action: Change of DataFrame type to SparseDataFrame
as Percentage of Non-Zero Non-NaN values is very less in number
Result: Drop from 228.9 MB to 152.6 MB
in Memory usage i.e. 33%
reduction
## number of rows in dataframe
nrows = np.power(10,7)
## creation of dataframe
df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)
## check memory usage before action
df_dense.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000000 entries, 0 to 9999999 Data columns (total 3 columns): 0 int64 1 float64 2 float64 dtypes: float64(2), int64(1) memory usage: 228.9 MB
## Percentage of Non-zero and Non-NaN values in dataframe
non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()
non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)
print("Percentage of Non-Zero Non-NaN values in dataframe {} %".format(non_zero_non_nan_percentage))
Percentage of Non-Zero Non-NaN values in dataframe 33.33 %
## Action: Change of DataFrame type to SparseDataFrame
df_sparse = df_dense.to_sparse()
## check memory usage after action
df_sparse.info(memory_usage='deep')
<class 'pandas.core.sparse.frame.SparseDataFrame'> RangeIndex: 10000000 entries, 0 to 9999999 Data columns (total 3 columns): 0 Sparse[int64, nan] 1 Sparse[float64, nan] 2 Sparse[float64, nan] dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1) memory usage: 152.6 MB