Drop Memory Usage Tricks¶

Watch Other Interesting Data Science Topics
Created By: Aakash Goel
Connect on LinkedIn
Subscribe on YouTube
Created on: 17-FEB-2020
Last Updated on: 17-FEB-2020

Table of contents¶

Reduce DataFrame size
1.1. Change in int datatype
1.2. Change in float datatype
1.3. Change from object to category datatype
1.4. Convert to Sparse DataFrame
References

SegmentLocal

[1]

In [22]:

import numpy as np
import pandas as pd
import nbconvert
import warnings
warnings.filterwarnings("ignore")

1. Reduce DataFrame size¶

1.1 Change in int datatype¶

Situation: Let say, you have Age column having minimum value 1 and maximum value 150, with 10 million total rows in dataframe
Task: Reduce Memory Usage of Age column given above constraints
Action: Change of original dtype from int32 to uint8
Result: Drop from 38.1 MB to 9.5 MB in Memory usage i.e. 75% reduction

In [2]:

## Initializing minimum and maximum value of age
min_age_value , max_age_value = 1,150
## Number of rows in dataframe
nrows = int(np.power(10,7))
## creation of Age dataframe
df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})

In [3]:

## check memory usage before action
df_age.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 1 columns):
Age    int32
dtypes: int32(1)
memory usage: 38.1 MB

In [4]:

## Range of "uint8"; satisfies range constraint of Age column 
np.iinfo('uint8')

Out[4]:

iinfo(min=0, max=255, dtype=uint8)

In [5]:

## Action: conversion of dtype from "int32" to "uint8"
converted_df_age = df_age.astype(np.uint8)

In [6]:

## check memory usage after action
converted_df_age.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 1 columns):
Age    uint8
dtypes: uint8(1)
memory usage: 9.5 MB

1.2 Change in float datatype¶

Situation: Let say, you have 50,000 search queries and 5,000 documents and computed cosine similarity for each search query with all documents i.e. dimension 50,000 X 5,000. All similarity values are between 0 and 1 and should have atleast 2 decimal precision
Task: Reduce Memory Usage of cosine smilarity dataframe given above constraints
Action: Change of original dtype from float64 to float16
Result: Drop from 1.9 GB to 476.8 MB or 0.46 GB in Memory usage i.e. 75% reduction

In [7]:

## no. of documents
ncols = int(5*np.power(10,3))
## no. of search queries
nrows = int(5*np.power(10,4))
## creation of cosine similarity dataframe
df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))
print("No. of search queries: {} and No. of documents: {}".format(df_query_doc.shape[0],df_query_doc.shape[1]))

No. of search queries: 50000 and No. of documents: 5000

In [8]:

## check memory usage before action
df_query_doc.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 5000 entries, 0 to 4999
dtypes: float64(5000)
memory usage: 1.9 GB

In [9]:

## Action: conversion of dtype from "float64" to "float16"
converted_df_query_doc = df_query_doc.astype('float16')

In [10]:

## check memory usage after action
converted_df_query_doc.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 5000 entries, 0 to 4999
dtypes: float16(5000)
memory usage: 476.8 MB

1.3 Change from object to category datatype¶

Situation: Let say, you have Day of Week column having 7 unique values, with 4.9 million total rows in dataframe
Task: Reduce Memory Usage of Day of Week column given only 7 unique value exist
Action: Change of dtype from object to category as ratio of unique values to no. of rows is almost zero
Result: Drop from 2.9 GB to 46.7 MB or 0.045 GB in Memory usage i.e. 98% reduction

In [11]:

## unique values of "days of week"
day_of_week = ["monday","tuesday","wednesday","thursday","friday","saturday","sunday"]
## Number of times day_of_week repeats
repeat_times = 7*np.power(10,6)
## creation of days of week dataframe
df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})
print("No of rows in days of week dataframe {}".format(df_day_of_week.shape[0]))

No of rows in days of week dataframe 49000000

In [12]:

## check memory usage before action
df_day_of_week.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49000000 entries, 0 to 48999999
Data columns (total 1 columns):
day_of_week    object
dtypes: object(1)
memory usage: 2.9 GB

In [13]:

## Action: conversion of dtype from "object" to "category"
converted_df_day_of_week = df_day_of_week.astype('category')

In [14]:

## check memory usage after action
converted_df_day_of_week.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49000000 entries, 0 to 48999999
Data columns (total 1 columns):
day_of_week    category
dtypes: category(1)
memory usage: 46.7 MB

In [15]:

## check first two rows of dataframe
converted_df_day_of_week.head(2)

Out[15]:

	day_of_week
0	monday
1	monday

In [16]:

## check how mapping of day_of_week is created in category dtype
converted_df_day_of_week.head(2)['day_of_week'].cat.codes

Out[16]:

0    1
1    1
dtype: int8

1.4 Convert to Sparse DataFrame¶

Situation: Let say, you have dataframe having large count of zero or missing values (66%) usually happens in lot of NLP task like Count/TF-IDF encoding, Recommender Systems [2]
Task: Reduce Memory Usage of dataframe
Action: Change of DataFrame type to SparseDataFrame as Percentage of Non-Zero Non-NaN values is very less in number
Result: Drop from 228.9 MB to 152.6 MB in Memory usage i.e. 33% reduction

In [17]:

## number of rows in dataframe
nrows = np.power(10,7)
## creation of dataframe
df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)

In [18]:

## check memory usage before action
df_dense.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
0    int64
1    float64
2    float64
dtypes: float64(2), int64(1)
memory usage: 228.9 MB

In [19]:

## Percentage of Non-zero and Non-NaN values in dataframe
non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()
non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)
print("Percentage of Non-Zero Non-NaN values in dataframe {} %".format(non_zero_non_nan_percentage))

Percentage of Non-Zero Non-NaN values in dataframe 33.33 %

In [20]:

## Action: Change of DataFrame type to SparseDataFrame
df_sparse = df_dense.to_sparse()

In [21]:

## check memory usage after action
df_sparse.info(memory_usage='deep')

<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
0    Sparse[int64, nan]
1    Sparse[float64, nan]
2    Sparse[float64, nan]
dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1)
memory usage: 152.6 MB