LOGO

Scale your pandas workflows by changing one line of code

Getting Started¶

To install the most recent stable release for Modin run the following code on your command line:

In [ ]:

!pip install "modin[all]"

For further instructions on how to install Modin with conda or for specific platforms or engines, see our detailed installation guide.

Modin acts as a drop-in replacement for pandas so you can simply change a single line of import to speed up your pandas workflows. To use Modin, you simply have to replace the import of pandas with the import of Modin, as follows.

In [1]:

import modin.pandas as pd
import pandas

In [2]:

#############################################
### For the purpose of timing comparisons ###
#############################################
import time
import ray
ray.init()
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

2022-01-07 07:29:30,173	INFO services.py:1250 -- View the Ray dashboard at http://127.0.0.1:8265

Dataset: NYC taxi trip data¶

Link to raw dataset: https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv (Size: ~200MB)

In [3]:

# This may take a few minutes to download
import urllib.request
s3_path = "https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv"
urllib.request.urlretrieve(s3_path, "taxi.csv")  

Out[3]:

('taxi.csv', <http.client.HTTPMessage at 0x1307faf70>)

Faster Data Loading with Modin's `read_csv`¶

In [4]:

start = time.time()

pandas_df = pandas.read_csv("taxi.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))

DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.

Time to read with pandas: 2.744 seconds

In [5]:

start = time.time()

modin_df = pd.read_csv("taxi.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)

end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("## Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))

Time to read with Modin: 1.35 seconds

UserWarning: `read_*` implementation has mismatches with pandas:
Data types of partitions are different! Please refer to the troubleshooting section of the Modin documentation to fix this issue.

Modin is 2.03x faster than pandas at `read_csv`!¶

You can quickly check that the result from pandas and Modin is exactly the same.

In [6]:

pandas_df

Out[6]:

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge
0	1.0	2021-01-01 00:30:10	2021-01-01 00:36:12	1.0	2.10	1.0	N	142	43	2.0	8.00	3.00	0.5	0.00	0.0	0.3	11.80	2.5
1	1.0	2021-01-01 00:51:20	2021-01-01 00:52:19	1.0	0.20	1.0	N	238	151	2.0	3.00	0.50	0.5	0.00	0.0	0.3	4.30	0.0
2	1.0	2021-01-01 00:43:30	2021-01-01 01:11:06	1.0	14.70	1.0	N	132	165	1.0	42.00	0.50	0.5	8.65	0.0	0.3	51.95	0.0
3	1.0	2021-01-01 00:15:48	2021-01-01 00:31:01	0.0	10.60	1.0	N	138	132	1.0	29.00	0.50	0.5	6.05	0.0	0.3	36.35	0.0
4	2.0	2021-01-01 00:31:49	2021-01-01 00:48:21	1.0	4.94	1.0	N	68	33	1.0	16.50	0.50	0.5	4.06	0.0	0.3	24.36	2.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1369760	NaN	2021-01-25 08:32:04	2021-01-25 08:49:32	NaN	8.80	NaN	NaN	135	82	NaN	21.84	2.75	0.5	0.00	0.0	0.3	25.39	0.0
1369761	NaN	2021-01-25 08:34:00	2021-01-25 09:04:00	NaN	5.86	NaN	NaN	42	161	NaN	26.67	2.75	0.5	0.00	0.0	0.3	30.22	0.0
1369762	NaN	2021-01-25 08:37:00	2021-01-25 08:53:00	NaN	4.45	NaN	NaN	14	106	NaN	25.29	2.75	0.5	0.00	0.0	0.3	28.84	0.0
1369763	NaN	2021-01-25 08:28:00	2021-01-25 08:50:00	NaN	10.04	NaN	NaN	175	216	NaN	28.24	2.75	0.5	0.00	0.0	0.3	31.79	0.0
1369764	NaN	2021-01-25 08:38:00	2021-01-25 08:50:00	NaN	4.93	NaN	NaN	248	168	NaN	20.76	2.75	0.5	0.00	0.0	0.3	24.31	0.0

1369765 rows × 18 columns

In [7]:

modin_df

Out[7]:

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge
0	1.0	2021-01-01 00:30:10	2021-01-01 00:36:12	1.0	2.10	1.0	N	142	43	2.0	8.00	3.00	0.5	0.00	0.0	0.3	11.80	2.5
1	1.0	2021-01-01 00:51:20	2021-01-01 00:52:19	1.0	0.20	1.0	N	238	151	2.0	3.00	0.50	0.5	0.00	0.0	0.3	4.30	0.0
2	1.0	2021-01-01 00:43:30	2021-01-01 01:11:06	1.0	14.70	1.0	N	132	165	1.0	42.00	0.50	0.5	8.65	0.0	0.3	51.95	0.0
3	1.0	2021-01-01 00:15:48	2021-01-01 00:31:01	0.0	10.60	1.0	N	138	132	1.0	29.00	0.50	0.5	6.05	0.0	0.3	36.35	0.0
4	2.0	2021-01-01 00:31:49	2021-01-01 00:48:21	1.0	4.94	1.0	N	68	33	1.0	16.50	0.50	0.5	4.06	0.0	0.3	24.36	2.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1369760	NaN	2021-01-25 08:32:04	2021-01-25 08:49:32	NaN	8.80	NaN	NaN	135	82	NaN	21.84	2.75	0.5	0.00	0.0	0.3	25.39	0.0
1369761	NaN	2021-01-25 08:34:00	2021-01-25 09:04:00	NaN	5.86	NaN	NaN	42	161	NaN	26.67	2.75	0.5	0.00	0.0	0.3	30.22	0.0
1369762	NaN	2021-01-25 08:37:00	2021-01-25 08:53:00	NaN	4.45	NaN	NaN	14	106	NaN	25.29	2.75	0.5	0.00	0.0	0.3	28.84	0.0
1369763	NaN	2021-01-25 08:28:00	2021-01-25 08:50:00	NaN	10.04	NaN	NaN	175	216	NaN	28.24	2.75	0.5	0.00	0.0	0.3	31.79	0.0
1369764	NaN	2021-01-25 08:38:00	2021-01-25 08:50:00	NaN	4.93	NaN	NaN	248	168	NaN	20.76	2.75	0.5	0.00	0.0	0.3	24.31	0.0

1369765 rows x 18 columns

Faster Append with Modin's `concat`¶

Our previous read_csv example operated on a relatively small dataframe. In the following example, we duplicate the same taxi dataset 100 times and then concatenate them together.

Please note that this quickstart notebook is assumed to be run on a machine that has enough memory in order to be able to perform the operations both with pandas and Modin in a single pipeline (which at least doubles the amount of required memory). If your machine doesn't have enough resources to execute every cell of the notebook and you see an OOM issue, you most likely need to reduce N_copies in the cell below.

In [8]:

N_copies= 100
start = time.time()

big_pandas_df = pandas.concat([pandas_df for _ in range(N_copies)])

end = time.time()
pandas_duration = end - start
print("Time to concat with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to concat with pandas: 34.144 seconds

In [9]:

start = time.time()

big_modin_df = pd.concat([modin_df for _ in range(N_copies)])

end = time.time()
modin_duration = end - start
print("Time to concat with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `concat`!".format(round(pandas_duration / modin_duration, 2)))

Time to concat with Modin: 0.564 seconds

Modin is 60.57x faster than pandas at `concat`!¶

The result dataset is around 19GB in size.

In [10]:

big_modin_df.info()

(apply_list_of_funcs pid=73415) 
(apply_list_of_funcs pid=73416) 
<class 'modin.pandas.dataframe.DataFrame'>
Int64Index: 136976500 entries, 0 to 1369764
Data columns (total 18 columns):
 #   Column                 Non-Null Count      Dtype         
---  ---------------------  ------------------  -----         
 0   VendorID               127141300 non-null  float64
 1   tpep_pickup_datetime   136976500 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  136976500 non-null  datetime64[ns]
 3   passenger_count        127141300 non-null  float64
 4   trip_distance          136976500 non-null  float64
 5   RatecodeID             127141300 non-null  float64
 6   store_and_fwd_flag     127141300 non-null  object
 7   PULocationID           136976500 non-null  int64
 8   DOLocationID           136976500 non-null  int64
 9   payment_type           127141300 non-null  float64
 10  fare_amount            136976500 non-null  float64
 11  extra                  136976500 non-null  float64
 12  mta_tax                136976500 non-null  float64
 13  tip_amount             136976500 non-null  float64
 14  tolls_amount           136976500 non-null  float64
 15  improvement_surcharge  136976500 non-null  float64
 16  total_amount           136976500 non-null  float64
 17  congestion_surcharge   136976500 non-null  float64
dtypes: float64(13), datetime64[ns](2), int64(2), object(1)
memory usage: 19.4 GB

UserWarning: Distributing <class 'int'> object. This may take some time.

Faster `apply` over a single column¶

The performance benefits of Modin becomes aparent when we operate on large gigabyte-scale datasets. For example, let's say that we want to round up the number across a single column via the apply operation.

In [11]:

start = time.time()
rounded_trip_distance_pandas = big_pandas_df["trip_distance"].apply(round)

end = time.time()
pandas_duration = end - start
print("Time to apply with pandas: {} seconds".format(round(pandas_duration, 3)))

Time to apply with pandas: 43.969 seconds

In [12]:

start = time.time()

rounded_trip_distance_modin = big_modin_df["trip_distance"].apply(round)

end = time.time()
modin_duration = end - start
print("Time to apply with Modin: {} seconds".format(round(modin_duration, 3)))

printmd("### Modin is {}x faster than pandas at `apply` on one column!".format(round(pandas_duration / modin_duration, 2)))

Time to apply with Modin: 1.225 seconds

Modin is 35.88x faster than pandas at `apply` on one column!¶

Summary¶

Hopefully, this tutorial demonstrated how Modin delivers significant speedup on pandas operations without the need for any extra effort. Throughout example, we moved from working with 100MBs of data to 20GBs of data all without having to change anything or manually optimize our code to achieve the level of scalable performance that Modin provides.

Note that in this quickstart example, we've only shown read_csv, concat, apply, but these are not the only pandas operations that Modin optimizes for. In fact, Modin covers more than 90% of the pandas API, yielding considerable speedups for many common operations.

Scale your pandas workflows by changing one line of code

Getting Started¶

Dataset: NYC taxi trip data¶

Faster Data Loading with Modin's read_csv¶

Modin is 2.03x faster than pandas at read_csv!¶

Faster Append with Modin's concat¶

Modin is 60.57x faster than pandas at concat!¶

Faster apply over a single column¶

Modin is 35.88x faster than pandas at apply on one column!¶