To install the most recent stable release for Modin run the following code on your command line:
!pip install "modin[all]"
For further instructions on how to install Modin with conda or for specific platforms or engines, see our detailed installation guide.
Modin acts as a drop-in replacement for pandas so you can simply change a single line of import to speed up your pandas workflows. To use Modin, you simply have to replace the import of pandas with the import of Modin, as follows.
import modin.pandas as pd
import pandas
#############################################
### For the purpose of timing comparisons ###
#############################################
import time
import ray
ray.init()
from IPython.display import Markdown, display
def printmd(string):
display(Markdown(string))
2022-01-07 07:29:30,173 INFO services.py:1250 -- View the Ray dashboard at http://127.0.0.1:8265
Link to raw dataset: https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv (Size: ~200MB)
# This may take a few minutes to download
import urllib.request
s3_path = "https://modin-datasets.s3.amazonaws.com/testing/yellow_tripdata_2015-01.csv"
urllib.request.urlretrieve(s3_path, "taxi.csv")
('taxi.csv', <http.client.HTTPMessage at 0x1307faf70>)
read_csv
¶start = time.time()
pandas_df = pandas.read_csv("taxi.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))
DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.
Time to read with pandas: 2.744 seconds
start = time.time()
modin_df = pd.read_csv("taxi.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], quoting=3)
end = time.time()
modin_duration = end - start
print("Time to read with Modin: {} seconds".format(round(modin_duration, 3)))
printmd("## Modin is {}x faster than pandas at `read_csv`!".format(round(pandas_duration / modin_duration, 2)))
Time to read with Modin: 1.35 seconds
UserWarning: `read_*` implementation has mismatches with pandas: Data types of partitions are different! Please refer to the troubleshooting section of the Modin documentation to fix this issue.
read_csv
!¶You can quickly check that the result from pandas and Modin is exactly the same.
pandas_df
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2021-01-01 00:30:10 | 2021-01-01 00:36:12 | 1.0 | 2.10 | 1.0 | N | 142 | 43 | 2.0 | 8.00 | 3.00 | 0.5 | 0.00 | 0.0 | 0.3 | 11.80 | 2.5 |
1 | 1.0 | 2021-01-01 00:51:20 | 2021-01-01 00:52:19 | 1.0 | 0.20 | 1.0 | N | 238 | 151 | 2.0 | 3.00 | 0.50 | 0.5 | 0.00 | 0.0 | 0.3 | 4.30 | 0.0 |
2 | 1.0 | 2021-01-01 00:43:30 | 2021-01-01 01:11:06 | 1.0 | 14.70 | 1.0 | N | 132 | 165 | 1.0 | 42.00 | 0.50 | 0.5 | 8.65 | 0.0 | 0.3 | 51.95 | 0.0 |
3 | 1.0 | 2021-01-01 00:15:48 | 2021-01-01 00:31:01 | 0.0 | 10.60 | 1.0 | N | 138 | 132 | 1.0 | 29.00 | 0.50 | 0.5 | 6.05 | 0.0 | 0.3 | 36.35 | 0.0 |
4 | 2.0 | 2021-01-01 00:31:49 | 2021-01-01 00:48:21 | 1.0 | 4.94 | 1.0 | N | 68 | 33 | 1.0 | 16.50 | 0.50 | 0.5 | 4.06 | 0.0 | 0.3 | 24.36 | 2.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1369760 | NaN | 2021-01-25 08:32:04 | 2021-01-25 08:49:32 | NaN | 8.80 | NaN | NaN | 135 | 82 | NaN | 21.84 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 25.39 | 0.0 |
1369761 | NaN | 2021-01-25 08:34:00 | 2021-01-25 09:04:00 | NaN | 5.86 | NaN | NaN | 42 | 161 | NaN | 26.67 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 30.22 | 0.0 |
1369762 | NaN | 2021-01-25 08:37:00 | 2021-01-25 08:53:00 | NaN | 4.45 | NaN | NaN | 14 | 106 | NaN | 25.29 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 28.84 | 0.0 |
1369763 | NaN | 2021-01-25 08:28:00 | 2021-01-25 08:50:00 | NaN | 10.04 | NaN | NaN | 175 | 216 | NaN | 28.24 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 31.79 | 0.0 |
1369764 | NaN | 2021-01-25 08:38:00 | 2021-01-25 08:50:00 | NaN | 4.93 | NaN | NaN | 248 | 168 | NaN | 20.76 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 24.31 | 0.0 |
1369765 rows × 18 columns
modin_df
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2021-01-01 00:30:10 | 2021-01-01 00:36:12 | 1.0 | 2.10 | 1.0 | N | 142 | 43 | 2.0 | 8.00 | 3.00 | 0.5 | 0.00 | 0.0 | 0.3 | 11.80 | 2.5 |
1 | 1.0 | 2021-01-01 00:51:20 | 2021-01-01 00:52:19 | 1.0 | 0.20 | 1.0 | N | 238 | 151 | 2.0 | 3.00 | 0.50 | 0.5 | 0.00 | 0.0 | 0.3 | 4.30 | 0.0 |
2 | 1.0 | 2021-01-01 00:43:30 | 2021-01-01 01:11:06 | 1.0 | 14.70 | 1.0 | N | 132 | 165 | 1.0 | 42.00 | 0.50 | 0.5 | 8.65 | 0.0 | 0.3 | 51.95 | 0.0 |
3 | 1.0 | 2021-01-01 00:15:48 | 2021-01-01 00:31:01 | 0.0 | 10.60 | 1.0 | N | 138 | 132 | 1.0 | 29.00 | 0.50 | 0.5 | 6.05 | 0.0 | 0.3 | 36.35 | 0.0 |
4 | 2.0 | 2021-01-01 00:31:49 | 2021-01-01 00:48:21 | 1.0 | 4.94 | 1.0 | N | 68 | 33 | 1.0 | 16.50 | 0.50 | 0.5 | 4.06 | 0.0 | 0.3 | 24.36 | 2.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1369760 | NaN | 2021-01-25 08:32:04 | 2021-01-25 08:49:32 | NaN | 8.80 | NaN | NaN | 135 | 82 | NaN | 21.84 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 25.39 | 0.0 |
1369761 | NaN | 2021-01-25 08:34:00 | 2021-01-25 09:04:00 | NaN | 5.86 | NaN | NaN | 42 | 161 | NaN | 26.67 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 30.22 | 0.0 |
1369762 | NaN | 2021-01-25 08:37:00 | 2021-01-25 08:53:00 | NaN | 4.45 | NaN | NaN | 14 | 106 | NaN | 25.29 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 28.84 | 0.0 |
1369763 | NaN | 2021-01-25 08:28:00 | 2021-01-25 08:50:00 | NaN | 10.04 | NaN | NaN | 175 | 216 | NaN | 28.24 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 31.79 | 0.0 |
1369764 | NaN | 2021-01-25 08:38:00 | 2021-01-25 08:50:00 | NaN | 4.93 | NaN | NaN | 248 | 168 | NaN | 20.76 | 2.75 | 0.5 | 0.00 | 0.0 | 0.3 | 24.31 | 0.0 |
1369765 rows x 18 columns
concat
¶Our previous read_csv
example operated on a relatively small dataframe. In the following example, we duplicate the same taxi dataset 100 times and then concatenate them together.
Please note that this quickstart notebook is assumed to be run on a machine that has enough memory in order to be able to perform the operations both with pandas and Modin in a single pipeline (which at least doubles the amount of required memory). If your machine doesn't have enough resources to execute every cell of the notebook and you see an OOM issue, you most likely need to reduce N_copies
in the cell below.
N_copies= 100
start = time.time()
big_pandas_df = pandas.concat([pandas_df for _ in range(N_copies)])
end = time.time()
pandas_duration = end - start
print("Time to concat with pandas: {} seconds".format(round(pandas_duration, 3)))
Time to concat with pandas: 34.144 seconds
start = time.time()
big_modin_df = pd.concat([modin_df for _ in range(N_copies)])
end = time.time()
modin_duration = end - start
print("Time to concat with Modin: {} seconds".format(round(modin_duration, 3)))
printmd("### Modin is {}x faster than pandas at `concat`!".format(round(pandas_duration / modin_duration, 2)))
Time to concat with Modin: 0.564 seconds
concat
!¶The result dataset is around 19GB in size.
big_modin_df.info()
(apply_list_of_funcs pid=73415) (apply_list_of_funcs pid=73416) <class 'modin.pandas.dataframe.DataFrame'> Int64Index: 136976500 entries, 0 to 1369764 Data columns (total 18 columns): # Column Non-Null Count Dtype --- --------------------- ------------------ ----- 0 VendorID 127141300 non-null float64 1 tpep_pickup_datetime 136976500 non-null datetime64[ns] 2 tpep_dropoff_datetime 136976500 non-null datetime64[ns] 3 passenger_count 127141300 non-null float64 4 trip_distance 136976500 non-null float64 5 RatecodeID 127141300 non-null float64 6 store_and_fwd_flag 127141300 non-null object 7 PULocationID 136976500 non-null int64 8 DOLocationID 136976500 non-null int64 9 payment_type 127141300 non-null float64 10 fare_amount 136976500 non-null float64 11 extra 136976500 non-null float64 12 mta_tax 136976500 non-null float64 13 tip_amount 136976500 non-null float64 14 tolls_amount 136976500 non-null float64 15 improvement_surcharge 136976500 non-null float64 16 total_amount 136976500 non-null float64 17 congestion_surcharge 136976500 non-null float64 dtypes: float64(13), datetime64[ns](2), int64(2), object(1) memory usage: 19.4 GB
UserWarning: Distributing <class 'int'> object. This may take some time.
apply
over a single column¶The performance benefits of Modin becomes aparent when we operate on large gigabyte-scale datasets. For example, let's say that we want to round up the number across a single column via the apply
operation.
start = time.time()
rounded_trip_distance_pandas = big_pandas_df["trip_distance"].apply(round)
end = time.time()
pandas_duration = end - start
print("Time to apply with pandas: {} seconds".format(round(pandas_duration, 3)))
Time to apply with pandas: 43.969 seconds
start = time.time()
rounded_trip_distance_modin = big_modin_df["trip_distance"].apply(round)
end = time.time()
modin_duration = end - start
print("Time to apply with Modin: {} seconds".format(round(modin_duration, 3)))
printmd("### Modin is {}x faster than pandas at `apply` on one column!".format(round(pandas_duration / modin_duration, 2)))
Time to apply with Modin: 1.225 seconds
apply
on one column!¶Hopefully, this tutorial demonstrated how Modin delivers significant speedup on pandas operations without the need for any extra effort. Throughout example, we moved from working with 100MBs of data to 20GBs of data all without having to change anything or manually optimize our code to achieve the level of scalable performance that Modin provides.
Note that in this quickstart example, we've only shown read_csv
, concat
, apply
, but these are not the only pandas operations that Modin optimizes for. In fact, Modin covers more than 90% of the pandas API, yielding considerable speedups for many common operations.