Hi, Are you in Google Colab?¶

In Google colab you can easily run Optimus. If you not you may want to go here https://colab.research.google.com/github/ironmussa/Optimus/blob/master/examples/10_min_from_spark_to_pandas_with_optimus.ipynb

Install Optimus all the dependencies.

In [1]:

import sys
if 'google.colab' in sys.modules:
  !apt-get install openjdk-8-jdk-headless -qq > /dev/null
  !wget -q https://archive.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
  !tar xf spark-2.4.1-bin-hadoop2.7.tgz
  !pip install optimuspyspark

Restart Runtime¶

Before you continue, please go to the 'Runtime' Menu above, and select 'Restart Runtime (Ctrl + M + .)'.

In [2]:

if 'google.colab' in sys.modules:
    import os
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
    os.environ["SPARK_HOME"] = "/content/spark-2.4.1-bin-hadoop2.7"

You are done. Enjoy Optimus!¶

Hacking Optimus!¶

To hacking Optimus we recommend to clone the repo and change repo_path relative to this notebook.

In [3]:

repo_path=".."

# This will reload the change you make to Optimus in real time
%load_ext autoreload
%autoreload 2
import sys
sys.path.append(repo_path)

Install Optimus¶

from command line:

pip install optimuspyspark

from a notebook you can use:

!pip install optimuspyspark

Import Optimus and start it¶

In [4]:

from optimus import Optimus

C:\Users\argenisleon\Anaconda3\lib\site-packages\socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Callable

    You are using PySparkling of version 2.4.10, but your PySpark is of
    version 2.3.1. Please make sure Spark and PySparkling versions are compatible. 
`formatargspec` is deprecated since Python 3.5. Use `signature` and the `Signature` object directly

In [5]:

op = Optimus(master="local")

Dataframe creation¶

Create a dataframe to passing a list of values for columns and rows. Unlike pandas you need to specify the column names.

In [6]:

df = op.create.df(
    [
        "names",
        "height(ft)",
        "function",
        "rank",
        "weight(t)",
        "japanese name",
        "last position",
        "attributes"
    ],
    [

        ("Optim'us", 28.0, "Leader", 10, 4.3, ["Inochi", "Convoy"], "19.442735,-99.201111", [8.5344, 4300.0]),
        ("bumbl#ebéé  ", 17.5, "Espionage", 7, 2.0, ["Bumble", "Goldback"], "10.642707,-71.612534", [5.334, 2000.0]),
        ("ironhide&", 26.0, "Security", 7, 4.0, ["Roadbuster"], "37.789563,-122.400356", [7.9248, 4000.0]),
        ("Jazz", 13.0, "First Lieutenant", 8, 1.8, ["Meister"], "33.670666,-117.841553", [3.9624, 1800.0]),
        ("Megatron", None, "None", None, 5.7, ["Megatron"], None, [None, 5700.0]),
        ("Metroplex_)^$", 300.0, "Battle Station", 8, None, ["Metroflex"], None, [91.44, None]),

    ]).h_repartition(1)
df.table()

Viewing 6 of 6 rows / 8 columns

1 partition(s)

names 1 (string) nullable	height(ft) 2 (float) nullable	function 3 (string) nullable	rank 4 (int) nullable	weight(t) 5 (float) nullable	japanese name 6 (array<string>) nullable	last position 7 (string) nullable	attributes 8 (array<float>) nullable
Optim'us	28.0	Leader	10	4.300000190734863	['Inochi',⋅'Convoy']	19.442735,-99.201111	[8.53439998626709,⋅4300.0]
bumbl#ebéé⋅⋅	17.5	Espionage	7	2.0	['Bumble',⋅'Goldback']	10.642707,-71.612534	[5.334000110626221,⋅2000.0]
ironhide&	26.0	Security	7	4.0	['Roadbuster']	37.789563,-122.400356	[7.924799919128418,⋅4000.0]
Jazz	13.0	First⋅Lieutenant	8	1.7999999523162842	['Meister']	33.670666,-117.841553	[3.962399959564209,⋅1800.0]
Megatron	None	None	None	5.699999809265137	['Megatron']	None	[None,⋅5700.0]
Metroplex_)^$	300.0	Battle⋅Station	8	None	['Metroflex']	None	[91.44000244140625,⋅None]

Viewing 6 of 6 rows / 8 columns

1 partition(s)

Creating a dataframe by passing a list of tuples specifyng the column data type. You can specify as data type an string or a Spark Datatypes. https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/types/package-summary.html

Also you can use some Optimus predefined types:

"str" = StringType()
"int" = IntegerType()
"float" = FloatType()
"bool" = BoleanType()

In [9]:

df = op.create.df(
    [
        ("names", "str"),
        ("height", "float"),
        ("function", "str"),
        ("rank", "int"),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz", 13.0, "First Lieutenant", 8),
        ("Megatron", None, "None", None),

    ])
df.table()

Viewing 5 of 5 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (float) nullable	function 3 (string) nullable	rank 4 (int) nullable
bumbl#ebéé⋅⋅	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7
Jazz	13.0	First⋅Lieutenant	8
Megatron	None	None	None

Viewing 5 of 5 rows / 4 columns

1 partition(s)

Creating a dataframe and specify if the column accepts null values

In [10]:

df = op.create.df(
    [
        ("names", "str", True),
        ("height", "float", True),
        ("function", "str", True),
        ("rank", "int", True),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz", 13.0, "First Lieutenant", 8),
        ("Megatron", None, "None", None),

    ])
df.table()

Viewing 5 of 5 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (float) nullable	function 3 (string) nullable	rank 4 (int) nullable
bumbl#ebéé⋅⋅	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7
Jazz	13.0	First⋅Lieutenant	8
Megatron	None	None	None

Viewing 5 of 5 rows / 4 columns

1 partition(s)

Creating a Daframe using a pandas dataframe

In [11]:

import pandas as pd

data = [("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7)]
labels = ["names", "height", "function", "rank"]

# Create pandas dataframe
pdf = pd.DataFrame.from_records(data, columns=labels)

df = op.create.df(pdf=pdf)
df.table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⋅⋅	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Viewing data¶

Here is how to View the first 10 elements in a dataframe.

In [12]:

df.table(10)

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⋅⋅	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

About Spark¶

Spark and Optimus work differently than pandas or R. If you are not familiar with Spark, we recommend taking the time to take a look at the links below.

Partitions¶

Partition are the way Spark divide the data in your local computer or cluster to better optimize how it will be processed.It can greatly impact the Spark performance.

Take 5 minutes to read this article: https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297

Lazy operations¶

Lazy evaluation in Spark means that the execution will not start until an action is triggered.

https://stackoverflow.com/questions/38027877/spark-transformation-why-its-lazy-and-what-is-the-advantage

Inmutability¶

Immutability rules out a big set of potential problems due to updates from multiple threads at once. Immutable data is definitely safe to share across processes.

https://www.quora.com/Why-is-RDD-immutable-in-Spark

Spark Architecture¶

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html

Columns and Rows¶

Optimus organized operations in columns and rows. This is a little different of how pandas works in which all operations are aroud the pandas class. We think this approach can better help you to access and transform data. For a deep dive about the designing decision please read:

https://towardsdatascience.com/announcing-optimus-v2-agile-data-science-workflows-made-easy-c127a12d9e13

Sort by cols names

In [9]:

df.cols.sort().table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

function 1 (string) nullable	height 2 (double) nullable	names 3 (string) nullable	rank 4 (bigint) nullable
Espionage	17.5	bumbl#ebéé⸱⸱	7
Leader	28.0	Optim'us	10
Security	26.0	ironhide&	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Sort by rows rank value

In [10]:

df.rows.sort("rank").table()

Viewing 3 of 3 rows / 4 columns

3 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
Optim'us	28.0	Leader	10
bumbl#ebéé⸱⸱	17.5	Espionage	7
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

3 partition(s)

In [15]:

df.describe().table()

Viewing 5 of 5 rows / 5 columns

1 partition(s)

summary 1 (string) nullable	names 2 (string) nullable	height 3 (string) nullable	function 4 (string) nullable	rank 5 (string) nullable
count	3	3	3	3
mean	None	23.833333333333332	None	8.0
stddev	None	5.575242894559244	None	1.7320508075688772
min	Optim'us	17.5	Espionage	7
max	ironhide&	28.0	Security	10

Viewing 5 of 5 rows / 5 columns

1 partition(s)

Selection¶

Unlike Pandas, Spark DataFrames don't support random row access. So methods like loc in pandas are not available.

Also Pandas don't handle indexes. So methods like iloc are not available.

Select an show an specific column

In [12]:

df.cols.select("names").table()

Viewing 3 of 3 rows / 1 columns

1 partition(s)

names 1 (string) nullable
bumbl#ebéé⸱⸱
Optim'us
ironhide&

Viewing 3 of 3 rows / 1 columns

1 partition(s)

Select rows from a Dataframe where a the condition is meet

In [13]:

df.rows.select(df["rank"] > 7).table()

Viewing 1 of 1 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
Optim'us	28.0	Leader	10

Viewing 1 of 1 rows / 4 columns

1 partition(s)

Select rows by specific values on it

In [14]:

df.rows.is_in("rank", [7, 10]).table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⸱⸱	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Create and unique id for every row.

In [ ]:

df.rows.create_id().table()

Create wew columns

In [16]:

df.cols.append("Affiliation", "Autobot").table()

Viewing 3 of 3 rows / 5 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable	Affiliation 5 (string)
bumbl#ebéé⸱⸱	17.5	Espionage	7	Autobot
Optim'us	28.0	Leader	10	Autobot
ironhide&	26.0	Security	7	Autobot

Viewing 3 of 3 rows / 5 columns

1 partition(s)

Missing Data¶

In [17]:

df.rows.drop_na("*", how='any').table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⸱⸱	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Filling missing data.

In [18]:

df.cols.fill_na("*", "N//A").table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (string) nullable	function 3 (string) nullable	rank 4 (string) nullable
bumbl#ebéé⸱⸱	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

To get the boolean mask where values are nan.

In [19]:

df.cols.is_na("*").table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (boolean)	function 3 (string) nullable	rank 4 (boolean)
bumbl#ebéé⸱⸱	False	Espionage	False
Optim'us	False	Leader	False
ironhide&	False	Security	False

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Operations¶

Stats¶

In [20]:

df.cols.mean("height")

Out[20]:

23.833333333333332

In [21]:

df.cols.mean("*")

Out[21]:

{'rank': {'mean': 8.0}, 'height': {'mean': 23.833333333333332}}

Apply¶

In [22]:

def func(value, args):
    return value + 1


df.cols.apply("height", func, "float").table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (float) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⸱⸱	18.5	Espionage	7
Optim'us	29.0	Leader	10
ironhide&	27.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Histogramming¶

In [23]:

df.cols.count_uniques("*")

Out[23]:

{'names': {'approx_count_distinct': 3},
 'height': {'approx_count_distinct': 3},
 'function': {'approx_count_distinct': 3},
 'rank': {'approx_count_distinct': 2}}

String Methods¶

In [24]:

df \
    .cols.lower("names") \
    .cols.upper("function").table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⸱⸱	17.5	ESPIONAGE	7
optim'us	28.0	LEADER	10
ironhide&	26.0	SECURITY	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Merge¶

Concat¶

Optimus provides and intuitive way to concat Dataframes by columns or rows.

In [1]:

df_new = op.create.df(
    [
        "class"
    ],
    [
        ("Autobot"),
        ("Autobot"),
        ("Autobot"),
        ("Autobot"),
        ("Decepticons"),

    ]).h_repartition(1)

op.append([df, df_new], "columns").table()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-6af36f3ed73f> in <module>
----> 1 df_new = op.create.df(
      2     [
      3         "class"
      4     ],
      5     [

NameError: name 'op' is not defined

In [26]:

df_new = op.create.df(
    [
        "names",
        "height",
        "function",
        "rank",
    ],
    [
        ("Grimlock", 22.9, "Dinobot Commander", 9),
    ]).h_repartition(1)

op.append([df, df_new], "rows").table()

Viewing 4 of 4 rows / 4 columns

2 partition(s)

names 1 (string) nullable	height 2 (string) nullable	function 3 (string) nullable	rank 4 (string) nullable
bumbl#ebéé⸱⸱	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7
Grimlock	22.9	Dinobot⸱Commander	9

Viewing 4 of 4 rows / 4 columns

2 partition(s)

In [27]:

# Operations like `join` and `group` are handle using Spark directly

In [28]:

df_melt = df.melt(id_vars=["names"], value_vars=["height", "function", "rank"])
df.table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⸱⸱	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

In [29]:

df_melt.pivot("names", "variable", "value").table()

Viewing 3 of 3 rows / 4 columns

200 partition(s)

names 1 (string) nullable	function 2 (string) nullable	height 3 (string) nullable	rank 4 (string) nullable
bumbl#ebéé⸱⸱	Espionage	17.5	7
ironhide&	Security	26.0	7
Optim'us	Leader	28.0	10

Viewing 3 of 3 rows / 4 columns

200 partition(s)

Ploting¶

In [16]:

df.plot.hist("height", 10)

bucketizer() executed in 0.1 sec
hist() executed in 1.27 sec
hist() executed in 3.39 sec

In [31]:

df.plot.frequency("*", 10)

Getting Data In/Out¶

In [32]:

df.cols.names()

Out[32]:

['names', 'height', 'function', 'rank']

In [ ]:

df.to_json()

In [34]:

df.schema

Out[34]:

StructType(List(StructField(names,StringType,true),StructField(height,DoubleType,true),StructField(function,StringType,true),StructField(rank,LongType,true)))

In [7]:

df.table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⸱⸱	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

In [26]:

op.profiler.run(df, "height", infer=True)

Processing column 'height'...
_count_data_types() executed in 1.11 sec
count_data_types() executed in 1.11 sec
cast_columns() executed in 0.0 sec
_exprs() executed in 1.18 sec
general_stats() executed in 1.19 sec
------------------------------
Processing column 'height'...
frequency() executed in 1.19 sec
stats_by_column() executed in 0.0 sec
percentile() executed in 0.04 sec
extra_numeric_stats() executed in 0.17 sec
bucketizer() executed in 0.19 sec
hist() executed in 1.38 sec
dataset_info() executed in 1.21 sec

Overview

Dataset info

Number of columns	4
Number of rows	3
Total Missing (%)	0.0%
Total size in memory	81.7 MB

Column types

String	0
Numeric	1
Date	0
Bool	0
Array	0
Not available	0

height

numeric

Unique	3
Unique (%)	100.0
Missing	0.0
Missing (%)	0

Datatypes

String	0
Integer	0
Float	0
Bool	0
Date	0
Missing	0
Null	0

Basic Stats

Mean	23.833333333333332
Minimum	17.5
Maximum	28.0
Zeros(%)	0

Frequency

Value	Count	Frequency (%)
28.0	1	33.333%
26.0	1	33.333%
17.5	1	33.333%
"Missing"	0	0.0%

Quantile statistics

Minimum	17.5
5-th percentile	17.5
Q1	17.5
Median	17.5
Q3	17.5
95-th percentile	17.5
Maximum	28.0
Range	10.5
Interquartile range	0.0

Descriptive statistics

Standard deviation	5.575242894559244
Coef of variation	0.23393
Kurtosis	-1.5000000000000004
Mean	23.833333333333332
MAD	0.0
Skewness	0
Sum	71.5
Variance	31.083333333333336

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⸱⸱	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Pika version 0.12.0 connecting to ::1:5672
Created channel=1
Closing channel (0): 'Normal shutdown' on <Channel number=1 OPEN conn=<SelectConnection OPEN socket=('::1', 60968, 0, 0)->('::1', 5672, 0, 0) params=<URLParameters host=localhost port=5672 virtual_host=/ ssl=False>>>
Received <Channel.CloseOk> on <Channel number=1 CLOSING conn=<SelectConnection OPEN socket=('::1', 60968, 0, 0)->('::1', 5672, 0, 0) params=<URLParameters host=localhost port=5672 virtual_host=/ ssl=False>>>
run() executed in 8.76 sec

In [34]:

df_csv = op.load.csv("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.csv").limit(5)
df_csv.table()

Downloading foo.csv from https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.csv
Downloaded 967 bytes
Creating DataFrame for foo.csv. Please wait...
Successfully created DataFrame for 'foo.csv'

Viewing 5 of 5 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez$$%!	123	Cake	10	1980/07/07	never
2	André	Ampère	423	piza	8	1950/07/08	gonna
3	NiELS	Böhr//((%%	551	pizza	8	1990/07/09	give
4	PAUL	dirac$	521	pizza	8	1954/07/10	you
5	Albert	Einstein	634	pizza	8	1990/07/11	up

Viewing 5 of 5 rows / 8 columns

1 partition(s)

In [35]:

df_json = op.load.json("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json").limit(5)
df_json.table()

Downloading foo.json from https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json
Downloaded 2596 bytes
Creating DataFrame for foo.json. Please wait...
Successfully created DataFrame for 'foo.json'

Viewing 5 of 5 rows / 8 columns

1 partition(s)

billingId 1 (bigint) nullable	birth 2 (string) nullable	dummyCol 3 (string) nullable	firstName 4 (string) nullable	id 5 (bigint) nullable	lastName 6 (string) nullable	price 7 (bigint) nullable	product 8 (string) nullable
123	1980/07/07	never	Luis	1	Alvarez$$%!	10	Cake
423	1950/07/08	gonna	André	2	Ampère	8	piza
551	1990/07/09	give	NiELS	3	Böhr//((%%	8	pizza
521	1954/07/10	you	PAUL	4	dirac$	8	pizza
634	1990/07/11	up	Albert	5	Einstein	8	pizza

Viewing 5 of 5 rows / 8 columns

1 partition(s)

In [ ]:

df_csv.save.csv("test.csv")

In [13]:

df.table()

Viewing 3 of 3 rows / 4 columns

1 partition(s)

names 1 (string) nullable	height 2 (double) nullable	function 3 (string) nullable	rank 4 (bigint) nullable
bumbl#ebéé⸱⸱	17.5	Espionage	7
Optim'us	28.0	Leader	10
ironhide&	26.0	Security	7

Viewing 3 of 3 rows / 4 columns

1 partition(s)

Enrichment¶

In [10]:

df = op.load.json("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json")

In [12]:

df.table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

billingId 1 (bigint) nullable	birth 2 (string) nullable	dummyCol 3 (string) nullable	firstName 4 (string) nullable	id 5 (bigint) nullable	lastName 6 (string) nullable	price 7 (bigint) nullable	product 8 (string) nullable
123	1980/07/07	never	Luis	1	Alvarez$$%!	10	Cake
423	1950/07/08	gonna	André	2	Ampère	8	piza
551	1990/07/09	give	NiELS	3	Böhr//((%%	8	pizza
521	1954/07/10	you	PAUL	4	dirac$	8	pizza
634	1990/07/11	up	Albert	5	Einstein	8	pizza
672	1930/08/12	never	Galileo	6	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	5	arepa
323	1970/07/13	gonna	CaRL	7	Ga%%%uss	3	taco
624	1950/07/14	let	David	8	H$$$ilbert	3	taaaccoo
735	1920/04/22	you	Johannes	9	KEPLER	3	taco
875	1923/03/12	down	JaMES	10	M$$ax%%well	3	taco

Viewing 10 of 19 rows / 8 columns

1 partition(s)

In [15]:

import requests


def func_request(params):
    # You can use here whatever header or auth info you need to send. 
    # For more information see the requests library
    
    url= "https://jsonplaceholder.typicode.com/todos/" + str(params["id"])
    return requests.get(url)

def func_response(response):
    # Here you can parse de response
    return response["title"]


e = op.enrich(host="localhost", port=27017, db_name="jazz")
e.flush()
df_result = e.run(df, func_request, func_response, calls= 60, period = 60, max_tries = 8)

count is deprecated. Use Collection.count_documents instead.

HBox(children=(IntProgress(value=0, description='Processing...', max=19, style=ProgressStyle(description_width…

find_and_modify is deprecated, use find_one_and_delete, find_one_and_replace, or find_one_and_update instead

In [16]:

df_result.table()

Viewing 10 of 19 rows / 9 columns

1 partition(s)

billingId 1 (bigint) nullable	birth 2 (string) nullable	dummyCol 3 (string) nullable	firstName 4 (string) nullable	id 5 (bigint) nullable	lastName 6 (string) nullable	price 7 (bigint) nullable	product 8 (string) nullable	jazz_results 9 (string) nullable
123	1980/07/07	never	Luis	1	Alvarez$$%!	10	Cake	delectus⋅aut⋅autem
423	1950/07/08	gonna	André	2	Ampère	8	piza	quis⋅ut⋅nam⋅facilis⋅et⋅officia⋅qui
551	1990/07/09	give	NiELS	3	Böhr//((%%	8	pizza	fugiat⋅veniam⋅minus
521	1954/07/10	you	PAUL	4	dirac$	8	pizza	et⋅porro⋅tempora
634	1990/07/11	up	Albert	5	Einstein	8	pizza	laboriosam⋅mollitia⋅et⋅enim⋅quasi⋅adipisci⋅quia⋅provident⋅illum
672	1930/08/12	never	Galileo	6	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	5	arepa	qui⋅ullam⋅ratione⋅quibusdam⋅voluptatem⋅quia⋅omnis
323	1970/07/13	gonna	CaRL	7	Ga%%%uss	3	taco	illo⋅expedita⋅consequatur⋅quia⋅in
624	1950/07/14	let	David	8	H$$$ilbert	3	taaaccoo	quo⋅adipisci⋅enim⋅quam⋅ut⋅ab
735	1920/04/22	you	Johannes	9	KEPLER	3	taco	molestiae⋅perspiciatis⋅ipsa
875	1923/03/12	down	JaMES	10	M$$ax%%well	3	taco	illo⋅est⋅ratione⋅doloremque⋅quia⋅maiores⋅aut

Viewing 10 of 19 rows / 9 columns

1 partition(s)