Dataset info
Number of columns | 10 |
Number of rows | 45716 |
Total Missing (%) | 0.49% |
Total size in memory | 88.2 MB |
Column types
String | 0 |
Numeric | 1 |
Date | 0 |
Bool | 0 |
Array | 0 |
Not available | 0 |
!python --version
Python 3.7.6
import nltk
nltk.__version__
'3.4.5'
To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:
PyOptimus is the missing framework to profile, clean, process and plot small and big data. PyOptimus run over pandas, Dask, cuDF, Dask-cuDF, Spark, Vaex so you can use any of this librearies withou changing your code.
In your terminal just type pip install pyoptimus
For the engines
Why so many engines Every engine has
You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.
Also you can go to the examples folder to found specific notebooks about data cleaning, data wrangling, profiling and how to create ML.
Besides check the Cheat Sheet
Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey
Also if you want to a suggestion or feature request use https://github.com/ironmussa/optimus/issues
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("..")
from optimus import Optimus
op= Optimus(verbose=True)
C:\Users\argenisleon\Anaconda3\lib\site-packages\socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Callable You are using PySparkling of version 2.4.10, but your PySpark is of version 2.3.1. Please make sure Spark and PySparkling versions are compatible. INFO:optimus:Operative System:Windows INFO:optimus:Just check that Spark and all necessary environments vars are present... INFO:optimus:----- INFO:optimus:SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7 INFO:optimus:HADOOP_HOME=C:\opt\hadoop-2.7.7 INFO:optimus:PYSPARK_PYTHON=C:\Users\argenisleon\Anaconda3\python.exe INFO:optimus:PYSPARK_DRIVER_PYTHON=jupyter INFO:optimus:PYSPARK_SUBMIT_ARGS=--jars "file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --driver-class-path "C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --conf "spark.sql.catalogImplementation=hive" pyspark-shell INFO:optimus:JAVA_HOME=C:\java INFO:optimus:Pyarrow Installed INFO:optimus:----- INFO:optimus:Starting or getting SparkSession and SparkContext... INFO:optimus:Spark Version:2.3.1 INFO:optimus: ____ __ _ / __ \____ / /_(_)___ ___ __ _______ / / / / __ \/ __/ / __ `__ \/ / / / ___/ / /_/ / /_/ / /_/ / / / / / / /_/ (__ ) \____/ .___/\__/_/_/ /_/ /_/\__,_/____/ /_/ INFO:optimus:Transform and Roll out... INFO:optimus:Optimus successfully imported. Have fun :). INFO:optimus:Config.ini not found
You also can use an already created Spark session:
from pyspark.sql import SparkSession
from optimus import Optimus
spark = SparkSession.builder.appName('optimus').getOrCreate()
op= Optimus(spark)
INFO:optimus: ____ __ _ / __ \____ / /_(_)___ ___ __ _______ / / / / __ \/ __/ / __ `__ \/ / / / ___/ / /_/ / /_/ / /_/ / / / / / / /_/ (__ ) \____/ .___/\__/_/_/ /_/ /_/\__,_/____/ /_/ INFO:optimus:Transform and Roll out... INFO:optimus:Optimus successfully imported. Have fun :). INFO:optimus:Config.ini not found
Now Optimus can load data in csv, json, parquet, avro, excel from a local file or URL.
#csv
df = op.load.csv("../examples/data/foo.csv")
#json
# Use a local file
df = op.load.json("../examples/data/foo.json")
# Use a url
df = op.load.json("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json")
# parquet
df = op.load.parquet("../examples/data/foo.parquet")
# avro
# df = op.load.avro("../examples/data/foo.avro").table(5)
# excel
df = op.load.excel("../examples/data/titanic3.xls")
INFO:optimus:Downloading foo.json from https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json INFO:optimus:Downloaded 2596 bytes INFO:optimus:Creating DataFrame for foo.json. Please wait... INFO:optimus:Successfully created DataFrame for 'foo.json'
Also you can load data from oracle, redshift, mysql and postgres. See *Database connection*
#csv
df.save.csv("data/foo.csv")
# json
df.save.json("data/foo.json")
# parquet
df.save.parquet("data/foo.parquet")
# avro
#df.save.avro("examples/data/foo.avro")
INFO:optimus:`pclass`,`survived`,`name`,`sex`,`age`,`sibsp`,`parch`,`ticket`,`fare`,`cabin`,`embarked`,`boat`,`body`,`home_dest` column(s) was not processed because is/are not date,array,vector,binary,null INFO:optimus:`pclass`,`survived`,`name`,`sex`,`age`,`sibsp`,`parch`,`ticket`,`fare`,`cabin`,`embarked`,`boat`,`body`,`home_dest` column(s) was not processed because is/are not null
Also you can save data to oracle, redshift, mysql and postgres. See *Database connection*
With optimus is easy to loading jars, packages and repos. You can init optimus/spark like
op= Optimus(repositories = "myrepo", packages="org.apache.spark:spark-avro_2.12:2.4.3", jars="my.jar", driver_class_path="this_is_a_jar_class_path.jar", verbose= True)
INFO:optimus:Operative System:Windows INFO:optimus:Just check that Spark and all necessary environments vars are present... INFO:optimus:----- INFO:optimus:SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7 INFO:optimus:HADOOP_HOME=C:\opt\hadoop-2.7.7 INFO:optimus:PYSPARK_PYTHON=C:\Users\argenisleon\Anaconda3\python.exe INFO:optimus:PYSPARK_DRIVER_PYTHON=jupyter INFO:optimus:PYSPARK_SUBMIT_ARGS=--repositories myrepo --packages org.apache.spark:spark-avro_2.12:2.4.3 --jars "my.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --driver-class-path "this_is_a_jar_class_path.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --conf "spark.sql.catalogImplementation=hive" pyspark-shell INFO:optimus:JAVA_HOME=C:\java INFO:optimus:Pyarrow Installed INFO:optimus:----- INFO:optimus:Starting or getting SparkSession and SparkContext... INFO:optimus:Spark Version:2.3.1 INFO:optimus: ____ __ _ / __ \____ / /_(_)___ ___ __ _______ / / / / __ \/ __/ / __ `__ \/ / / / ___/ / /_/ / /_/ / /_/ / / / / / / /_/ (__ ) \____/ .___/\__/_/_/ /_/ /_/\__,_/____/ /_/ INFO:optimus:Transform and Roll out... INFO:optimus:Optimus successfully imported. Have fun :). INFO:optimus:Config.ini not found
Also you can create a dataframe from scratch
from pyspark.sql.types import *
from datetime import date, datetime
df = op.create.df(
[
("names", "str", True),
("height(ft)","int", True),
("function", "str", True),
("rank", "int", True),
("age","int",True),
("weight(t)","float",True),
("japanese name", ArrayType(StringType()), True),
("last position seen", "str", True),
("date arrival", "str", True),
("last date seen", "str", True),
("attributes", ArrayType(FloatType()), True),
("DateType"),
("Tiemstamp"),
("Cybertronian", "bool", True),
("NullType", "null", True),
],
[
("Optim'us", 28, "Leader", 10, 5000000, 4.3, ["Inochi", "Convoy"], "19.442735,-99.201111", "1980/04/10",
"2016/09/10", [8.5344, 4300.0], date(2016, 9, 10), datetime(2014, 6, 24), True,
None),
("bumbl#ebéé ", 17, "Espionage", 7, 5000000, 2.0, ["Bumble", "Goldback"], "10.642707,-71.612534", "1980/04/10",
"2015/08/10", [5.334, 2000.0], date(2015, 8, 10), datetime(2014, 6, 24), True,
None),
("ironhide&", 26, "Security", 7, 5000000, 4.0, ["Roadbuster"], "37.789563,-122.400356", "1980/04/10",
"2014/07/10", [7.9248, 4000.0], date(2014, 6, 24), datetime(2014, 6, 24), True,
None),
("Jazz", 13, "First Lieutenant", 8, 5000000, 1.80, ["Meister"], "33.670666,-117.841553", "1980/04/10",
"2013/06/10", [3.9624, 1800.0], date(2013, 6, 24), datetime(2014, 6, 24), True, None),
("Megatron", None, "None", 10, 5000000, 5.70, ["Megatron"], None, "1980/04/10", "2012/05/10", [None, 5700.0],
date(2012, 5, 10), datetime(2014, 6, 24), True, None),
("Metroplex_)^$", 300, "Battle Station", 8, 5000000, None, ["Metroflex"], None, "1980/04/10", "2011/04/10",
[91.44, None], date(2011, 4, 10), datetime(2014, 6, 24), True, None),
], infer_schema = True).ext.h_repartition(1)
With .table() you have a beautifull way to show your data. You have extra information like column number, column data type and marked white spaces
df.table_image("images/table.png")
Loading page (1/2) Rendering (2/2) Done
Also you can create a dataframe from a Pandas dataframe
import pandas as pd
pdf = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c',3:'d'},
'B': {0: 1, 1: 3, 2: 5,3:7},
'C': {0: 2, 1: 4, 2: 6,3:None},
'D': {0:'1980/04/10',1:'1980/04/10',2:'1980/04/10',3:'1980/04/10'},
})
s_pdf = op.create.df(pdf=pdf)
s_pdf.table_image("images/pandas.png")
Loading page (1/2) Rendering (2/2) Done
Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas. Optimus expands the Spark DataFrame functionality adding .rows and .cols attributes.
For example you can load data from a url, transform and apply some predefined cleaning functions:
# This is a custom function
def func(value, arg):
return "this was a number"
new_df = df\
.rows.sort("rank","desc")\
.withColumn('new_age', df.age)\
.cols.lower(["names","function"])\
.cols.date_transform("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
.cols.years_between("date arrival", "dd-MM-YYYY", output_cols = "from arrival")\
.cols.remove_accents("names")\
.cols.remove_special_chars("names")\
.rows.drop(df["rank"]>8)\
.cols.rename(str.lower)\
.cols.trim("*")\
.cols.unnest("japanese name", output_cols="other names")\
.cols.unnest("last position seen",separator=",", output_cols="pos")\
.cols.drop(["last position seen", "japanese name","date arrival", "cybertronian", "nulltype"])
INFO:optimus:Using 'column_exp' to process column 'names' with function _lower INFO:optimus:Using 'column_exp' to process column 'function' with function _lower INFO:optimus:Using 'column_exp' to process column 'date arrival' with function _date_transform INFO:optimus:Using 'column_exp' to process column 'date arrival' with function _years_between INFO:optimus:Using 'column_exp' to process column 'from arrival' with function _cast_to INFO:optimus:Using 'pandas_udf' to process column 'names' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'names' with function multiple_replace INFO:optimus:`japanese name`,`attributes`,`tiemstamp`,`nulltype` column(s) was not processed because is/are not byte,short,big,int,double,float,string,date,bool INFO:optimus:Using 'column_exp' to process column 'names' with function _trim INFO:optimus:Using 'column_exp' to process column 'height(ft)' with function _trim INFO:optimus:Using 'column_exp' to process column 'function' with function _trim INFO:optimus:Using 'column_exp' to process column 'rank' with function _trim INFO:optimus:Using 'column_exp' to process column 'age' with function _trim INFO:optimus:Using 'column_exp' to process column 'weight(t)' with function _trim INFO:optimus:Using 'column_exp' to process column 'last position seen' with function _trim INFO:optimus:Using 'column_exp' to process column 'date arrival' with function _trim INFO:optimus:Using 'column_exp' to process column 'last date seen' with function _trim INFO:optimus:Using 'column_exp' to process column 'datetype' with function _trim INFO:optimus:Using 'column_exp' to process column 'cybertronian' with function _trim INFO:optimus:Using 'column_exp' to process column 'new_age' with function _trim INFO:optimus:Using 'column_exp' to process column 'from arrival' with function _trim
You transform this
df.table_image("images/table1.png")
Loading page (1/2) Rendering (2/2) Done
Into this
new_df.table_image("images/table2.png")
Loading page (1/2) Rendering (2/2) Done
Note that you can use Optimus functions and Spark functions(.WithColumn()
) and all the df function availables in a Spark Dataframe at the same time. To know about all the Optimus functionality please go to this notebooks
With Optimus you can handle how the output column from a transformation in going to be handled.
from pyspark.sql import functions as F
def func(col_name, attr):
return F.upper(F.col(col_name))
If a string is passed to input_cols and output_cols is not defined the result from the operation is going to be saved in the same input column
output_df = df.cols.apply(input_cols="names", output_cols=None,func=func)
output_df.table_image("images/column_output_1.png")
INFO:optimus:Using 'column_exp' to process column 'names' with function func
Loading page (1/2) Rendering (2/2) Done
If a string is passed to input_cols and a string is passed to output_cols the output is going to be saved in the output column
output_df = df.cols.apply(input_cols="names", output_cols="names_up",func=func)
output_df.table_image("images/column_output_2.png")
INFO:optimus:Using 'column_exp' to process column 'names' with function func
Loading page (1/2) Rendering (2/2) Done
If a list is passed to input_cols and a string is passed to out_cols Optimus will concatenate the list with every element in the list to create a new column name with the output
output_df = df.cols.apply(input_cols=["names","function"], output_cols="_up",func=func)
output_df.table_image("images/column_output_3.png")
INFO:optimus:Using 'column_exp' to process column 'names' with function func INFO:optimus:Using 'column_exp' to process column 'function' with function func
Loading page (1/2) Rendering (2/2) Done
If a list is passed to input_cols and a list is passed in out_cols Optimus will output every input column in the respective output column
output_df = df.cols.apply(input_cols=["names","function"], output_cols=["names_up","function_up"],func=func)
output_df.table_image("images/column_output_4.png")
INFO:optimus:Using 'column_exp' to process column 'names' with function func INFO:optimus:Using 'column_exp' to process column 'function' with function func
Loading page (1/2) Rendering (2/2) Done
Spark has multiple ways to transform your data like rdd, Column Expression, udf and pandas udf. In Optimus we created the apply()
and apply_expr
which handles all the implementation complexity.
Here you apply a function to the "billingid" column. Sum 1 and 2 to the current column value. All powered by Pandas UDF
def func(value, args):
return value + args[0] + args[1]
df.cols.apply("height(ft)",func,"int", [1,2]).table_image("images/table3.png")
INFO:optimus:Using 'pandas_udf' to process column 'height(ft)' with function func
Loading page (1/2) Rendering (2/2) Done
If you want to apply a Column Expression use apply_expr()
like this. In this case we pass an argument 10 to divide the actual column value
from pyspark.sql import functions as F
def func(col_name, args):
return F.col(col_name)/20
df.cols.apply("height(ft)", func=func, args=20).table_image("images/table4.png")
INFO:optimus:Using 'column_exp' to process column 'height(ft)' with function func
Loading page (1/2) Rendering (2/2) Done
You can change the table output back to ascii if you wish
op.output("ascii")
To return to HTML just:
op.output("html")
Optimus comes with a powerful and unique data profiler. Besides basic and advance stats like min, max, kurtosis, mad etc,
it also let you know what type of data has every column. For example if a string column have string, integer, float, bool, date Optimus can give you an unique overview about your data.
Just run df.profile("*")
to profile all the columns. For more info about the profiler please go to this notebook.
Let's load a "big" dataset
df = op.load.csv("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/Meteorite_Landings.csv").ext.h_repartition()
INFO:optimus:Downloading Meteorite_Landings.csv from https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/Meteorite_Landings.csv INFO:optimus:Downloaded 4978151 bytes INFO:optimus:Creating DataFrame for Meteorite_Landings.csv. Please wait... INFO:optimus:Successfully created DataFrame for 'Meteorite_Landings.csv'
op.profiler.run(df, "mass (g)", infer=False)
INFO:optimus:Processing column 'mass (g)'... INFO:optimus:_count_data_types() executed in 2.31 sec INFO:optimus:count_data_types() executed in 2.31 sec INFO:optimus:cast_columns() executed in 0.0 sec INFO:optimus:agg_exprs() executed in 3.29 sec INFO:optimus:general_stats() executed in 3.3 sec INFO:optimus:------------------------------ INFO:optimus:Processing column 'mass (g)'... INFO:optimus:frequency() executed in 4.77 sec INFO:optimus:stats_by_column() executed in 0.0 sec INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to INFO:optimus:percentile() executed in 0.39 sec INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to INFO:optimus:extra_numeric_stats() executed in 0.9 sec INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _bucketizer INFO:optimus:bucketizer() executed in 0.47 sec INFO:optimus:hist() executed in 2.54 sec
Including 'nan' as Null in processing 'name' Including 'nan' as Null in processing 'nametype' Including 'nan' as Null in processing 'recclass' Including 'nan' as Null in processing 'fall' Including 'nan' as Null in processing 'year' Including 'nan' as Null in processing 'GeoLocation'
INFO:optimus:dataset_info() executed in 2.28 sec
Number of columns | 10 |
Number of rows | 45716 |
Total Missing (%) | 0.49% |
Total size in memory | 88.2 MB |
String | 0 |
Numeric | 1 |
Date | 0 |
Bool | 0 |
Array | 0 |
Not available | 0 |
Unique | 12497 |
Unique (%) | 27.336 |
Missing | 0.0 |
Missing (%) | 0 |
String | 0 |
Integer | 0 |
Float | 0 |
Bool | 0 |
Date | 0 |
Missing | 0 |
Null | 131 |
Mean | 13278.07855 |
Minimum | 0.0 |
Maximum | 60000000.0 |
Zeros(%) |
Value | Count | Frequency (%) |
---|---|---|
1.3 | 171 | 0.374% |
1.2 | 140 | 0.306% |
1.4 | 138 | 0.302% |
None | 131 | 0.287% |
2.1 | 130 | 0.284% |
2.4 | 126 | 0.276% |
1.6 | 120 | 0.262% |
0.5 | 119 | 0.26% |
1.1 | 116 | 0.254% |
3.8 | 114 | 0.249% |
"Missing" | 0 | 0.0% |
Minimum | 0.0 |
5-th percentile | |
Q1 | |
Median | |
Q3 | |
95-th percentile | |
Maximum | 60000000.0 |
Range | 60000000.0 |
Interquartile range | 0.0 |
Standard deviation | 574988.87641 |
Coef of variation | 43.30362 |
Kurtosis | 6796.17061 |
Mean | 13278.07855 |
MAD | 0.0 |
Skewness | |
Sum | 605281210.638 |
Variance | 330612207995.7785 |
|
|
name
1 (string)
nullable
|
id
2 (int)
nullable
|
nametype
3 (string)
nullable
|
recclass
4 (string)
nullable
|
mass (g)
5 (double)
nullable
|
fall
6 (string)
nullable
|
year
7 (string)
nullable
|
reclat
8 (double)
nullable
|
reclong
9 (double)
nullable
|
GeoLocation
10 (string)
nullable
|
---|---|---|---|---|---|---|---|---|---|
Acfer⋅232
|
240
|
Valid
|
H5
|
725.0
|
Found
|
01/01/1991⋅12:00:00⋅AM
|
27.73944
|
4.32833
|
(27.739440,⋅4.328330)
|
Asuka⋅87197
|
2554
|
Valid
|
H4
|
124.99
|
Found
|
01/01/1987⋅12:00:00⋅AM
|
-72.0
|
26.0
|
(-72.000000,⋅26.000000)
|
Gladstone⋅(iron)
|
10920
|
Valid
|
Iron,⋅IAB-MG
|
736600.0
|
Found
|
01/01/1915⋅12:00:00⋅AM
|
-23.9
|
151.3
|
(-23.900000,⋅151.300000)
|
Nullarbor⋅015
|
17955
|
Valid
|
L6
|
3986.0
|
Found
|
01/01/1980⋅12:00:00⋅AM
|
None
|
None
|
None
|
Lewis⋅Cliff⋅86533
|
13461
|
Valid
|
H5
|
15.7
|
Found
|
01/01/1986⋅12:00:00⋅AM
|
-84.26756
|
161.3631
|
(-84.267560,⋅161.363100)
|
Grove⋅Mountains⋅053589
|
48447
|
Valid
|
L5
|
1.4
|
Found
|
01/01/2006⋅12:00:00⋅AM
|
-72.7825
|
75.300278
|
(-72.782500,⋅75.300278)
|
Sayh⋅al⋅Uhaymir⋅108
|
23300
|
Valid
|
H6
|
16.0
|
Found
|
01/01/2001⋅12:00:00⋅AM
|
21.06667
|
57.31667
|
(21.066670,⋅57.316670)
|
Northwest⋅Africa⋅3088
|
31218
|
Valid
|
L6
|
171.0
|
Found
|
01/01/2003⋅12:00:00⋅AM
|
None
|
None
|
None
|
Reckling⋅Peak⋅92423
|
22432
|
Valid
|
L6
|
3.8
|
Found
|
01/01/1992⋅12:00:00⋅AM
|
-76.22029
|
158.37967
|
(-76.220290,⋅158.379670)
|
Sweetwater
|
23770
|
Valid
|
H5
|
1760.0
|
Found
|
01/01/1961⋅12:00:00⋅AM
|
32.55
|
-100.41667
|
(32.550000,⋅-100.416670)
|
INFO:optimus:run() executed in 19.68 sec
op.profiler.to_image(output_path="images/profiler_numeric.png")
Loading page (1/2) Warning: Failed to load file:///C:/Users/ARGENI~1/AppData/Local/Temp/optimus/styles/styles.css (ignore) Rendering (2/2) Done
op.profiler.run(df, "name", infer=False)
INFO:optimus:Processing column 'name'...
Including 'nan' as Null in processing 'name'
INFO:optimus:_count_data_types() executed in 1.43 sec INFO:optimus:count_data_types() executed in 1.43 sec INFO:optimus:Using 'column_exp' to process column 'name' with function _cast_to INFO:optimus:cast_columns() executed in 0.02 sec INFO:optimus:agg_exprs() executed in 1.94 sec INFO:optimus:general_stats() executed in 1.95 sec INFO:optimus:------------------------------ INFO:optimus:Processing column 'name'... INFO:optimus:frequency() executed in 4.21 sec INFO:optimus:stats_by_column() executed in 0.0 sec INFO:optimus:Using 'column_exp' to process column 'name_len' with function func_col_exp INFO:optimus:Using 'column_exp' to process column 'name_len' with function _bucketizer INFO:optimus:bucketizer() executed in 0.35 sec INFO:optimus:hist() executed in 3.02 sec INFO:optimus:hist_string() executed in 5.39 sec
Including 'nan' as Null in processing 'name' Including 'nan' as Null in processing 'nametype' Including 'nan' as Null in processing 'recclass' Including 'nan' as Null in processing 'fall' Including 'nan' as Null in processing 'year' Including 'nan' as Null in processing 'GeoLocation'
INFO:optimus:dataset_info() executed in 1.78 sec
Number of columns | 10 |
Number of rows | 45716 |
Total Missing (%) | 0.49% |
Total size in memory | 92.1 MB |
String | 1 |
Numeric | 0 |
Date | 0 |
Bool | 0 |
Array | 0 |
Not available | 0 |
Unique | 45515 |
Unique (%) | 99.56 |
Missing | 0.0 |
Missing (%) | 0 |
String | 45716 |
Integer | 0 |
Float | 0 |
Bool | 0 |
Date | 0 |
Missing | 0 |
Null | 0 |
Value | Count | Frequency (%) |
---|---|---|
Święcany | 1 | 0.002% |
Łowicz | 1 | 0.002% |
Österplana 064 | 1 | 0.002% |
Österplana 063 | 1 | 0.002% |
Österplana 062 | 1 | 0.002% |
Österplana 061 | 1 | 0.002% |
Österplana 060 | 1 | 0.002% |
Österplana 059 | 1 | 0.002% |
Österplana 058 | 1 | 0.002% |
Österplana 057 | 1 | 0.002% |
"Missing" | 0 | 0.0% |
|
|
name
1 (string)
nullable
|
id
2 (int)
nullable
|
nametype
3 (string)
nullable
|
recclass
4 (string)
nullable
|
mass (g)
5 (double)
nullable
|
fall
6 (string)
nullable
|
year
7 (string)
nullable
|
reclat
8 (double)
nullable
|
reclong
9 (double)
nullable
|
GeoLocation
10 (string)
nullable
|
---|---|---|---|---|---|---|---|---|---|
Acfer⋅232
|
240
|
Valid
|
H5
|
725.0
|
Found
|
01/01/1991⋅12:00:00⋅AM
|
27.73944
|
4.32833
|
(27.739440,⋅4.328330)
|
Asuka⋅87197
|
2554
|
Valid
|
H4
|
124.99
|
Found
|
01/01/1987⋅12:00:00⋅AM
|
-72.0
|
26.0
|
(-72.000000,⋅26.000000)
|
Gladstone⋅(iron)
|
10920
|
Valid
|
Iron,⋅IAB-MG
|
736600.0
|
Found
|
01/01/1915⋅12:00:00⋅AM
|
-23.9
|
151.3
|
(-23.900000,⋅151.300000)
|
Nullarbor⋅015
|
17955
|
Valid
|
L6
|
3986.0
|
Found
|
01/01/1980⋅12:00:00⋅AM
|
None
|
None
|
None
|
Lewis⋅Cliff⋅86533
|
13461
|
Valid
|
H5
|
15.7
|
Found
|
01/01/1986⋅12:00:00⋅AM
|
-84.26756
|
161.3631
|
(-84.267560,⋅161.363100)
|
Grove⋅Mountains⋅053589
|
48447
|
Valid
|
L5
|
1.4
|
Found
|
01/01/2006⋅12:00:00⋅AM
|
-72.7825
|
75.300278
|
(-72.782500,⋅75.300278)
|
Sayh⋅al⋅Uhaymir⋅108
|
23300
|
Valid
|
H6
|
16.0
|
Found
|
01/01/2001⋅12:00:00⋅AM
|
21.06667
|
57.31667
|
(21.066670,⋅57.316670)
|
Northwest⋅Africa⋅3088
|
31218
|
Valid
|
L6
|
171.0
|
Found
|
01/01/2003⋅12:00:00⋅AM
|
None
|
None
|
None
|
Reckling⋅Peak⋅92423
|
22432
|
Valid
|
L6
|
3.8
|
Found
|
01/01/1992⋅12:00:00⋅AM
|
-76.22029
|
158.37967
|
(-76.220290,⋅158.379670)
|
Sweetwater
|
23770
|
Valid
|
H5
|
1760.0
|
Found
|
01/01/1961⋅12:00:00⋅AM
|
32.55
|
-100.41667
|
(32.550000,⋅-100.416670)
|
INFO:optimus:run() executed in 17.73 sec
op.profiler.to_image(output_path="images/profiler.png")
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-63-1304007aee00> in <module> ----> 1 op.profiler.to_image(output_path="images/profiler.png") ~\Documents\Optimus\optimus\profiler\profiler.py in to_image(self, output_path) 267 """ 268 css = absolute_path("/css/styles.css") --> 269 imgkit.from_string(self.html, output_path, css=css) 270 271 print_html("<img src='" + output_path + "'>") ~\Anaconda3\lib\site-packages\imgkit\api.py in from_string(string, output_path, options, toc, cover, css, config, cover_first) 87 """ 88 rtn = IMGKit(string, 'string', options=options, toc=toc, cover=cover, css=css, ---> 89 config=config, cover_first=cover_first) 90 return rtn.to_img(output_path) 91 ~\Anaconda3\lib\site-packages\imgkit\imgkit.py in __init__(self, url_or_file, source_type, options, toc, cover, css, config, cover_first) 42 self.options = {} 43 if self.source.isString(): ---> 44 self.options.update(self._find_options_in_meta(url_or_file)) 45 46 if options: ~\Anaconda3\lib\site-packages\imgkit\imgkit.py in _find_options_in_meta(self, content) 199 found = {} 200 --> 201 for x in re.findall('<meta [^>]*>', content): 202 if re.search('name=["\']%s' % self.config.meta_tag_prefix, x): 203 name = re.findall('name=["\']%s([^"\']*)' % ~\Anaconda3\lib\re.py in findall(pattern, string, flags) 221 222 Empty matches are included in the result.""" --> 223 return _compile(pattern, flags).findall(string) 224 225 def finditer(pattern, string, flags=0): TypeError: expected string or bytes-like object
For dates data types Optimus can give you extra information
op.profiler.run(df, "year", infer=True)
INFO:optimus:Processing column 'year'... INFO:optimus:_count_data_types() executed in 23.75 sec INFO:optimus:count_data_types() executed in 23.75 sec INFO:optimus:cast_columns() executed in 0.0 sec INFO:optimus:agg_exprs() executed in 1.59 sec INFO:optimus:general_stats() executed in 1.6 sec INFO:optimus:------------------------------ INFO:optimus:Processing column 'year'... INFO:optimus:frequency() executed in 3.26 sec INFO:optimus:stats_by_column() executed in 0.0 sec INFO:optimus:Using 'pandas_udf' to process column 'year' with function func_infer_date INFO:optimus:Using 'column_exp' to process column 'year_0' with function _bucketizer INFO:optimus:bucketizer() executed in 0.81 sec INFO:optimus:hist() executed in 3.44 sec INFO:optimus:Using 'column_exp' to process column 'year_1' with function _bucketizer INFO:optimus:bucketizer() executed in 0.12 sec INFO:optimus:hist() executed in 1.61 sec INFO:optimus:Using 'column_exp' to process column 'year_2' with function _bucketizer INFO:optimus:bucketizer() executed in 0.1 sec INFO:optimus:hist() executed in 1.76 sec INFO:optimus:Using 'column_exp' to process column 'year_3' with function _bucketizer INFO:optimus:bucketizer() executed in 0.22 sec INFO:optimus:hist() executed in 1.7 sec INFO:optimus:Using 'column_exp' to process column 'year_4' with function _bucketizer INFO:optimus:bucketizer() executed in 0.52 sec INFO:optimus:hist() executed in 1.97 sec INFO:optimus:hist_date() executed in 62.45 sec
Including 'nan' as Null in processing 'name' Including 'nan' as Null in processing 'nametype' Including 'nan' as Null in processing 'recclass' Including 'nan' as Null in processing 'fall' Including 'nan' as Null in processing 'year' Including 'nan' as Null in processing 'GeoLocation'
INFO:optimus:dataset_info() executed in 1.79 sec
Number of columns | 10 |
Number of rows | 45716 |
Total Missing (%) | 0.49% |
Total size in memory | 97.6 MB |
String | 0 |
Numeric | 0 |
Date | 1 |
Bool | 0 |
Array | 0 |
Not available | 0 |
Unique | 265 |
Unique (%) | 0.58 |
Missing | 0.0 |
Missing (%) | 0 |
String | 0 |
Integer | 0 |
Float | 0 |
Bool | 0 |
Date | 45428 |
Missing | 0 |
Null | 288 |
Value | Count | Frequency (%) |
---|---|---|
01/01/2003 12:00:00 AM | 3323 | 7.269% |
01/01/1979 12:00:00 AM | 3046 | 6.663% |
01/01/1998 12:00:00 AM | 2697 | 5.899% |
01/01/2006 12:00:00 AM | 2456 | 5.372% |
01/01/1988 12:00:00 AM | 2296 | 5.022% |
01/01/2002 12:00:00 AM | 2078 | 4.545% |
01/01/2004 12:00:00 AM | 1940 | 4.244% |
01/01/2000 12:00:00 AM | 1792 | 3.92% |
01/01/1997 12:00:00 AM | 1696 | 3.71% |
01/01/1999 12:00:00 AM | 1691 | 3.699% |
"Missing" | 0 | 0.0% |
|
|
|
|
|
|
name
1 (string)
nullable
|
id
2 (int)
nullable
|
nametype
3 (string)
nullable
|
recclass
4 (string)
nullable
|
mass (g)
5 (double)
nullable
|
fall
6 (string)
nullable
|
year
7 (string)
nullable
|
reclat
8 (double)
nullable
|
reclong
9 (double)
nullable
|
GeoLocation
10 (string)
nullable
|
---|---|---|---|---|---|---|---|---|---|
Acfer⋅232
|
240
|
Valid
|
H5
|
725.0
|
Found
|
01/01/1991⋅12:00:00⋅AM
|
27.73944
|
4.32833
|
(27.739440,⋅4.328330)
|
Asuka⋅87197
|
2554
|
Valid
|
H4
|
124.99
|
Found
|
01/01/1987⋅12:00:00⋅AM
|
-72.0
|
26.0
|
(-72.000000,⋅26.000000)
|
Gladstone⋅(iron)
|
10920
|
Valid
|
Iron,⋅IAB-MG
|
736600.0
|
Found
|
01/01/1915⋅12:00:00⋅AM
|
-23.9
|
151.3
|
(-23.900000,⋅151.300000)
|
Nullarbor⋅015
|
17955
|
Valid
|
L6
|
3986.0
|
Found
|
01/01/1980⋅12:00:00⋅AM
|
None
|
None
|
None
|
Lewis⋅Cliff⋅86533
|
13461
|
Valid
|
H5
|
15.7
|
Found
|
01/01/1986⋅12:00:00⋅AM
|
-84.26756
|
161.3631
|
(-84.267560,⋅161.363100)
|
Grove⋅Mountains⋅053589
|
48447
|
Valid
|
L5
|
1.4
|
Found
|
01/01/2006⋅12:00:00⋅AM
|
-72.7825
|
75.300278
|
(-72.782500,⋅75.300278)
|
Sayh⋅al⋅Uhaymir⋅108
|
23300
|
Valid
|
H6
|
16.0
|
Found
|
01/01/2001⋅12:00:00⋅AM
|
21.06667
|
57.31667
|
(21.066670,⋅57.316670)
|
Northwest⋅Africa⋅3088
|
31218
|
Valid
|
L6
|
171.0
|
Found
|
01/01/2003⋅12:00:00⋅AM
|
None
|
None
|
None
|
Reckling⋅Peak⋅92423
|
22432
|
Valid
|
L6
|
3.8
|
Found
|
01/01/1992⋅12:00:00⋅AM
|
-76.22029
|
158.37967
|
(-76.220290,⋅158.379670)
|
Sweetwater
|
23770
|
Valid
|
H5
|
1760.0
|
Found
|
01/01/1961⋅12:00:00⋅AM
|
32.55
|
-100.41667
|
(32.550000,⋅-100.416670)
|
INFO:optimus:run() executed in 96.23 sec
op.profiler.to_image(output_path="images/profiler1.png")
Loading page (1/2) Warning: Failed to load file:///C:/Users/ARGENI~1/AppData/Local/Temp/optimus/styles/styles.css (ignore) Rendering (2/2) Done
With relative_error and approx_count params you can control how some operations are caculated so you can speedup the profiling in case is needed.
relative_error: Relative Error for quantile discretizer calculation. 1 is Faster, 0 Slower
approx_count: Use approx_count_distinct
or countDistinct
. approx_count_distinct
is faster
op.profiler.run(df, "mass (g)", infer=False, relative_error =1, approx_count=True)
INFO:optimus:Processing column 'mass (g)'... INFO:optimus:_count_data_types() executed in 1.33 sec INFO:optimus:count_data_types() executed in 1.33 sec INFO:optimus:cast_columns() executed in 0.0 sec INFO:optimus:agg_exprs() executed in 1.64 sec INFO:optimus:general_stats() executed in 1.65 sec INFO:optimus:------------------------------ INFO:optimus:Processing column 'mass (g)'... INFO:optimus:frequency() executed in 3.56 sec INFO:optimus:stats_by_column() executed in 0.0 sec INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to INFO:optimus:percentile() executed in 0.23 sec INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _cast_to INFO:optimus:extra_numeric_stats() executed in 0.58 sec INFO:optimus:Using 'column_exp' to process column 'mass (g)' with function _bucketizer INFO:optimus:bucketizer() executed in 0.3 sec INFO:optimus:hist() executed in 2.04 sec
Including 'nan' as Null in processing 'name' Including 'nan' as Null in processing 'nametype' Including 'nan' as Null in processing 'recclass' Including 'nan' as Null in processing 'fall' Including 'nan' as Null in processing 'year' Including 'nan' as Null in processing 'GeoLocation'
INFO:optimus:dataset_info() executed in 1.75 sec
Number of columns | 10 |
Number of rows | 45716 |
Total Missing (%) | 0.49% |
Total size in memory | 98.2 MB |
String | 0 |
Numeric | 1 |
Date | 0 |
Bool | 0 |
Array | 0 |
Not available | 0 |
Unique | 12497 |
Unique (%) | 27.336 |
Missing | 0.0 |
Missing (%) | 0 |
String | 0 |
Integer | 0 |
Float | 0 |
Bool | 0 |
Date | 0 |
Missing | 0 |
Null | 131 |
Mean | 13278.07855 |
Minimum | 0.0 |
Maximum | 60000000.0 |
Zeros(%) |
Value | Count | Frequency (%) |
---|---|---|
1.3 | 171 | 0.374% |
1.2 | 140 | 0.306% |
1.4 | 138 | 0.302% |
None | 131 | 0.287% |
2.1 | 130 | 0.284% |
2.4 | 126 | 0.276% |
1.6 | 120 | 0.262% |
0.5 | 119 | 0.26% |
1.1 | 116 | 0.254% |
3.8 | 114 | 0.249% |
"Missing" | 0 | 0.0% |
Minimum | 0.0 |
5-th percentile | |
Q1 | |
Median | |
Q3 | |
95-th percentile | |
Maximum | 60000000.0 |
Range | 60000000.0 |
Interquartile range | 0.0 |
Standard deviation | 574988.87641 |
Coef of variation | 43.30362 |
Kurtosis | 6796.17061 |
Mean | 13278.07855 |
MAD | 0.0 |
Skewness | |
Sum | 605281210.638 |
Variance | 330612207995.7785 |
|
|
name
1 (string)
nullable
|
id
2 (int)
nullable
|
nametype
3 (string)
nullable
|
recclass
4 (string)
nullable
|
mass (g)
5 (double)
nullable
|
fall
6 (string)
nullable
|
year
7 (string)
nullable
|
reclat
8 (double)
nullable
|
reclong
9 (double)
nullable
|
GeoLocation
10 (string)
nullable
|
---|---|---|---|---|---|---|---|---|---|
Acfer⋅232
|
240
|
Valid
|
H5
|
725.0
|
Found
|
01/01/1991⋅12:00:00⋅AM
|
27.73944
|
4.32833
|
(27.739440,⋅4.328330)
|
Asuka⋅87197
|
2554
|
Valid
|
H4
|
124.99
|
Found
|
01/01/1987⋅12:00:00⋅AM
|
-72.0
|
26.0
|
(-72.000000,⋅26.000000)
|
Gladstone⋅(iron)
|
10920
|
Valid
|
Iron,⋅IAB-MG
|
736600.0
|
Found
|
01/01/1915⋅12:00:00⋅AM
|
-23.9
|
151.3
|
(-23.900000,⋅151.300000)
|
Nullarbor⋅015
|
17955
|
Valid
|
L6
|
3986.0
|
Found
|
01/01/1980⋅12:00:00⋅AM
|
None
|
None
|
None
|
Lewis⋅Cliff⋅86533
|
13461
|
Valid
|
H5
|
15.7
|
Found
|
01/01/1986⋅12:00:00⋅AM
|
-84.26756
|
161.3631
|
(-84.267560,⋅161.363100)
|
Grove⋅Mountains⋅053589
|
48447
|
Valid
|
L5
|
1.4
|
Found
|
01/01/2006⋅12:00:00⋅AM
|
-72.7825
|
75.300278
|
(-72.782500,⋅75.300278)
|
Sayh⋅al⋅Uhaymir⋅108
|
23300
|
Valid
|
H6
|
16.0
|
Found
|
01/01/2001⋅12:00:00⋅AM
|
21.06667
|
57.31667
|
(21.066670,⋅57.316670)
|
Northwest⋅Africa⋅3088
|
31218
|
Valid
|
L6
|
171.0
|
Found
|
01/01/2003⋅12:00:00⋅AM
|
None
|
None
|
None
|
Reckling⋅Peak⋅92423
|
22432
|
Valid
|
L6
|
3.8
|
Found
|
01/01/1992⋅12:00:00⋅AM
|
-76.22029
|
158.37967
|
(-76.220290,⋅158.379670)
|
Sweetwater
|
23770
|
Valid
|
H5
|
1760.0
|
Found
|
01/01/1961⋅12:00:00⋅AM
|
32.55
|
-100.41667
|
(32.550000,⋅-100.416670)
|
INFO:optimus:run() executed in 13.74 sec
Besides histograms and frequency plots you also have scatter plots and box plots. All powered by Apache by pyspark
df = op.load.excel("../examples/data/titanic3.xls")
df = df.rows.drop_na(["age","fare"])
You can output to the notebook or as an image
# Output and image
df.plot.hist("fare", output_format="image", output_path="images/hist.png")
INFO:optimus:Using 'column_exp' to process column 'fare' with function _bucketizer INFO:optimus:bucketizer() executed in 0.11 sec INFO:optimus:hist() executed in 5.16 sec INFO:optimus:hist() executed in 9.98 sec
df.plot.frequency("age")
df.plot.frequency("age", output_format="image", output_path="images/frequency.png")
df.plot.scatter(["fare", "age"], buckets=30)
df.plot.scatter(["fare", "age"], buckets=30, output_format="image", output_path="images/scatter.png")
INFO:optimus:Using 'column_exp' to process column 'fare' with function _bucketizer INFO:optimus:bucketizer() executed in 0.29 sec INFO:optimus:Using 'column_exp' to process column 'age' with function _bucketizer INFO:optimus:bucketizer() executed in 0.38 sec INFO:optimus:Using 'column_exp' to process column 'fare' with function _bucketizer INFO:optimus:bucketizer() executed in 0.29 sec INFO:optimus:Using 'column_exp' to process column 'age' with function _bucketizer INFO:optimus:bucketizer() executed in 0.3 sec
df.plot.box("age")
df.plot.box("age", output_format="image", output_path="images/box.png")
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:percentile() executed in 6.89 sec INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:percentile() executed in 3.92 sec
<Figure size 864x360 with 0 Axes>
df.plot.correlation("*")
df.plot.correlation("*", output_format="image", output_path="images/correlation.png")
INFO:optimus:`name`,`sex`,`ticket`,`cabin`,`embarked`,`boat`,`home_dest` column(s) was not processed because is/are not byte,short,big,int,double,float INFO:optimus:Using 'column_exp' to process column 'pclass' with function _cast_to INFO:optimus:Casting pclass to float... INFO:optimus:Using 'column_exp' to process column 'survived' with function _cast_to INFO:optimus:Casting survived to float... INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:Casting age to float... INFO:optimus:Using 'column_exp' to process column 'sibsp' with function _cast_to INFO:optimus:Casting sibsp to float... INFO:optimus:Using 'column_exp' to process column 'parch' with function _cast_to INFO:optimus:Casting parch to float... INFO:optimus:Using 'column_exp' to process column 'fare' with function _cast_to INFO:optimus:Casting fare to float... INFO:optimus:Using 'column_exp' to process column 'body' with function _cast_to INFO:optimus:Casting body to float... object of type <class 'float'> cannot be safely interpreted as an integer. INFO:optimus:`name`,`sex`,`ticket`,`cabin`,`embarked`,`boat`,`home_dest` column(s) was not processed because is/are not byte,short,big,int,double,float INFO:optimus:Using 'column_exp' to process column 'pclass' with function _cast_to INFO:optimus:Casting pclass to float... INFO:optimus:Using 'column_exp' to process column 'survived' with function _cast_to INFO:optimus:Casting survived to float... INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:Casting age to float... INFO:optimus:Using 'column_exp' to process column 'sibsp' with function _cast_to INFO:optimus:Casting sibsp to float... INFO:optimus:Using 'column_exp' to process column 'parch' with function _cast_to INFO:optimus:Casting parch to float... INFO:optimus:Using 'column_exp' to process column 'fare' with function _cast_to INFO:optimus:Casting fare to float... INFO:optimus:Using 'column_exp' to process column 'body' with function _cast_to INFO:optimus:Casting body to float...
Optimus has a tiny API so you can use any plotting library. For example, you can use df.cols.scatter()
, df.cols.frequency()
, df.cols.boxplot()
or df.cols.hist()
to output a JSON that you can process to adapt the data to any plotting library.
df.outliers.tukey("age").select().table_image("images/table5.png")
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:percentile() executed in 4.02 sec
Loading page (1/2) Rendering (2/2) Done
df.outliers.tukey("age").drop().table_image("images/table6.png")
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:percentile() executed in 4.1 sec
Loading page (1/2) Rendering (2/2) Done
df.outliers.tukey("age").info()
INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:percentile() executed in 3.94 sec INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:percentile() executed in 4.09 sec
{'count_outliers': 1045, 'count_non_outliers': 1036, 'lower_bound': -6.0, 'upper_bound': 66.0, 'iqr1': 21.0, 'iqr3': 39.0}
df.outliers.z_score("age", threshold=2).drop()
df.outliers.modified_z_score("age", threshold = 2).drop()
df.outliers.mad("age", threshold = 2).drop()
INFO:optimus:Using 'column_exp' to process column 'age' with function _z_score INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'age' with function _cast_to
DataFrame[pclass: bigint, survived: bigint, name: string, sex: string, age: double, sibsp: bigint, parch: bigint, ticket: string, fare: double, cabin: string, embarked: string, boat: string, body: double, home_dest: string]
Optimus have handy tools to connect to databases and extract informacion. Optimus can handle redshift, postgres, oracle and mysql
import sys
sys.path.append("..")
from optimus import Optimus
op= Optimus(verbose=True)
INFO:optimus:Operative System:Windows INFO:optimus:Just check that Spark and all necessary environments vars are present... INFO:optimus:----- INFO:optimus:SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7 INFO:optimus:HADOOP_HOME=C:\opt\hadoop-2.7.7 INFO:optimus:PYSPARK_PYTHON=C:\Users\argenisleon\Anaconda3\python.exe INFO:optimus:PYSPARK_DRIVER_PYTHON=jupyter INFO:optimus:PYSPARK_SUBMIT_ARGS=--jars "file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar,file:///C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --driver-class-path "C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/RedshiftJDBC42-1.2.16.1027.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/mysql-connector-java-8.0.16.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/ojdbc8.jar;C:/Users/argenisleon/Documents/Optimus/optimus/jars/postgresql-42.2.5.jar" --conf "spark.sql.catalogImplementation=hive" pyspark-shell INFO:optimus:JAVA_HOME=C:\java INFO:optimus:Pyarrow Installed INFO:optimus:----- INFO:optimus:Starting or getting SparkSession and SparkContext... INFO:optimus:Spark Version:2.3.1 INFO:optimus: ____ __ _ / __ \____ / /_(_)___ ___ __ _______ / / / / __ \/ __/ / __ `__ \/ / / / ___/ / /_/ / /_/ / /_/ / / / / / / /_/ (__ ) \____/ .___/\__/_/_/ /_/ /_/\__,_/____/ /_/ INFO:optimus:Transform and Roll out... INFO:optimus:Optimus successfully imported. Have fun :). INFO:optimus:Config.ini not found
# This import is only to hide the credentials
from credentials import *
# For others databases use in db_type accepts 'oracle','mysql','redshift','postgres'
db = op.connect(
db_type=DB_TYPE,
host=HOST,
database= DATABASE,
user= USER,
password = PASSWORD,
port=PORT)
# Show all tables names
db.tables(limit="all")
INFO:optimus:jdbc:redshift://yopter.ccyy9l6xansm.us-east-1.redshift.amazonaws.com:5439/yopterdwh?currentSchema=public INFO:optimus:( SELECT relname as table_name,cast (reltuples as integer) AS count FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname IN ('public') AND relkind='r' ORDER BY reltuples DESC) AS t INFO:optimus:jdbc:redshift://yopter.ccyy9l6xansm.us-east-1.redshift.amazonaws.com:5439/yopterdwh?currentSchema=public
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) <ipython-input-42-f35140062f08> in <module> 13 14 # Show all tables names ---> 15 db.tables(limit="all") ~\Documents\Optimus\optimus\io\jdbc.py in tables(self, schema, database, limit) 123 FROM user_tables ORDER BY table_name""" 124 --> 125 df = self.execute(query, limit) 126 return df.table(limit) 127 ~\Documents\Optimus\optimus\io\jdbc.py in execute(self, query, limit) 230 conf.option("driver", self.driver_option) 231 --> 232 return conf.load() 233 234 def df_to_table(self, df, table, mode="overwrite"): ~\Anaconda3\lib\site-packages\pyspark\sql\readwriter.py in load(self, path, format, schema, **options) 170 return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path))) 171 else: --> 172 return self._df(self._jreader.load()) 173 174 @since(1.4) ~\Anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args: ~\Anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() ~\Anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError( Py4JJavaError: An error occurred while calling o6140.load. : java.sql.SQLException: No suitable driver at java.sql.DriverManager.getDriver(DriverManager.java:315) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:84) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
# # Show a summary of every table
db.table.show("*",20)
# # Get a table as dataframe
df_ = db.table_to_df("places_interest").table()
# # Create new table in the database
db.df_to_table(df, "new_table")
You can connect to any external API to enrich your data using Optimus. Optimus uses MongoDB to download the data and then merge it with the Spark Dataframe. You need to install MongoDB
Let's load a tiny dataset we can enrich
df = op.load.json("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.json")
import requests
def func_request(params):
# You can use here whatever header or auth info you need to send.
# For more information see the requests library
url= "https://jsonplaceholder.typicode.com/todos/" + str(params["id"])
return requests.get(url)
def func_response(response):
# Here you can parse de response
return response["title"]
e = op.enrich(host="localhost", port=27017, db_name="jazz")
df_result = e.run(df, func_request, func_response, calls= 60, period = 60, max_tries = 8)
df_result.table("all")
df_result.table_image("images/table7.png")
Optimus implements some funciton to cluster Strings. We get graet inspiration from OpenRefine
Here a quote from its site:
"In OpenRefine, clustering refers to the operation of "finding groups of different values that might be alternative representations of the same thing". For example, the two strings "New York" and "new york" are very likely to refer to the same concept and just have capitalization differences. Likewise, "Gödel" and "Godel" probably refer to the same person."
For more informacion see this: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
df = op.read.csv("../examples/data/random.csv",header=True, sep=";")
from optimus.ml import keycollision as keyCol
df_kc = keyCol.fingerprint_cluster(df, 'STATE')
df_kc.table()
df_kc.table_image("images/table8.png")
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join
STATE_CLUSTER_SIZE
1 (int)
not nullable
|
STATE_CLUSTER
2 (array<string>)
nullable
|
STATE_COUNT
3 (bigint)
nullable
|
STATE_RECOMMENDED
4 (string)
nullable
|
---|---|---|---|
1
|
['Estado⋅de⋅México']
|
810
|
Estado⋅de⋅México
|
2
|
['México⋅D.F.',⋅'Mexico⋅D.F.']
|
2495
|
Mexico⋅D.F.
|
1
|
['D.F.']
|
66
|
D.F.
|
1
|
['Distriro⋅Federal']
|
259
|
Distriro⋅Federal
|
3
|
['Distrito⋅Federal',⋅'DISTRITO⋅FEDERAL',⋅'distrito⋅federal']
|
11930
|
Distrito⋅Federal
|
Loading page (1/2) Rendering (2/2) Done
keyCol.fingerprint_cluster(df, "STATE").to_json()
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join
[{'STATE_CLUSTER_SIZE': 1, 'STATE_CLUSTER': ['Estado de México'], 'STATE_COUNT': 810, 'STATE_RECOMMENDED': 'Estado de México'}, {'STATE_CLUSTER_SIZE': 2, 'STATE_CLUSTER': ['México D.F.', 'Mexico D.F.'], 'STATE_COUNT': 2495, 'STATE_RECOMMENDED': 'Mexico D.F.'}, {'STATE_CLUSTER_SIZE': 1, 'STATE_CLUSTER': ['D.F.'], 'STATE_COUNT': 66, 'STATE_RECOMMENDED': 'D.F.'}, {'STATE_CLUSTER_SIZE': 1, 'STATE_CLUSTER': ['Distriro Federal'], 'STATE_COUNT': 259, 'STATE_RECOMMENDED': 'Distriro Federal'}, {'STATE_CLUSTER_SIZE': 3, 'STATE_CLUSTER': ['Distrito Federal', 'DISTRITO FEDERAL', 'distrito federal'], 'STATE_COUNT': 11930, 'STATE_RECOMMENDED': 'Distrito Federal'}]
df_kc = keyCol.n_gram_fingerprint_cluster(df, "STATE" , 2)
df_kc.table()
df_kc.table_image("images/table9.png")
INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _lower INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _remove_white_spaces INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM' with function _remove_accents INFO:optimus:Using 'column_exp' to process column 'count' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'STATE' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function func_col_exp INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM_FINGERPRINT' with function remote_white_spaces_remove_sort_join
STATE_CLUSTER_SIZE
1 (int)
not nullable
|
STATE_CLUSTER
2 (array<string>)
nullable
|
STATE_COUNT
3 (double)
nullable
|
STATE_RECOMMENDED
4 (string)
nullable
|
---|---|---|---|
8
|
['Distrito⋅Federal',⋅'México⋅D.F.',⋅'DISTRITO⋅FEDERAL',⋅'Mexico⋅D.F.',⋅'Distr...
|
15560.0
|
Mexico⋅D.F.
|
Loading page (1/2) Rendering (2/2) Done
keyCol.n_gram_fingerprint_cluster(df, "STATE" , 2).to_json()
INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _lower INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _remove_white_spaces INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM' with function _remove_accents INFO:optimus:Using 'column_exp' to process column 'count' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'STATE' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function _cast_to INFO:optimus:Using 'column_exp' to process column 'STATE_NGRAM' with function func_col_exp INFO:optimus:Using 'pandas_udf' to process column 'STATE_NGRAM_FINGERPRINT' with function remote_white_spaces_remove_sort_join
[{'STATE_CLUSTER_SIZE': 8, 'STATE_CLUSTER': ['Distrito Federal', 'México D.F.', 'DISTRITO FEDERAL', 'Mexico D.F.', 'Distriro Federal', 'D.F.', 'Estado de México', 'distrito federal'], 'STATE_COUNT': 15560.0, 'STATE_RECOMMENDED': 'Mexico D.F.'}]
from optimus.ml import distancecluster as dc
df_dc = dc.levenshtein_matrix(df,"STATE")
df_dc.table_image("images/table10.png")
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join
Loading page (1/2) Rendering (2/2) Done
df_dc=dc.levenshtein_filter(df,"STATE")
df_dc.table()
df_dc.table_image("images/table11.png")
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join
STATE_FROM
1 (string)
nullable
|
STATE_LEVENSHTEIN_DISTANCE
2 (int)
nullable
|
STATE_TO
3 (string)
nullable
|
---|---|---|
estadodemexico
|
10
|
mexicodf
|
df
|
6
|
mexicodf
|
distrirofederal
|
1
|
distritofederal
|
distritofederal
|
1
|
distrirofederal
|
mexicodf
|
6
|
df
|
Loading page (1/2) Rendering (2/2) Done
df_dc = dc.levenshtein_cluster(df,"STATE")
df_dc.table()
df_dc.table_image("images/table12.png")
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join
STATE_CLUSTER
1 (array<string>)
nullable
|
STATE_CLUSTER_SIZE
2 (int)
nullable
|
STATE_RECOMMENDED
3 (string)
nullable
|
STATE_COUNT
4 (bigint)
nullable
|
---|---|---|---|
['Estado⋅de⋅México']
|
1
|
Estado⋅de⋅México
|
810
|
['D.F.']
|
1
|
D.F.
|
66
|
['Distriro⋅Federal']
|
1
|
Distriro⋅Federal
|
259
|
['Distrito⋅Federal',⋅'DISTRITO⋅FEDERAL',⋅'distrito⋅federal']
|
3
|
Distrito⋅Federal
|
11930
|
['Mexico⋅D.F.',⋅'México⋅D.F.']
|
2
|
Mexico⋅D.F.
|
2495
|
Loading page (1/2) Rendering (2/2) Done
dc.to_json(df, "STATE")
INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _trim INFO:optimus:Using 'column_exp' to process column 'STATE_FINGERPRINT' with function _lower INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function multiple_replace INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _remove_accents INFO:optimus:Using 'pandas_udf' to process column 'STATE_FINGERPRINT' with function _split_sort_remove_join
[{'STATE_CLUSTER': ['Estado de México'], 'STATE_CLUSTER_SIZE': 1, 'STATE_RECOMMENDED': 'Estado de México', 'STATE_COUNT': 810}, {'STATE_CLUSTER': ['D.F.'], 'STATE_CLUSTER_SIZE': 1, 'STATE_RECOMMENDED': 'D.F.', 'STATE_COUNT': 66}, {'STATE_CLUSTER': ['Distriro Federal'], 'STATE_CLUSTER_SIZE': 1, 'STATE_RECOMMENDED': 'Distriro Federal', 'STATE_COUNT': 259}, {'STATE_CLUSTER': ['Distrito Federal', 'DISTRITO FEDERAL', 'distrito federal'], 'STATE_CLUSTER_SIZE': 3, 'STATE_RECOMMENDED': 'Distrito Federal', 'STATE_COUNT': 11930}, {'STATE_CLUSTER': ['Mexico D.F.', 'México D.F.'], 'STATE_CLUSTER_SIZE': 2, 'STATE_RECOMMENDED': 'Mexico D.F.', 'STATE_COUNT': 2495}]
Machine Learning is one of the last steps, and the goal for most Data Science WorkFlows.
Apache Spark created a library called MLlib where they coded great algorithms for Machine Learning. Now with the ML library we can take advantage of the Dataframe API and its optimization to create Machine Learning Pipelines easily.
Even though this task is not extremely hard, it is not easy. The way most Machine Learning models work on Spark are not straightforward, and they need lots of feature engineering to work. That's why we created the feature engineering section inside Optimus.
One of the best "tree" models for machine learning is Random Forest. What about creating a RF model with just one line? With Optimus is really easy.
df_cancer = op.load.csv("https://raw.githubusercontent.com/ironmussa/Optimus/master/tests/data_cancer.csv")
INFO:optimus:Downloading data_cancer.csv from https://raw.githubusercontent.com/ironmussa/Optimus/master/tests/data_cancer.csv INFO:optimus:Downloaded 125205 bytes INFO:optimus:Creating DataFrame for data_cancer.csv. Please wait... INFO:optimus:Successfully created DataFrame for 'data_cancer.csv'
columns = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean',
'fractal_dimension_mean']
df_predict, rf_model = op.ml.random_forest(df_cancer, columns, "diagnosis")
This will create a DataFrame with the predictions of the Random Forest model.
So lets see the prediction compared with the actual label:
df_predict.cols.select(["label","prediction"]).table_image("images/table13.png")
Loading page (1/2) Rendering (2/2) Done
The rf_model variable contains the Random Forest model for analysis.
Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:
[Become a backer] and get your image on our README on Github with a link to your site.
[Become a sponsor] and get your image on our README on Github with a link to your site.
Argenis Leon and Luis Aguirre
Apache 2.0 © Iron
This will recreate README.md
The bellow script process the readme_.md
that is ouputed from this notebook and remove the header from jupytext, python comments and convert/add table to images and output readme.md
.
To make table_image()
function be sure to install imagekit pip install imgkit
Also install wkhtmltopdf https://wkhtmltopdf.org/downloads.html. This is responsible to generate the optimus tables as images
from shutil import copyfile
output_file = "../README.md"
copyfile("readme_.md", output_file)
import sys
import fileinput
import re
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
jupytext_header = False
flag_remove = False
remove = ["load_ext", "autoreload","import sys","sys.path.append"]
buffer = None
for i, line in enumerate(fileinput.input(output_file, inplace=1)):
done= False
try:
# Remove some helper lines
for r in remove:
if re.search(r, line):
done= True
#Remove the post process code
if re.search("Post-process", line):
flag_remove = True
if flag_remove is True:
done = True
# Remove jupytext header
if jupytext_header is True:
done = True
if "---\n" == line:
jupytext_header = not jupytext_header
elif done is False:
# Replace .table_image(...) by table()
chars_table=re.search(".table_image", line)
chars_image=re.search(".to_image", line)
chars_plot = True if len(re.findall('(.plot.|output_path=)', line))==2 else False
path = "readme/"
if chars_table:
print(line[0:int(chars_table.start())]+".table()")
m = re.search(r'table_image\("(.*?)"\)', line).group(1)
if m:
buffer = "![]("+ path + m + ")"
elif chars_image:
m = re.search(r'to_image\(output_path="(.*?)"\)', line).group(1)
if m:
buffer = "![]("+ path + m + ")"
elif chars_plot:
m = re.search('output_path="(.*?)"', line).group(1)
if m:
buffer = "![]("+ path + m + ")"
else:
sys.stdout.write(line)
if "```\n"==line and buffer:
print(buffer)
buffer = None
except Exception as e:
print(e)
fileinput.close()
# Remove empyt python cells
flag = False
for i, line in enumerate(fileinput.input(output_file, inplace=1)):
if re.search("```python", line):
flag = True
elif re.search("```", line) and flag is True:
flag=False
elif flag is True:
flag = False
print("```python")
print(line,end="")
else:
print(line, end="")
fileinput.close()
line = 'op.profiler.to_image(output_path="images/profiler.png")")'
m = re.search(r'to_image\(output_path="(.*?)"\)', line).group(1)
print(m)
images/profiler.png