This environment has everything setup for you, you can run Fugue on native python, Spark and Dask, with Fugue SQL support. In order to setup your own environment, you can pip install the package:
pip install fugue[sql,spark,dask]
To run the tutorials environment on your own machine, the simplest way is, if you have docker installed:
docker run -p 8888:8888 fugueproject/tutorials:latest
This tutorial is not only to help you ramp up on Fugue, but more importantly, it helps you better understand the basic concepts of distributed computing, from higher level. The philosophy of Fugue is to adapt to you, so as long as you understand the basic concepts, you can simply use Fugue to express or glue your logic. Most of your code will stay in native python.
The topics will follow the architecture from bottom up, from the most basic abstraction elements to the most abstract layers.
How to quickly start playing with Fugue.
Fugue data types and schema are strictly based on Apache Arrow. Dataframe is an abstract concept with several built in implementations to adapt to different dataframes. In this tutorial, we will go through the basic APIs and focus on the most common use cases.
This tutorial is more focused on explaining the basic ideas of data partition. It's less related with Fugue. To have a good understanding of partition is the key to writing high performance code.
The heart of Fugue. It is the layer that unifies many of the core concepts of distributed computing, and separates the underlying computing frameworks from users' higher level logic. Normally you don't directly operate on execution engines. But it's good to understand some basics.
The most important tutorial you should read. In this tutorial, we will focus on how to use the APIs to construct your own workflow. In this tutorial we will also have some basic examples of extensions, for more examples, please read the next topic.
The second most important tutorial you should read. It covers all types of extensions you can customize.
The most useful extension, that is widely used in real world. PLEASE READ
Transformation on multiple dataframes partitioned in the same way
Creators of dataframes for a DAG to use
Taking in one or multiple dataframes and produce a single dataframe
Taking in one or multiple dataframes to do final jobs such as save and print
The most fun part of Fugue. You can use SQL instead of python to represent the backbone of your workflow, and you can keep you mindset in SQL in most of the time and with the help of python extensions. The SQL mindset is great for distributed computing, you may be able to make your logic more scale agnostic if within SQL mindset. In this tutorial, we will cover all syntax of Fugue SQL.
Every framework has a hello world, Fugue is the same. But you must understand that distributed computing is not easy, being able to modify the hello world code for some simple things doesn't mean you master it. And please don't be misled by hello world examples of any distributed frameworks. There is much more to understand.
from fugue import FugueWorkflow
# create a dataframe and print
with FugueWorkflow() as dag:
dag.df([[0,"hello"],[1,"world"]],"x:int,b:str").show()
Fugue is using DAG (Directed Acyclic Graph) to express workflow. You always construct a dag before executing it. Currently, Fugue does not support execution during construction (this is actually interactive mode). But the experience of using dag should be similar to using native Spark in notebook, both are lazy.
The with
statement tells the system I want to execute it when exiting. You don't have to use with
all the time. For example, submitting to spark may be slow, it's totally fine we construct the dag, then start Spark and run it, this can capture many errors much quicker.
from fugue import FugueWorkflow
from fugue_spark import SparkExecutionEngine
from fugue_dask import DaskExecutionEngine
dag = FugueWorkflow()
dag.df([[0,"hello"],[1,"world"]],"x:int,b:str").show()
# here I have finished the construction, and the following is to run on different execution engines
dag.run() # native python
dag.run(SparkExecutionEngine) # spark
dag.run(DaskExecutionEngine) # dask
You can find that the results are the same, but they are of different dataframes. Different execution engine will use different dataframes, they can convert to each other. Fugue tries to make the concept of DataFrame
as abstract as possible, users in most cases only need to care data inside a dataframe.
Here we show the simple ways to run the same dag on different execution engines, it's good for initial prototyping. But in real use cases, you should well configure your execution engines and then pass into the dag to run. Again, hello world !=
real way to use.