Notebook

Welcome To Fugue Tutorials¶

This environment has everything setup for you, you can run Fugue on native python, Spark and Dask, with Fugue SQL support. In order to setup your own environment, you can pip install the package:

pip install fugue[sql,spark,dask]

To run the tutorials environment on your own machine, the simplest way is, if you have docker installed:

docker run -p 8888:8888 fugueproject/tutorials:latest

This tutorial is not only to help you ramp up on Fugue, but more importantly, it helps you better understand the basic concepts of distributed computing, from higher level. The philosophy of Fugue is to adapt to you, so as long as you understand the basic concepts, you can simply use Fugue to express or glue your logic. Most of your code will stay in native python.

Topics¶

The topics will follow the architecture from bottom up, from the most basic abstraction elements to the most abstract layers.

1. Hello World ¶

How to quickly start playing with Fugue.

2. Data Type, Schema & DataFrames ¶

Fugue data types and schema are strictly based on Apache Arrow. Dataframe is an abstract concept with several built in implementations to adapt to different dataframes. In this tutorial, we will go through the basic APIs and focus on the most common use cases.

3. Partition ¶

This tutorial is more focused on explaining the basic ideas of data partition. It's less related with Fugue. To have a good understanding of partition is the key to writing high performance code.

4. Execution Engine ¶

The heart of Fugue. It is the layer that unifies many of the core concepts of distributed computing, and separates the underlying computing frameworks from users' higher level logic. Normally you don't directly operate on execution engines. But it's good to understand some basics.

5. Execution Graph (DAG) & Programming Interface ¶

The most important tutorial you should read. In this tutorial, we will focus on how to use the APIs to construct your own workflow. In this tutorial we will also have some basic examples of extensions, for more examples, please read the next topic.

6. Extensions ¶

The second most important tutorial you should read. It covers all types of extensions you can customize.

Transformer ¶

The most useful extension, that is widely used in real world. PLEASE READ

CoTransformer ¶

Transformation on multiple dataframes partitioned in the same way

Creator ¶

Creators of dataframes for a DAG to use

Processor ¶

Taking in one or multiple dataframes and produce a single dataframe

Outputter ¶

Taking in one or multiple dataframes to do final jobs such as save and print

7. Fugue SQL ¶

The most fun part of Fugue. You can use SQL instead of python to represent the backbone of your workflow, and you can keep you mindset in SQL in most of the time and with the help of python extensions. The SQL mindset is great for distributed computing, you may be able to make your logic more scale agnostic if within SQL mindset. In this tutorial, we will cover all syntax of Fugue SQL.

Hello World¶

Every framework has a hello world, Fugue is the same. But you must understand that distributed computing is not easy, being able to modify the hello world code for some simple things doesn't mean you master it. And please don't be misled by hello world examples of any distributed frameworks. There is much more to understand.

In [ ]:

from fugue import FugueWorkflow

# create a dataframe and print
with FugueWorkflow() as dag:
    dag.df([[0,"hello"],[1,"world"]],"x:int,b:str").show()

Fugue is using DAG (Directed Acyclic Graph) to express workflow. You always construct a dag before executing it. Currently, Fugue does not support execution during construction (this is actually interactive mode). But the experience of using dag should be similar to using native Spark in notebook, both are lazy.

The with statement tells the system I want to execute it when exiting. You don't have to use with all the time. For example, submitting to spark may be slow, it's totally fine we construct the dag, then start Spark and run it, this can capture many errors much quicker.

In [ ]:

from fugue import FugueWorkflow
from fugue_spark import SparkExecutionEngine
from fugue_dask import DaskExecutionEngine

dag = FugueWorkflow()
dag.df([[0,"hello"],[1,"world"]],"x:int,b:str").show()
# here I have finished the construction, and the following is to run on different execution engines

dag.run()                     # native python
dag.run(SparkExecutionEngine) # spark
dag.run(DaskExecutionEngine)  # dask

You can find that the results are the same, but they are of different dataframes. Different execution engine will use different dataframes, they can convert to each other. Fugue tries to make the concept of DataFrame as abstract as possible, users in most cases only need to care data inside a dataframe.

Here we show the simple ways to run the same dag on different execution engines, it's good for initial prototyping. But in real use cases, you should well configure your execution engines and then pass into the dag to run. Again, hello world != real way to use.