It is strongly recommended to quickly go through the COVID19 example to get a sense of what Fugue SQL can do, and how it works. Here we are going through details of different Fugue SQL features.
Fugue SQL is an alternative to Fugue programming interface. Both are used to describe your end-to-end workflow logic. The SQL semantic is platform and scale agnostic, so if you write logic in SQL, it's very high level and abstract, and the underlying computing frameworks will try to excute them in the optimal way.
The syntax of Fugue SQL is between standard SQL, json and python. The goals behind this design are:
SELECT
statementTo use Fugue SQL, you need to make sure you have installed the SQL extra
pip install fugue[sql]
To make writing SQL easier we will use cell magic that was introduced in COVID19 Data Exploration section of the tutorial. We will take the libraries and functions menteioned above and import it using the following imports:
from fugue_notebook import setup
import pandas as pd
setup ()
df = pd.DataFrame([[0,"hello"],[1,"world"]],columns = ['a','b'])
print(df)
a b 0 0 hello 1 1 world
the SQL will translate to a sequence of operations in programming interface.
%%fsql
SELECT
*
FROM df
WHERE a=0 -- we can use df directly defined outside of this cell
PRINT
a | b | |
---|---|---|
0 | 0 | hello |
In Fugue SQL, a very important simplification is anonymity, it is optional, but it can significantly simplify your code.
For a statement that only needs to consume the previous dataframe, you can use anonymity to chain commands. PRINT
is the best example. This is good for chaining commands.
%%fsql
a=CREATE [[0,"hello"],[1,"world"]] SCHEMA a:int,b:str
PRINT -- If the PRINT is not specify, it means it will print
-- the last dataframe output of the previous statements
PRINT -- I can use anonymity again because PRINT doesn't generate output, so it still means PRINT a
a | b | |
---|---|---|
0 | 0 | hello |
1 | 1 | world |
a | b | |
---|---|---|
0 | 0 | hello |
1 | 1 | world |
For statements that don't generate output, you can't assign it to any variable. For statements that generates single output, you can also use anonymity and don't assign to a variable. The following statements will have to use anonymity if they need to consume this output.
%%fsql
a=CREATE [[0,"hello"]] SCHEMA a:int,b:str
CREATE [[1,"world"]] SCHEMA a:int,b:str
PRINT -- print the second
PRINT a -- print the first, because it is explicit
PRINT -- print the second
a | b | |
---|---|---|
0 | 1 | world |
a | b | |
---|---|---|
0 | 0 | hello |
a | b | |
---|---|---|
0 | 1 | world |
In the same manner, SELECT
statement also follows this rule.
%%fsql
CREATE [[0,"hello"], [1,"world"]] SCHEMA a:int,b:str
SELECT * WHERE a=1 -- The FROM is not needed and it will grab last output of the previous statements
-- This is good for chaining commands
PRINT
a | b | |
---|---|---|
0 | 1 | world |
Inline statements is a very powerful tool for data wrangling and general analysis. It is easy to use and has an instinctive feel to it.
Since we can easily do variable assignment in Fugue, it may not be necessary to write your code using anonymity. It's all up to you.
%%fsql
SELECT *
FROM (CREATE [[0,"hello"], [1,"world"]] SCHEMA a:int,b:str)
WHERE a=1
PRINT
PRINT ( SELECT * FROM (CREATE [[0,"hello"], [1,"world"]] SCHEMA a:int,b:str)
WHERE a=1)
a | b | |
---|---|---|
0 | 1 | world |
a | b | |
---|---|---|
0 | 1 | world |
For more information please checkout the Full SQL tutorial.