To start using ibis
, you need a Python environment with ibis
installed.
If you're running through this tutorial on your own machine (rather than binder) please follow the [installation instructions](https://ibis-project.org/install/ to setup an environment with the SQLite
backend.
You'll also need access to the geography.db
database hosted here.
Every notebook in the tutorial starts with the following code to download the database if it doesn't already exist:
from tutorial_utils import setup
setup()
You should now have ibis
and the tutorial data all setup.
We're ready to get started. First lets import ibis
.
import ibis
To make things easier, we will be using ibis
's interactive mode in order to see the results of an operation immediately.
This is the recommended mode to use when doing interactive/iterative work with ibis
.
When deploying production code you'll typically run in non-interactive/lazy mode. More details on ibis
non-interactive mode are covered in a later notebook.
To enable interactive mode, run:
ibis.options.interactive = True
Next thing we need is to create a connection object.
The connection defines where the data is stored and where the computations will be performed.
This is not the same as in pandas
when we import the data from an external source (e.g. pandas.read_sql
). In this case pandas
loads data into memory and performs the computations itself. ibis
will not load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will ask the backend to perform the computations.
In this tutorial we will be using a SQLite
connection for its simplicity (no installation is needed). But ibis
can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL
, MySQL
, ...).
connection = ibis.sqlite.connect("geography.db")
To list the tables in the connection
object, we can use the .list_tables()
method. If you are using Jupyter, you can see all the methods and attributes of the connection
object by writing connection.
and pressing the <TAB>
key.
connection.list_tables()
These three tables include world countries data, their GDP by year and their independence information.
independence
table has been obtained from Wikipedia and will be used in one of the following tutorials.Next, we want to access a specific table in the database. We can create a handler to the countries
table with:
countries = connection.table("countries")
To list the columns of the countries
table, we can use the columns
attribute.
Again, Jupyter users can see all the methods and attributes of the countries
object by typing countries.
and pressing <TAB>
.
countries.columns
We can now access a sample of the data. Let's focus on the name
, continent
and population
columns to start with. We can visualize the values of the columns with:
countries["name", "continent", "population"]
The table is too big for all the results to be displayed, and we probably don't want to see all of them at once anyway. For this reason, just the first 10 rows of the results are displayed.
To check how many rows a table has, we can use the .count()
method:
countries.count()
To fetch just a subset of the rows, we can use the .limit(n)
method, where n
is the number of samples we want. In this case we will fetch the first 3
countries from the table:
countries["name", "continent", "population"].limit(3)
Now that we've got an intuition of the data available in the table countries
, we will extract some information from it by applying filters and sorting the data.
Let's focus on a single continent. We can see a list of unique continents in the table using the .distinct()
method:
countries[["continent"]].distinct()
We will focus on Asia (AS
). We can identify which rows belong to Asian countries using the standard Python ==
operator:
countries["continent"] == "AS"
The result has a value True
for rows where the condition is true, and the value False
when it's not.
We can provide this expression to the method .filter()
, and save the result in the variable asian_countries
for future use.
asian_countries = countries["name", "continent", "population"].filter(
countries["continent"] == "AS"
)
asian_countries
We can check how many countries exist in Asia (based on the information in the database) by using the .count()
method we've already seen:
asian_countries.count()
Next, we want to find the most populated countries in Asia. We are going to sort the countries by the column population
and fetch the first 10. We can use the .order_by()
method to sort by a column:
asian_countries.order_by("population").limit(10)
Because the default for .order_by
is ascending order (ascending order like in 1, 2, 3, 4
) the operation will return the least populated countries. This behavior is consistent with SQL ORDER BY
.
To order in descending order we can use ibis.desc()
:
asian_countries.order_by(ibis.desc("population")).limit(10)
This is the list of the 10 most populated countries based on the data from GeoNames.
To learn more about Ibis, continue to our next tutorial: Aggregating and joining data.