To start using ibis, you need a Python environment with ibis installed.
If you're running through this tutorial on your own machine (rather than binder) please follow the [installation instructions](https://ibis-project.org/install/ to setup an environment with the SQLite backend.
You'll also need access to the geography.db database hosted here.
Every notebook in the tutorial starts with the following code to download the database if it doesn't already exist:
from tutorial_utils import setup
setup()
You should now have ibis and the tutorial data all setup.
We're ready to get started. First lets import ibis.
import ibis
To make things easier, we will be using ibis's interactive mode in order to see the results of an operation immediately.
This is the recommended mode to use when doing interactive/iterative work with ibis.
When deploying production code you'll typically run in non-interactive/lazy mode. More details on ibis non-interactive mode are covered in a later notebook.
To enable interactive mode, run:
ibis.options.interactive = True
Next thing we need is to create a connection object.
The connection defines where the data is stored and where the computations will be performed.
This is not the same as in pandas when we import the data from an external source (e.g. pandas.read_sql). In this case pandas loads data into memory and performs the computations itself. ibis will not load the data and perform any computation, but instead will leave the data in the backend defined in the connection, and will ask the backend to perform the computations.
In this tutorial we will be using a SQLite connection for its simplicity (no installation is needed). But ibis can work with many different backends, including big data systems, or GPU-accelerated analytical databases. As well as most common relational databases (PostgreSQL, MySQL, ...).
connection = ibis.sqlite.connect("geography.db")
To list the tables in the connection object, we can use the .list_tables() method. If you are using Jupyter, you can see all the methods and attributes of the connection object by writing connection. and pressing the <TAB> key.
connection.list_tables()
These three tables include world countries data, their GDP by year and their independence information.
independence table has been obtained from Wikipedia and will be used in one of the following tutorials.Next, we want to access a specific table in the database. We can create a handler to the countries table with:
countries = connection.table("countries")
To list the columns of the countries table, we can use the columns attribute.
Again, Jupyter users can see all the methods and attributes of the countries object by typing countries. and pressing <TAB>.
countries.columns
We can now access a sample of the data. Let's focus on the name, continent and population columns to start with. We can visualize the values of the columns with:
countries["name", "continent", "population"]
The table is too big for all the results to be displayed, and we probably don't want to see all of them at once anyway. For this reason, just the first 10 rows of the results are displayed.
To check how many rows a table has, we can use the .count() method:
countries.count()
To fetch just a subset of the rows, we can use the .limit(n) method, where n is the number of samples we want. In this case we will fetch the first 3 countries from the table:
countries["name", "continent", "population"].limit(3)
Now that we've got an intuition of the data available in the table countries, we will extract some information from it by applying filters and sorting the data.
Let's focus on a single continent. We can see a list of unique continents in the table using the .distinct() method:
countries[["continent"]].distinct()
We will focus on Asia (AS). We can identify which rows belong to Asian countries using the standard Python == operator:
countries["continent"] == "AS"
The result has a value True for rows where the condition is true, and the value False when it's not.
We can provide this expression to the method .filter(), and save the result in the variable asian_countries for future use.
asian_countries = countries["name", "continent", "population"].filter(
countries["continent"] == "AS"
)
asian_countries
We can check how many countries exist in Asia (based on the information in the database) by using the .count() method we've already seen:
asian_countries.count()
Next, we want to find the most populated countries in Asia. We are going to sort the countries by the column population and fetch the first 10. We can use the .order_by() method to sort by a column:
asian_countries.order_by("population").limit(10)
Because the default for .order_by is ascending order (ascending order like in 1, 2, 3, 4) the operation will return the least populated countries. This behavior is consistent with SQL ORDER BY.
To order in descending order we can use ibis.desc():
asian_countries.order_by(ibis.desc("population")).limit(10)
This is the list of the 10 most populated countries based on the data from GeoNames.
To learn more about Ibis, continue to our next tutorial: Aggregating and joining data.