Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan.
Regardless of how you connect, most interactions with relational database management systems (RDBMS) are carried out via Structured Query Language (SQL). Many programming languages are more similar than different.
SQL is genuinely different conceptually and syntactically.
To make learning SQL easier, in this notebook we list a number of database clients you can use to connect to a PostgreSQL database and run SQL queries, so you can try them out and find one you prefer to use (we recommend pgAdmin if you are new to databases).
We will follow the following sequence:
This notebook documents different database clients you can use to run SQL queries against the PostgreSQL database used for this class. PostgreSQL is an open source relational database management system (DBMS) developed by a worldwide team of volunteers.
Learning objectives:
We cover the following database clients in this notebook:
Graphical User Interface (GUI) application 'pgAdmin'
Using SQL in Python with:
psycopg2
SQLAlchemy
pandas
You can use any of these clients to run SQL in the database. Some are easier to use or better suited in certain situations over others. Each client's section includes information on good and bad points of each.
If you are here to learn SQL, once you've looked over your options, pick one and proceed to the notebook "Intro to SQL" to learn more about the SQL language.
All of the programs that connect to and query a database listed below need to be initially told how to connect to the database one wants to query. There are a set of common connection properties that are used to specify how to connect to a broad range of database servers:
Not all setups will need all of these parameters to be specified to successfully connect to the database. For our class database in the ADRF, for example, we only need to specify:
The database server listens on the default PostgreSQL port (5432), so no port is needed, and it authenticates the user based on whether that user has a linux user on the database server itself, rather than requiring a username and password (though access to schemas and tables inside are controlled by a more stringent set of per-user access privileges stored within the database).
The first database clients we will cover are Graphical User Interface (GUI) clients. These clients are designed to be used with mouse and keyboard, and to simplify submitting queries to a database and interacting with the results.
We will briefly cover connecting to a database and running a query in the GUI database client pgAdmin, a PostgreSQL-specific database client.
pgAdmin is a PostgreSQL client written and suppported by the PostgreSQL community. It isn't the most beautiful program, but it is full-featured and available on many platforms. It doesn't let you connect to any databases other than PostgreSQL.
1. Running pgadmin Double-click the "pgAdmin III
" icon on the Desktop in the ADRF workspace.
2. Creating a connection to the class database In pgadmin:
Go to the file menu, then click on the "Add Connection to Server" option on top-left.
In the "New Server Registration" window that opens, set:
3. Connecting to the class database 'appliedda'
4. Running a Query Once you are connected to the "appliedda" database, you can start running queries using this GUI. Click on the button that looks like a magnifying glass with "SQL" inside it, at the top center of the window. Enter your SQL query in the "SQL Editor" in the top left.
Let us count the number of rows in the dataset il_des_kcmo.il_qcew_employers:
SELECT COUNT(*)
FROM il_des_kcmo.il_qcew_employers;
Now, press the green triangle "play" button to run the query. In the data output tab (down left)- you will see the results of this query.
Other queries you can run:
Counting number of unique employers in the data:
SELECT COUNT(distinct ein)
FROM il_des_kcmo.il_qcew_employers;
Counting number of different records for each NAICS industry code:
SELECT naics, COUNT(*) AS cnt
FROM il_des_kcmo.il_qcew_employers
GROUP BY naics;
Apart from client GUIs, we can also access PostgreSQL using programming languages like Python. We do this using libraries of code that extend core Python named 'packages'.
The commands work similarly, you can execute almost any SQL in a programming language that you can in a manual client, and the results are returned in a format that lets you interact with them after the SQL statements finish.
(Python lets you interact with databases using SQL just like you would in any SQL GUI or terminal. Python code can do SELECTs, CREATEs, INSERTs, UPDATEs, and DELETEs, and any other SQL)
Below are three ways one can interact with PostgreSQL using Python:
psycopg2
- The Python psycopg2
package implements Python's DBAPI, a mostly-standardized API for database interaction, to allow for querying PostgreSQL, It is the closest you can get in Python to a direct database connection.SQLAlchemy
- SQLAlchemy
can be used to map Python objects to database tables, but it also contains a wrapper around DBAPI that allows for query code be more consistently re-used across databases.pandas
- pandas
is an analysis package that can use a database connection (with either SQLAlchemy
or psycopg2
) to read the results of SQL queries directly into a pandas
DataFrame, allowing you to further analyze the data in Python.psycopg2
¶The psycopg2
package is the most popular PostgreSQL adapter for the Python programming language. This Python package implements the standard DBAPI Python interface for interacting with a relational database. This is the closest you can get to connecting directly to the database in Python - there aren't any objects creating in-memory tables or layers of abstraction between you and the data. Your Python sends SQL directly to the database and then deals row-by-row with the results.
Pros:
Cons:
Mixed:
# importing datetime and psycopg2 package
import datetime
import psycopg2
import psycopg2.extras
print( "psycopg2 imports completed at " + str( datetime.datetime.now() ) )
# set up connection properties
db_host = "10.10.2.10"
db_database = "appliedda"
# and connect.
pgsql_connection = psycopg2.connect( host = db_host, database = db_database )
print( "psycopg2 connection to host: " + db_host + ", database: " + db_database
+ " completed at " + str( datetime.datetime.now() ) )
# results come back as a list of columns:
pgsql_cursor = pgsql_connection.cursor()
# results come back as a dictionary where values are mapped to column names (preferred)
pgsql_cursor = pgsql_connection.cursor( cursor_factory = psycopg2.extras.DictCursor )
print( "psycopg2 cursor created at " + str( datetime.datetime.now() ) )
# SQL
sql_string = "SELECT COUNT( * ) AS row_count FROM il_des_kcmo.il_qcew_employers;"
# execute it.
pgsql_cursor.execute( sql_string )
# fetch first (and only) row, then output the count
first_row = pgsql_cursor.fetchone()
print( "row_count = " + str( first_row[ "row_count" ] ) )
# SQL
sql_string = "SELECT * FROM il_des_kcmo.il_qcew_employers LIMIT 1000;"
# execute it.
pgsql_cursor.execute( sql_string )
# ==> fetch rows to loop over:
# all rows.
#result_list = pgsql_cursor.fetchall()
# first 10 rows.
result_list = pgsql_cursor.fetchmany( size = 10 )
# loop
result_counter = 0
for result_row in result_list:
result_counter += 1
print( "- row " + str( result_counter ) + ": " + str( result_row ) )
#-- END loop over 10 rows --#
# ==> loop over the rest one at a time.
result_counter = 0
result_row = pgsql_cursor.fetchone()
while result_row is not None:
# increment counter
result_counter += 1
# get next row
result_row = pgsql_cursor.fetchone()
#-- END loop over rows, one at a time. --#
print( "fetchone() row_count = " + str( result_counter ) )
# Close Connection and cursor
pgsql_cursor.close()
pgsql_connection.close()
print( "psycopg2 cursor and connection closed at " + str( datetime.datetime.now() ) )
SQLAlchemy
¶SQLAlchemy
is a higher-level Python database library that, among many other things, contains a wrapper around DBAPI that makes a subset of the DBAPI API work the same for any database SQLAlchemy
supports (though it doesn't work exactly like DBAPI... nothing's perfect). You can use this wrapper to write Python code that can be re-used with different databases (though you'll have to make sure the SQL also is portable). SQLAlchemy
also includes advanced features like connection pooling in its implementation of DBAPI that help to make it perform better than a direct database connection.
Just be aware that the farther you move from a direct connection, the more potential there is for things to go wrong. Under the hood, SQLAlchemy
is using psycopg2
for its PostgreSQL database access, so now you have two relatively complex packages working in tandem. If you get a particularly vexing bug running SQL with SQLAlchemy
, in particular complex SQL or statements that update or alter the database, make sure to try that SQL with a pure DBAPI client or in the command line client to see if it is a problem with SQLAlchemy
, not with your SQL or database.
SQLAlchemy
's database connection is called an engine. To connect a SQLAlchemy
engine to a database, you will:
SQLAlchemy
connection string for your database.A full connection URL for SQLAlchemy
looks like this:
dialect+driver://username:password@host:port/database
If you recall back to our connection properties, we only need to specify host name and database. In SQLAlchemy
, any elements of the URL that are not needed can be omitted. So for our database, the connection URL is:
postgresql://10.10.2.10/appliedda
# imports
import sqlalchemy
import datetime
# Connect
connection_string = 'postgresql://10.10.2.10/appliedda'
pgsql_engine = sqlalchemy.create_engine( connection_string )
print( "SQLAlchemy engine connected to " + connection_string + " at " + str( datetime.datetime.now() ) )
# Single row query - with the streaming option so it does not return results until we "fetch" them:
sql_string = "SELECT COUNT( * ) AS row_count FROM il_des_kcmo.il_qcew_employers;"
query_result = pgsql_engine.execution_options( stream_results = True ).execute( sql_string )
# output results - you can also check what columns "query_result" has by accessing
# it's "keys" since it is just a Python dict object. Like so:
print( query_result.keys() )
# print an empty string to separate out our two more useful print statements
print('')
# fetch first (and only) row, then output the count
first_row = query_result.fetchone()
print("row_count = " + str( first_row[ "row_count" ] ) )
# run query with the streaming option so it does not return results until we "fetch" them:
# SQL
sql_string = "SELECT * FROM il_des_kcmo.il_qcew_employers LIMIT 1000;"
# execute it.
query_result = pgsql_engine.execution_options( stream_results = True ).execute( sql_string )
# ==> fetch rows to loop over:
# all rows.
#result_list = query_result.fetchall()
# first 10 rows.
result_list = query_result.fetchmany( size = 10 )
# loop
result_counter = 0
for result_row in result_list:
result_counter += 1
print( "- row " + str( result_counter ) + ": " + str( result_row ) )
#-- END loop over 10 rows --#
# ==> loop over the rest one at a time.
result_counter = 0
result_row = query_result.fetchone()
while result_row is not None:
# increment counter
result_counter += 1
# get next row
result_row = query_result.fetchone()
#-- END loop over rows, one at a time. --#
print( "fetchone() row_count = " + str( result_counter ) )
# Clean up:
pgsql_engine.dispose()
print( "SQLAlchemy engine dispose() called at " + str( datetime.datetime.now() ) )
pandas
¶Next we'll use the pandas package to populate pandas
DataFrames from the results of SQL queries. pandas
uses a SQLAlchemy
database engine to connect to databases and run queries. It then reads data returned from a given SQL query and further processes it to store it in a tabular data format called a "DataFrame" (a term that will be familiar for those with R or STATA experience).
DataFrames allow for easy statistical analysis, and can be directly used for machine learning. They also load your entire result set into memory by default, and so are not suitable for really large data sets.
And, as discussed in the SQLAlchemy
section, this is yet another layer added on top of other relatively complex database packages, such that you multiply the potential for a peculiarity in one to cause obscure, difficult-to-troubleshoot problems in one of the other layers. It won't occur frequently, but if you run into weird or inexplicable problems when turning SQL into DataFrames, try running the SQL using lower layers to isolate the problem.
In the code cell below, we'll use SQLAlchemy
to connect to the database, then we'll give this engine to pandas and let it retrieve and process data.
Note: in addition to processing SQL queries, pandas
has a range of Input/Output tools that let it read from and write to a large variety of tabular data formats, including CSV and Excel files, databases via SQL, JSON files, and even SAS and Stata data files. In the example below, we'll use the pandas.read_sql()
function to read the results of an SQL query into a data frame.
# imports
import datetime
import pandas
# Connect - create SQLAlchemy engine for pandas to use.
connection_string = 'postgresql://10.10.2.10/appliedda'
pgsql_engine = sqlalchemy.create_engine( connection_string )
print( "SQLAlchemy engine connected to " + connection_string + " at " + str( datetime.datetime.now() ) )
# Single row query
sql_string = "SELECT COUNT( * ) AS row_count FROM il_des_kcmo.il_qcew_employers;"
df = pandas.read_sql( sql_string, con = pgsql_engine )
# get row_count - first get first row
first_row = df.iloc[ 0 ]
# then grab value.
row_count = first_row[ "row_count" ]
print("row_count = " + str( row_count ) )
# and call head().
df.head()
# SQL
sql_string = "SELECT * FROM il_des_kcmo.il_qcew_employers LIMIT 2000;"
# execute it.
df = pandas.read_sql( sql_string, con = pgsql_engine )
# unlike previous Python examples, rows are already fetched and in a dataframe:
# you can loop over them...
row_count = 0
for result_row in df.iterrows():
row_count += 1
#-- END loop over rows. --#
print( "loop row_count = " + str( row_count ) )
# Print out the first X using head()
output_count = 10
df.head( output_count )
# etc.
# Close Connection - Except you don't have to because pandas does it for you!