import pyodbc
import sys
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, MetaData, Table, select
from IPython.display import Image
Return the Python and panda version being used since the database library drivers need to be properly aligned.
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
Python version 3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)] Pandas version 0.18.1
The pd.read_csv() method accepts a range of file format types. In the example below, a URL is given as the input source. In the SAS example that follows, the FILENAME URL access method is utilized to the same effect.
fips = pd.read_csv('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
header=None,names=['state','state_fips','county_fips','county','fips_class_code'])
fips.shape
/******************************************************/
/* c11_read_csv_url_method.sas */
/******************************************************/
44 filename ctyfips url "http://www2.census.gov/geo/docs/reference/codes/files/national_c
44 ! ounty.txt";
45
46 data cnty_fips;
47 length county_nm $ 40
48 st_name $ 32 ;
49
50 infile ctyfips dlm=',';
51 input st_abrev $
52 st_fips $
53 cnty_fips $
54 county_nm $
55 ;
56
57 st_name = fipnamel(st_fips);
The clever folks who brought you pandas also figured out they can avoid re-inventing so they utilize SQLAlchemy as an abstraction layer to the usual databases out there. This approach reduces the amount of database-dependent code pandas need internally.
Using SQLAlchemy you can pass SQLAlchemy Expression language constructs which are database-agnostic. This is analogous to PROC SQL's behavior of general SQL constructs being translated to specific a database without you having to know the specific SQL dialect.
Two parts needed. You need the SQLAlchemy package. Im my case, it was a part of the Anaconda 4.1 distribution for Windows.
The second is the driver library for the target database. In my case, I am using MS SQL/Server, 2016. The target database used below is AdventureWorksDW2012 which Microsoft provides to illustrate their Analytical Services.
The examples are analogous to SAS' SQL pass-thru approach where PROC SQL is a wrapper to pass-thru the SQL dialect specific to the database.
To connect to the database with SQLAlchemy, you construct a connection string to the database (uri) using the create_engine() method. The SQLAlchemy doc for the various databases are here . This only needs to be executed once for the connection shown as:
ServerName = "homeoffice2"
Database = "AdventureWorksDW2012?driver=SQL+Server+Native+Client+11.0"
TableName = "DimCustomer"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database, legacy_schema_aliasing=False)
The string:
?driver=SQL+Server+Native+Client+11.0
was needed to inform SQLAlchemy the name of the specific database client library. This is found on the Driver's tab of the ODBC Data Source Adminstrator.
Image(filename='Anaconda3\\output\\odbc.JPG')
Next, create a tuple of parameter values to pass to the create_engine= assignment to identify the Python package used on the client-side as well as the database server name, target database, and target table.
ServerName = "homeoffice2"
Database = "AdventureWorksDW2012?driver=SQL+Server+Native+Client+11.0"
TableName = "DimCustomer"
engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database, legacy_schema_aliasing=False)
conn = engine.connect()
metadata = MetaData(conn)
The read_sql_table() method reads a database table and optionally a subset of columns. The syntax is documented here .
For clarity, we build a list of columns we want to have returned along with a table name that are used as arguments in the next set of examples.
col_names = ['FirstName', 'LastName', 'BirthDate', 'Gender', 'YearlyIncome', 'CustomerKey']
tbl = 'DimCustomer'
t0 = pd.read_sql_table(tbl, engine, columns=col_names)
t0.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18484 entries, 0 to 18483 Data columns (total 6 columns): FirstName 18484 non-null object LastName 18484 non-null object BirthDate 18484 non-null datetime64[ns] Gender 18484 non-null object YearlyIncome 18484 non-null float64 CustomerKey 18484 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(3) memory usage: 866.5+ KB
Designate an incoming column to be used as the DataFrame index with the index_col= argument.
t1 = pd.read_sql_table(tbl, engine, columns=col_names, index_col='CustomerKey')
t1.index
Int64Index([11000, 11001, 11002, 11003, 11004, 11005, 11006, 11007, 11008, 11009, ... 29474, 29475, 29476, 29477, 29478, 29479, 29480, 29481, 29482, 29483], dtype='int64', name='CustomerKey', length=18484)
Coerce date strings into date objects. The example below is redundant since the query engine was smart enough to determine t0['BirthDate'] is a date column. This is shown by examining output from t0.info() call showing the column as type datetime64.
t2 = pd.read_sql_table(tbl, engine, columns=col_names, index_col='CustomerKey', \
parse_dates={'BirthDate': {'format': '%Y-%m-%d'}})
Review the incoming columns and their attributes.
t2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 18484 entries, 11000 to 29483 Data columns (total 5 columns): FirstName 18484 non-null object LastName 18484 non-null object BirthDate 18484 non-null datetime64[ns] Gender 18484 non-null object YearlyIncome 18484 non-null float64 dtypes: datetime64[ns](1), float64(1), object(3) memory usage: 866.4+ KB
Review the customer index key values.
t2.index
Int64Index([11000, 11001, 11002, 11003, 11004, 11005, 11006, 11007, 11008, 11009, ... 29474, 29475, 29476, 29477, 29478, 29479, 29480, 29481, 29482, 29483], dtype='int64', name='CustomerKey', length=18484)
Display the first 5 rows for the the 'BirthDate' column. Also notice above, from the .info() method how the t2['BirthDate'] column was read as datetime timestamps.
t2["BirthDate"].head(5)
CustomerKey 11000 1966-04-08 11001 1965-05-14 11002 1965-08-12 11003 1968-02-15 11004 1968-08-08 Name: BirthDate, dtype: datetime64[ns]
The SAS program below reads the same input table from SQL/Server called DimCustomer. Under the covers, SAS constructs the appropriate TSQL statements to retrieve rows and columns forming the SAS data set WORK.CUSTOMERS.
In this case, SAS reads the 'BirthDate' variable as a string. The SAS/Access product line provide an array of SAS Data Set options to map data types between SAS and SQL/Server described here. SAS/Access products are outside the scope of these examples.
/******************************************************/
/* c11_pd.read_sql_table.sas */
/******************************************************/
4 proc pwencode in=XXXXXXXXX;
{sas001}UG9yc2hjZSMx
5
6 libname sqlsrvr odbc
7 uid=randy
8 pwd=XXXXXXXXXXXXXXXXXXXX
9 datasrc=mssql
10 bulkload=yes;
11
12 proc sql;
13 create table customers as
14 select FirstName,
15 LastName,
16 BirthDate,
17 Gender,
18 YearlyIncome,
19 CustomerKey
20 from sqlsrvr.DimCustomer;
NOTE: Data set "WORK.customers" has 18484 observation(s) and 6 variable(s)
21 quit;
22
23 data _null_;
24 set customers(obs=5);
25
26 bdate = input(birthdate, yymmdd10.);
27 bd_f = put(bdate, yydddd10.);
28
29 put bd_f;
1966-04-08
1965-05-14
1965-08-12
1968-02-15
1968-08-08
The read_sql_query() method returns a DataFrame based on the evaluation of a query string. It is similiar to the read_sql_table() method described above.
You must use the appropriate SQL dialect for the target database. This is analogous to SAS' SQL pass-thru.
q1 = pd.read_sql_query('SELECT FirstName, LastName, Gender, BirthDate, YearlyIncome '
'FROM dbo.DimCustomer '
'WHERE YearlyIncome > 50000;'
, engine)
q1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9858 entries, 0 to 9857 Data columns (total 5 columns): FirstName 9858 non-null object LastName 9858 non-null object Gender 9858 non-null object BirthDate 9858 non-null object YearlyIncome 9858 non-null float64 dtypes: float64(1), object(4) memory usage: 385.2+ KB
The SAS analog example for pd.read_sql_query in cell #12 above is shown below. Similiar to the create_engine() method, the connect to ODBC statement uses a series of arguments to used to make authenicated calls into the database. All of the SQL statements inside the parethensized expression is not scanned by the SAS tokenizer/parser and is passed to the database for execution. Returning rows and columns are returned by the ODBC api.
15 LIBNAME in_sql ODBC DSN="mssql";
16
17 proc sql;
18 connect to odbc(datasrc=mssql uid=randy password=XXXXXXXX);
NOTE: Successfully connected to database ODBC as alias ODBC.
19 create table q1 as
20 select * from connection to odbc
21 (select FirstName,
22 LastName,
23 Gender,
24 BirthDate,
25 YearlyIncome
26 from dbo.DimCustomer
27 where YearlyIncome > 50000);
NOTE: Data set "WORK.q1" has 9858 observation(s) and 5 variable(s)
...or pass a DocStrings to the pd.read_sql_query() method.
sql_str = '''SELECT
FirstName,
LastName,
Gender,
CustomerKey,
BirthDate
FROM DimCustomer
WHERE YearlyIncome > 120000'''
q2 = pd.read_sql_query(sql_str, engine, index_col='CustomerKey')
q2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 821 entries, 11079 to 29261 Data columns (total 4 columns): FirstName 821 non-null object LastName 821 non-null object Gender 821 non-null object BirthDate 821 non-null object dtypes: object(4) memory usage: 32.1+ KB
You can execute a query without returning a dataframe with execute(). This is useful for queries that don’t return values, such as a DROP TABLE statement. You must use the SQL dialect appropriate for the target database.
from pandas.io import sql
sql.execute('DROP TABLE dbo.df2', engine)
<sqlalchemy.engine.result.ResultProxy at 0x1f5e04d1710>
Construct the DataFrame to load into SQL/Server.
df = pd.DataFrame([['cold','slow', np.nan, 2., 6., 3.],
['warm', 'medium', 4, 5, 7, 9],
['hot', 'fast', 9, 4, np.nan, 6],
['cool', None, np.nan, np.nan, 17, 89],
['cool', 'medium', 16, 44, 21, 13],
['cold', 'slow', np.nan, 29, 33, 17]],
columns=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
index=(list('abcdef')))
The .DataFrame.to_sql method loads a target DBMS table with the rows and columns from a DataFrame. The syntax is documented here . The method has the chunksize= parameter where the default None writes all rows at once. The doc does not specify how the writes take place, either through SQL INSERT statements or uses a bulk copy interface.
df.to_sql('dbo.DF2', engine, if_exists='replace',index=False)
The analog SAS program to copy a SAS data set to the target DBMS is below. Note the PROC PWENCODE used to encode strings to prevent passwords from being stored in clear text. Also notice the LIBNAME option bulkload=yes. This causes the load operation to go through the RDBMS' bulk load interface.
/******************************************************/
/* c11_DataFrame.to_sql.sas */
/******************************************************/
56 proc pwencode in=XXXXXXXXX;
{sas001}UG9yc2hjZSMx
57
58 libname sqlsrvr odbc
59 uid=randy
60 pwd=XXXXXXXXXXXXXXXXXXXX
61 datasrc=mssql
62 bulkload=yes;
63
64 data sqlsrvr.tickets;
65 set tickets;
NOTE: 72 observations were read from "WORK.tickets"
NOTE: Data set "SQLSRVR.tickets" has an unknown number of observation(s) and 5 variable(s)
The pd.read_sas() method maps either a SAS data set (.sas7bdat file) or a SAS transport data set (.xpt file) directly into a DataFrame. The syntax details are located here. This is a SAS data set created by Base SAS software and not WPS.
df4 = pd.read_sas("c:\\Data\\accident.sas7bdat", format='SAS7BDAT')
df4.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32166 entries, 0 to 32165 Data columns (total 52 columns): STATE 32166 non-null float64 ST_CASE 32166 non-null float64 VE_TOTAL 32166 non-null float64 VE_FORMS 32166 non-null float64 PVH_INVL 32166 non-null float64 PEDS 32166 non-null float64 PERNOTMVIT 32166 non-null float64 PERMVIT 32166 non-null float64 PERSONS 32166 non-null float64 COUNTY 32166 non-null float64 CITY 32166 non-null float64 DAY 32166 non-null float64 MONTH 32166 non-null float64 YEAR 32166 non-null float64 DAY_WEEK 32166 non-null float64 HOUR 32166 non-null float64 MINUTE 32166 non-null float64 NHS 32166 non-null float64 RUR_URB 32166 non-null float64 FUNC_SYS 32166 non-null float64 RD_OWNER 32166 non-null float64 ROUTE 32166 non-null float64 TWAY_ID 32166 non-null object TWAY_ID2 8400 non-null object MILEPT 32166 non-null float64 LATITUDE 32166 non-null float64 LONGITUD 32166 non-null float64 SP_JUR 32166 non-null float64 HARM_EV 32166 non-null float64 MAN_COLL 32166 non-null float64 RELJCT1 32166 non-null float64 RELJCT2 32166 non-null float64 TYP_INT 32166 non-null float64 WRK_ZONE 32166 non-null float64 REL_ROAD 32166 non-null float64 LGT_COND 32166 non-null float64 WEATHER1 32166 non-null float64 WEATHER2 32166 non-null float64 WEATHER 32166 non-null float64 SCH_BUS 32166 non-null float64 RAIL 32166 non-null object NOT_HOUR 32166 non-null float64 NOT_MIN 32166 non-null float64 ARR_HOUR 32166 non-null float64 ARR_MIN 32166 non-null float64 HOSP_HR 32166 non-null float64 HOSP_MN 32166 non-null float64 CF1 32166 non-null float64 CF2 32166 non-null float64 CF3 32166 non-null float64 FATALS 32166 non-null float64 DRUNK_DR 32166 non-null float64 dtypes: float64(49), object(3) memory usage: 12.8+ MB
The following SAS program was used to create both a SAS data and a SAS transport SAS data set. Both were created with Version: 3.2.2 of the WPS Workbench for Windows. The SAS data DATA.COMPANY failed to be read with the pd.read_sas() method. The SAS transport file was read correctly below.
/******************************************************/
/* c11_write_export_file.sas */
/******************************************************/
93 libname data 'c:\data';
94 libname xptout xport "c:\data\company.xpt";
95
96 data data.company;
97 input Name $ 1-25 Age 27-28 Gender $ 30;
98 datalines;
99
100 Vincent, Martina 34 F
101 Phillipon, Marie-Odile 28 F
102 Gunter, Thomas 27 M
103 Harbinger, Nicholas 36 M
104 Benito, Gisela 32 F
105 Rudelich, Herbert 39 M
106 Sirignano, Emily 12 F
107 Morrison, Michael 32 M
108 ;;;;
109
110 proc copy in = data
111 out = xptout;
112 select company;
NOTE: Member "DATA.COMPANY.DATA" (memtype=DATA) copied
NOTE: 1 member copied
Using Version: 3.2.2 of the WPS Workbench for Windows to create its .wpd format as input to the pd.read_sas() method fails. That is because the binary file format chosen by WPS does not match those from SAS.
df5 = pd.read_sas("c:\\Data\\company.wpd", format='SAS7BDAT')
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-7-860acca0c6b3> in <module>() ----> 1 df5 = pd.read_sas("c:\\Data\\company.wpd", format='SAS7BDAT') C:\Users\randy\Anaconda3\lib\site-packages\pandas\io\sas\sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator) 52 reader = SAS7BDATReader(filepath_or_buffer, index=index, 53 encoding=encoding, ---> 54 chunksize=chunksize) 55 else: 56 raise ValueError('unknown SAS format') C:\Users\randy\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py in __init__(self, path_or_buf, index, convert_dates, blank_missing, chunksize, encoding, convert_text, convert_header_text) 94 self._path_or_buf = open(self._path_or_buf, 'rb') 95 ---> 96 self._get_properties() 97 self._parse_metadata() 98 C:\Users\randy\Anaconda3\lib\site-packages\pandas\io\sas\sas7bdat.py in _get_properties(self) 103 self._cached_page = self._path_or_buf.read(288) 104 if self._cached_page[0:len(const.magic)] != const.magic: --> 105 raise ValueError("magic number mismatch (not a SAS file?)") 106 107 # Get alignment information ValueError: magic number mismatch (not a SAS file?)
Using Version: 3.2.2 of the WPS Workbench for Windows to create a SAS transport file (.xpt) as input to the pd.read_sas() method works.
df6 = pd.read_sas("c:\\Data\\company.xpt", format='xport')
df6.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9 entries, 0 to 8 Data columns (total 3 columns): NAME 9 non-null object AGE 8 non-null float64 GENDER 9 non-null object dtypes: float64(1), object(2) memory usage: 296.0+ bytes
SQL Queries , from the pandas.org doc.
How to pull data from a microsoft sql database
defining the create_engine() method from the SQLAlchemy doc.