In this example we show how to explore data in Hive and build reports.
The example uses a dataset of real estate sales in the Sacramento area, which you can download from here.
Create a dataset with a name of your choosing, for example RawData
and upload the CSV file. Make sure the dataset is empty, you will need to delete the auto-generated README.md
First we need to setup the IPython magic and the connection information by running the following cell:
%reload_ext sql
from hops import hive
hive.setup_hive_connection()
Let's first create a Hive external and let's point the table to the just created dataset. In the query below you should replace [Projectname]
with your project name.
The %%sql
magic allows you to write HiveQL and execute the query against the HiveServer.
%%sql
CREATE EXTERNAL TABLE sacramento_properties_ext(
street string,
city string,
zip int,
state string,
beds int,
baths int,
sq__ft float,
sales_type string,
sale_date string,
price float,
latitude float,
longitude float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/Projects/[Projectname]/RawData'
To make sure the table was created successfully, we can list all the tables in the project's database and run a simple query against the external table.
%%sql
show tables
Done.
tab_name |
---|
sacramento_properties_ext |
%%sql
select * from sacramento_properties_ext limit 10
Done.
street | city | zip | state | beds | baths | sq__ft | sales_type | sale_date | price | latitude | longitude |
---|---|---|---|---|---|---|---|---|---|---|---|
3526 HIGH ST | SACRAMENTO | 95838 | CA | 2 | 1 | 836.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 59222.0 | 38.631912 | -121.434875 |
51 OMAHA CT | SACRAMENTO | 95823 | CA | 3 | 1 | 1167.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 68212.0 | 38.4789 | -121.43103 |
2796 BRANCH ST | SACRAMENTO | 95815 | CA | 2 | 1 | 796.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 68880.0 | 38.618305 | -121.44384 |
2805 JANETTE WAY | SACRAMENTO | 95815 | CA | 2 | 1 | 852.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 69307.0 | 38.616837 | -121.43915 |
6001 MCMAHON DR | SACRAMENTO | 95824 | CA | 2 | 1 | 797.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 81900.0 | 38.51947 | -121.43577 |
5828 PEPPERMILL CT | SACRAMENTO | 95841 | CA | 3 | 1 | 1122.0 | Condo | Wed May 21 00:00:00 EDT 2008 | 89921.0 | 38.662594 | -121.32781 |
6048 OGDEN NASH WAY | SACRAMENTO | 95842 | CA | 3 | 2 | 1104.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 90895.0 | 38.68166 | -121.35171 |
2561 19TH AVE | SACRAMENTO | 95820 | CA | 3 | 1 | 1177.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 91002.0 | 38.53509 | -121.48137 |
11150 TRINITY RIVER DR Unit 114 | RANCHO CORDOVA | 95670 | CA | 2 | 2 | 941.0 | Condo | Wed May 21 00:00:00 EDT 2008 | 94905.0 | 38.62119 | -121.27055 |
7325 10TH ST | RIO LINDA | 95673 | CA | 3 | 2 | 1146.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 98937.0 | 38.70091 | -121.44298 |
In this section we are going to load the data in a managed table. The table is going to be partitioned by ZIP and stored in ORC (a columnar storage format).
%%sql
CREATE TABLE sacramento_properties(
street string,
city string,
state string,
beds int,
baths int,
sq__ft float,
sales_type string,
sale_date string,
price float,
latitude float,
longitude float)
PARTITIONED by (zip int)
STORED AS ORC
%%sql
show tables
Done.
tab_name |
---|
sacramento_properties |
sacramento_properties_ext |
By default we don't allow dynamic inserts. This means that we should specify, for each row, which partition it belongs to. This is meant to avoid having a wrong insert query create a huge number of directories.
In this case, however, the dataset is small and we know what we are doing. In the next cell we configure the hive session to allow nonstrict dymanic insert.
%%sql
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
Now we load the data into the managed table. Partition column(s) should be listed last in the SELECT
statement.
%%sql
INSERT OVERWRITE TABLE sacramento_properties PARTITION (zip)
SELECT street, city, state, beds, baths, sq__ft, sales_type, sale_date, price, latitude, longitude, zip
FROM sacramento_properties_ext
%%sql
select *
FROM sacramento_properties
WHERE zip=95608
LIMIT 10
Done.
street | city | state | beds | baths | sq__ft | sales_type | sale_date | price | latitude | longitude | zip |
---|---|---|---|---|---|---|---|---|---|---|---|
5332 SANDSTONE ST | CARMICHAEL | CA | 3 | 1 | 1152.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 181872.0 | 38.662106 | -121.31394 | 95608 |
5907 ELLERSLEE DR | CARMICHAEL | CA | 3 | 1 | 936.0 | Residential | Wed May 21 00:00:00 EDT 2008 | 200000.0 | 38.664467 | -121.32683 | 95608 |
4010 ALEX LN | CARMICHAEL | CA | 2 | 2 | 1326.0 | Condo | Wed May 21 00:00:00 EDT 2008 | 250134.0 | 38.637028 | -121.312965 | 95608 |
5925 MALEVILLE AVE | CARMICHAEL | CA | 4 | 2 | 1120.0 | Residential | Tue May 20 00:00:00 EDT 2008 | 189000.0 | 38.666565 | -121.325714 | 95608 |
2109 HAMLET PL | CARMICHAEL | CA | 2 | 2 | 1598.0 | Residential | Tue May 20 00:00:00 EDT 2008 | 484000.0 | 38.602753 | -121.32932 | 95608 |
5709 RIVER OAK WAY | CARMICHAEL | CA | 4 | 2 | 2222.0 | Residential | Tue May 20 00:00:00 EDT 2008 | 582000.0 | 38.602463 | -121.33098 | 95608 |
7032 FAIR OAKS BLVD | CARMICHAEL | CA | 3 | 2 | 1245.0 | Condo | Mon May 19 00:00:00 EDT 2008 | 139500.0 | 38.628563 | -121.3283 | 95608 |
7110 STELLA LN Unit 15 | CARMICHAEL | CA | 2 | 2 | 1000.0 | Condo | Mon May 19 00:00:00 EDT 2008 | 182000.0 | 38.637398 | -121.30006 | 95608 |
5847 DEL CAMPO LN | CARMICHAEL | CA | 3 | 1 | 1713.0 | Residential | Mon May 19 00:00:00 EDT 2008 | 266000.0 | 38.671993 | -121.32434 | 95608 |
4622 MEYER WAY | CARMICHAEL | CA | 4 | 2 | 1559.0 | Residential | Mon May 19 00:00:00 EDT 2008 | 285000.0 | 38.64913 | -121.31067 | 95608 |
%%sql
select sales_type, avg(price) as avg_price
FROM sacramento_properties
WHERE zip=95608
GROUP BY sales_type
LIMIT 10
Done.
sales_type | avg_price |
---|---|
Condo | 190544.66666666666 |
Residential | 314238.8823529412 |
The query result can also be stored in a Python dictionary. This enables us to choose from a plethora of Python libraries available online, to visualize data and build BI reports
In the example below we use the folium library to mark where the Condos in our dataset are located in Sacramento. For the next section to work, if you haven't already, please install the folium package in your project environment and restart the kernel. (Also, please re-run the first cell)
(Internet access is required to visualize the map)
condos = %sql select * from sacramento_properties where `sales_type` = 'Condo'
Done.
import folium
m = folium.Map(
location=[38.5815700, -121.4944000],
zoom_start=10
)
for condo in condos:
folium.Marker([float(condo[9]), float(condo[10])], popup=str(condo[8])).add_to(m)
m