%load_ext autoreload
%autoreload 2
In this notebook, we will be analyzing the open bigquery google analytics dataset.
Objectives:
Privacy of your data is of primte importance. This library has been specifically designed to NOT share any part of your data with the Genie APIs. Just the metadata about your data like name and types of columns of a pandas dataframe would be shared, which help in generating high quality results.
Don't have an access token yet? Sign up for free!
Store your access token in a .env
file with the same format as shown in the .env.example file
from dotenv import load_dotenv
load_dotenv(".env") # replace with path to your env file
True
We're using dotenv library here, alternatively you can simply set the environment variable as:
import os
os.environ["CODE_GENIE_TOKEN"] = "your-access-token-here"
Though it is not recommended to do so as notebooks are typically shared with others and you should keep your access token private.
By default the package will cache genie invocations in a temp file. This would mean that any genies created will be lost once you restart the kernel or rerun the notebook at another time. to keep the cached genies, you can set a custom path where genies would be stored so that you will not lose them when you rerun the notebook.
You can modify the cached code if you'd like and those modifications would be loaded the next time you run the code.
CACHE_DIR = "./_cache_google_analytics"
We will use the google analytics dataset and create a pipeline which can be run once every week and generate stats for the given week. We will create 2 tables containing following information:
Lets load 1 week of data using bigquery from July 1 - July 7 2017
start_date = "20170701"
end_date = "20170707"
from google.cloud import bigquery
client = bigquery.Client()
query = f"""
select * from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
where _TABLE_SUFFIX between '{start_date}' and '{end_date}'
""".format(start_date=start_date, end_date=end_date)
df = client.query(query).result().to_dataframe()
df.shape
(15920, 16)
df.head(2)
visitorId | visitNumber | visitId | visitStartTime | date | totals | trafficSource | device | geoNetwork | customDimensions | hits | fullVisitorId | userId | clientId | channelGrouping | socialEngagementType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | <NA> | 1 | 1499117293 | 1499117293 | 20170703 | {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... | {'referralPath': None, 'campaign': '(not set)'... | {'browser': 'Safari', 'browserVersion': 'not a... | {'continent': 'Americas', 'subContinent': 'Cen... | [{'index': 4, 'value': 'Central America'}] | [{'hitNumber': 1, 'time': 0, 'hour': 14, 'minu... | 1018154947831642966 | None | None | Organic Search | Not Socially Engaged |
1 | <NA> | 1 | 1499132474 | 1499132474 | 20170703 | {'visits': 1, 'hits': 16, 'pageviews': 10, 'ti... | {'referralPath': None, 'campaign': '(not set)'... | {'browser': 'Chrome', 'browserVersion': 'not a... | {'continent': 'Americas', 'subContinent': 'Nor... | [{'index': 4, 'value': 'North America'}] | [{'hitNumber': 1, 'time': 0, 'hour': 18, 'minu... | 1770622829105680991 | None | None | Organic Search | Not Socially Engaged |
In order to perform the given analysis, we need to extract the following nested fields into their own columns:
from code_genie import Genie
genie = Genie(data=df, cache_dir=CACHE_DIR)
The hits column contains multiple values in a single row, we need to convert that into separate rows
Note that update_base_input=True means that the data in the genie is update and next genie invocations will use this data as input.
gr_hits_expanded = genie.plz("""
each value in hits column contains a list. create a new dataframe where a separate row exists for each value.
copy the totals columns to the new df, keep the same value for each item in the list. drop all other columns except hits and totals.
""", update_base_input=True)
Loading cached genie id: separate_hits_70651, set override = True to rerun
gr_hits_expanded.result.head(2)
hits | totals | |
---|---|---|
0 | {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut... | {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... |
0 | {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm... | {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... |
gr_page_path = genie.plz("""
each value of hits is a dictionary with page as a key. the value of page is also a dictionary with pagePath as a key.
extract pagePath into a new column called page_path
""", update_base_input=True)
Loading cached genie id: extract_page_path_78866, set override = True to rerun
gr_page_path.result.head(2)
hits | totals | page_path | |
---|---|---|---|
0 | {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut... | {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... | /home |
0 | {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm... | {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... | /home |
gr_hit_type = genie.custom("""
def run(df):
df["hit_type"] = df["hits"].apply(lambda x: x["type"])
return df
""", update_base_input=True)
Genie cached with id: run_51794
gr_hit_number = genie.plz("""
each value of hits is a dictionary with hitNumber as a key. extract type into a new column called hit_number
""", update_base_input=True)
Loading cached genie id: extract_hit_number_77980, set override = True to rerun
Let's see another example of using a custom genie
gr_action_type = genie.custom("""
def run(df):
df["action_type"] = df["hits"].apply(lambda x: int(x["eCommerceAction"]["action_type"]))
return df
""", update_base_input=True)
Genie cached with id: run_20093
gr_bounce = genie.plz("""
each value of totals is a dictionary. create a new boolean column called bounces which is True if the key bounces in totals has a value of 1
""", update_base_input=True)
Loading cached genie id: add_bounces_column_93508, set override = True to rerun
gr_bounce.result.head(2)
hits | totals | page_path | hit_type | hit_number | action_type | bounces | |
---|---|---|---|---|---|---|---|
0 | {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut... | {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... | /home | PAGE | 1 | 0 | False |
0 | {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm... | {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... | /home | PAGE | 2 | 0 | False |
gr_landing_page = genie.plz([
"filter data for hit_number as 1 and hit_type as PAGE",
"""group data by page_path, create 2 aggregate columns:
views: total number of rows in the group
bounce_rate: fraction of rows in the group where bounces column is True
""",
"sort the data by views descending"])
Loading cached genie id: get_page_bounce_rate_81411, set override = True to rerun
gr_landing_page.result.head(5)
views | bounce_rate | |
---|---|---|
page_path | ||
/home | 7698 | 0.489738 |
/google+redesign/shop+by+brand/youtube | 3911 | 0.629507 |
/google+redesign/shop+by+brand/waze+baby+on+board+window+decal.axd | 681 | 0.654919 |
/google+redesign/apparel/mens/mens+t+shirts | 585 | 0.482051 |
/signin.html | 225 | 0.342222 |
gr_exit_rate = genie.plz([
"filter data for hit_type as PAGE",
"""group data by page_path, create 2 aggregate columns:
views: total number of rows in the group
exit_rate: fraction of rows in the group where bounces column is True
""",
"sort the data by views descending"])
Loading cached genie id: aggregate_data_92888, set override = True to rerun
gr_exit_rate.result.head(5)
views | exit_rate | |
---|---|---|
page_path | ||
/home | 13174 | 0.287308 |
/google+redesign/shop+by+brand/youtube | 4988 | 0.493585 |
/basket.html | 2426 | 0.027205 |
/signin.html | 1512 | 0.050926 |
/google+redesign/apparel/mens/mens+t+shirts | 1468 | 0.192098 |
gr_funnel = genie.plz([
"group data by action_type and count number of rows in each group",
"remove action_type other than 1, 2, 5, 6",
"""replace the action_type values as:
1: Click on product list page
2: Product details page
5: Checkout
6: Purchase Complete
"""])
Loading cached genie id: group_action_type_27490, set override = True to rerun
gr_funnel.result
action_type | count | |
---|---|---|
1 | Click on product list page | 6887 |
2 | Product details page | 5386 |
5 | Checkout | 1115 |
6 | Purchase Complete | 292 |
The first step in a pipeline is to define a source of external data which will feed the pipeline. We need to use one of the pre-defined data sources, let's use the BQ data source here.
from code_genie.io import BigQueryToDataframeSource, StringArg
# difference from how we used above is that start_date and end_date need to come from inputs to the pipeline
query = f"""
select * from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
where _TABLE_SUFFIX between '{start_date}' and '{end_date}'
"""
source = BigQueryToDataframeSource(
query=query,
query_args={"start_date": StringArg(name="analysis-start-date"), # need to be set during pipeline run
"end_date": StringArg(name="analysis-end-date")}, # need to be set during pipeline run
key_path=StringArg(env_var="GOOGLE_APPLICATION_CREDENTIALS") # to be read from env var
)
Now that we have a source, we need to define where to export the results. In this example, let's export the 3 dataframes we created as csv files into the cache dir using the DataFrameToCsvSink
from code_genie.io import DataFrameToCsvSink
sink_bounce_rate = DataFrameToCsvSink(path=StringArg(name="bounce-rate-export-path"))
sink_exit_rate = DataFrameToCsvSink(path=StringArg(name="exit-rate-export-path"))
sink_funnel = DataFrameToCsvSink(path=StringArg(name="funnel-export-path"))
Once we have our source and sinks, we can stich them together with the genies we have already create to make a pipeline. Read the pipeline docs for more info.
from code_genie.pipeline import PipelineStep, GeniePipeline
# initialize pipeline step
pipeline_steps = []
pipeline_steps.append(PipelineStep(genie_result=gr_hits_expanded, data=source))
len(pipeline_steps)
1
Note that here the input source is the previous genie as we want to carry forward all the columns added to the very end
pipeline_steps.extend([
# extract page path
PipelineStep(genie_result=gr_page_path,
data=gr_hits_expanded),
# extract hit type
PipelineStep(genie_result=gr_hit_type,
data=gr_page_path),
# extract hit number
PipelineStep(genie_result=gr_hit_number,
data=gr_hit_type),
# extract action type
PipelineStep(genie_result=gr_action_type,
data=gr_hit_number),
# extract bounce
PipelineStep(genie_result=gr_bounce,
data=gr_action_type)
])
len(pipeline_steps)
6
Now that we have interim layers added, we can add the final aggregation layer and assign the corresponding sinks so that the output data is generated. Note that the base_input_genie argument for all of these is still gr_bounce as the result of that genie needs to be fed into all of them.
We will also add the sink parameter to these steps so that the results are generated
pipeline_steps.extend([
PipelineStep(genie_result=gr_landing_page, data=gr_bounce, sink=sink_bounce_rate),
PipelineStep(genie_result=gr_exit_rate, data=gr_bounce, sink=sink_exit_rate),
PipelineStep(genie_result=gr_funnel, data=gr_bounce, sink=sink_funnel)
])
len(pipeline_steps)
9
pipeline = GeniePipeline(name="google-analytics-pipeline", version="1", steps=pipeline_steps, cache_dir=CACHE_DIR)
pipeline.export("pipeline-v1.json")
Now that we have created the pipeline, we can load it into a new session and run it. From this point, the code does not depend on the execution of any pieces of code before this as long as we have exported the pipeline. You can restart your kernel and run again from here on.
lets load our env vars and setup cache dir as before
from dotenv import load_dotenv
load_dotenv(".env") # replace with path to your env file
CACHE_DIR = "./_cache_google_analytics"
Lets rerun for the same dates as above to see if we can get the same results back.
from code_genie.pipeline import GeniePipeline
pipeline = GeniePipeline.load(f"{CACHE_DIR}/pipeline-v1.json")
# lets create a temporary directory to store the generated datasets
from tempfile import mkdtemp
export_dir = mkdtemp()
pipeline_args={
"analysis-start-date": "20170701",
"analysis-end-date": "20170707",
"bounce-rate-export-path": f"{export_dir}/bounce_rate.csv",
"exit-rate-export-path": f"{export_dir}/exit_rate.csv",
"funnel-export-path": f"{export_dir}/funnel_stats.csv"
}
pipeline.run(pipeline_args)
Running step 1: separate_hits_70651 Completed in 13.2 seconds Running step 2: extract_page_path_78866 Completed in 0.0 seconds Running step 3: run_75120 Completed in 0.0 seconds Running step 4: extract_hit_number_77980 Completed in 0.0 seconds Running step 5: run_83895 Completed in 0.0 seconds Running step 6: add_bounces_column_93508 Completed in 0.0 seconds Running step 7: get_page_bounce_rate_81411 Completed in 0.0 seconds Running step 8: aggregate_data_92888 Completed in 0.0 seconds Running step 9: group_action_type_27490 Completed in 0.0 seconds
# load results
import pandas as pd
pd.read_csv(pipeline_args["bounce-rate-export-path"]).head(5)
page_path | views | bounce_rate | |
---|---|---|---|
0 | /home | 7698 | 0.489738 |
1 | /google+redesign/shop+by+brand/youtube | 3911 | 0.629507 |
2 | /google+redesign/shop+by+brand/waze+baby+on+bo... | 681 | 0.654919 |
3 | /google+redesign/apparel/mens/mens+t+shirts | 585 | 0.482051 |
4 | /signin.html | 225 | 0.342222 |
pd.read_csv(pipeline_args["exit-rate-export-path"]).head(5)
page_path | views | exit_rate | |
---|---|---|---|
0 | /home | 13174 | 0.287308 |
1 | /google+redesign/shop+by+brand/youtube | 4988 | 0.493585 |
2 | /basket.html | 2426 | 0.027205 |
3 | /signin.html | 1512 | 0.050926 |
4 | /google+redesign/apparel/mens/mens+t+shirts | 1468 | 0.192098 |
pd.read_csv(pipeline_args["funnel-export-path"]).head(5)
index | action_type | count | |
---|---|---|---|
0 | 1 | Click on product list page | 6887 |
1 | 2 | Product details page | 5386 |
2 | 5 | Checkout | 1115 |
3 | 6 | Purchase Complete | 292 |