In [1]:

%load_ext autoreload
%autoreload 2

Google Analytics Analysis + Pipeline Example¶

In this notebook, we will be analyzing the open bigquery google analytics dataset.

Objectives:

Demonstrate capability to process complex data sources
Demonstrate capability to convert the analysis into a pipeline which can be reran over time with little effort.

Note on Privacy & Security¶

Privacy of your data is of primte importance. This library has been specifically designed to NOT share any part of your data with the Genie APIs. Just the metadata about your data like name and types of columns of a pandas dataframe would be shared, which help in generating high quality results.

setup¶

access token¶

Don't have an access token yet? Sign up for free!

Store your access token in a .env file with the same format as shown in the .env.example file

In [2]:

from dotenv import load_dotenv

In [3]:

load_dotenv(".env")  # replace with path to your env file

Out[3]:

True

We're using dotenv library here, alternatively you can simply set the environment variable as:

import os
os.environ["CODE_GENIE_TOKEN"] = "your-access-token-here"

Though it is not recommended to do so as notebooks are typically shared with others and you should keep your access token private.

genie cache¶

By default the package will cache genie invocations in a temp file. This would mean that any genies created will be lost once you restart the kernel or rerun the notebook at another time. to keep the cached genies, you can set a custom path where genies would be stored so that you will not lose them when you rerun the notebook.

You can modify the cached code if you'd like and those modifications would be loaded the next time you run the code.

In [4]:

CACHE_DIR = "./_cache_google_analytics"

Analysis objective¶

We will use the google analytics dataset and create a pipeline which can be run once every week and generate stats for the given week. We will create 2 tables containing following information:

Page level bounce and exit rate stats
User funnel stats

Load Data¶

Lets load 1 week of data using bigquery from July 1 - July 7 2017

In [5]:

start_date = "20170701"
end_date = "20170707"

In [6]:

from google.cloud import bigquery
client = bigquery.Client()

In [7]:

query = f"""
select * from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
where _TABLE_SUFFIX between '{start_date}' and '{end_date}'
""".format(start_date=start_date, end_date=end_date)
df = client.query(query).result().to_dataframe()
df.shape

Out[7]:

(15920, 16)

In [8]:

df.head(2)

Out[8]:

	visitorId	visitNumber	visitId	visitStartTime	date	totals	trafficSource	device	geoNetwork	customDimensions	hits	fullVisitorId	userId	clientId	channelGrouping	socialEngagementType
0	<NA>	1	1499117293	1499117293	20170703	{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...	{'referralPath': None, 'campaign': '(not set)'...	{'browser': 'Safari', 'browserVersion': 'not a...	{'continent': 'Americas', 'subContinent': 'Cen...	[{'index': 4, 'value': 'Central America'}]	[{'hitNumber': 1, 'time': 0, 'hour': 14, 'minu...	1018154947831642966	None	None	Organic Search	Not Socially Engaged
1	<NA>	1	1499132474	1499132474	20170703	{'visits': 1, 'hits': 16, 'pageviews': 10, 'ti...	{'referralPath': None, 'campaign': '(not set)'...	{'browser': 'Chrome', 'browserVersion': 'not a...	{'continent': 'Americas', 'subContinent': 'Nor...	[{'index': 4, 'value': 'North America'}]	[{'hitNumber': 1, 'time': 0, 'hour': 18, 'minu...	1770622829105680991	None	None	Organic Search	Not Socially Engaged

Extract information¶

In order to perform the given analysis, we need to extract the following nested fields into their own columns:

hits.page.pagePath: landing_page
totals.bounces: bounces
hits.type: hit_type
hits.hitNumber: hit_number
hits.eCommerceAction.action_type: action_type

In [9]:

from code_genie import Genie

In [10]:

genie = Genie(data=df, cache_dir=CACHE_DIR)

The hits column contains multiple values in a single row, we need to convert that into separate rows

expand hits into separate rows¶

Note that update_base_input=True means that the data in the genie is update and next genie invocations will use this data as input.

In [11]:

gr_hits_expanded = genie.plz("""
each value in hits column contains a list. create a new dataframe where a separate row exists for each value. 
copy the totals columns to the new df, keep the same value for each item in the list. drop all other columns except hits and totals.
""", update_base_input=True)

Loading cached genie id: separate_hits_70651, set override = True to rerun

In [12]:

gr_hits_expanded.result.head(2)

Out[12]:

	hits	totals
0	{'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...	{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...
0	{'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...	{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...

extract page path column¶

In [13]:

gr_page_path = genie.plz("""
each value of hits is a dictionary with page as a key. the value of page is also a dictionary with pagePath as a key. 
extract pagePath into a new column called page_path
""", update_base_input=True)

Loading cached genie id: extract_page_path_78866, set override = True to rerun

In [14]:

gr_page_path.result.head(2)

Out[14]:

	hits	totals	page_path
0	{'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...	{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...	/home
0	{'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...	{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...	/home

extract hit_type¶

In [15]:

gr_hit_type = genie.custom("""
def run(df):
    df["hit_type"] = df["hits"].apply(lambda x: x["type"])
    return df
""", update_base_input=True)

Genie cached with id: run_51794

extract hit_number¶

In [16]:

gr_hit_number = genie.plz("""
each value of hits is a dictionary with hitNumber as a key. extract type into a new column called hit_number
""", update_base_input=True)

Loading cached genie id: extract_hit_number_77980, set override = True to rerun

Let's see another example of using a custom genie

extract action_type¶

In [17]:

gr_action_type = genie.custom("""
def run(df):
    df["action_type"] = df["hits"].apply(lambda x: int(x["eCommerceAction"]["action_type"]))
    return df
""", update_base_input=True)

Genie cached with id: run_20093

extract bounce¶

In [18]:

gr_bounce = genie.plz("""
each value of totals is a dictionary. create a new boolean column called bounces which is True if the key bounces in totals has a value of 1
""", update_base_input=True)

Loading cached genie id: add_bounces_column_93508, set override = True to rerun

In [19]:

gr_bounce.result.head(2)

Out[19]:

	hits	totals	page_path	hit_type	hit_number	action_type	bounces
0	{'hitNumber': 1, 'time': 0, 'hour': 14, 'minut...	{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...	/home	PAGE	1	0	False
0	{'hitNumber': 2, 'time': 56142, 'hour': 14, 'm...	{'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...	/home	PAGE	2	0	False

Aggregate data¶

landing page bounce rate¶

In [20]:

gr_landing_page = genie.plz([
    "filter data for hit_number as 1 and hit_type as PAGE",
    """group data by page_path, create 2 aggregate columns:
views: total number of rows in the group
bounce_rate: fraction of rows in the group where bounces column is True
""",
    "sort the data by views descending"])

Loading cached genie id: get_page_bounce_rate_81411, set override = True to rerun

In [21]:

gr_landing_page.result.head(5)

Out[21]:

	views	bounce_rate
page_path
/home	7698	0.489738
/google+redesign/shop+by+brand/youtube	3911	0.629507
/google+redesign/shop+by+brand/waze+baby+on+board+window+decal.axd	681	0.654919
/google+redesign/apparel/mens/mens+t+shirts	585	0.482051
/signin.html	225	0.342222

exit rates of pages¶

In [22]:

gr_exit_rate = genie.plz([
    "filter data for hit_type as PAGE",
    """group data by page_path, create 2 aggregate columns:
views: total number of rows in the group
exit_rate: fraction of rows in the group where bounces column is True
""",
    "sort the data by views descending"])

Loading cached genie id: aggregate_data_92888, set override = True to rerun

In [23]:

gr_exit_rate.result.head(5)

Out[23]:

	views	exit_rate
page_path
/home	13174	0.287308
/google+redesign/shop+by+brand/youtube	4988	0.493585
/basket.html	2426	0.027205
/signin.html	1512	0.050926
/google+redesign/apparel/mens/mens+t+shirts	1468	0.192098

user funnel¶

In [24]:

gr_funnel = genie.plz([
    "group data by action_type and count number of rows in each group",
    "remove action_type other than 1, 2, 5, 6",
    """replace the action_type values as:
    1: Click on product list page
    2: Product details page
    5: Checkout
    6: Purchase Complete
    """])

Loading cached genie id: group_action_type_27490, set override = True to rerun

In [25]:

gr_funnel.result

Out[25]:

	action_type	count
1	Click on product list page	6887
2	Product details page	5386
5	Checkout	1115
6	Purchase Complete	292

Convert into a pipeline¶

Now that we have done all the work in terms of analyzing data and creating what we want, we can put this into a pipeline. Links to API documentation:

Define a source¶

The first step in a pipeline is to define a source of external data which will feed the pipeline. We need to use one of the pre-defined data sources, let's use the BQ data source here.

In [26]:

from code_genie.io import BigQueryToDataframeSource, StringArg

In [27]:

# difference from how we used above is that start_date and end_date need to come from inputs to the pipeline
query = f"""
select * from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
where _TABLE_SUFFIX between '{start_date}' and '{end_date}'
"""
source = BigQueryToDataframeSource(
    query=query,
    query_args={"start_date": StringArg(name="analysis-start-date"),  # need to be set during pipeline run
                "end_date": StringArg(name="analysis-end-date")},  # need to be set during pipeline run
    key_path=StringArg(env_var="GOOGLE_APPLICATION_CREDENTIALS")  # to be read from env var
)

Define sinks¶

Now that we have a source, we need to define where to export the results. In this example, let's export the 3 dataframes we created as csv files into the cache dir using the DataFrameToCsvSink

In [28]:

from code_genie.io import DataFrameToCsvSink

In [29]:

sink_bounce_rate = DataFrameToCsvSink(path=StringArg(name="bounce-rate-export-path"))
sink_exit_rate = DataFrameToCsvSink(path=StringArg(name="exit-rate-export-path"))
sink_funnel = DataFrameToCsvSink(path=StringArg(name="funnel-export-path"))

Define pipeline steps¶

Once we have our source and sinks, we can stich them together with the genies we have already create to make a pipeline. Read the pipeline docs for more info.

In [30]:

from code_genie.pipeline import PipelineStep, GeniePipeline

In [31]:

# initialize pipeline step
pipeline_steps = []

Use source data and expand hits¶

In [32]:

pipeline_steps.append(PipelineStep(genie_result=gr_hits_expanded, data=source))
len(pipeline_steps)

Out[32]:

Add all extraction steps¶

Note that here the input source is the previous genie as we want to carry forward all the columns added to the very end

In [33]:

pipeline_steps.extend([
    # extract page path
    PipelineStep(genie_result=gr_page_path,
                 data=gr_hits_expanded),
    # extract hit type
    PipelineStep(genie_result=gr_hit_type,
                 data=gr_page_path),
    # extract hit number
    PipelineStep(genie_result=gr_hit_number,
                 data=gr_hit_type),
    # extract action type
    PipelineStep(genie_result=gr_action_type,
                 data=gr_hit_number),
    # extract bounce
    PipelineStep(genie_result=gr_bounce,
                 data=gr_action_type)
])
len(pipeline_steps)

Out[33]:

Export results to sinks¶

Now that we have interim layers added, we can add the final aggregation layer and assign the corresponding sinks so that the output data is generated. Note that the base_input_genie argument for all of these is still gr_bounce as the result of that genie needs to be fed into all of them.

We will also add the sink parameter to these steps so that the results are generated

In [34]:

pipeline_steps.extend([
    PipelineStep(genie_result=gr_landing_page, data=gr_bounce, sink=sink_bounce_rate),
    PipelineStep(genie_result=gr_exit_rate, data=gr_bounce, sink=sink_exit_rate),
    PipelineStep(genie_result=gr_funnel, data=gr_bounce, sink=sink_funnel)
])
len(pipeline_steps)

Out[34]:

Create pipeline and export¶

In [35]:

pipeline = GeniePipeline(name="google-analytics-pipeline", version="1", steps=pipeline_steps, cache_dir=CACHE_DIR)

In [36]:

pipeline.export("pipeline-v1.json")

Run the pipeline¶

Now that we have created the pipeline, we can load it into a new session and run it. From this point, the code does not depend on the execution of any pieces of code before this as long as we have exported the pipeline. You can restart your kernel and run again from here on.

rerun setup¶

lets load our env vars and setup cache dir as before