Check out more notebooks at our Community Notebooks Repository!
Title: How to visualize results from BigQuery
Author: David L Gibbs
Created: 2019-07-17
Purpose: Demonstrate how visualize the results from a query.
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_plot_BigQuery_results.ipynb
Notes:
In this example, we'll perform a few easy queries, and use the python Seaborn library to visualize the results.
Library documentation: https://seaborn.pydata.org/
There's many python tutorials on using the Seaborn library for plotting. For example, https://elitedatascience.com/python-seaborn-tutorial
# If you start your jupyter notebook session using the
# Community-Notebook/VM-Notebook-Launcher then you're already logged in!
# !gcloud auth application-default login
!pip3 install seaborn
Collecting seaborn Using cached https://files.pythonhosted.org/packages/a8/76/220ba4420459d9c4c9c9587c6ce607bf56c25b3d3d2de62056efe482dadc/seaborn-0.9.0-py3-none-any.whl Collecting matplotlib>=1.4.3 (from seaborn) Using cached https://files.pythonhosted.org/packages/89/61/465fb3bfba684b0f53b5c4829c3c89e86e6fe9fdcdfda93e38f1788090f0/matplotlib-3.0.3-cp35-cp35m-manylinux1_x86_64.whl Collecting numpy>=1.9.3 (from seaborn) Using cached https://files.pythonhosted.org/packages/bb/ef/d5a21cbc094d3f4d5b5336494dbcc9550b70c766a8345513c7c24ed18418/numpy-1.16.4-cp35-cp35m-manylinux1_x86_64.whl Collecting scipy>=0.14.0 (from seaborn) Using cached https://files.pythonhosted.org/packages/14/49/8f13fa215e10a7ab0731cc95b0e9bb66cf83c6a98260b154cfbd0b55fb19/scipy-1.3.0-cp35-cp35m-manylinux1_x86_64.whl Collecting pandas>=0.15.2 (from seaborn) Using cached https://files.pythonhosted.org/packages/74/24/0cdbf8907e1e3bc5a8da03345c23cbed7044330bb8f73bb12e711a640a00/pandas-0.24.2-cp35-cp35m-manylinux1_x86_64.whl Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib>=1.4.3->seaborn) Using cached https://files.pythonhosted.org/packages/dd/d9/3ec19e966301a6e25769976999bd7bbe552016f0d32b577dc9d63d2e0c49/pyparsing-2.4.0-py2.py3-none-any.whl Collecting cycler>=0.10 (from matplotlib>=1.4.3->seaborn) Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl Collecting kiwisolver>=1.0.1 (from matplotlib>=1.4.3->seaborn) Using cached https://files.pythonhosted.org/packages/ee/18/4cd2e84c6aff0c6a50479118083d20b9e676e5175a913c0ea76d700fc244/kiwisolver-1.1.0-cp35-cp35m-manylinux1_x86_64.whl Collecting python-dateutil>=2.1 (from matplotlib>=1.4.3->seaborn) Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl Collecting pytz>=2011k (from pandas>=0.15.2->seaborn) Using cached https://files.pythonhosted.org/packages/3d/73/fe30c2daaaa0713420d0382b16fbb761409f532c56bdcc514bf7b6262bb6/pytz-2019.1-py2.py3-none-any.whl Collecting six (from cycler>=0.10->matplotlib>=1.4.3->seaborn) Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl Collecting setuptools (from kiwisolver>=1.0.1->matplotlib>=1.4.3->seaborn) Using cached https://files.pythonhosted.org/packages/ec/51/f45cea425fd5cb0b0380f5b0f048ebc1da5b417e48d304838c02d6288a1e/setuptools-41.0.1-py2.py3-none-any.whl Installing collected packages: pyparsing, six, cycler, setuptools, kiwisolver, python-dateutil, numpy, matplotlib, scipy, pytz, pandas, seaborn Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.0.3 numpy-1.16.4 pandas-0.24.2 pyparsing-2.4.0 python-dateutil-2.8.0 pytz-2019.1 scipy-1.3.0 seaborn-0.9.0 setuptools-41.0.1 six-1.12.0
!pip3 install google-cloud-bigquery
Collecting google-cloud-bigquery Downloading https://files.pythonhosted.org/packages/2e/95/64e92560983db41ff1de7c08839f38ae7c5326a8aad71f5e893098cd1c85/google_cloud_bigquery-1.15.0-py2.py3-none-any.whl (133kB) 100% |████████████████████████████████| 133kB 4.0MB/s ta 0:00:01 Collecting protobuf>=3.6.0 (from google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/7c/d2/581ebc3c41879aca2c4fce5c37cdb8d779c4ea79109b6da7f640735ea0a2/protobuf-3.8.0-cp35-cp35m-manylinux1_x86_64.whl (1.2MB) 100% |████████████████████████████████| 1.2MB 1.1MB/s eta 0:00:01 Collecting google-resumable-media>=0.3.1 (from google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/e2/5d/4bc5c28c252a62efe69ed1a1561da92bd5af8eca0cdcdf8e60354fae9b29/google_resumable_media-0.3.2-py2.py3-none-any.whl Collecting google-cloud-core<2.0dev,>=1.0.0 (from google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/98/7f/ff56aec313787577e262d5a2e306c04aef61c5c274699ff9fb40095e6691/google_cloud_core-1.0.2-py2.py3-none-any.whl Collecting setuptools (from protobuf>=3.6.0->google-cloud-bigquery) Using cached https://files.pythonhosted.org/packages/ec/51/f45cea425fd5cb0b0380f5b0f048ebc1da5b417e48d304838c02d6288a1e/setuptools-41.0.1-py2.py3-none-any.whl Collecting six>=1.9 (from protobuf>=3.6.0->google-cloud-bigquery) Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl Collecting google-api-core<2.0.0dev,>=1.11.0 (from google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/10/d6/8b1e8d79a8a56649af3a094e3d90dd213278da942f36d831b57c0ca4a503/google_api_core-1.11.1-py2.py3-none-any.whl (66kB) 100% |████████████████████████████████| 71kB 10.6MB/s ta 0:00:01 Collecting google-auth<2.0dev,>=0.4.0 (from google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/c5/9b/ed0516cc1f7609fb0217e3057ff4f0f9f3e3ce79a369c6af4a6c5ca25664/google_auth-1.6.3-py2.py3-none-any.whl (73kB) 100% |████████████████████████████████| 81kB 10.7MB/s ta 0:00:01 Collecting pytz (from google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Using cached https://files.pythonhosted.org/packages/3d/73/fe30c2daaaa0713420d0382b16fbb761409f532c56bdcc514bf7b6262bb6/pytz-2019.1-py2.py3-none-any.whl Collecting requests<3.0.0dev,>=2.18.0 (from google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB) 100% |████████████████████████████████| 61kB 10.3MB/s ta 0:00:01 Collecting googleapis-common-protos!=1.5.4,<2.0dev,>=1.5.3 (from google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/eb/ee/e59e74ecac678a14d6abefb9054f0bbcb318a6452a30df3776f133886d7d/googleapis-common-protos-1.6.0.tar.gz Collecting rsa>=3.1.4 (from google-auth<2.0dev,>=0.4.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/02/e5/38518af393f7c214357079ce67a317307936896e961e35450b70fad2a9cf/rsa-4.0-py2.py3-none-any.whl Collecting pyasn1-modules>=0.2.1 (from google-auth<2.0dev,>=0.4.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/91/f0/b03e00ce9fddf4827c42df1c3ce10c74eadebfb706231e8d6d1c356a4062/pyasn1_modules-0.2.5-py2.py3-none-any.whl (74kB) 100% |████████████████████████████████| 81kB 11.0MB/s ta 0:00:01 Collecting cachetools>=2.0.0 (from google-auth<2.0dev,>=0.4.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/2f/a6/30b0a0bef12283e83e58c1d6e7b5aabc7acfc4110df81a4471655d33e704/cachetools-3.1.1-py2.py3-none-any.whl Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/e6/60/247f23a7121ae632d62811ba7f273d0e58972d75e58a94d329d51550a47d/urllib3-1.25.3-py2.py3-none-any.whl (150kB) 100% |████████████████████████████████| 153kB 8.1MB/s eta 0:00:01 Collecting certifi>=2017.4.17 (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/60/75/f692a584e85b7eaba0e03827b3d51f45f571c2e793dd731e598828d380aa/certifi-2019.3.9-py2.py3-none-any.whl (158kB) 100% |████████████████████████████████| 163kB 7.1MB/s eta 0:00:01 Collecting chardet<3.1.0,>=3.0.2 (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl (133kB) 100% |████████████████████████████████| 143kB 7.5MB/s eta 0:00:01 Collecting idna<2.9,>=2.5 (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl (58kB) 100% |████████████████████████████████| 61kB 10.2MB/s ta 0:00:01 Collecting pyasn1>=0.1.3 (from rsa>=3.1.4->google-auth<2.0dev,>=0.4.0->google-api-core<2.0.0dev,>=1.11.0->google-cloud-core<2.0dev,>=1.0.0->google-cloud-bigquery) Downloading https://files.pythonhosted.org/packages/7b/7c/c9386b82a25115cccf1903441bba3cbadcfae7b678a20167347fa8ded34c/pyasn1-0.4.5-py2.py3-none-any.whl (73kB) 100% |████████████████████████████████| 81kB 11.4MB/s ta 0:00:01 Building wheels for collected packages: googleapis-common-protos Running setup.py bdist_wheel for googleapis-common-protos ... done Stored in directory: /home/dgibbs/.cache/pip/wheels/9e/3d/a2/1bec8bb7db80ab3216dbc33092bb7ccd0debfb8ba42b5668d5 Successfully built googleapis-common-protos Installing collected packages: setuptools, six, protobuf, google-resumable-media, pyasn1, rsa, pyasn1-modules, cachetools, google-auth, pytz, urllib3, certifi, chardet, idna, requests, googleapis-common-protos, google-api-core, google-cloud-core, google-cloud-bigquery Successfully installed cachetools-3.1.1 certifi-2019.3.9 chardet-3.0.4 google-api-core-1.11.1 google-auth-1.6.3 google-cloud-bigquery-1.15.0 google-cloud-core-1.0.2 google-resumable-media-0.3.2 googleapis-common-protos-1.6.0 idna-2.8 protobuf-3.8.0 pyasn1-0.4.5 pyasn1-modules-0.2.5 pytz-2019.1 requests-2.22.0 rsa-4.0 setuptools-41.0.1 six-1.12.0 urllib3-1.25.3
import seaborn as se
from google.cloud import bigquery
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
def runQuery ( client, qString, dryRun=False ):
print ( "\n in runQuery ... " )
if ( dryRun ):
print ( " dry-run only " )
## set up QueryJobConfig object
job_config = bigquery.QueryJobConfig()
job_config.dry_run = dryRun
job_config.use_query_cache = True
job_config.use_legacy_sql = False
## run the query
try:
query_job = client.query ( qString, job_config=job_config )
## print ( " query job state: ", query_job.state )
except:
print ( " FATAL ERROR: query execution failed " )
return ( None )
## return results as a dataframe (or an empty dataframe for a dry-run)
if ( not dryRun ):
try:
df = query_job.to_dataframe()
if ( query_job.total_bytes_processed==0 ):
print ( " the results for this query were previously cached " )
else:
print ( " this query processed {} bytes ".format(query_job.total_bytes_processed) )
if ( len(df) < 1 ):
print ( " WARNING: this query returned NO results ")
return ( df )
except:
print ( " FATAL ERROR: query execution failed " )
return ( None )
else:
print ( " if not cached, this query will process {} bytes ".format(query_job.total_bytes_processed) )
## return an empty dataframe
return ( pd.DataFrame() )
def checkQueryResults ( qr ):
print ( "\n in checkQueryResults ... " )
if ( not isinstance(qr, pd.DataFrame) ):
print ( " query execution failed! " )
return ( False )
else:
if ( len(qr) > 0 ):
print ( " # of rows in query results: {} ".format(len(qr)) )
print ( "\n", qr.head() )
else:
print ( " query returned NO results ?!? " )
return ( True )
# User your own project ID
project_id = 'your_project_number'
if project_id == 'your_project_number':
print('Please update the project number with your Google Cloud Project')
else:
bqclient = bigquery.Client(project=project_id)
# We define queries as strings #
sql = '''
SELECT
icd_10,
COUNT(*) as Count
FROM
`isb-cgc.TCGA_bioclin_v0.Clinical`
GROUP BY
1 -- this is the same as 'group by icd_10'
ORDER BY
Count
'''
res0 = runQuery ( bqclient, sql, dryRun=False )
res0[0:5]
in runQuery ... this query processed 74236 bytes
icd_10 | Count | |
---|---|---|
0 | C02.2 | 1 |
1 | C51.9 | 1 |
2 | C44.3 | 1 |
3 | C53.1 | 1 |
4 | C49.8 | 1 |
# first I'll subset the results,
# there's a lot of categories!
res1 = res0[res0['Count'] > 200]
# then we'll make a bar plot
p = se.barplot(data=res1, x = 'icd_10', y = 'Count')
sql = '''
SELECT
avg_percent_neutrophil_infiltration,
avg_percent_lymphocyte_infiltration
FROM
`isb-cgc.TCGA_bioclin_v0.Biospecimen`
GROUP BY
1,2
'''
res3 = runQuery ( bqclient, sql, dryRun=False )
res3[0:5]
in runQuery ... this query processed 138600 bytes
avg_percent_neutrophil_infiltration | avg_percent_lymphocyte_infiltration | |
---|---|---|
0 | 10.0 | 40.0 |
1 | 0.0 | 85.0 |
2 | 1.5 | 8.5 |
3 | 3.0 | 9.0 |
4 | 5.0 | 6.0 |
se.scatterplot(data=res3, x='avg_percent_lymphocyte_infiltration', y='avg_percent_neutrophil_infiltration')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6da51a04e0>