In this notebook we will visualize the feature statistics stored in the featurestore for featuregroups and training datasets. This notebook assumes that you have already run the featurestore tour and the notebook FeaturestoreTourPython.ipynb
.
The following featuregroups should exist in the featurestore:
games_features
attendances_features
players_features
season_scores_features
teams_features
And the following training dataset should exist in the featurestore:
team_position_prediction
When using Jupyter on Hopsworks, a library called sparkmagic is used to interact with the Hops cluster. When you create a Jupyter notebook on Hopsworks, you first select a kernel. A kernel is simply a program that executes the code that you have in the Jupyter cells, you can think of it as a REPL-backend to your jupyter notebook that acts as a frontend.
Sparkmagic works with a remote REST server for Spark, called livy, running inside the Hops cluster. Livy is an interface that Jupyter-on-Hopsworks uses to interact with the Hops cluster. When you run Jupyter cells using the pyspark kernel, the kernel will automatically send commands to livy in the background for executing the commands on the cluster.
Since the code in a pyspark notebook is being executed remotely, in the spark cluster, regular python plotting will not work. What you can do however is to use the magic %%local
to access the local python kernel, or save figures as pngs to HopsFS and plot them locally later, we will go over both approaches in this tutorial.
import os
from hops import featurestore, hdfs
%%local
%matplotlib inline
from hops import featurestore
%%local
¶In the local environment we can plot like usual, just remember to execute a cell with %matplotlib inline
so that the figures will be visualized in the notebook
%%local
featurestore.visualize_featuregroup_distributions("games_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_distributions("attendances_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_distributions("players_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_distributions("season_scores_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_distributions("teams_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_training_dataset_distributions("team_position_prediction")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_correlations("players_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_correlations("teams_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_correlations("season_scores_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_correlations("attendances_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_correlations("games_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_training_dataset_correlations("team_position_prediction")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_clusters("games_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_clusters("attendances_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_clusters("season_scores_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_clusters("teams_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_featuregroup_clusters("players_features")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
featurestore.visualize_training_dataset_clusters("team_position_prediction")
Remember to add %%matplotlib inline when doing visualizations in Jupyter notebooks
%%local
desc_stats_df = featurestore.visualize_training_dataset_descriptive_stats("team_position_prediction")
desc_stats_df.head()
/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/autovizwidget/widget/utils.py:50: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…
Output()
%%local
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("games_features")
desc_stats_df.head()
/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/autovizwidget/widget/utils.py:50: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…
Output()
%%local
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("attendances_features")
desc_stats_df.head()
/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/autovizwidget/widget/utils.py:50: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…
Output()
%%local
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("season_scores_features")
desc_stats_df.head()
/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/autovizwidget/widget/utils.py:50: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…
Output()
%%local
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("teams_features")
desc_stats_df.head()
/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/autovizwidget/widget/utils.py:50: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…
Output()
%%local
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("players_features")
desc_stats_df.head()
/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/autovizwidget/widget/utils.py:50: FutureWarning: A future version of pandas will default to `skipna=True`. To silence this warning, pass `skipna=True|False` explicitly.
VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…
Output()
spark
driver or executor¶Since the notebook server is talking remotely to the spark driver or executor we can't expect visualizations to work out of the box. However, what you can do to perform visualizations in spark is to set the flag plot=False
which will not plot the figure but rather return it, and then you can save the figure as a .png or .jpg to HopsFS and access it later on
fig = featurestore.visualize_featuregroup_distributions("teams_features", plot=False)
fig.savefig("teams_features_distributions.png")
hdfs.copy_to_hdfs("teams_features_distributions.png", "Resources/", overwrite=True)
Started copying local path teams_features_distributions.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//teams_features_distributions.png Finished copying
fig = featurestore.visualize_featuregroup_distributions("games_features", plot=False)
fig.savefig("games_features_distributions.png")
hdfs.copy_to_hdfs("games_features_distributions.png", "Resources/", overwrite=True)
Started copying local path games_features_distributions.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//games_features_distributions.png Finished copying
fig = featurestore.visualize_featuregroup_distributions("season_scores_features", plot=False)
fig.savefig("season_scores_features_distributions.png")
hdfs.copy_to_hdfs("season_scores_features_distributions.png", "Resources/", overwrite=True)
Started copying local path season_scores_features_distributions.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//season_scores_features_distributions.png Finished copying
fig = featurestore.visualize_featuregroup_distributions("attendances_features", plot=False)
fig.savefig("attendances_features_distributions.png")
hdfs.copy_to_hdfs("attendances_features_distributions.png", "Resources/", overwrite=True)
Started copying local path attendances_features_distributions.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//attendances_features_distributions.png Finished copying
fig = featurestore.visualize_featuregroup_distributions("players_features", plot=False)
fig.savefig("players_features_distributions.png")
hdfs.copy_to_hdfs("players_features_distributions.png", "Resources/", overwrite=True)
Started copying local path players_features_distributions.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//players_features_distributions.png Finished copying
fig = featurestore.visualize_training_dataset_distributions("team_position_prediction", plot=False)
fig.savefig("team_position_prediction_distributions.png")
hdfs.copy_to_hdfs("team_position_prediction_distributions.png", "Resources/", overwrite=True)
Started copying local path team_position_prediction_distributions.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//team_position_prediction_distributions.png Finished copying
fig = featurestore.visualize_featuregroup_correlations("teams_features", plot=False)
fig.savefig("teams_features_correlations.png")
hdfs.copy_to_hdfs("teams_features_correlations.png", "Resources/", overwrite=True)
Started copying local path teams_features_correlations.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//teams_features_correlations.png Finished copying
fig = featurestore.visualize_featuregroup_correlations("games_features", plot=False)
fig.savefig("games_features_correlations.png")
hdfs.copy_to_hdfs("games_features_correlations.png", "Resources/", overwrite=True)
Started copying local path games_features_correlations.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//games_features_correlations.png Finished copying
fig = featurestore.visualize_featuregroup_correlations("attendances_features", plot=False)
fig.savefig("attendances_features_correlations.png")
hdfs.copy_to_hdfs("attendances_features_correlations.png", "Resources/", overwrite=True)
Started copying local path attendances_features_correlations.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//attendances_features_correlations.png Finished copying
fig = featurestore.visualize_featuregroup_correlations("season_scores_features", plot=False)
fig.savefig("season_scores_features_correlations.png")
hdfs.copy_to_hdfs("season_scores_features_correlations.png", "Resources/", overwrite=True)
Started copying local path season_scores_features_correlations.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//season_scores_features_correlations.png Finished copying
fig = featurestore.visualize_featuregroup_correlations("players_features", plot=False)
fig.savefig("players_features_correlations.png")
hdfs.copy_to_hdfs("players_features_correlations.png", "Resources/", overwrite=True)
Started copying local path players_features_correlations.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//players_features_correlations.png Finished copying
fig = featurestore.visualize_training_dataset_correlations("team_position_prediction", plot=False)
fig.savefig("team_position_prediction_correlations.png")
hdfs.copy_to_hdfs("team_position_prediction_correlations.png", "Resources/", overwrite=True)
Started copying local path team_position_prediction_correlations.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//team_position_prediction_correlations.png Finished copying
fig = featurestore.visualize_featuregroup_clusters("teams_features", plot=False)
fig.savefig("teams_features_clusters.png")
hdfs.copy_to_hdfs("teams_features_clusters.png", "Resources/", overwrite=True)
Started copying local path teams_features_clusters.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//teams_features_clusters.png Finished copying
fig = featurestore.visualize_featuregroup_clusters("players_features", plot=False)
fig.savefig("players_features_clusters.png")
hdfs.copy_to_hdfs("players_features_clusters.png", "Resources/", overwrite=True)
Started copying local path players_features_clusters.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//players_features_clusters.png Finished copying
fig = featurestore.visualize_featuregroup_clusters("attendances_features", plot=False)
fig.savefig("attendances_features_clusters.png")
hdfs.copy_to_hdfs("attendances_features_clusters.png", "Resources/", overwrite=True)
Started copying local path attendances_features_clusters.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//attendances_features_clusters.png Finished copying
fig = featurestore.visualize_featuregroup_clusters("season_scores_features", plot=False)
fig.savefig("season_scores_features_clusters.png")
hdfs.copy_to_hdfs("season_scores_features_clusters.png", "Resources/", overwrite=True)
Started copying local path season_scores_features_clusters.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//season_scores_features_clusters.png Finished copying
fig = featurestore.visualize_featuregroup_clusters("games_features", plot=False)
fig.savefig("games_features_clusters.png")
hdfs.copy_to_hdfs("games_features_clusters.png", "Resources/", overwrite=True)
Started copying local path games_features_clusters.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//games_features_clusters.png Finished copying
fig = featurestore.visualize_training_dataset_clusters("team_position_prediction", plot=False)
fig.savefig("team_position_prediction_clusters.png")
hdfs.copy_to_hdfs("team_position_prediction_clusters.png", "Resources/", overwrite=True)
Started copying local path team_position_prediction_clusters.png to hdfs path hdfs://10.0.2.15:8020/Projects/demo_featurestore_admin000/Resources//team_position_prediction_clusters.png Finished copying
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("teams_features")
desc_stats_df.head()
metric team_budget team_id team_position 0 stddev 5238.9430 14.57738 14.57738 1 min 760.8729 1.00000 1.00000 2 mean 8723.2920 25.50000 25.50000 3 count 50.0000 50.00000 50.00000 4 max 21319.5330 50.00000 50.00000
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("games_features")
desc_stats_df.head()
metric home_team_id score away_team_id 0 stddev 14.334562 0.918359 14.504639 1 min 1.000000 1.000000 1.000000 2 mean 25.589900 2.001300 25.576800 3 count 10000.000000 10000.000000 10000.000000 4 max 50.000000 3.000000 50.000000
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("season_scores_features")
desc_stats_df.head()
metric sum_position team_id average_position 0 stddev 306.1239 14.57738 15.306195 1 min 542.0000 1.00000 27.100000 2 mean 1009.7800 25.50000 50.489000 3 count 50.0000 50.00000 50.000000 4 max 1607.0000 50.00000 80.350000
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("attendances_features")
desc_stats_df.head()
metric sum_attendance team_id average_attendance 0 stddev 284646.200 14.57738 14232.3090 1 min 20770.475 1.00000 1038.5237 2 mean 173387.880 25.50000 8669.3940 3 count 50.000 50.00000 50.0000 4 max 1846021.800 50.00000 92301.0860
desc_stats_df = featurestore.visualize_featuregroup_descriptive_stats("players_features")
desc_stats_df.head()
metric sum_player_rating ... sum_player_age average_player_age 0 stddev 118708.750 ... 52.29132 0.522913 1 min 15096.327 ... 2434.00000 24.340000 2 mean 71738.375 ... 2556.84000 25.568400 3 count 50.000 ... 50.00000 50.000000 4 max 719186.300 ... 2700.00000 27.000000 [5 rows x 8 columns]
desc_stats_df = featurestore.visualize_training_dataset_descriptive_stats("team_position_prediction")
desc_stats_df.head()
metric team_budget ... sum_player_rating average_attendance 0 stddev 5238.9430 ... 118708.750 14232.3090 1 min 760.8729 ... 15096.327 1038.5237 2 mean 8723.2920 ... 71738.375 8669.3940 3 count 50.0000 ... 50.000 50.0000 4 max 21319.5330 ... 719186.300 92301.0860 [5 rows x 13 columns]