Modify the report's structure

In this notebook we have a look at two use cases in which we modify the existing report structure: splitting up large reports and reordering the report sections. Both use cases are based on actual user inquiries. The datasets used in this notebook are obtained using the kaggle api. If you haven't done so already, you should set up the api credentials.

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [ ]:
%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [ ]:
import sys

!{sys.executable} -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

You might want to restart the kernel now.

In [ ]:
from pathlib import Path

import kaggle

from pandas_profiling.utils.common import extract_zip

kaggle.api.authenticate()

Reorder Sections

We can leverage the same approach to reorder sections. First, we need to generate a profile report.

In [ ]:
# We are using the Craigslist Carstrucks Data
vehicles_dataset = Path("craigslist-carstrucks-data/vehicles.csv")

# Download and extract (~228M)
if not vehicles_dataset.exists():
    kaggle.api.dataset_download_files(
        "austinreese/craigslist-carstrucks-data",
        path="craigslist-carstrucks-data",
        quiet=False,
    )

    extract_zip(
        "craigslist-carstrucks-data/craigslist-carstrucks-data.zip",
        "craigslist-carstrucks-data/",
    )
In [ ]:
import pandas as pd

from pandas_profiling import ProfileReport

# For our demonstration, we only take a fraction of the dataset
df = pd.read_csv(vehicles_dataset, nrows=100)

# Generate the profile report
vehicles_report = ProfileReport(df)

The structure of the report is stored in the report attribute. The report is essentially a tree object. We inspect the root of the report.

In [ ]:
print(repr(vehicles_report.report))

We can see that the root object is of the type "Sequence". Sequence types have at least the attributes name and items.

In [ ]:
print(vehicles_report.report.content)

In this example, we would like to pull up the samples section, so that the reordered sequence items are:

  • Overview
  • Samples
  • Missing
  • Variables
  • Interactions
  • Correlations
In [ ]:
# Reorder the sections
vehicles_report.report.content["items"] = [
    vehicles_report.report.content["items"][i] for i in [0, 5, 1, 2, 3, 4]
]

Finally, we can render the report and see that the changes have taken place:

In [ ]:
vehicles_report.to_notebook_iframe()

Split Profile Reports

When profiling large datasets, a monolithic HTML file can become enormous. Using the report structure generated by pandas-profiling, we create a modular report. In this notebook we demonstrate how to split up a profile report in multiple different titles. We start with generating the report's structure in the usual way. The minimal mode is set to True. This step may take a few minutes.

In [ ]:
# We are using the IEEE Fraud Detection transaction training data
ieee_dataset = Path("ieee-fraud-detection/train_transaction.csv")

# Download and extract (~118M)
if not ieee_dataset.exists():
    kaggle.api.competition_download_files(
        "ieee-fraud-detection", path="ieee-fraud-detection", quiet=False
    )

    extract_zip(
        "ieee-fraud-detection/ieee-fraud-detection.zip",
        "ieee-fraud-detection/",
    )
In [ ]:
import pandas as pd

from pandas_profiling import ProfileReport

# Read the dataset
df = pd.read_csv(ieee_dataset)

# Generate the profile report
ieee_fraud_report = ProfileReport(df, minimal=True)
In [ ]:
print(repr(ieee_fraud_report.report))
In [ ]:
print(repr(ieee_fraud_report.report.content))
In [ ]:
from copy import deepcopy

# Make a copy for the original report structure
original_report_structure = deepcopy(ieee_fraud_report.report)
In [ ]:
# Loop over each section
for section in original_report_structure.content["items"]:
    # Only consider sections that contain items
    #     if len(section.content['items']) > 0:
    # Set the report structure
    ieee_fraud_report.report = deepcopy(original_report_structure)
    # Overwrite the section lists with the section we would like to print
    ieee_fraud_report.report.content["items"] = [section]
    # Output the report to HTML
    ieee_fraud_report.to_file(f"ieee_fraud_report_section_{section.name.lower()}.html")

Paginate variables

We can use the same approach to paginate variables:

In [ ]:
ieee_fraud_report.report = original_report_structure

# Number of variables per page
page_size = 25

# The Root node, which is a sequence of sections
print(repr(ieee_fraud_report.report.content["items"]))

# The variables
variable_section = ieee_fraud_report.report.content["items"][1]
variables = variable_section.content["items"]
variable_count = len(variables)
print(f"Number of variables: {variable_count}")

# Reset the report structure
ieee_fraud_report.report = deepcopy(original_report_structure)

# Only keep the variables section
ieee_fraud_report.report.content["items"] = [
    ieee_fraud_report.report.content["items"][1]
]

for page_num, variable_page in enumerate(
    [variables[i : i + page_size] for i in range(0, variable_count, page_size)]
):
    print(f"Write page {page_num}")
    # Set the report title
    ieee_fraud_report.title = (
        f"IEEE Fraud Detection Dataset, Variables, Page {page_num}"
    )

    # Overwrite the variables lists with the variables we would like to print
    ieee_fraud_report.report.content["items"][0].content["items"] = variable_page

    # Output the report to HTML
    ieee_fraud_report.to_file(f"ieee_fraud_report_variables_page_{page_num}.html")

In this notebook we have seen two ways of manipulating the report structure. Advanced users may alter the structure in other ways we have not touched, such as exploring deeper parts of the tree structure or inserting and deleting objects.