import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)
pd.set_option('display.max_rows', 500) # ensure that all rows are shown
This Notebook gives some ideas how you could customize the standardizer classes.
All three standardizer classes BalanceSheetStandardizer
, IncomeStatementStandardizer
, and CashFlowStandardizer
are derived from the same base class Standardizer
and share the same constructor parameters. In fact, the whole behavior of the standardizer is defined by these parameters and the three standardizer classes are just containers which define the values for the constructor parameters but do not define additional methods or overwrite existing methods. So, it is simply a configuration of the base class.
Since every constructor parameter can be overwritten when instantiating one of the three standardizer classes, you can customize the standardizer in three ways:
BalanceSheetStandardizer
, IncomeStatementStandardizer
, or CashFlowStandardizer
. A simply way, for instance, to adapt the list of tags/columns that should appear in the final result.BalanceSheetStandardizer
, IncomeStatementStandardizer
, or CashFlowStandardizer
and redefine certain, more complex rules. For instance, maybe you want to define additional Validation
rules, or you want to change the Post
rules so that NaN-values are not set to zero but instead stay undefined.Standardizer
and define everything yourself.==========================================================
If you find this tool useful, a sponsorship would be greatly appreciated!
https://github.com/sponsors/HansjoergW
How to get in touch
==========================================================
The following simple bascic constructors are available to change some details of the behavior.
filter_for_main_statement
¶A quaterly or annual report usually contains many different tables with data. Beside the tables with the primary financial information (Balance Sheet, Income Statement, or the CashFlow) there tables that often contain part of the information from the primary financial statements. Usually, however, you are just interested in the tables that contain the primary financial information.
If this flag is set to true (which is the default value), only the table that contains most data points that generally belong to the appropriate statement, will be returned in the result set.
additional_final_sub_fields
¶When you call the process
method of a standardizer, you will receive a restulting dataframe that just contains the adsh
column as an identifier. In contrary, when you use the present
method, the resulting data frame is enriched with additional information from the sub_df. By default, these are the columns cik
, name
(the last registered name of the company), form
(either 10-K or 10Q), fye
(the financial year ending as MMDD), fy
(the financial year to which the report belongs), fp
(the financial period Q1, Q2, Q3, or FY), filed
(date when the report was filed with the SEC as an integer value in the format YYYYMMDD), data
(same as filed
but as areal date format).
However, there are many more columns in the sub_df available (like contact information). So if you would like to have the zip code of the town where the company is based, you can define this with the additional_final_sub_fields
parameter:
bs_standardizer = BalanceSheetStandardizer(additional_final_sub_fields=['zipba'])
result_df = bs_standardizer.present(joined_bag)
# or via the get_standardize_bag
bs_standardizer.get_standardize_bag().result_df
additional_final_tags
¶Every standardizer defines an internal list final_tags
which defines the tags (resp. the columns) that are contained in the data frame that is returned. This columns are only a subset and sometimes aggregated fields of the fields that actually are avaiable. As the name standardizer suggest, the goal is to just provide information that is available in most of the reports.
There may be situations, when you would like to have additional tags returned as well. For instance, instead of just having LiabilitiesNoncurrent
, you might also be interested in the LongTermDebt
. This is possible by defining the additional_final_tags
parameter:
bs_standardizer = BalanceSheetStandardizer(additional_final_tags=['LongTermDebt'])
result_df = bs_standardizer.present(joined_bag)
# or via the get_standardize_bag
bs_standardizer.get_standardize_bag().result_df
final_tags
¶Instead of just adding additional final tags with the additional_final_tags
parameter, you can redefine the whole list directly with final_tags
parameter. For instance, if you want to remove certain tags from the final result, or if you want them to appear in a certain order.
# The default list is
# ['Assets', 'AssetsCurrent', 'Cash', 'AssetsNoncurrent',
# 'Liabilities', 'LiabilitiesCurrent', 'LiabilitiesNoncurrent',
# 'Equity',
# 'HolderEquity',
# 'RetainedEarnings',
# 'AdditionalPaidInCapital',
# 'TreasuryStockValue',
# 'TemporaryEquity',
# 'RedeemableEquity',
# 'LiabilitiesAndEquity']
# However, we are only interested in a subset of it and in a different order, so we adapt final_tags
bs_standardizer = BalanceSheetStandardizer(final_tags=['LiabilitiesCurrent', 'LiabilitiesNoncurrent', 'Liabilities', 'AssetsCurrent', 'AssetsNoncurrent', 'Assets'])
result_df = bs_standardizer.present(joined_bag)
# or via the get_standardize_bag
bs_standardizer.get_standardize_bag().result_df
Subclassing makes sense when you want to change the more complex parameters. For instance, the definition of rules. Of course, you could also directly do that just by changing the constructor parameter as explained above, but it might make more sense to encapsulate more complex definitions within a special class.
The following example shows, how we could change the definition of the Post
rules, so that unset values are not set to zero at the end of the process in the BalanceSheetStandardizer
. So, we simply remove all PostSetToZero
entries from the original definition of the post_rule_tree
.
class NoSetToZeroBalanceSheetStandardizer(BalanceSheetStandardizer):
# redefined post_rule_tree without any PostSetToZero rules
post_rule_tree = RuleGroup(prefix="BS_POST",
rules=[
# if only Assets is sets, set the AssetsCurrent to value
# of Assets and AssetsNoncurrent to 0
PostCopyToFirstSummand(sum_tag='Assets',
first_summand='AssetsCurrent',
other_summands=[
'AssetsNoncurrent']),
# if only Liabilities is sets, set the LiabilitiesCurrent to
# value of Liabilities and LiabilitiesNoncurrent to 0
PostCopyToFirstSummand(sum_tag='Liabilities',
first_summand='LiabilitiesCurrent',
other_summands=[
'LiabilitiesNoncurrent']),
])
def __init__():
super().__init__(
post_rule_tree=post_rule_tree
)
A crucial thing when standardizing reports is to know or to find out which tags we care about.
As mentioned in other places, many different tags can be used in a report, some have a similar meaning, a lot of them have a hierachical order, sometimes some of them are misused, and so on.
Since we actually have all the data for a certain report (for instance the balance sheet) it is quite esay to get a first impression about the number of tags and how often they are used. In order to analyze, we use the data set that was produced in the notebook 06_bulk_data_processing_deep_dive.ipynb
from secfsdstools.d_container.databagmodel import JoinedDataBag
print("loading data ...")
all_bs_joinedbag:JoinedDataBag = JoinedDataBag.load(target_path="set/parallel/BS/joined")
loading data ...
First, lets see how many tags there are in total and how often they seem to be used
print("number of different tags used in a BalanceSheets:", len(all_bs_joinedbag.pre_num_df.tag.unique()))
number of different tags used in a BalanceSheets: 3148
Now, an average balance sheet has about 20 to 30 positions and in this data set we have about 320'000 balance sheets. As you can imagine, there is a lot if "individuality" in these reports.
So, let's find out, which are the more "common" tags that are used.
counts_df = all_bs_joinedbag.pre_num_df.tag.value_counts()
counts_df.head(50)
StockholdersEquity 1110491 Assets 709747 StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest 575385 LiabilitiesAndStockholdersEquity 394086 InvestmentOwnedAtFairValue 389453 PropertyPlantAndEquipmentNet 375284 CommonStockSharesOutstanding 371743 CashAndCashEquivalentsAtCarryingValue 369591 CommonStockSharesIssued 366020 Goodwill 356448 CommonStockValue 351803 Liabilities 348181 CommonStockSharesAuthorized 348130 InvestmentOwnedAtCost 336960 CommonStockParOrStatedValuePerShare 335472 RetainedEarningsAccumulatedDeficit 329198 AvailableForSaleSecuritiesDebtSecurities 318473 AssetsCurrent 304570 LiabilitiesCurrent 302623 AccumulatedOtherComprehensiveIncomeLossNetOfTax 300826 AvailableForSaleSecurities 264330 PreferredStockSharesAuthorized 236841 PreferredStockValue 225513 AccountsPayableCurrent 217727 PreferredStockSharesIssued 216400 LoansAndLeasesReceivableNetOfDeferredIncome 211806 PreferredStockParOrStatedValuePerShare 211325 CommitmentsAndContingencies 207112 AccountsReceivableNetCurrent 205625 PreferredStockSharesOutstanding 202550 OtherAssetsNoncurrent 200911 AdditionalPaidInCapital 181274 InventoryNet 171734 IntangibleAssetsNetExcludingGoodwill 168389 OtherLiabilitiesNoncurrent 159061 AccruedLiabilitiesCurrent 144543 PrepaidExpenseAndOtherAssetsCurrent 137482 LongTermDebtNoncurrent 132520 AdditionalPaidInCapitalCommonStock 131248 LoansAndLeasesReceivableAllowance 119182 LoansAndLeasesReceivableGrossCarryingAmount 119165 FinancingReceivableExcludingAccruedInterestBeforeAllowanceForCreditLoss 108309 FiniteLivedIntangibleAssetsNet 105873 MinorityInterest 100242 TreasuryStockValue 98472 OtherAssets 91265 OtherAssetsCurrent 88988 NotesReceivableGross 88965 LongTermDebt 87933 OperatingLeaseRightOfUseAsset 85842 Name: tag, dtype: int64
Maybe we are interested in a certain group of tags, like Assets. So lets see what tags we have that have "Assets" in their name:
all_bs_joinedbag.pre_num_df.tag[all_bs_joinedbag.pre_num_df.tag.str.contains("Assets")].value_counts()[:50] # only show the top 50
Assets 709747 AssetsCurrent 304570 OtherAssetsNoncurrent 200911 IntangibleAssetsNetExcludingGoodwill 168389 PrepaidExpenseAndOtherAssetsCurrent 137482 FiniteLivedIntangibleAssetsNet 105873 OtherAssets 91265 OtherAssetsCurrent 88988 DerivativeAssets 44517 DeferredTaxAssetsNetCurrent 36946 DeferredIncomeTaxAssetsNet 35487 FiniteLivedIntangibleAssetsAccumulatedAmortization 34219 DeferredTaxAssetsNetNoncurrent 31841 RegulatoryAssetsNoncurrent 30960 AssetsNoncurrent 24428 PrepaidExpenseAndOtherAssets 18703 AssetsOfDisposalGroupIncludingDiscontinuedOperationCurrent 17427 DerivativeAssetsCurrent 16088 SeparateAccountAssets 13668 RegulatoryAssetsCurrent 12606 DepositsAssetsNoncurrent 12044 DerivativeAssetsNoncurrent 11716 AssetsHeldInTrustNoncurrent 11390 IntangibleAssetsNetIncludingGoodwill 11171 DeferredTaxAssetsLiabilitiesNetNoncurrent 10513 OtherIntangibleAssetsNet 9503 OtherRealEstateAndForeclosedAssets 8885 AssetsOfDisposalGroupIncludingDiscontinuedOperation 6603 DeferredTaxAssetsNet 6397 DeferredTaxAssetsLiabilitiesNetCurrent 6357 IndefiniteLivedIntangibleAssetsExcludingGoodwill 5783 DeferredTaxAssetsLiabilitiesNet 5620 OtherAssetsFairValueDisclosure 5305 DisposalGroupIncludingDiscontinuedOperationAssetsNoncurrent 5132 DefinedBenefitPlanAssetsForPlanBenefitsNoncurrent 5123 ForeclosedAssets 4890 AssetsHeldForSaleNotPartOfDisposalGroupCurrent 4572 DecommissioningTrustAssetsAmount 4535 DepositsAssetsCurrent 4263 VariableInterestEntityConsolidatedCarryingAmountAssets 3903 AssetsNet 3600 DeferredCompensationPlanAssets 3349 AssetsHeldForSaleCurrent 3289 DepositAssets 3278 DepositsAssets 3224 AssetsHeldForSaleNotPartOfDisposalGroup 2836 InvestmentsAndOtherNoncurrentAssets 2576 AssetsSoldUnderAgreementsToRepurchaseRepurchaseLiability 2563 OtherAssetsMiscellaneousNoncurrent 2488 FiniteLivedIntangibleAssetsGross 2420 Name: tag, dtype: int64
Another great tool to get a better understanding about the tag hierarchy is ChatGPT. For instance, try out a prompt like the following:
Please visualize the hierarchy of the 50 most common xbrl tags in a balance sheet that belong to "Assets". Please create the visualization with ASCII characters only.
Result:
Total Assets
│
├── Current Assets
│ ├── Cash and Cash Equivalents
│ │ ├── CashAndCashEquivalentsAtCarryingValue
│ │ ├── CashAndCashEquivalentsRestrictedCashAndRestrictedCashEquivalents
│ │ └── CashAndCashEquivalentsAtCarryingValueIncludingDiscontinuedOperations
│ ├── Receivables
│ │ ├── AccountsReceivableNetCurrent
│ │ └── ReceivablesNetCurrent
│ ├── Inventories
│ │ ├── InventoryNet
│ │ └── InventoriesFinishedGoodsNetOfReserves
│ ├── Prepaid Expenses
│ │ └── PrepaidExpenseAndOtherAssetsCurrent
│ └── Other Current Assets
│ ├── OtherAssetsCurrent
│ └── AssetsHeldForSaleCurrent
│
├── Non-Current Assets
│ ├── Property, Plant, and Equipment (PP&E)
│ │ ├── PropertyPlantAndEquipmentNet
│ │ └── LandAndBuildingsNetOfAccumulatedDepreciation
│ ├── Intangible Assets
│ │ ├── Goodwill
│ │ └── IntangibleAssetsNetExcludingGoodwill
│ ├── Investments
│ │ ├── InvestmentsAndAdvances
│ │ └── InvestmentsNoncurrent
│ ├── Deferred Tax Assets
│ │ └── DeferredTaxAssetsNoncurrent
│ └── Other Non-Current Assets
│ ├── OtherAssetsNoncurrent
│ ├── OtherIntangibleAssetsNet
│ └── AssetsHeldForSaleNoncurrent
│
├── Restricted Assets
│ ├── RestrictedCashAndCashEquivalentsNoncurrent
│ ├── RestrictedCashAndCashEquivalentsAtCarryingValue
│ └── RestrictedCashAndInvestmentsNoncurrent
│
├── Financial Instruments
│ ├── MarketableSecurities
│ │ ├── MarketableSecuritiesCurrent
│ │ └── MarketableSecuritiesNoncurrent
│ ├── AvailableForSaleSecurities
│ │ ├── AvailableForSaleSecuritiesCurrent
│ │ └── AvailableForSaleSecuritiesNoncurrent
│ └── TradingSecurities
│ ├── TradingSecuritiesCurrent
│ └── TradingSecuritiesNoncurrent
│
└── Other Assets
├── OtherAssets
├── MiscellaneousAssets
└── DerivativeAssets
├── DerivativeAssetsCurrent
└── DerivativeAssetsNoncurrent
Is the tag RestrictedCashAndCashEquivalentsNoncurrent
actually being used?
sum(all_bs_joinedbag.pre_num_df.tag == "RestrictedCashAndCashEquivalentsNoncurrent")
15666
As it is with LLMs, sometimes information is made up, or does not actually reflect actual official tags. None the less, LLMs can help to get a good understanding of the hierarchy of tags.
Now, if you want to know for a certain report what tag it used to report a certain position, you can have a look at EDGAR at sec.gov itself: https://www.sec.gov/search-filings
So let us have a look at the latest 10-Q report of Apple.
In the overview for Apple (https://www.sec.gov/edgar/browse/?CIK=320193&owner=exclude), you can use the predefined filter Annual & quarterly reports
to show only the 10-K and 10-Q filings. I also like to see the Accession number
, so i marked that checkbox as well.
As of today (September 2024), the latest quaterly report was filed on 2nd August 2024 (accession number 0000320193-24-000081). So let us open its details (https://www.sec.gov/Archives/edgar/data/320193/000032019324000081/0000320193-24-000081-index.htm).
There are two paths to find the used tag for a certain position. Either you can use the Interactive Data
or you can view the real report as rendered html.
Let's first have a look at how to use the Interactive Data
. So, press on that button and navigate to the balance sheet:
The interactive data directly present the most important tables of a report as a nicely rendered tables.
For instance, let us find out what tag is used to report Deferred revenue
. To do that, we simply click on that entry which opens a dialog box that describes the position. We can then open the details to see which tag is used to report that value:
As we can see, the used tag to report that position is ContractWithCustomerLiabilityCurrent
.
Let's have a look at the second possible path. So, in the report overview instead of clicking on Interactive Data
click on the first link in the Document
column of the table. This opens the actual report.
You can either scroll down till you find the balance sheet, or you can click on Sections
, which opens the Tagged Sections
side bar; click on Financial Statements
and then Condensed Consolidated Balance Sheets
.
This scrolls down to the balance sheet. And here again, you can directly click on a Fact
(a value), which opens a little dialog with detailed information about that position.