How we can use the
huggingface_hub
library to explore metadata on the Hugging Face Hub.
The Hugging Face Hub has become the de facto place to share machine learning models and datasets.
As the number of models and datasets grow the challenge of finding the right model or dataset for your needs may become more challenging.
There are various ways in which we can try and make it easier for people to find relevant models and datasets.
One of these is by associating metadata with datasets and models.
This blog post will (very briefly) begin to explore metadata on the Hugging Face Hub.
Often you'll want to explore models and datasets via the Hub website but this isn't the only way to explore the Hub.
As part of the process of exploring metadata on the Hugging Face Hub we'll briefly look at how we can use the huggingface_hub
library to programmatically interact with the Hub.
For this post we'll need a few libraries, pandas
, requests
and matplotlib
are likely old friends (or foes...). The huggingface_hub
library might be new to you but will soon become a good friend too! The rich
library is fantastically useful for quickly getting
familiar with a library (i.e. avoiding reading all the docs!) so we'll import that too.
import requests
from huggingface_hub import hf_api
import pandas as pd
import matplotlib.pyplot as plt
import rich
%matplotlib inline
plt.style.use("ggplot")
We'll instantiate an instance of the HfApi
class.
api = hf_api.HfApi()
We can use rich
inspect
to get a better sense of what a function or class instance is all about. Let's see what methods the api has.
rich.inspect(api, methods=True)
╭──────────────────────────────────── <class 'huggingface_hub.hf_api.HfApi'> ─────────────────────────────────────╮ │ ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ │ │ <huggingface_hub.hf_api.HfApi object at 0x136a2ce80> │ │ │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │ │ │ │ endpoint = 'https://huggingface.co' │ │ token = None │ │ change_discussion_status = def change_discussion_status(repo_id: str, discussion_num: int, new_status: │ │ Literal['open', 'closed'], *, token: Optional[str] = None, comment: Optional[str] = │ │ None, repo_type: Optional[str] = None) -> │ │ huggingface_hub.community.DiscussionStatusChange: Closes or re-opens a Discussion or │ │ Pull Request. │ │ comment_discussion = def comment_discussion(repo_id: str, discussion_num: int, comment: str, *, token: │ │ Optional[str] = None, repo_type: Optional[str] = None) -> │ │ huggingface_hub.community.DiscussionComment: Creates a new comment on the given │ │ Discussion. │ │ create_branch = def create_branch(repo_id: str, *, branch: str, token: Optional[str] = None, │ │ repo_type: Optional[str] = None) -> None: Create a new branch from `main` on a repo │ │ on the Hub. │ │ create_commit = def create_commit(repo_id: str, operations: │ │ Iterable[Union[huggingface_hub._commit_api.CommitOperationAdd, │ │ huggingface_hub._commit_api.CommitOperationDelete]], *, commit_message: str, │ │ commit_description: Optional[str] = None, token: Optional[str] = None, repo_type: │ │ Optional[str] = None, revision: Optional[str] = None, create_pr: Optional[bool] = │ │ None, num_threads: int = 5, parent_commit: Optional[str] = None) -> │ │ huggingface_hub.hf_api.CommitInfo: Creates a commit in the given repo, deleting & │ │ uploading files as needed. │ │ create_discussion = def create_discussion(repo_id: str, title: str, *, token: Optional[str] = None, │ │ description: Optional[str] = None, repo_type: Optional[str] = None, pull_request: │ │ bool = False) -> huggingface_hub.community.DiscussionWithDetails: Creates a │ │ Discussion or Pull Request. │ │ create_pull_request = def create_pull_request(repo_id: str, title: str, *, token: Optional[str] = None, │ │ description: Optional[str] = None, repo_type: Optional[str] = None) -> │ │ huggingface_hub.community.DiscussionWithDetails: Creates a Pull Request . Pull │ │ Requests created programmatically will be in `"draft"` status. │ │ create_repo = def create_repo(repo_id: str, *, token: Optional[str] = None, private: bool = False, │ │ repo_type: Optional[str] = None, exist_ok: bool = False, space_sdk: Optional[str] = │ │ None) -> str: Create an empty repo on the HuggingFace Hub. │ │ create_tag = def create_tag(repo_id: str, *, tag: str, tag_message: Optional[str] = None, │ │ revision: Optional[str] = None, token: Optional[str] = None, repo_type: │ │ Optional[str] = None) -> None: Tag a given commit of a repo on the Hub. │ │ dataset_info = def dataset_info(repo_id: str, *, revision: Optional[str] = None, timeout: │ │ Optional[float] = None, files_metadata: bool = False, token: Union[bool, str, │ │ NoneType] = None) -> huggingface_hub.hf_api.DatasetInfo: Get info on one specific │ │ dataset on huggingface.co. │ │ delete_branch = def delete_branch(repo_id: str, *, branch: str, token: Optional[str] = None, │ │ repo_type: Optional[str] = None) -> None: Delete a branch from a repo on the Hub. │ │ delete_file = def delete_file(path_in_repo: str, repo_id: str, *, token: Optional[str] = None, │ │ repo_type: Optional[str] = None, revision: Optional[str] = None, commit_message: │ │ Optional[str] = None, commit_description: Optional[str] = None, create_pr: │ │ Optional[bool] = None, parent_commit: Optional[str] = None) -> │ │ huggingface_hub.hf_api.CommitInfo: Deletes a file in the given repo. │ │ delete_folder = def delete_folder(path_in_repo: str, repo_id: str, *, token: Optional[str] = None, │ │ repo_type: Optional[str] = None, revision: Optional[str] = None, commit_message: │ │ Optional[str] = None, commit_description: Optional[str] = None, create_pr: │ │ Optional[bool] = None, parent_commit: Optional[str] = None) -> │ │ huggingface_hub.hf_api.CommitInfo: Deletes a folder in the given repo. │ │ delete_repo = def delete_repo(repo_id: str, *, token: Optional[str] = None, repo_type: │ │ Optional[str] = None): Delete a repo from the HuggingFace Hub. CAUTION: this is │ │ irreversible. │ │ delete_tag = def delete_tag(repo_id: str, *, tag: str, token: Optional[str] = None, repo_type: │ │ Optional[str] = None) -> None: Delete a tag from a repo on the Hub. │ │ edit_discussion_comment = def edit_discussion_comment(repo_id: str, discussion_num: int, comment_id: str, │ │ new_content: str, *, token: Optional[str] = None, repo_type: Optional[str] = None) │ │ -> huggingface_hub.community.DiscussionComment: Edits a comment on a Discussion / │ │ Pull Request. │ │ get_dataset_tags = def get_dataset_tags() -> huggingface_hub.utils.endpoint_helpers.DatasetTags: Gets │ │ all valid dataset tags as a nested namespace object. │ │ get_discussion_details = def get_discussion_details(repo_id: str, discussion_num: int, *, repo_type: │ │ Optional[str] = None, token: Optional[str] = None) -> │ │ huggingface_hub.community.DiscussionWithDetails: Fetches a Discussion's / Pull │ │ Request 's details from the Hub. │ │ get_full_repo_name = def get_full_repo_name(model_id: str, *, organization: Optional[str] = None, token: │ │ Union[bool, str, NoneType] = None): │ │ Returns the repository name for a given model ID and optional │ │ organization. │ │ get_model_tags = def get_model_tags() -> huggingface_hub.utils.endpoint_helpers.ModelTags: Gets all │ │ valid model tags as a nested namespace object │ │ get_repo_discussions = def get_repo_discussions(repo_id: str, *, repo_type: Optional[str] = None, token: │ │ Optional[str] = None) -> Iterator[huggingface_hub.community.Discussion]: Fetches │ │ Discussions and Pull Requests for the given repo. │ │ hide_discussion_comment = def hide_discussion_comment(repo_id: str, discussion_num: int, comment_id: str, *, │ │ token: Optional[str] = None, repo_type: Optional[str] = None) -> │ │ huggingface_hub.community.DiscussionComment: Hides a comment on a Discussion / Pull │ │ Request. │ │ list_datasets = def list_datasets(*, filter: │ │ Union[huggingface_hub.utils.endpoint_helpers.DatasetFilter, str, Iterable[str], │ │ NoneType] = None, author: Optional[str] = None, search: Optional[str] = None, sort: │ │ Union[Literal['lastModified'], str, NoneType] = None, direction: │ │ Optional[Literal[-1]] = None, limit: Optional[int] = None, cardData: Optional[bool] │ │ = None, full: Optional[bool] = None, token: Optional[str] = None) -> │ │ List[huggingface_hub.hf_api.DatasetInfo]: Get the list of all the datasets on │ │ huggingface.co │ │ list_metrics = def list_metrics() -> List[huggingface_hub.hf_api.MetricInfo]: Get the public list │ │ of all the metrics on huggingface.co │ │ list_models = def list_models(*, filter: Union[huggingface_hub.utils.endpoint_helpers.ModelFilter, │ │ str, Iterable[str], NoneType] = None, author: Optional[str] = None, search: │ │ Optional[str] = None, emissions_thresholds: Optional[Tuple[float, float]] = None, │ │ sort: Union[Literal['lastModified'], str, NoneType] = None, direction: │ │ Optional[Literal[-1]] = None, limit: Optional[int] = None, full: Optional[bool] = │ │ None, cardData: bool = False, fetch_config: bool = False, token: Union[bool, str, │ │ NoneType] = None) -> List[huggingface_hub.hf_api.ModelInfo]: Get the list of all the │ │ models on huggingface.co │ │ list_repo_files = def list_repo_files(repo_id: str, *, revision: Optional[str] = None, repo_type: │ │ Optional[str] = None, timeout: Optional[float] = None, token: Union[bool, str, │ │ NoneType] = None) -> List[str]: Get the list of files in a given repo. │ │ list_spaces = def list_spaces(*, filter: Union[str, Iterable[str], NoneType] = None, author: │ │ Optional[str] = None, search: Optional[str] = None, sort: │ │ Union[Literal['lastModified'], str, NoneType] = None, direction: │ │ Optional[Literal[-1]] = None, limit: Optional[int] = None, datasets: Union[str, │ │ Iterable[str], NoneType] = None, models: Union[str, Iterable[str], NoneType] = None, │ │ linked: bool = False, full: Optional[bool] = None, token: Optional[str] = None) -> │ │ List[huggingface_hub.hf_api.SpaceInfo]: Get the public list of all Spaces on │ │ huggingface.co │ │ merge_pull_request = def merge_pull_request(repo_id: str, discussion_num: int, *, token: Optional[str] = │ │ None, comment: Optional[str] = None, repo_type: Optional[str] = None): Merges a Pull │ │ Request. │ │ model_info = def model_info(repo_id: str, *, revision: Optional[str] = None, timeout: │ │ Optional[float] = None, securityStatus: Optional[bool] = None, files_metadata: bool │ │ = False, token: Union[bool, str, NoneType] = None) -> │ │ huggingface_hub.hf_api.ModelInfo: Get info on one specific model on huggingface.co │ │ move_repo = def move_repo(from_id: str, to_id: str, *, repo_type: Optional[str] = None, token: │ │ Optional[str] = None): Moving a repository from namespace1/repo_name1 to │ │ namespace2/repo_name2 │ │ rename_discussion = def rename_discussion(repo_id: str, discussion_num: int, new_title: str, *, token: │ │ Optional[str] = None, repo_type: Optional[str] = None) -> │ │ huggingface_hub.community.DiscussionTitleChange: Renames a Discussion. │ │ repo_info = def repo_info(repo_id: str, *, revision: Optional[str] = None, repo_type: │ │ Optional[str] = None, timeout: Optional[float] = None, files_metadata: bool = False, │ │ token: Union[bool, str, NoneType] = None) -> Union[huggingface_hub.hf_api.ModelInfo, │ │ huggingface_hub.hf_api.DatasetInfo, huggingface_hub.hf_api.SpaceInfo]: Get the info │ │ object for a given repo of a given type. │ │ set_access_token = def set_access_token(access_token: str): │ │ Saves the passed access token so git can correctly authenticate the │ │ user. │ │ space_info = def space_info(repo_id: str, *, revision: Optional[str] = None, timeout: │ │ Optional[float] = None, files_metadata: bool = False, token: Union[bool, str, │ │ NoneType] = None) -> huggingface_hub.hf_api.SpaceInfo: Get info on one specific │ │ Space on huggingface.co. │ │ unset_access_token = def unset_access_token(): Resets the user's access token. │ │ update_repo_visibility = def update_repo_visibility(repo_id: str, private: bool = False, *, token: │ │ Optional[str] = None, organization: Optional[str] = None, repo_type: Optional[str] = │ │ None, name: Optional[str] = None) -> Dict[str, bool]: Update the visibility setting │ │ of a repository. │ │ upload_file = def upload_file(*, path_or_fileobj: Union[str, bytes, BinaryIO], path_in_repo: str, │ │ repo_id: str, token: Optional[str] = None, repo_type: Optional[str] = None, │ │ revision: Optional[str] = None, commit_message: Optional[str] = None, │ │ commit_description: Optional[str] = None, create_pr: Optional[bool] = None, │ │ parent_commit: Optional[str] = None) -> str: │ │ Upload a local file (up to 50 GB) to the given repo. The upload is done │ │ through a HTTP post request, and doesn't require git or git-lfs to be │ │ installed. │ │ upload_folder = def upload_folder(*, repo_id: str, folder_path: Union[str, pathlib.Path], │ │ path_in_repo: Optional[str] = None, commit_message: Optional[str] = None, │ │ commit_description: Optional[str] = None, token: Optional[str] = None, repo_type: │ │ Optional[str] = None, revision: Optional[str] = None, create_pr: Optional[bool] = │ │ None, parent_commit: Optional[str] = None, allow_patterns: Union[List[str], str, │ │ NoneType] = None, ignore_patterns: Union[List[str], str, NoneType] = None): │ │ Upload a local folder to the given repo. The upload is done │ │ through a HTTP requests, and doesn't require git or git-lfs to be │ │ installed. │ │ whoami = def whoami(token: Optional[str] = None) -> Dict: Call HF API to know "whoami". │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
You'll see from looking through this there is a bunch of different things we can now do programmatically via the hub. For this post we're interested in the list_datasets
and list_models
methods. If we look at one of these we can see it has a bunch of different options we can use when listing datasets or models.
rich.inspect(api.list_models)
╭─────────── <bound method HfApi.list_models of <huggingface_hub.hf_api.HfApi object at 0x136a2ce80>> ────────────╮ │ def HfApi.list_models(*, filter: Union[huggingface_hub.utils.endpoint_helpers.ModelFilter, str, Iterable[str], │ │ NoneType] = None, author: Optional[str] = None, search: Optional[str] = None, emissions_thresholds: │ │ Optional[Tuple[float, float]] = None, sort: Union[Literal['lastModified'], str, NoneType] = None, direction: │ │ Optional[Literal[-1]] = None, limit: Optional[int] = None, full: Optional[bool] = None, cardData: bool = False, │ │ fetch_config: bool = False, token: Union[bool, str, NoneType] = None) -> │ │ List[huggingface_hub.hf_api.ModelInfo]: │ │ │ │ Get the list of all the models on huggingface.co │ │ │ │ 28 attribute(s) not shown. Run inspect(inspect) for options. │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
For our use case we want everything, so we set limit=None
, we don't want any filters
so we set this to None
(this is the default behaviour, but we set them explicitly here to make it clearer for our future selves). We also set full=True
so we get back more verbose information about our dataset and models. We also wrap the result in iter
and list
since the behaviour of these methods will change in future versions to support paging.
hub_datasets = list(iter(api.list_datasets(limit=None, filter=None, full=True)))
hub_models = list(iter(api.list_models(limit=None, filter=None, full=True)))
Let's peek at an example of what we get back
hub_models[0]
ModelInfo: { modelId: albert-base-v1 sha: aeffd769076a5c4f83b2546aea99ca45a15a5da4 lastModified: 2021-01-13T15:08:24.000Z tags: ['pytorch', 'tf', 'albert', 'fill-mask', 'en', 'dataset:bookcorpus', 'dataset:wikipedia', 'arxiv:1909.11942', 'transformers', 'exbert', 'license:apache-2.0', 'autotrain_compatible', 'has_space'] pipeline_tag: fill-mask siblings: [RepoFile(rfilename='.gitattributes', size='None', blob_id='None', lfs='None'), RepoFile(rfilename='README.md', size='None', blob_id='None', lfs='None'), RepoFile(rfilename='config.json', size='None', blob_id='None', lfs='None'), RepoFile(rfilename='pytorch_model.bin', size='None', blob_id='None', lfs='None'), RepoFile(rfilename='spiece.model', size='None', blob_id='None', lfs='None'), RepoFile(rfilename='tf_model.h5', size='None', blob_id='None', lfs='None'), RepoFile(rfilename='tokenizer.json', size='None', blob_id='None', lfs='None'), RepoFile(rfilename='with-prefix-tf_model.h5', size='None', blob_id='None', lfs='None')] private: False author: None config: None securityStatus: None _id: 621ffdc036468d709f174328 id: albert-base-v1 gitalyUid: 4f35551ea371da7a8762caab54319a54ade836044f0ca7690d21e86b159867eb likes: 1 downloads: 75182 library_name: transformers }
hub_datasets[0]
DatasetInfo: { id: acronym_identification sha: 173af1516c409eb4596bc63a69626bdb5584c40c lastModified: 2022-11-18T17:25:49.000Z tags: ['task_categories:token-classification', 'annotations_creators:expert-generated', 'language_creators:found', 'multilinguality:monolingual', 'size_categories:10K<n<100K', 'source_datasets:original', 'language:en', 'license:mit', 'acronym-identification', 'arxiv:2010.14678'] private: False author: None description: Acronym identification training and development sets for the acronym identification task at SDU@AAAI-21. citation: @inproceedings{veyseh-et-al-2020-what, title={{What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation}}, author={Amir Pouran Ben Veyseh and Franck Dernoncourt and Quan Hung Tran and Thien Huu Nguyen}, year={2020}, booktitle={Proceedings of COLING}, link={https://arxiv.org/pdf/2010.14678v1.pdf} } cardData: {'annotations_creators': ['expert-generated'], 'language_creators': ['found'], 'language': ['en'], 'license': ['mit'], 'multilinguality': ['monolingual'], 'size_categories': ['10K<n<100K'], 'source_datasets': ['original'], 'task_categories': ['token-classification'], 'task_ids': [], 'paperswithcode_id': 'acronym-identification', 'pretty_name': 'Acronym Identification Dataset', 'train-eval-index': [{'config': 'default', 'task': 'token-classification', 'task_id': 'entity_extraction', 'splits': {'eval_split': 'test'}, 'col_mapping': {'tokens': 'tokens', 'labels': 'tags'}}], 'tags': ['acronym-identification'], 'dataset_info': {'features': [{'name': 'id', 'dtype': 'string'}, {'name': 'tokens', 'sequence': 'string'}, {'name': 'labels', 'sequence': {'class_label': {'names': {'0': 'B-long', '1': 'B-short', '2': 'I-long', '3': 'I-short', '4': 'O'}}}}], 'splits': [{'name': 'train', 'num_bytes': 7792803, 'num_examples': 14006}, {'name': 'validation', 'num_bytes': 952705, 'num_examples': 1717}, {'name': 'test', 'num_bytes': 987728, 'num_examples': 1750}], 'download_size': 8556464, 'dataset_size': 9733236}} siblings: [] _id: 621ffdd236468d709f181d58 disabled: False gated: False gitalyUid: 6570517623fa521aa189178e7c7e73d9d88c01b295204edef97f389a15eae144 likes: 9 downloads: 6074 paperswithcode_id: acronym-identification }
Since we want both models and datasets we create a dictionary which stores the types of item i.e. is it a dataset or a model.
hub_data = {"model": hub_models, "dataset": hub_datasets}
We'll be putting our data inside a pandas DataFrame, so we'll grab the .__dict__
attribute for each hub item, so it's more pandas friendly.
hub_item_dict = []
for hub_type, hub_item in hub_data.items():
for item in hub_item:
data = item.__dict__
data["type"] = hub_type
hub_item_dict.append(data)
df = pd.DataFrame.from_dict(hub_item_dict)
How many hub items do we have?
len(df)
151343
What info do we have?
df.columns
Index(['modelId', 'sha', 'lastModified', 'tags', 'pipeline_tag', 'siblings', 'private', 'author', 'config', 'securityStatus', '_id', 'id', 'gitalyUid', 'likes', 'downloads', 'library_name', 'type', 'description', 'citation', 'cardData', 'disabled', 'gated', 'paperswithcode_id'], dtype='object')
Models and datasets have a bunch of metadata i.e. last modified and number of downloads. We'll focus on tags
here. Let's start by looking at a single example.
df.loc[30, "tags"]
['pytorch', 'tf', 'rust', 'safetensors', 'distilbert', 'text-classification', 'en', 'dataset:sst2', 'dataset:glue', 'doi:10.57967/hf/0181', 'transformers', 'license:apache-2.0', 'model-index', 'has_space']
We can see that tags
capture can relate to tasks i.e. text-classification
, libraries supported i.e. tf
, or the licence
associated with a model or dataset. As a starting point for exploring tags we can take a look at how many tags models and datasets have. We'll add a new column to capture this number.
def calculate_number_of_tags(tags: [str]) -> int:
return len(tags)
df["number_of_tags"] = df["tags"].apply(lambda x: calculate_number_of_tags(x))
We can now use describe
to see the breakdown of this number.
df.number_of_tags.describe()
count 151343.000000 mean 3.855566 std 6.878613 min 0.000000 25% 0.000000 50% 4.000000 75% 6.000000 max 650.000000 Name: number_of_tags, dtype: float64
We can see that we have quite a range of tag numbers ranging from 0
to 650
! If your brain works anything like mine you probably want to know what this high value is about!
df[df.number_of_tags > 640][["id", "tags"]]
id | tags | |
---|---|---|
136372 | bible-nlp/biblenlp-corpus | [task_categories:translation, annotations_crea... |
df[df.number_of_tags > 640]["tags"].tolist()
[['task_categories:translation', 'annotations_creators:no-annotation', 'language_creators:expert-generated', 'multilinguality:translation', 'multilinguality:multilingual', 'size_categories:1M<n<10M', 'source_datasets:original', 'language:aau', 'language:aaz', 'language:abx', 'language:aby', 'language:acf', 'language:acu', 'language:adz', 'language:aey', 'language:agd', 'language:agg', 'language:agm', 'language:agn', 'language:agr', 'language:agu', 'language:aia', 'language:ake', 'language:alp', 'language:alq', 'language:als', 'language:aly', 'language:ame', 'language:amk', 'language:amp', 'language:amr', 'language:amu', 'language:anh', 'language:anv', 'language:aoi', 'language:aoj', 'language:apb', 'language:apn', 'language:apu', 'language:apy', 'language:arb', 'language:arl', 'language:arn', 'language:arp', 'language:aso', 'language:ata', 'language:atb', 'language:atd', 'language:atg', 'language:auc', 'language:aui', 'language:auy', 'language:avt', 'language:awb', 'language:awk', 'language:awx', 'language:azg', 'language:azz', 'language:bao', 'language:bbb', 'language:bbr', 'language:bch', 'language:bco', 'language:bdd', 'language:bea', 'language:bel', 'language:bgs', 'language:bgt', 'language:bhg', 'language:bhl', 'language:big', 'language:bjr', 'language:bjv', 'language:bkd', 'language:bki', 'language:bkq', 'language:bkx', 'language:bla', 'language:blw', 'language:blz', 'language:bmh', 'language:bmk', 'language:bmr', 'language:bnp', 'language:boa', 'language:boj', 'language:bon', 'language:box', 'language:bqc', 'language:bre', 'language:bsn', 'language:bsp', 'language:bss', 'language:buk', 'language:bus', 'language:bvr', 'language:bxh', 'language:byx', 'language:bzd', 'language:bzj', 'language:cab', 'language:caf', 'language:cao', 'language:cap', 'language:car', 'language:cav', 'language:cax', 'language:cbc', 'language:cbi', 'language:cbk', 'language:cbr', 'language:cbs', 'language:cbt', 'language:cbu', 'language:cbv', 'language:cco', 'language:ces', 'language:cgc', 'language:cha', 'language:chd', 'language:chf', 'language:chk', 'language:chq', 'language:chz', 'language:cjo', 'language:cjv', 'language:cle', 'language:clu', 'language:cme', 'language:cmn', 'language:cni', 'language:cnl', 'language:cnt', 'language:cof', 'language:con', 'language:cop', 'language:cot', 'language:cpa', 'language:cpb', 'language:cpc', 'language:cpu', 'language:crn', 'language:crx', 'language:cso', 'language:cta', 'language:ctp', 'language:ctu', 'language:cub', 'language:cuc', 'language:cui', 'language:cut', 'language:cux', 'language:cwe', 'language:daa', 'language:dad', 'language:dah', 'language:ded', 'language:deu', 'language:dgr', 'language:dgz', 'language:dif', 'language:dik', 'language:dji', 'language:djk', 'language:dob', 'language:dwr', 'language:dww', 'language:dwy', 'language:eko', 'language:emi', 'language:emp', 'language:eng', 'language:epo', 'language:eri', 'language:ese', 'language:etr', 'language:faa', 'language:fai', 'language:far', 'language:for', 'language:fra', 'language:fuf', 'language:gai', 'language:gam', 'language:gaw', 'language:gdn', 'language:gdr', 'language:geb', 'language:gfk', 'language:ghs', 'language:gia', 'language:glk', 'language:gmv', 'language:gng', 'language:gnn', 'language:gnw', 'language:gof', 'language:grc', 'language:gub', 'language:guh', 'language:gui', 'language:gul', 'language:gum', 'language:guo', 'language:gvc', 'language:gvf', 'language:gwi', 'language:gym', 'language:gyr', 'language:hat', 'language:haw', 'language:hbo', 'language:hch', 'language:heb', 'language:heg', 'language:hix', 'language:hla', 'language:hlt', 'language:hns', 'language:hop', 'language:hrv', 'language:hub', 'language:hui', 'language:hus', 'language:huu', 'language:huv', 'language:hvn', 'language:ign', 'language:ikk', 'language:ikw', 'language:imo', 'language:inb', 'language:ind', 'language:ino', 'language:iou', 'language:ipi', 'language:ita', 'language:jac', 'language:jao', 'language:jic', 'language:jiv', 'language:jpn', 'language:jvn', 'language:kaq', 'language:kbc', 'language:kbh', 'language:kbm', 'language:kdc', 'language:kde', 'language:kdl', 'language:kek', 'language:ken', 'language:kew', 'language:kgk', 'language:kgp', 'language:khs', 'language:kje', 'language:kjs', 'language:kkc', 'language:kky', 'language:klt', 'language:klv', 'language:kms', 'language:kmu', 'language:kne', 'language:knf', 'language:knj', 'language:kos', 'language:kpf', 'language:kpg', 'language:kpj', 'language:kpw', 'language:kqa', 'language:kqc', 'language:kqf', 'language:kql', 'language:kqw', 'language:ksj', 'language:ksr', 'language:ktm', 'language:kto', 'language:kud', 'language:kue', 'language:kup', 'language:kvn', 'language:kwd', 'language:kwf', 'language:kwi', 'language:kwj', 'language:kyf', 'language:kyg', 'language:kyq', 'language:kyz', 'language:kze', 'language:lac', 'language:lat', 'language:lbb', 'language:leu', 'language:lex', 'language:lgl', 'language:lid', 'language:lif', 'language:lww', 'language:maa', 'language:maj', 'language:maq', 'language:mau', 'language:mav', 'language:maz', 'language:mbb', 'language:mbc', 'language:mbh', 'language:mbl', 'language:mbt', 'language:mca', 'language:mcb', 'language:mcd', 'language:mcf', 'language:mcp', 'language:mdy', 'language:med', 'language:mee', 'language:mek', 'language:meq', 'language:met', 'language:meu', 'language:mgh', 'language:mgw', 'language:mhl', 'language:mib', 'language:mic', 'language:mie', 'language:mig', 'language:mih', 'language:mil', 'language:mio', 'language:mir', 'language:mit', 'language:miz', 'language:mjc', 'language:mkn', 'language:mks', 'language:mlh', 'language:mlp', 'language:mmx', 'language:mna', 'language:mop', 'language:mox', 'language:mph', 'language:mpj', 'language:mpm', 'language:mpp', 'language:mps', 'language:mpx', 'language:mqb', 'language:mqj', 'language:msb', 'language:msc', 'language:msk', 'language:msm', 'language:msy', 'language:mti', 'language:muy', 'language:mva', 'language:mvn', 'language:mwc', 'language:mxb', 'language:mxp', 'language:mxq', 'language:mxt', 'language:myu', 'language:myw', 'language:myy', 'language:mzz', 'language:nab', 'language:naf', 'language:nak', 'language:nay', 'language:nbq', 'language:nca', 'language:nch', 'language:ncj', 'language:ncl', 'language:ncu', 'language:ndj', 'language:nfa', 'language:ngp', 'language:ngu', 'language:nhg', 'language:nhi', 'language:nho', 'language:nhr', 'language:nhu', 'language:nhw', 'language:nhy', 'language:nif', 'language:nin', 'language:nko', 'language:nld', 'language:nlg', 'language:nna', 'language:nnq', 'language:not', 'language:nou', 'language:npl', 'language:nsn', 'language:nss', 'language:ntj', 'language:ntp', 'language:nwi', 'language:nyu', 'language:obo', 'language:ong', 'language:ons', 'language:ood', 'language:opm', 'language:ote', 'language:otm', 'language:otn', 'language:otq', 'language:ots', 'language:pab', 'language:pad', 'language:pah', 'language:pao', 'language:pes', 'language:pib', 'language:pio', 'language:pir', 'language:pjt', 'language:plu', 'language:pma', 'language:poe', 'language:poi', 'language:pon', 'language:poy', 'language:ppo', 'language:prf', 'language:pri', 'language:ptp', 'language:ptu', 'language:pwg', 'language:quc', 'language:quf', 'language:quh', 'language:qul', 'language:qup', 'language:qvc', 'language:qve', 'language:qvh', 'language:qvm', 'language:qvn', 'language:qvs', 'language:qvw', 'language:qvz', 'language:qwh', 'language:qxh', 'language:qxn', 'language:qxo', 'language:rai', 'language:rkb', 'language:rmc', 'language:roo', 'language:rop', 'language:rro', 'language:ruf', 'language:rug', 'language:rus', 'language:sab', 'language:san', 'language:sbe', 'language:seh', 'language:sey', 'language:sgz', 'language:shj', 'language:shp', 'language:sim', 'language:sja', 'language:sll', 'language:smk', 'language:snc', 'language:snn', 'language:sny', 'language:som', 'language:soq', 'language:spa', 'language:spl', 'language:spm', 'language:sps', 'language:spy', 'language:sri', 'language:srm', 'language:srn', 'language:srp', 'language:srq', 'language:ssd', 'language:ssg', 'language:ssx', 'language:stp', 'language:sua', 'language:sue', 'language:sus', 'language:suz', 'language:swe', 'language:swh', 'language:swp', 'language:sxb', 'language:tac', 'language:tav', 'language:tbc', 'language:tbl', 'language:tbo', 'language:tbz', 'language:tca', 'language:tee', 'language:ter', 'language:tew', 'language:tfr', 'language:tgp', 'language:tif', 'language:tim', 'language:tiy', 'language:tke', 'language:tku', 'language:tna', 'language:tnc', 'language:tnn', 'language:tnp', 'language:toc', 'language:tod', 'language:toj', 'language:ton', 'language:too', 'language:top', 'language:tos', 'language:tpt', 'language:trc', 'language:tsw', 'language:ttc', 'language:tue', 'language:tuo', 'language:txu', 'language:ubr', 'language:udu', 'language:ukr', 'language:uli', 'language:ura', 'language:urb', 'language:usa', 'language:usp', 'language:uvl', 'language:vid', 'language:vie', 'language:viv', 'language:vmy', 'language:waj', 'language:wal', 'language:wap', 'language:wat', 'language:wbp', 'language:wed', 'language:wer', 'language:wim', 'language:wmt', 'language:wmw', 'language:wnc', 'language:wnu', 'language:wos', 'language:wrk', 'language:wro', 'language:wsk', 'language:wuv', 'language:xav', 'language:xed', 'language:xla', 'language:xnn', 'language:xon', 'language:xsi', 'language:xtd', 'language:xtm', 'language:yaa', 'language:yad', 'language:yal', 'language:yap', 'language:yaq', 'language:yby', 'language:ycn', 'language:yka', 'language:yml', 'language:yre', 'language:yuj', 'language:yut', 'language:yuw', 'language:yva', 'language:zaa', 'language:zab', 'language:zac', 'language:zad', 'language:zai', 'language:zaj', 'language:zam', 'language:zao', 'language:zar', 'language:zas', 'language:zat', 'language:zav', 'language:zaw', 'language:zca', 'language:zia', 'language:ziw', 'language:zos', 'language:zpc', 'language:zpl', 'language:zpo', 'language:zpq', 'language:zpu', 'language:zpv', 'language:zpz', 'language:zsr', 'language:ztq', 'language:zty', 'language:zyp', 'language:be', 'language:br', 'language:cs', 'language:ch', 'language:zh', 'language:de', 'language:en', 'language:eo', 'language:fr', 'language:ht', 'language:he', 'language:hr', 'language:id', 'language:it', 'language:ja', 'language:la', 'language:nl', 'language:ru', 'language:sa', 'language:so', 'language:es', 'language:sr', 'language:sv', 'language:to', 'language:uk', 'language:vi', 'license:cc-by-4.0', 'license:other']]
We can see that in this case many of the tags relate to language. Since the dataset is bible related and the bible has been heavily translated this might not be as surprising.
Although these high-level stats are somewhat interesting we probably want to break these numbers down. At a high level we can groupby datasets vs models.
df.groupby("type")["number_of_tags"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
type | ||||||||
dataset | 19576.0 | 2.46935 | 13.137220 | 0.0 | 0.0 | 0.0 | 2.0 | 650.0 |
model | 131767.0 | 4.06151 | 5.327066 | 0.0 | 0.0 | 4.0 | 6.0 | 413.0 |
We can see that the mean number of tags for models is higher than datasets. We can also see at the 75% percentile models also have more tags compared to datasets. The possible reasons for this (and whether this is a problem or not) is something we may wish to explore further...
Since the hub hosts models from different libraries we may want to also breakdown by library. First let's grab only the model part of our DataFrame.
models_df = df[df["type"] == "model"]
The library_name
column contains info about the library. Let's see how many unique libraries we have.
models_df.library_name.unique().shape
(63,)
This is quite a few! We can do a groupby on this column
models_df.groupby("library_name")["number_of_tags"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
library_name | ||||||||
BERT | 1.0 | 7.0 | NaN | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 |
Doc-UFCN | 2.0 | 4.0 | 0.000000 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 |
EveryDream | 2.0 | 7.0 | 0.000000 | 7.0 | 7.0 | 7.0 | 7.0 | 7.0 |
FastAI | 1.0 | 1.0 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
JoeyNMT | 1.0 | 4.0 | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
ultralytics | 4.0 | 10.0 | 1.414214 | 8.0 | 9.5 | 10.5 | 11.0 | 11.0 |
ultralyticsplus | 1.0 | 9.0 | NaN | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 |
yolor | 2.0 | 9.0 | 0.000000 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 |
yolov5 | 36.0 | 9.0 | 0.000000 | 9.0 | 9.0 | 9.0 | 9.0 | 9.0 |
yolov6detect | 1.0 | 10.0 | NaN | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 |
62 rows × 8 columns
We might find this a bit tricky to look at. We may want to only include the top n libraries since some of these libraries may be less well used.
models_df.library_name.value_counts()[:15]
transformers 63754 stable-baselines3 3183 diffusers 2802 sentence-transformers 1273 ml-agents 763 keras 470 timm 383 espnet 381 spacy 296 sample-factory 273 adapter-transformers 201 sklearn 113 nemo 103 fastai 99 speechbrain 94 Name: library_name, dtype: int64
top_libraries = models_df.library_name.value_counts()[:9].index.to_list()
top_libraries_df = models_df[models_df.library_name.isin(top_libraries)]
top_libraries_df.groupby("library_name")["number_of_tags"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
library_name | ||||||||
diffusers | 2802.0 | 4.374732 | 2.171226 | 1.0 | 3.0 | 4.0 | 5.0 | 18.0 |
espnet | 381.0 | 6.965879 | 0.595060 | 3.0 | 7.0 | 7.0 | 7.0 | 9.0 |
keras | 470.0 | 3.842553 | 14.422674 | 1.0 | 1.0 | 2.0 | 5.0 | 311.0 |
ml-agents | 763.0 | 6.965924 | 0.273775 | 2.0 | 7.0 | 7.0 | 7.0 | 7.0 |
sentence-transformers | 1273.0 | 6.984289 | 3.221840 | 2.0 | 6.0 | 6.0 | 7.0 | 36.0 |
spacy | 296.0 | 4.611486 | 0.985180 | 2.0 | 4.0 | 5.0 | 5.0 | 10.0 |
stable-baselines3 | 3183.0 | 4.997801 | 0.163426 | 3.0 | 5.0 | 5.0 | 5.0 | 8.0 |
timm | 383.0 | 3.548303 | 1.315291 | 2.0 | 3.0 | 3.0 | 3.0 | 13.0 |
transformers | 63754.0 | 6.912037 | 5.262633 | 1.0 | 5.0 | 6.0 | 8.0 | 240.0 |
Let's take a quick look at some examples from the library with the highest and lowest number or tags.
top_libraries_df[top_libraries_df.library_name == "sentence-transformers"].sample(15)[
"tags"
]
6123 [pytorch, gpt_neo, arxiv:2202.08904, sentence-... 2488 [pytorch, distilbert, sentence-transformers, f... 37669 [pytorch, distilbert, sentence-transformers, f... 71483 [pytorch, bert, sentence-transformers, feature... 20710 [pytorch, tf, roberta, ko, sentence-transforme... 27073 [pytorch, tf, jax, roberta, arxiv:1908.10084, ... 92037 [pytorch, mpnet, sentence-transformers, featur... 90320 [pytorch, mpnet, sentence-transformers, featur... 63555 [pytorch, bert, sentence-transformers, feature... 87707 [pytorch, mpnet, sentence-transformers, featur... 80570 [pytorch, bert, sentence-transformers, feature... 111407 [pytorch, bert, sentence-transformers, feature... 82690 [pytorch, mpnet, sentence-transformers, featur... 36217 [pytorch, bert, pl, dataset:Wikipedia, arxiv:1... 100086 [pytorch, roberta, sentence-transformers, feat... Name: tags, dtype: object
top_libraries_df[top_libraries_df.library_name == "timm"].sample(15)["tags"]
104432 [pytorch, timm, image-classification] 110296 [pytorch, arxiv:2301.00808, timm, image-classi... 24158 [pytorch, timm, image-classification] 26471 [pytorch, timm, image-classification] 104437 [pytorch, timm, image-classification] 61630 [pytorch, dataset:beans, timm, image-classific... 110298 [pytorch, arxiv:2301.00808, timm, image-classi... 104015 [pytorch, timm, image-classification] 101124 [pytorch, timm, image-classification] 57882 [coreml, onnx, en, dataset:imagenet-1k, arxiv:... 83459 [pytorch, timm, image-classification, vision, ... 99461 [pytorch, timm, image-classification] 104029 [pytorch, timm, image-classification] 84402 [pytorch, timm, image-classification, vision, ... 104428 [pytorch, timm, image-classification] Name: tags, dtype: object
We can see here that some tags for sentence-transformers
are very closely tied to that libraries purpose e.g. the sentence-similarity
tag. This tag migth be useful when a user is looking for models to do sentence-similarity
but might be less useful if you are trying to choose between models for this task i.e. trying to find the setence-transformer
model that will be useful for you. We should be careful, therefore, in treating number of tags as a proxy for quality.
We have a column in our dataframe pipeline
tag, which refers to the type of task a model is for. We should be careful relying too much on this but we can have a quick look at how often these are used.
models_df["pipeline_tag"].value_counts()
text-classification 14479 text2text-generation 8102 text-generation 7602 reinforcement-learning 6885 token-classification 6386 automatic-speech-recognition 6238 fill-mask 5447 question-answering 3147 feature-extraction 2661 translation 1837 conversational 1770 image-classification 1760 text-to-image 1604 sentence-similarity 1248 summarization 735 unconditional-image-generation 428 text-to-speech 244 audio-classification 234 multiple-choice 169 object-detection 158 image-segmentation 134 audio-to-audio 130 tabular-classification 97 zero-shot-classification 97 image-to-text 76 zero-shot-image-classification 56 video-classification 50 table-question-answering 47 tabular-regression 44 image-to-image 43 depth-estimation 37 document-question-answering 18 visual-question-answering 13 voice-activity-detection 6 other 4 time-series-forecasting 1 Name: pipeline_tag, dtype: int64
We may also want to see if there are some type of task that have more tags.
models_df.groupby("pipeline_tag")["number_of_tags"].mean().sort_values().plot.barh()
<AxesSubplot: ylabel='pipeline_tag'>
We can also look at the breakdown for a particular task
text_classification_df = models_df[models_df["pipeline_tag"] == "text-classification"]
text_classification_df["number_of_tags"].describe()
count 14479.000000 mean 5.948822 std 3.718800 min 1.000000 25% 4.000000 50% 5.000000 75% 7.000000 max 240.000000 Name: number_of_tags, dtype: float64
Again, we have some extreme outliers
text_classification_df[text_classification_df.number_of_tags > 230][["tags", "modelId"]]
tags | modelId | |
---|---|---|
22457 | [pytorch, tf, roberta, text-classification, mu... | m3hrdadfi/zabanshenas-roberta-base-mix |
101628 | [pytorch, canine, text-classification, ace, af... | SebOchs/canine-c-lang-id |
We see that these mostly seem to relate to language. Let's remove these outliers and look at the distribution in the number of tags without these.
text_classification_df_no_outliers = text_classification_df[
text_classification_df["number_of_tags"]
<= text_classification_df["number_of_tags"].quantile(0.95)
]
text_classification_df_no_outliers["number_of_tags"].plot.hist(bins=9)
<AxesSubplot: ylabel='Frequency'>
I'e already hinted at why looking at raw number of tags might not be a good idea. Let's close this blog by briefly digging into at least one reason why. We'll use the toolz library for some of this analysis.
from toolz import concat
First we grab all the tags and put them in a single list.
all_tags = list(concat(df.tags.tolist()))
If we look at some examples, we'll see some tags are in the form of something:somethingelse
.
all_tags[:10]
['pytorch', 'tf', 'albert', 'fill-mask', 'en', 'dataset:bookcorpus', 'dataset:wikipedia', 'arxiv:1909.11942', 'transformers', 'exbert']
for example dataset:wikipedia
, we should therefore avoid treating all tags as the same since tags can have a particular purpose. i.e. indicating a dataset is associated with a model.
def is_special_tag(tag: str):
return ":" in tag
from toolz import countby, valmap
special_tag_vs_normal = countby(is_special_tag, all_tags)
special_tag_vs_normal
{False: 467758, True: 115755}
total = sum(special_tag_vs_normal.values())
valmap(lambda x: x / total, special_tag_vs_normal)
{False: 0.8016239569641121, True: 0.1983760430358878}
We can see that a good chunk of tags are 'special' tags. i.e. they have a 'type' associated with them. If we want to explore tags on the hub more carefully we'll need to take this into account...