from datasets import load_dataset
dataset_large = load_dataset("pierreguillou/DocLayNet-large")
/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed). from pandas.core.computation.check import NUMEXPR_INSTALLED
Downloading builder script: 0%| | 0.00/15.6k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/14.5k [00:00<?, ?B/s]
No config specified, defaulting to: doc_lay_net-large/DocLayNet_2022.08_processed_on_2023.01
Downloading and preparing dataset doc_lay_net-large/DocLayNet_2022.08_processed_on_2023.01 to /home/ubuntu/.cache/huggingface/datasets/pierreguillou___doc_lay_net-large/DocLayNet_2022.08_processed_on_2023.01/1.1.0/d06ef3264887eb493e248b8c845cc7610012d0f62e21b0b21a3a4423e461a95e...
Downloading data files: 0%| | 0/4 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/9.42G [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/9.41G [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/9.44G [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/9.38G [00:00<?, ?B/s]
Extracting data files: 0%| | 0/4 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
Dataset doc_lay_net-large downloaded and prepared to /home/ubuntu/.cache/huggingface/datasets/pierreguillou___doc_lay_net-large/DocLayNet_2022.08_processed_on_2023.01/1.1.0/d06ef3264887eb493e248b8c845cc7610012d0f62e21b0b21a3a4423e461a95e. Subsequent calls will reuse this data.
0%| | 0/3 [00:00<?, ?it/s]
dataset_large
DatasetDict({ train: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 69103 }) validation: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 6480 }) test: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 4994 }) })
dataset_large["train"][0]["image"]