Pandas¶

Export a TF dataset as Pandas

In [1]:

from tf.app import use

In [2]:

A = use("CLARIAH/wp6-ferdinandhuyck:clone", checkout="clone", hoist=globals())

Locating corpus resources ...

app: ~/github/CLARIAH/wp6-ferdinandhuyck/app

data: ~/github/CLARIAH/wp6-ferdinandhuyck/tf/0.1

Text-Fabric: Text-Fabric API 11.2.3, CLARIAH/wp6-ferdinandhuyck/app v3, Search Reference
Data: CLARIAH - wp6-ferdinandhuyck 0.1, Character table, Feature docs

Node types

Name	# of nodes	# slots/node	% coverage
text	1	218025.00	100
body	1	218018.00	100
div	42	5190.90	100
chapter	44	4963.18	100
fileDesc	1	299.00	0
editionStmt	1	268.00	0
p	3725	58.04	99
chunk	3833	56.93	100
lg	41	23.34	0
ebook	1	21.00	0
pod	1	21.00	0
note	9	20.89	0
sourceDesc	1	16.00	0
bibl	2	13.00	0
revisionDesc	1	12.00	0
q	27	9.04	0
head	86	8.40	0
titleStmt	1	8.00	0
l	122	7.80	0
interpGrp	1	7.00	0
change	2	6.00	0
publicationStmt	1	5.00	0
title	3	5.00	0
item	2	4.00	0
hi	602	3.50	1
author	3	3.00	0
imprint	2	3.00	0
encodingDesc	1	2.00	0
notesStmt	1	2.00	0
order	2	2.00	0
availability	3	1.67	0
name	268	1.21	0
idno	9	1.11	0
blurb	2	1.00	0
colofon	2	1.00	0
date	4	1.00	0
figure	5	1.00	0
interp	7	1.00	0
price	2	1.00	0
pubPlace	2	1.00	0
publisher	2	1.00	0
respStmt	2	1.00	0
titlepage	2	1.00	0
xptr	5	1.00	0
word	218380	1.00	100

Sets: no custom sets
Features:

CLARIAH - wp6-ferdinandhuyck

after

str

the text after a word till the next word

chapter

str

name of chapter

chunk

int

number of a chunk within a file

curr

str

this is TEI attribute curr

empty

int

whether a slot has been inserted in an empty element

empty_lb

int

empty TEI element lb follows

empty_link

int

empty TEI element link follows

empty_pb

int

empty TEI element pb follows

empty_pb_n

str

TEI attribute n of empty element pb

is_meta

str

whether a slot or word is in the teiHeader element

is_note

str

whether a slot or word is in the note element

n

str

this is TEI attribute n

otype

str

place

str

this is TEI attribute place

rend

str

this is TEI attribute rend

rend_1tab

int

whether text is to be rendered as 1tab

rend_b

int

whether text is to be rendered as b

rend_bq

int

whether text is to be rendered as bq

rend_h2

int

whether text is to be rendered as h2

rend_h3

int

whether text is to be rendered as h3

rend_h4

int

whether text is to be rendered as h4

rend_i

int

whether text is to be rendered as i

rend_sc

int

whether text is to be rendered as sc

rend_spat

int

whether text is to be rendered as spat

rend_sup

int

whether text is to be rendered as sup

str

the text of a word

to

str

this is TEI attribute to

type

str

this is TEI attribute type

value

str

this is TEI attribute value

oslots

none

Text-Fabric API: names N F E L T S C TF directly usable

In [3]:

c1 = F.otype.s("chunk")[100]
c1

Out[3]:

In [4]:

A.plain(c1)

Tweede hoofdstuk.@-2 Waarin men lezen zal, wat in en voor de herberg te Zoest voorviel.

In [5]:

A.exportPandas(inTypes="")

  0.00s Create tsv file ...
   |     0.17s   5%   11363 nodes written
   |     0.33s  10%   22726 nodes written
   |     0.49s  15%   34089 nodes written
   |     0.65s  20%   45452 nodes written
   |     0.81s  25%   56815 nodes written
   |     0.97s  30%   68178 nodes written
   |     1.13s  35%   79541 nodes written
   |     1.29s  40%   90904 nodes written
   |     1.45s  45%  102267 nodes written
   |     1.61s  50%  113630 nodes written
   |     1.77s  55%  124993 nodes written
   |     1.93s  60%  136356 nodes written
   |     2.09s  65%  147719 nodes written
   |     2.25s  70%  159082 nodes written
   |     2.41s  75%  170445 nodes written
   |     2.57s  80%  181808 nodes written
   |     2.73s  85%  193171 nodes written
   |     2.89s  90%  204534 nodes written
   |     3.05s  95%  215897 nodes written
   |     3.22s  95%  227255 nodes written and done
  3.22s TSV file is ~/github/CLARIAH/wp6-ferdinandhuyck/_temp/data-0.1.tsv
  3.22s Columns 32:
  3.22s 	nd
  3.22s 	otype
  3.22s 	after
  3.22s 	str
  3.22s 	in_chapter
  3.22s 	in_chunk
  3.22s 	chapter
  3.22s 	chunk
  3.23s 	curr
  3.23s 	empty
  3.23s 	empty_lb
  3.23s 	empty_link
  3.23s 	empty_pb
  3.23s 	empty_pb_n
  3.23s 	is_meta
  3.23s 	is_note
  3.23s 	n
  3.23s 	place
  3.23s 	rend
  3.23s 	rend_1tab
  3.23s 	rend_b
  3.23s 	rend_bq
  3.23s 	rend_h2
  3.23s 	rend_h3
  3.23s 	rend_h4
  3.23s 	rend_i
  3.23s 	rend_sc
  3.23s 	rend_spat
  3.23s 	rend_sup
  3.23s 	to
  3.23s 	type
  3.23s 	value

  3.29s 	227256 rows
  3.29s 	13520413 characters
  3.29s Importing into Pandas ...
   |     0.00s Reading tsv file ...
   |     0.96s Done. Size = 7272160
   |     0.96s Saving as Parquet file ...
   |     1.12s Saved
  4.41s PD  in ~/github/CLARIAH/wp6-ferdinandhuyck/pandas/data-0.1.pd

In [ ]: