The CLTK has a distributed infrastructure that lets you download official CLTK texts or other corpora shared by others. For full docs, see http://docs.cltk.org/en/latest/importing_corpora.html.

To get started, from the Terminal, open a new Jupyter notebook from within your ~/cltk directory (see notebook 1 "CLTK Setup" for instructions): jupyter notebook. Then go to http://localhost:8888.

See what corpora are available¶

First we need to "import" the right part of the CLTK library. Think of this as pulling just the book you need off the shelf and having it ready to read.

In [1]:

# This is the import of the right part of the CLTK library

from cltk.corpus.utils.importer import CorpusImporter

In [2]:

# See https://github.com/cltk for all official corpora

my_latin_downloader = CorpusImporter('latin')

# Now 'my_latin_downloader' is the variable by which we call the CorpusImporter

In [3]:

my_latin_downloader.list_corpora

Out[3]:

['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia']

Import several corpora¶

In [4]:

my_latin_downloader.import_corpus('latin_text_latin_library')
my_latin_downloader.import_corpus('latin_models_cltk')

You can verify the files were downloaded in the Terminal with $ ls -l ~/cltk_data/latin/text/latin_text_latin_library/

In [6]:

# Let's get some Greek corpora, too

my_greek_downloader = CorpusImporter('greek')
my_greek_downloader.import_corpus('greek_models_cltk')
my_greek_downloader.list_corpora

Out[6]:

['greek_software_tlgu',
 'greek_text_perseus',
 'phi7',
 'tlg',
 'greek_proper_names_cltk',
 'greek_models_cltk',
 'greek_treebank_perseus',
 'greek_lexica_perseus',
 'greek_training_set_sentence_cltk',
 'greek_word2vec_cltk',
 'greek_text_lacus_curtius',
 'greek_text_first1kgreek']

In [6]:

my_greek_downloader.import_corpus('greek_text_lacus_curtius')

Likewise, verify with ls -l ~/cltk_data/greek/text/greek_text_lacus_curtius/plain/

In [3]:

my_greek_downloader.import_corpus('greek_text_first1kgreek')

Downloaded 100% 163.52 MiB | 5.21 MiB/s

In [4]:

!ls -l ~/cltk_data/greek/text/greek_text_first1kgreek/

total 2176
-rw-r--r--   1 root root  126919 Jul 13 10:05 Committing Issues using GitHub.docx
-rwxr-xr-x   1 root root    1889 Jul 13 10:05 cselstats.pl
drwxr-xr-x 118 root root    4096 Jul 13 10:05 data
-rwxr-xr-x   1 root root 1955024 Jul 13 10:05 #gelasius-kg.xml#
-rwxr-xr-x   1 root root    2414 Jul 13 10:05 greek-justwork.txt
-rwxr-xr-x   1 root root    3249 Jul 13 10:05 greek.txt
-rwxr-xr-x   1 root root   19777 Jul 13 10:05 Greek-works.txt
-rw-r--r--   1 root root   19125 Jul 13 10:05 license.md
-rw-r--r--   1 root root   58346 Jul 13 10:05 new_edition_metadata.csv
-rw-r--r--   1 root root     697 Jul 13 10:05 pages.sh
-rwxr-xr-x   1 root root    1901 Jul 13 10:05 pnumber.xsl
-rw-r--r--   1 root root    1658 Jul 13 10:05 README.md
drwxr-xr-x   2 root root    4096 Jul 13 10:05 save
drwxr-xr-x  48 root root    4096 Jul 13 10:05 split
drwxr-xr-x   2 root root    4096 Jul 13 10:05 volume_xml

Convert TEI XML texts¶

Here we'll convert the First 1K Years' Greek corpus from TEI XML to plain text.

In [3]:

from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text

In [4]:

#! If you get the following error: 'Install `bs4` and `lxml` to parse these TEI files.'
# then run: `pip install bs4 lxml`.

onekgreek_tei_xml_to_text()

In [5]:

# Count the converted plaintext files

!ls -l ~/cltk_data/greek/text/greek_text_first1kgreek_plaintext/ | wc -l

Import local corpora¶

In [10]:

my_latin_downloader.import_corpus('phi5', '~/cltk/corpora/PHI5/')

In [11]:

my_latin_downloader.import_corpus('phi7', '~/cltk/corpora/PHI7/')

In [7]:

my_greek_downloader.import_corpus('tlg', '~/cltk/corpora/TLG_E/')

In [12]:

!ls -l /home/kyle/cltk_data/originals/

total 204
drwxr-xr-x 2 kyle kyle  32768 Mar 30  2014 phi5
drwxr-xr-x 2 kyle kyle  24576 Mar 30  2014 phi7
drwxr-xr-x 2 kyle kyle 151552 Mar 30  2014 tlg