The CLTK has a distributed infrastructure that lets you download official CLTK texts or other corpora shared by others. For full docs, see http://docs.cltk.org/en/latest/importing_corpora.html.
To get started, from the Terminal, open a new Jupyter notebook from within your ~/cltk
directory (see notebook 1 "CLTK Setup" for instructions): jupyter notebook
. Then go to http://localhost:8888.
First we need to "import" the right part of the CLTK library. Think of this as pulling just the book you need off the shelf and having it ready to read.
# This is the import of the right part of the CLTK library
from cltk.corpus.utils.importer import CorpusImporter
# See https://github.com/cltk for all official corpora
my_latin_downloader = CorpusImporter('latin')
# Now 'my_latin_downloader' is the variable by which we call the CorpusImporter
my_latin_downloader.list_corpora
['latin_text_perseus', 'latin_treebank_perseus', 'latin_text_latin_library', 'phi5', 'phi7', 'latin_proper_names_cltk', 'latin_models_cltk', 'latin_pos_lemmata_cltk', 'latin_treebank_index_thomisticus', 'latin_lexica_perseus', 'latin_training_set_sentence_cltk', 'latin_word2vec_cltk', 'latin_text_antique_digiliblt', 'latin_text_corpus_grammaticorum_latinorum', 'latin_text_poeti_ditalia']
my_latin_downloader.import_corpus('latin_text_latin_library')
my_latin_downloader.import_corpus('latin_models_cltk')
You can verify the files were downloaded in the Terminal with $ ls -l ~/cltk_data/latin/text/latin_text_latin_library/
# Let's get some Greek corpora, too
my_greek_downloader = CorpusImporter('greek')
my_greek_downloader.import_corpus('greek_models_cltk')
my_greek_downloader.list_corpora
['greek_software_tlgu', 'greek_text_perseus', 'phi7', 'tlg', 'greek_proper_names_cltk', 'greek_models_cltk', 'greek_treebank_perseus', 'greek_lexica_perseus', 'greek_training_set_sentence_cltk', 'greek_word2vec_cltk', 'greek_text_lacus_curtius', 'greek_text_first1kgreek']
my_greek_downloader.import_corpus('greek_text_lacus_curtius')
Likewise, verify with ls -l ~/cltk_data/greek/text/greek_text_lacus_curtius/plain/
my_greek_downloader.import_corpus('greek_text_first1kgreek')
Downloaded 100% 163.52 MiB | 5.21 MiB/s
!ls -l ~/cltk_data/greek/text/greek_text_first1kgreek/
total 2176 -rw-r--r-- 1 root root 126919 Jul 13 10:05 Committing Issues using GitHub.docx -rwxr-xr-x 1 root root 1889 Jul 13 10:05 cselstats.pl drwxr-xr-x 118 root root 4096 Jul 13 10:05 data -rwxr-xr-x 1 root root 1955024 Jul 13 10:05 #gelasius-kg.xml# -rwxr-xr-x 1 root root 2414 Jul 13 10:05 greek-justwork.txt -rwxr-xr-x 1 root root 3249 Jul 13 10:05 greek.txt -rwxr-xr-x 1 root root 19777 Jul 13 10:05 Greek-works.txt -rw-r--r-- 1 root root 19125 Jul 13 10:05 license.md -rw-r--r-- 1 root root 58346 Jul 13 10:05 new_edition_metadata.csv -rw-r--r-- 1 root root 697 Jul 13 10:05 pages.sh -rwxr-xr-x 1 root root 1901 Jul 13 10:05 pnumber.xsl -rw-r--r-- 1 root root 1658 Jul 13 10:05 README.md drwxr-xr-x 2 root root 4096 Jul 13 10:05 save drwxr-xr-x 48 root root 4096 Jul 13 10:05 split drwxr-xr-x 2 root root 4096 Jul 13 10:05 volume_xml
Here we'll convert the First 1K Years' Greek corpus from TEI XML to plain text.
from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text
#! If you get the following error: 'Install `bs4` and `lxml` to parse these TEI files.'
# then run: `pip install bs4 lxml`.
onekgreek_tei_xml_to_text()
# Count the converted plaintext files
!ls -l ~/cltk_data/greek/text/greek_text_first1kgreek_plaintext/ | wc -l
677
my_latin_downloader.import_corpus('phi5', '~/cltk/corpora/PHI5/')
my_latin_downloader.import_corpus('phi7', '~/cltk/corpora/PHI7/')
my_greek_downloader.import_corpus('tlg', '~/cltk/corpora/TLG_E/')
!ls -l /home/kyle/cltk_data/originals/
total 204 drwxr-xr-x 2 kyle kyle 32768 Mar 30 2014 phi5 drwxr-xr-x 2 kyle kyle 24576 Mar 30 2014 phi7 drwxr-xr-x 2 kyle kyle 151552 Mar 30 2014 tlg