The LTL Corpus started as a semester project for the Linguistics Program undergraduate topics course LING-479 “Language Corpora & Software” at EMU in Winter 2012. The project involves:
- The setup of an online corpus analysis tool: we set up Philologic 3.2 on the corpus site
- Annotation of freely available books and compilation of a corpus coded in XML, using the markup as specified in the TEI P5 XML guidelines. We used resources like the Project Gutenberg archive to get textual material for the corpus. Initial conversions from many file formats to XML TEI P5 can be achieved using the OxGarage Conversion tool on the TEI pages. It obviously is just a web interface to a batch-converter that uses the OpenOffice plugins and libs.
- Text processing and linguistic annotation of the corpus including tokenization, lemmatization, part-of-speech and named entity annotation, and syntactic structures. We use the Stanford CoreNLP tools, the GATE and UIMA tools, and many self-written scripts and components (like batch annotation of TEI P5 XML into again TEI P5 XML as in this simple demo online).
Contributors to this project were so far: Mohamed C. Beina, Tiffany J. Belcher, Damir Cavar, Malgorzata E. Cavar, Sarah A. Curry, Zac Smith, Mohammad S. Stroshein, David C. Zalewski
Although this corpus is still work in progress, we do hope you find the corpus and the online Philologic workbench useful. For suggestions, additional material, comments and ideas, contact the LTL.
We are grateful to Mark Olson from the ARTFL Project at the University of Chicago for his help and advice with Philologic 3, and to the ARTFL project members for creating Philologic and making it available as open source and free software. Special thanks go to Piotr BaĆski from the Institut für Deutsche Sprache (IDS) in Mannheim for his hints and suggestions related to the TEI P5 XML annotation and markup. We are also grateful to all the contributors to the Stanford CoreNLP, the GATE and UIMA text and language processing tools!