Library Guides: Digital Scholarship Research Guide: Identifying Your Corpus

What is a corpus?

A large, structured set of texts used for text analysis. Corpora can be single-language or multi-language, thematic, or contain all the texts of a single author, time period, or genre.

Finding Prepared Text Corpora

Project Gutenberg
A library of free, downloadable eBooks, includes books in English, Portuguese, German, and French
HathiTrust Digital Library
Requires login through Iowa State University to access full-text downloads. Some texts are search-only.
HathiTrust Research Center Analytics
Requires login, supports large-scale analysis of works held by HathiTrust Digital Library. Full-text datasets and metadata-only sets available depending on copyright.
Documenting the American South
A digital publishing initiative providing access to digitized and downloadable primary sources related to the history of the US South.
The Folger Shakespeare
Full-text, downloadable versions of Shakespeare's plays and sonnets, free for non-commercial use.
.TXTLab Data Sets
A number of text data sets compiled and made available by the .txtLab at McGill University; some sets include word counts or metadata only.
Early Modern Drama Collection
Corpora from the Visualizing English Print project.
Corpusdata.org
Downloadable, full-text data for 10 thematic corpora including examples in English, Spanish, and Portuguese.

Tools for Building Your Own Text Corpora

Optical Character Recognition (OCR) is the electronic conversion of scanned documents or images of text to machine-readable text files. You can use OCR to create your own corpus for text analysis. Be sure to understand the copyright status and usage allowances of the materials you wish to scan.

Tesseract
Free, open-source OCR engine run through the command line.
ABBYFineReader
OCR software that requires purchase of a license. May be available on select ISU library computers.
Transkribus
Free OCR platform for handwritten texts.
OpenRefine
Open-source desktop application for text data cleanup.