Skip to Main Content



Digital Scholarship Research Guide

What is a corpus?

A large, structured set of texts used for text analysis. Corpora can be single-language or multi-language, thematic, or contain all the texts of a single author, time period, or genre.

Finding Prepared Text Corpora

Tools for Building Your Own Text Corpora

Optical Character Recognition (OCR) is the electronic conversion of scanned documents or images of text to machine-readable text files. You can use OCR to create your own corpus for text analysis. Be sure to understand the copyright status and usage allowances of the materials you wish to scan.