A large, structured set of texts used for text analysis. Corpora can be single-language or multi-language, thematic, or contain all the texts of a single author, time period, or genre.
Optical Character Recognition (OCR) is the electronic conversion of scanned documents or images of text to machine-readable text files. You can use OCR to create your own corpus for text analysis. Be sure to understand the copyright status and usage allowances of the materials you wish to scan.