Text analysis, also known as text mining, distant reading, and computational linguistics, is the process of using software to extract meaningful information from a body of text by identifying entities, patterns, relationships, etc. Often text mining can help to address research questions about large bodies of text that are impossible (or extraordinarily difficult) to answer through normal human reading alone (we call this close reading). Text mining tools are meant to compliment rather than replace traditional human-driven literary analysis. With any research method, a content expert’s intervention is necessary to identify how meaningful the results of any text mining process are and to interpret those results in a responsible way.
In a text analysis project, your corpus (plural: corpora) is your data. It is a structured set of texts which you have compiled in order to answer a research question. Corpora can be single-language or multi-language, based on a specific theme or genre, or contain all the texts of a single author or time period.
The following list includes many sources of high quality corpora to help you get started.
Optical Character Recognition (OCR) is the electronic conversion of scanned documents or images of text to machine-readable text files. You can use OCR to create your own corpus. Be sure to understand the copyright status and usage allowances of the materials you wish to scan.