Skip to Main Content



Digital Scholarship Research Guide

What is Text Encoding

Text encoding is a powerful and often flexible way of marking up (or “tagging”) text in order to allow for a variety of display, analysis, research, and reuse. In addition to allowing for flexible display options, encoded texts allow for programmatically analyzing the texts and structuring them for distribution. Encoded texts form the basis of many digital editions. XML and TEI are two examples of text encoding options used in digital scholarship. After a text is encoded, programmatic languages such as xPath, XSLT, and xQuery can be used to help process, analyze, and interpret the text.

Quick Introductions

Brief introduction to what text encoding is, how it is done, and the reasons for using XML and TEI by Colin Justin.

How to grow data forests with XML trees, a brief introduction to XML by Elisa Beshero-Bondar.

The Text Encoding Initiative (TEI)

The Text Encoding Initiative (TEI) is the primary text encoding language used in digital scholarship. An XML based language, TEI is a human readable, standardized language which helps a computer interpret a file to do a whole variety of actions – display, analysis, etc. TEI uses a standardized, controlled vocabulary, limiting how and what tags can be used. TEI uses tags to designate types, relationships, and properties within the encoded text. TEI is extremely flexible and can be used to encode graphs, images, and a variety of other objects. In addition to its flexibility in allowing how and where a text can be used, encoding in TEI also allows for flexibility over time – as systems and standards change for publishing online and interpreting texts, a text encoded in TEI will need less adjustment over time.

Further Reading