Library Guides: Data management and sharing plan guide: 2. Document

Document

In this section you will include details about how your data is documented with a special concentration on which standards your data will utilize. Standards are established methods or systems for accomplishing the same task. For example, light bulbs and light socket shapes and sizes are standardized so they’re compatible with each other.

File format standards affect which programs can read and use data and metadata standards provide a uniform way of classifying and structuring data making it easier to remix and reuse. Lastly you also need to address how the whole data set will be documented as metadata alone is typically not enough to facilitate reuse.

File formats

File formats are a standardized way of formatting digital data so it can be read and used by computers consistently. Whenever possible you should use open, non-proprietary file formats. This means using CSV (.csv) over Excel (.xlsx) and plain text (.txt) over Word (.docx) when possible. There are two reasons to do so:

Proprietary formats typically require proprietary software which costs money and limits who can open and reuse the data.
As proprietary software is updated over time old files may no longer be supported by newer software limiting access to the data.

If you cannot save your data in an open, non-proprietary format then you will need to document the software, tools, and/or code needed to access the data and explain why you’ve chosen this format in your plan.

Resources

Metadata

A simple definition of metadata is that it is "data about data" (how meta!). A better definition is that metadata is descriptive and structured information describing a thing. A simple example of metadata is to think of how you would describe a book: title, subject matter, length, author, date of publication, etc., all these are metadata that describe the book. In other words, metadata are details that surround data.

Metadata is often formatted as data and may be in the same file or in a supplemental file. Importantly, metadata often follows standards. A good example is Ecological Metadata Language (EML), which is a vocabulary used to describe ecological data, or ISO 8601 which is an international standard for dates and time, or the MRLC Land Cover Classifications which is a standard vocabulary for land usage for the United States. Using metadata standards such as these ensures data can be easily mixed and reused with similar data will minimal information loss.

Resources

↑ return to top

Documentation

Documentation backs up and supports metadata by providing context. Documentation tells us the details of how the data was made and how different files relate. It provides details that might not be captured in the data and metadata alone. The most common kind of documentation is a readme – a simple text file with information to read first before using a data set.

Data dictionaries and codebooks are more focused kind of documentation that define variables, units, data types and more and often accompany spreadsheets. In other words, they define your data by providing metadata.

For example, a spreadsheet has columns and rows containing data. Column headers are often written in shorthand, so they need documentation to explain their full names and to give details such as measurement units, allowed ranges, and more. Similarly, your entire dataset needs documentation to help other people understand why it’s important and if they can reuse it.

Resources

README Files | Harvard Biomedical Data Management
README File Checklist from Harvard Biomedical Data Management
What is a codebook? | ICPSR

↑ return to top

Consider

What file formats will you use?
- Are they a standard in your field?
- Do they require proprietary software? If so, what is the justification for using them?
Which metadata standards will you use?
- How well known is the standard?
- Where is the standard described and who has access to it?
How will you document your data so others be able to reuse and understand it?

Tips

Sometimes proprietary formats are unavoidable. Try to locate alternate free-to-use software when faced with this situation.
Creating and maintaining documentation helps future you understand your data, not just strangers.
Adopting the date and time standard ISO 8601 ensures that dates and times always sort correctly.
Research papers do not document data sets which is why metadata and documentation are required for effective data sharing.

↑ return to top

Writing prompts

USDA-NIFA

For scientific data to be readily accessible and usable it is critical to use an appropriate community-recognized standard and machine-readable formats when they exist. If the data will be managed in domain-specific workspaces or submitted to public databases, indicate that their required formats will be followed. Regardless of the format used, the data set must contain enough information to allow independent use (understand, validate and use) of the data.

NIH

State what common data standards will be applied to the scientific data and associated metadata to enable interoperability of datasets and resources and provide the name(s) of the data standards that will be applied and describe how these data standards will be applied to the scientific data generated by the research proposed in this project. If applicable, indicate that no consensus standards exist

Department of Energy

Researchers are encouraged to use the data standards, formats, and other common practices of their research community or scientific domain to ensure maximal interoperability and impact of data sharing.

↑ return to top