Data Management in the Humanities: Progress in the Standardisation of Metadata Formats for Language-related Research Data

In July 2019, the International Organization for Standardization (ISO) published a new standard which contributes to describing language-related research data in a significant and sustainable way during archiving. The Standard ISO 24622-2 "Component Metadata Specification Language" standardizes procedures for defining a schema for descriptions tailored to requirements of specific types of research data.

When research data is archived, information about the data is collected and made available in a way that allows other researchers to find the data and to assess the relevance of the data from the description. In addition, potential users can get an idea of how they could incorporate this data into their own research and use it to answer research questions of their own. These descriptions are called metadata.

Experience shows that due to differences in the types of research data and research questions, it is very difficult to find an all-encompassing, universal pattern - or schema - according to which these descriptions can be created. For instance, the description of psychological experiments (number of test persons, research question, free and bound variables, recording system, etc.) are described in a different way than collections of texts for grammatical investigations or for the creation of word embeddings (number of "words", language, length of texts, source of texts, age of texts, authors, etc.). Despite their long tradition, libraries for books have a variety of metadata formats, e.g. Dublin Core, MARC 21, PREMIS, MODS. Many metadata schemas have some fields - also called data categories - that resemble each other, as well as some areas where they differ. In order to enable both an adequate description of research data and the utilisation of similar metadata structures, a framework called Component Metadata Infrastructure (CMDI) was developed: For each type of research data, an appropriate description schema - a metadata profile - is created, with the possibility of reusing parts of metadata schemas created for other types of data. These parts are called components. An example might be that for persons (e.g. researchers who have created a record), there is always a first name, a last name, an institution and an email address. It is then possible to refer to this component when specifying a person in the metadata of another type of research data, meaning that this part of the schema does not have to be redefined. In addition, such metadata can be handled directly by various tools such as special search engines for research data, editors, workflow tools, etc.

The new ISO 24622-2 standard describes how to define these components so as to be able to use them in conjunction with other - possibly customised - components. It thus represents an application of the underlying model that has already been standardised as a component model in another standard, ISO 24622-1, which is the conceptual basis for the implementation.

Metadata experts from the CLARIN infrastructure initiative were significantly involved in the development of this standard. With the CLARIN-Component Registry, CLARIN ERIC provides a reference implementation that employs the standard. In addition, CLARIN tools such as the VLO search engine, metadata editors such as COMEDI or ARBIL, and web service workflow tools such as WebLicht are already based on ISO 24622-2. Major contributions to the development were made by CLARIN participants from Germany, Greece, Austria and the Netherlands. Thorsten Trippel, German CLARIN expert for the German Institute for Standardisation (DIN), assumed the management and coordination of the standardisation project.

All CLARIN data centers automatically implement these international standards with the publication of the ISO 24622-2 standard, which is based on ISO 24622-1, when they deliver valid CMDI metadata whose schemata are registered in the CLARIN Component Registry. The implementation of this standard will contribute to sustainable documentation and thus to a reliable infrastructure for research data related to languages in the humanities.

Written by : Thorsten Trippel

