ChatCorpus2CLARIN: Integration of the Dortmund Chat Corpus into CLARIN-D

Project content

In the third curation project of the CLARIN-D working group 1 “German Philology” (F-AG 1), an existing corpus of computer-mediated communication (CMC), the Dortmund Chat Corpus, and samples of other CMC resources will be restructured to conform to current standards for the representation of corpora in the Digital Humanities context. The main goal of this work is to pave the way for the inclusion of linguistically annotated CMC resources into CLARIN-D corpus infrastructures and to create the prerequisites for investigating linguistic peculiarities of CMC with state-of-the art corpus technology. To this end, the project will (1) transform the metadata and the annotations of the chat corpus into a TEI-compliant format, (2) enrich the data by further linguistic annotations, and (3) integrate the resulting resource into the CLARIN-D Corpus Infrastructures at the Institute for the German Language (IDS) and the Berlin-Brandenburg Academy of Sciences (BBAW).

The integration in CLARIN-D will allow for a systematic corpus-based analysis of CMC discourse as compared to the language of edited text (as represented in the text corpora at BBAW and IDS) and of spoken conversations (as represented in the spoken language corpora at IDS).

The Dortmund Chat Corpus

The data for the Dortmund Chat Corpus (Beißwenger & Storrer 2008; Beißwenger 2013) was built at TU Dortmund University. The goal of the corpus project was to create a resource for researching the peculiarities and linguistic variation in written computer-mediated communication. The corpus comprises 478 logfile documents with about 140,000 postings and about 1 million tokens of German chats from different application contexts (social chats, advisory chats, chats in the context of learning and teaching, moderated chats in media context). The corpus has been annotated using an XML format (‘ChatXML’) that represents (1) the basic structure and properties of chat logfiles and postings, (2) selected “netspeak” phenomena such as emoticons, interaction words, addressing terms, nicknames and acronyms, (3) selected metadata about the chat users. Since 2005, the corpus has been made available at http://www.chatkorpus.tu-dortmund.de as an XML version for download and offline querying and as an HTML version for online browsing. It has been widely used as a resource for studying and teaching the peculiarities of German CMC discourse. In the CLARIN-D context, it has been used as one of the resources of the curation project: „Linguistic Annotation of Non-standard Varieties – Guidelines and Best Practices“ of working group 7.

Work packages

(1) TEI representation: For representing the corpus in TEI, the schema drafts and models developed in the TEI special interest group “Computer-mediated communication” are being used. This group is working on a proposal of a TEI standard for CMC genres. The ChatXML annotations of the original resource will be transformed into a TEI representation and enriched with additional structural annotations and metadata.

(2) Additional linguistic annotations: In order to enhance the possibilities for linguistic querying, a layer of part of speech (PoS) annotations will be added. For PoS tagging a version of the STTS tagset for German will be used, which has been adapted to the linguistic peculiarities of CMC (“STTS 2.0”, Beißwenger et al. 2015) and is compatible with the extended STTS tagset that is used for PoS tagging of the FOLK corpus of spoken language at IDS Mannheim. For automatic POS tagging, the project cooperates with the BMBF project “Schreibgebrauch” at the University of Saarbrücken.

(3) Integration into CLARIN-D: The integration of the resource in the CLARIN-D infrastructures comprises its hosting at the CLARIN-D centres BBAW and IDS and its ingestion in the centres' respective repositories for long-term data archiving. It also comprises developing a CMDI representation of metadata for the resource which will be harvestable via OAI-PMH and accessible from the CLARIN VLO (Virtual Language Observatory). The resource will be addressable via PIDs, it will be searchable in the CLARIN-D Federated Content Search and will also be accessible via web services. The conditions of licensing the corpus resource for scientific use will be defined on the basis of a legal expert opinion that is currently being sought.

The Target Resource

After its integration into the CLARIN-D infrastructure the resource will be characterized by the following added values:

advanced accessibility and retrieval options (incl. retrievability through metadata);
interoperability with other corpus resources that are represented in TEI and with annotation and analysis tools that support the TEI format;
advanced querying options (PoS tags, normalized spellings);
interoperability with other corpus resources that have been tagged with STTS;
advanced options for corpus-based analyses on the peculiarities of CMC discourse as compared to the language of edited text and of spoken language, using the text and speech corpora which are already available in the corpus infrastructures of BBAW and IDS.

Duration

01.05.2015 – 29.02.2016

Applicants

Prof. Dr. Michael Beißwenger (now University of Duisburg-Essen)
Prof. Dr. Angelika Storrer (University of Mannheim)

Responsible Institutions

Cooperation Partners

Executive Staff

Prof. Dr. Michael Beißwenger (Universität Duisburg-Essen)
Eric Ehrhardt, M.A. (University of Mannheim)
Axel Herold (BBAW)
Dr. Harald Lüngen (IDS Mannheim)
Prof. Dr. Angelika Storrer (University of Mannheim)

References

Beißwenger, Michael; Ehrhardt, Eric; Horbach, Andrea; Lüngen, Harald; Steffen, Diana; Storrer, Angelika (2015): Adding Value to CMC Corpora:
CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus. In: Proceedings of the 2nd Workshop on Natural Language Processing for
Computer-Mediated Communication / Social Media (NLP4CMC2015). Essen, S. 12-16. Online (PDF): https://sites.google.com/site/nlp4cmc2015/proceedings
Beißwenger, Michael (2013): Das Dortmunder Chat-Korpus. In: Zeitschrift für germanistische Linguistik 41 (1), 161-164. Extended version:http://www.linse.uni-due.de/tl_files/PDFs/Publikationen-Rezensionen/Chatkorpus_Beisswenger_2013.pdf
Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2012): A TEI Schema for the Representation of Computer-mediated Communication. In: Journal of the Text Encoding Initiative (jTEI) 3. http://jtei.revues.org/476 (DOI: 10.4000/jtei.476).
Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2013): DeRiK: A German Reference Corpus of Computer-Mediated Communication. In: Literary and Linguistic Computing 2013 (doi: 10.1093/llc/fqt038). https://academic.oup.com/dsh/article-lookup/doi/10.1093/llc/fqt038
Beißwenger, Michael; Bartz, Thomas; Storrer, Angelika; Westpfahl, Swantje (2015): Tagset und Richtlinie für das PoS-Tagging von Sprachdaten aus Genres internetbasierter Kommunikation. Guideline Document, Dortmund 2015. https://sites.google.com/site/empirist2015/home/annotation-guidelines
Beißwenger, Michael; Storrer, Angelika (2008): Corpora of Computer-Mediated Communication In: Lüdeling, Anke; Kytö, Merja (eds.): Corpus Linguistics. An international Handbook. Vol 1, Berlin (de Gruyter), 292-308.