Lexical Resources

Below we present the lexical resources, which are held by the CLARIN-D centers.

Lexical resources are collections of lexical items with additional linguistic information and/or classification of these items. Commonly encountered types of lexical items are words, multi-word units and morphemes.

Bayerisches Archiv für Sprachsignale, München:

The pronunciation dictionary PHONOLEX contains 1.6 Mio pronunciation codings of German words coded in SAM-PA. Most entries are derived from spoken language resources, a small proportion of entries also shows phonetic realisations of the spoken words together with their frequency.

BAStat is a collection of German phoneme, syllable and word statistics based on spoken language resources.

Berlin-Brandenburgische Akademie der Wissenschaften, Berlin:

The DWDS dictionary (digital dictionary of the German language) is based on the digitized version of the „Wörterbuch der deutschen Gegenwartssprache“ (WDG) published in 6 volumes between 1961 and 1977. The DWDS dictionary is a revised version of this resource. It comprises rich lexicographical information for ca. 120,000 headwords. Audio files of their pronunciations have been provided for the majority of these headwords. In addition, all spelling variants - most of which are a result of the latest spelling reform - have been integrated.

The etymological dictionary is based on the digitized version of the two-volume „Etymologisches Wörterbuch des Deutschen“, edited by Wolfgang Pfeifer. This dictionary comprises ca. 22,000 entries with grammatical, semantic and etymological information.

The lexical database dlexdb has been built in a DFG-funded joint project by the Cognitive Psychology Department of the University of Potsdam and the BBAW (the DWDS project) and will be maintained and further developed by these institutions. It is a general lexical resource supporting research in the fields of experimental psychology, psycholinguistics, and general linguistics. As such, this resource is complementary to the CELEX DB. The data are derived from the core corpus of the DWDS. Frequency data on the superlexical level (n-grams), the sublexical level (morphemes and syllable structures) as well as on the lexical level are provided. Additionally, research-relevant properties such as contextual diversity and orthographic neighbourhood have been computed.

Institut für Deutsche Sprache, Mannheim:

OWID is a portal for scientific, corpus-based lexicography at the Institut für Deutsche Sprache (IDS). It includes online German dictionaries devoted to various scientific fields, as well as a bibliography on digital lexicography and online dictionaries (OBELEX). OWID is the central gateway for Internet lexicography at the IDS.

Institut für Informatik, Abteilung Automatische Sprachverarbeitung, Leipzig:

Logo Wortschatz, Uni Leipzig The project Deutscher Wortschatz aims at documenting the usage of the German language. The content of the Wortschatz portal can be characterized as a collection. Since 1999 texts of newspaper portals, Wikipedia and other sources are automatically collected and separated into single sentences. Multiple language independent and mostly statistical methods are used to calculate data like word frequency, frequency class, sentence- and direct neighbour cooccurrences. In addition to the German portal concentrating on the German language an international portal provides access to monolingual lexicons that contain the typical Wortschatz data in over 90 different languages.

Words and phrases, the „words of the day“, mentioned in selected newspapers are extracted on a daily basis. The relevance of a word is calculated by comparing the frequency during a limited observation period to its long time moving average. The archive for German contains data from april 2002 till today. Norwegian data is available since march 2006.

Institut für Maschinelle Sprachverarbeitung, Stuttgart:

The IMSLex dictionary database contains information on inflection, word formation, and valence for several ten thousand German base forms. IMSLex has been used to derive specialized lexicon data for various applications in natural language processing, information retrieval, and information extraction. In order to keep the dictionary data as flexible as possible, it has been encoded on an XML basis. However, for efficiency reasons, the data is stored in a relational database. The lexical data itself has been built up semi-automatically with the help of specialized text mining methods from corpus linguistics.

Seminar für Sprachwissenschaft, Abt. Computerlinguistik, Tübingen:

GermaNet logo GermaNet is a lexical-semantic resource for German modeled after the Princeton WordNet for English. It combines several functions of a dictionary, a thesaurus, and a linguistic ontology to create a digital resource useful for a variety of language processing and analysis tasks, particularly word-sense disambiguation and modeling semantic categories.

GermaNet is a wordnet of German adjectives, nouns, and verbs, grouped into sets of conceptual synonyms or near-synonyms, (called synsets), each expressing a distinct concept. Lexical entries and synsets are tied together in a network of cognitive and linguistic relations, like antonymy, hyper- and hyponymy, meronymy (part-of relations). The resource is stored in a relational database and distributed in XML format to provide
the widest accessibility for computational tools. GermaNet is partly integrated into EuroWordNet, so that synsets with equivalents in other EuroWordNet-compatible resources can be automatically aligned. It also contains subcategorization frame information and example sentences for verbs.

Work on GermaNet started in 1997 at the Department of General and Computational Linguistics at the University of Tuebingen, and is still in progress. The current version of GermaNet (release 8.0, as of April 2013) contains 84,584 synsets and 111,361 lexical
units, bound by 96,925 relations.