Tools and Services

Institute of Phonetics and Speech Processing, Munich:

WebMAUS
The web service WebMAUS (CMDI) automatically segments and labels a speech recording into phonemes (SAM-PA). The RESTful web services can be called from within applications (e.g. ELAN, Weblicht) or batch scripts. An inter-active web interface can be found here.
Youtube tutorial.
SpeechRecorder
SpeechRecorder is a software for scripted speech and audio recordings. Its main features are
- platform independence
- support for any audio hardware recognized by the operating system
- text, image and audio prompting
- separate views for the recording supersvisor and the speaker
- full support for Unicode
- localized user interface
In SpeechRecorder, a recording script consists of sections which in turn contain recording items. Sections are presented in sequence and serve to structure the recording in the large. Within a section, recording items can be recorded in sequence or in random order. A recording item defines the prompt to be displayed to the spaker, a unique item code, the recording duration and a pre- and postrecording delay. Every recording is saved to a separate WAV file (versioning is supported), and every recording session is saved into a separate directory on the local disk. SpeechRecorder provides fine-grain control of the recordings via a traffic light: red means ‘do not speak’, yellow means ‘prepare to speak’, and green is ‘speak now’. Progress through the script can be manual, semi-automatic and fully automatic.
The software is available under the LGPL 3.0 license.
Webpage
Wikipedia article
WikiSpeech
WikiSpeech is a framework for scripted speech recordings via the WWW. Wherever there is an Internet connection, WikiSpeech can be used for structured recording sessions in high technical signal quality without the need to install special software on the local computer. The number of simultaneous recordings is not limited, so that many recordings can be performed in parallel in many locations. WikiSpeech is based on the recording software SpeechRecorder and supports the creation and administration of recording projects. The BAS provides WikiSpeech as a service. Registered users define their own recording projects using a graphical user interface. For the recordings they simply send a link pointing to their recording project to speakers. A speaker logs in, fills in a form with the required metadata (e.g. gender, age), and then starts the recording session proper. During a recording session, all recorded data is being transferred to the server in a background process, so that recording progress can be closely monitored on the server. When a recording session is done, WikiSpeech makes sure that all data has been transferred to the server and releases any local storage used during the recordings.
WikiSpeech has been used successfully in speech data collections in Germany, Scotland, England, The Netherlands, and even Antarctica.
Webpage

Berlin-Brandenburg Academy of Sciences and Humanities, Berlin:

DDC is a linguistic search engine which has been developed at the BBAW. Large text corpora can be searched with the power of logical operators, distance operators and linguistic expressions which refer to the part of speech annotation of text words. The search results in a set of text fragments (concordance lines) which match the search criteria.

The POS-tagger moot uses statistical methods for the disambiguation of lexical classes. In addition to the traditional bi-/trigram-based tagging methods the tagger takes into account sets of possible analyses (lexical classes) for each input word. It is thus possible to restrict the tagger’s output to a set of analyses provided by e.g. a rule-based morphological component. This approach has been shown to reduce the error rate by up to 21 % with respect to a traditional HMM.

BBAW's person name recognizer for historical text follows a rule-based approach. It draws on large lexical resources such as a complete German morphology and large lists of person names. The recognizer is optimized for the recognition of German historical text. The implementation of the tool is based on finite-state techniques.

English Applied Linguistics, Translation and Interpreting, Saarbrücken

Salto screenshot SALTO is agraphical tool for manual annotation of text corpora. It supports the annotation of asecond (typically semantic) layer of annotation to an existing syntactic annotation.Originally it was used in the SALSA project for the annotation of semantic roles andsemantic classes (FrameNet). The main features are: selecting data sets for annotation(query-based); tag set definition; annotation management (corpora distribution, qualitycontrol)

is a semi-supervised Named Entity Recognizer (with pre-trained models for German).

ANVIL is a free video annotation tool. Itoffers multi-layered annotation based on a user-defined coding scheme. During coding theuser can see color-coded elements on multiple tracks in time-alignment. Some specialfeatures are cross-level links, non-temporal objects, timepoint tracks, coding agreementanalysis and a project tool for managing whole corpora of annotation files.

The Hamburg Centre for language corpora, Hamburg:

EXMARaLDA ("Extensible Markup Language for Discourse Annotation") is a system of concepts, data formats and tools for the computer-assisted transcription and annotation of spoken language data, and for the creation and analysis of spoken language corpora. The EXMARaLDA system comprises the Partitur Editor, the Corpus Manager and the Analysis and Concordance Tool.

The EXMARaLDA Partitur Editor is used for the transcription and annotation of audio or video recordings. The EXMARaLDA data model allows multi-layer annotations and is best suited for discourse transcription. For widely used transcription systems (HIAT, GAT, CHAT) there is an integrated segmentation function and syntax checker. The Partitur Editor can be used with other tools such as Praat and TreeTagger. Apart from the EXMARaLDA format, there is an export function for e.g. EAF, FOLKER, Praat and TEI. Completed transcriptions are visualised in musical score format (e.g. for HIAT) or list format (e.g. for GAT, CA), further views and analyses (e.g. word lists) of the transcription data are also possible.

The EXMARaLDA Corpus Manager (Coma) is used to create, maintenance and analyse spoken language corpora. Metadata on the corpus and its parts (communications, speakers, files) can be managed and analysed, the corpus can be checked for common errors and commonly used corpus data can be generated as lists and tables.

The EXMARaLDA Analysis and Concordance Tool (EXAKT) was created to fulfill the requirements of spoken language corpora search and analysis. In addition to the classical concordance view (KWIC), selected results are visualised in the original transcription context with aligned audio or video available. Metadata on the communication or participating speakers can be displayed and used to filter or analyse the search results. Analyses can be saved and reopened for further elaboration.

TEI Drop is a tool used to segment (tokenize) and convert transcription data from various tools (EXMARaLDA, FOLKER, ELAN, Transcriber, CLAN) using various conventions (HIAT, cGAT) into a TEI conform exchange format.

Institute of German Language, Mannheim:

COSMAS II (Corpus Search, Management and Analysis System) is a full-text database for linguistic research on corpora of the Institut für Deutsche Sprache (IDS). It provides access to the steadily growing German Reference Corpus (DEREKO, over 4 billion words from newspapers, fictional, non-fictional and specialized works from Germany, Austria and Switzerland, from 1772 to present) and other written language corpora of the IDS.

CCDB is a corpus-oriented linguistic think tank for research and development. It focuses on modeling and data-guided exploration of language usage. Allowing users to empirically research linguistic phenomena, it is currently used in the formulation of new usage-related linguistic hypotheses, the modeling of semantic kinship and the visualization of aspects of linguistic usage.

Institut of computer science, department of natural language processing, Leipzig:

Logo SOAP Web Services, Uni Leipzig The web services provide direct access via SOAP to the data of the project Deutscher Wortschatz. The available services range from frequencies, cooccurrences, left/right neighbours to synonyms and various other kinds of data.

ASV Toolbox is a modular collection of tools for the exploration of written language data. They work either on word lists or text and solve several linguistic classification and clustering tasks. The topics covered contain language detection, POS-tagging, base form reduction, named entity recognition, and terminology extraction. On a more abstract level, the algorithms deal with various kinds of word similarity, using pattern-based and statistical approaches. The collection can be used to work on large real world data sets as well as for studying the underlying algorithms. The ASV Toolbox can work on plain text files and connect to a MySQL database. While it is especially designed to work with corpora of the Leipzig Corpora Collection (http://corpora.informatik.uni-leipzig.de/), it can easily be adapted to other sources.

Institute of natural language processing, Stuttgart:

Besides tokenizers for various languages, the following tools, which were developed by Helmut Schmid (SFST/SMOR, BitPar, TreeTagger) or Helmut Schmid and Florian Laws (RFTagger), are available as web services:

SFST (Stuttgart Finite State Transducer Tools) is a toolbox for the implementation of morphological analysers and other tools which are based on finite state transducer technology.

SMOR is a wide-coverage implementation of the morphology of the German language (inflection, derivation and compounding). It was realized using SFST.

BitPar is a parser for highly ambiguous probabilistic context-free grammars (such as treebank grammars). It uses bit-vector operations to speed up the basic parsing operations by parallelization. Pertinent grammars for English (TRACE, based on the PENN treebank) and German (based on the TIGER treebank) are provided.

The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It has been successfully used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Chinese, Swahili, Latin, Estonian and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.

RFTagger is a POS-Tagger suitable for fine-grained POS tagsets which encode morpho-syntactic information. It is built on the assumption to conceive of POS tags as attribute vectors and to estimate the given POS probabilities in context as the product of the pertinent attribute probabilities. Methodically, the estimation is realized using decision trees as is the case with the TreeTagger tool, too. RFTagger has been trained using data from German, Czech, Slovene and Hungarian.

Furthermore, the Bohnet-Parser of Bernd Bohnet (lemmatizer, tagger, dependency parser) and Anders Björkelund (semantic role labeller) will be integrated as web services into the CLARIN infrastructure, e.g. in WebLicht. They provide a pipeline of modules that carry out lemmatization, part-of-speech tagging, dependency parsing and semantic role labelling of a sentence. The tools are language independent, provide a very high accuracy and are fast. All the components are data-driven and depend on pre-trained models. The tools are language independent in the sense that they do not use lexica or a morphology which would provide extra language-specific data. Languages covered will include English, German and probably others (e.g. Czech and Chinese).

TIGERSearch is a search engine for retrieving information from a syntactic constituency treebank such as the TIGER corpus. It comes with an optional graphical user interface such that the user can formulate queries in an intuitive manner by 'drawing' partial graphs. It also includes a powerful textual query language. The tool provides XML-based interfaces for advanced applications, both for corpus import and for the export of results. State-of-the-art technologies can be used to manipulate and to aggregate query results. Basic tools are included for frequency calculations, export of query results in kwic (keyword in context) format, etc.

Max-Planck-Institut fof Psycholinguistics, Nijmegen:

The Language Archiving Technology tools at the Max Planck Institute for Psycholinguistics include ELAN (for the creation of multimedia-based annotations), LEXUS (for the creation of lexica) and ARBIL (for the creation of IMDI and CMDI metadata) among others.

Next to that there is also the Virtual Language Observatory which includes a facet metadata browser and exploration tools based on geographic information.

General Linguistics and Computational Linguistics section of the Department of Linguistics, Tübingen:

Please see the description of the WebLicht platform which includes many services from Tübingen and other centres.

CiNaViz: New web application for visualization of European city names

Which areas within Europe contain the most cities with names ending in "ona"? This and other questions can be answered with CiNaViz. CiNaViz is a CLARIN-D web application for searching and visualization of the distribution of city names (or parts of them) in Europe.

CiNaViz is freely accessible and runs in any modern browser.

TüNDRA – Tübingen Annotated Data Retrieval Application

TüNDRA Constituency Data TüNDRA is a web-based tool for searching and visualizing treebanks. It provides access to both constituency and dependency treebanks, and simultaneously displays token and segment-level annotations as well as non-syntactic connections between nodes. Users can choose from several tree-display formats according to their visualization needs.

TüNDRA targets diverse linguistic theories and schools of thought concerning annotation. It supports arbitrary features on tokens and constituents, as well as mixed-model data, i.e. dependency data with some constituency markup, constituency data with some dependency annotation, labeled edges, non-projecting trees, and secondary non-tree relations between any elements in the tree.

Queries use a variant of the TIGERSearch syntax originally provided in the widely-used TIGER treebank search and visualization program. All features and relations in the annotation are searchable, including features on tokens and constituents, hierarchal relations within syntax trees, surface token order, and ordering within constituents. All searchable relations support negation, so that users can query for nodes that do not match some particular criterion, like not having a particular kind of descendant. It also supports queries on surface order and tree distances bounded by numerical values, and setting restrictions on the number of children and tokens under some node.

TüNDRA is supported by most current browsers, and requires no installation of special software or plug-ins.