Welcome


WebLicht services

Your NLP tool can be easily integrated into the WebLicht tool-chaining. Wrap it by a RESTful webservice that accepts TCF input and/or produces TCF output and you are done!




TCF format

The TCF format (Text Corpus Format) is used by WebLicht services as a machine-readable format for representing and exchanging linguistically annotated texts. It enables interoperability of linguistic tools. WebLicht developer manual provides detailed information on the TCF format and its background.

  • Downloads:
    NG Relax Schema specifications for:
    • the latest TCF schema and document example with text corpus annotation layers, TCF specifications

    Since version 0.4.5, TCF specifications have been hosted on Github. Please check it for latest updates and more information.

  • Previous versions, change log:

    TCF 0.4, 01 Feb 2016 (in MetaData allow xsi:schemaLocation on parents of the CMD element)

    TCF 0.4, 11 Feb 2015 (in TextCorpus: textSource layer is added.)

    TCF 0.4, 16 May 2014 (in TextCorpus: textstructure layer extended with character offsets information in textspanS)

    - TCF 0.4, 08 May 2014 (in MetaData: incorporated CMD chain defenitions from http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629623/xsd)

    - TCF 0.4, 09 Dec 2013 (in TextCorpus: wsd (word senses) layer added; in Lexicon: entries layer replaces lemmas layer, syllabification, cooccurrences and synonyms layers added, minor modifications to other layers)

    - TCF 0.4, 26 Apr 2013 (in external data namedentitymodel layer added)

    - TCF 0.4, 14 Mar 2013 (in textstructure layer textspan element changed to represent a tree structure: textspan can optionally contain a value or nest another textspan elements)

    - TCF 0.4, 20 Dec 2012 ( TextCorpus lang attribute is made required; in references layer value of target attribute of reference element is changed from xsd:IDREF to xsd:IDREFS)

    - TCF 0.4, 26 Nov 2012 ( textstructure layer extended)

    - TCF 0.4, 31 July 2012 ( external data section added, geo layer extended, discourseconnectives layer added, minor change in matches layer)

    - TCF 0.4, 25 Jun 2012 (coreferences layer is changed into references and extended to allow relations between references; parsing layer extended to allow edge labels; textstructure, orthography, geo layers added)

    - TCF 0.4, 30 Mar 2012 (change in coreferences layer; change in matches layer; change in most layers to allow for empty layers, in case tool is run but no annotations are found)

    - TCF 0.4, 27 Jan 2012 (change in dependency layer: govIDs made optional)

    - TCF 0.4, 08 Dec 2011

    - TCF 0.4, 29 Sep 2011

    - XML Schema and NG Relax Schema specifications for TCF 0.3, 03 Jul 2009

  • Validation:

    - Online validator for TCF

Before you develop a WebLicht service check the latest TCF specification and decide what layers of annotation your service will consume, and what layers it will produce. In case you are developing a service that produces an annotation layer that is not part of the latest TCF specification, contact the Clarin-D development group (info AT d-spin DOT org). It is possible to integrate new linguistic annotation layers in TCF and the specification will be updated.

After you develop a WebLicht service, you can check using the online validator whether your service output complies with the corresponding TCF schema. The compliance with the TCF schema ensures that your service is interoperable with other WebLicht services and tools.




Working with TCF documents


Create/read/write

We offer the library for TCF - Java transformation, so that working with TCF data on client and server sides becomes an easy task. You may want to use this library if you want to integrate your tool into the WebLicht tool-chaining using Java.

  • Tutorials:

    - Up-to-date tutorial that shows how to use WLFXB library to read, create and write TCF documents can be found at the WebLicht developer manual in TCF section

    - Old tutorialTCFXB 0.3 for TCF 0.3 shows how to use TCF0.3 - Java objects binding library to consume and produce TCF0.3 documents using Java

  • Downloads:

    - WLFXB library for TCF0.4-Java objects binding can be found at Clarin EU repository. There you can also find WLFXB test case sources, where you can find examples of how to read/write layers from/into TCF0.4 using WLFXB

    - library for TCF0.3-Java objects binding TCFXB0.3 and its API



Visualize

Here you can find applications for viewing NLP tools processing results contained in TCF documents in a user friendly graphical interface.

  • Tutorials:

    - TIEWER TCF0.3 tutorial shows how to view and edit linguistic data in TCF0.3 documents with the help of Tiewer desktop application

  • Downloads:

    - TIEWER for TCF0.3 pilot desktop application for viewing and editing of linguistic data in TCF0.3

  • Online:

    - visual 0.3 web application for viewing linguistic data in TCF0.3

    - visual 0.4 web application for viewing linguistic data in TCF0.4

WebLicht - A Web-Based Analysis Tool

WebLicht Logo

WebLicht is a web-based tool to semi-automatically annotate texts for linguistics and humanities research. It removes many of the time-consuming technical aspects of natural language processing - software installation, data format conversion, and error handling - so that users can immediately process their texts from any computer with a current web browser. Development on WebLicht started in October 2008 as part of CLARIN-D's predecessor project D-SPIN, and CLARIN-D is enhancing WebLicht to provide a uniform, distributed, accessible infrastructure for the largest possible research community.

WebLicht is a fully-functional virtual research environment for text studies. WebLicht provides a user-friendly web interface to construct and run custom processing chains of web services - each encapsulating some specific linguistic tool - to produce complete processing workflows. Processing a text to extract the most information from it is often a complex task where multiple tools, written in different programming languages for different environments with generally incompatible formats, must be run sequentially or can run in parallel but require users to collate the output. WebLicht handles this complex, low-level processing by wrapping all the tools as services it provides in a framework that uses a multilayered and multifaceted, XML-based format to encode all inputs and outputs.

WebLicht assigns each token an identifier that follows it through all processing steps, so that all added annotation, whether lexical, syntactic, or semantic, is directly connected to the corresponding element of text. Users have access to enhanced tools for visualizing annotated language data, including easy to use tables, tree-drawing suites, and enhancements for mapping locations, viewing search results in texts and annotations, and displaying segmented audio data.


You can login to WebLicht here if you have an account in the CLARIN Service Provider Federation. Detailed information can be found here.

Standards of Language Resources

The CLARIN Standards Guide gives an overview of the various standards and codes of practices in use in the area of language resources and tools as text corpora, lexicon, language databases, etc. and makes them publicly available. 

Access to Standards and Guidelines

A standardization and development of guidelines for the annotation, setup, transformation and development of frameworks allows an exchange and reuse of linguistic resources between applications and users that is independent of the operating system. The CLARIN Standards Guide provides information on standards and guidelines that aims at supporting researchers in all questions related to research data management. The CLARIN Standards Guide lets users evaluate and compare different annotation and metadata formats to find appropriate standards for their projects. Registered users may also download additional resources and examples of how the standards are applied. Users are also invited to contribute to the project by adding descriptions and information on standards they have expertise in.

>> See the CLARIN Standards Guide
 
 
Clarin-Standard-Tag-Clouds
 

Virtual Language Observatory

The Virtual Language Observatory (VLO) can be characterized as a search engine or a metadata-based portal for language resources. It uses the information about data and tools harvested right at the respective resource providers via the OAI-PMH protocol and is completely based on the Component Metadata (CMDI) and ISOcat standards. After conversion and post-processing, such data are provided in user-friendly form via the VLO web portal. At the time of writing (March 2013) about 360000 resources are available through the VLO facet browser.

Virtual Language Observatory

CLARIN-D User Handbook

The CLARIN-D user handbook contains guides for developing, preparing and archiving own resources. It is, therefore, a reference for research projects both in the preparation phase and the actual execution of the project.