Parallel search using collections in distributed locations

FCS symbol picture

The CLARIN Federated Content Search (CLARIN FCS) allows a corpus search independent from the specifications of search engines and storage locations of the resources at different CLARIN centers. To enable interaction with the search engines, each center implements a standard application interface, that technically is a Contextual Query Language (CQL) interface based on Search Retrieve via URL (SRU) protocol.

The FCS Aggregator collects and visualizes the search results through a web interface. It also makes it possible to perform a search over various search engines. In the next step, the results can be lingusitically processed, for example in an automatic analysis by using tools in Weblicht.

Especially relevant for

  • Linguists
  • Computerlinguists

Starting point:

A query, for example a word or a phrase.

Task:

Distributed search in a wide range of corpora independent from their storage locations. Besides, multiple search engines of the CLARIN-D centers are needed.

Solution:

Use the CLARIN Federated Content Search via Aggregator and send the results to Weblicht for further linguistic processing.

Related CLARIN-D tools and services:

A short guide on how to perform a distributed search and process the search results using the CLARIN-D Infrastructure

How to use CLARIN Federated Content Search Aggregator

  • Go to Aggregator [https://clarin.eu/contentsearch/]
  • Type in a query, for example Prinz (prince in German) and click the search button with the magnifying glas symbol on the right side. You should ideally refine the query by choosing the resource language and corpora. Besides, you can also change the number of results per corpus.
  • To represent the results in KWIC (Key Word in Context) format, click on the „Display as Key Word in Context“ button.
  • The results can be downloaded in various formats (CSV, Excel, TCF, plain text). Firstly, click on "Download" and then choose a desired format.
  • You can retrieve more results from a corpus by clicking on the "View" button (eye symbol) and then "... More Results".

Further result processing using WebLicht

  • Click "View" and then "Use Weblicht" on the right side. On the drop down menu, click "Send to Weblicht".
  • You have to login to be able to use Weblicht. Shibboleth technology makes it possible to use your existing login data. Choose your institution affiliation from the given list (for example IDS) and log in using your account.
  • After login, you will be redirected to the Weblicht homepage.

How to process CLARIN FCS search results in WebLicht

  1. Click "Start Weblicht" and afterwards "Start" to create a new tool chain.
  2. On the right side, you can find the search results from Aggregator that have been transformed into the TCF format. Click the "OK" button.
  3. On the lower part, you can see the tool chain and the input data in TCF format is on the first place. On the upper part, there are tools you can choose to process the data. Click on the "i" button to get information about the tools.
  4. Choose a tool by double clicking on it, for example "SfS Tokenizer-OpenNLP". The tool will be automatically added to the tool chain. Click "X" in the tool box to remove it from the tool chain.
  5. You can add multiple tools to make to the tool chain longer. Please choose the following tools sequentially:
    1. „IMS:TreeTagger“
    2. „Berlin:Person Name Recognizer“
  6. Now click "Run Tools".
  7. The resuls from each tool can be downloaded by clicking on the download button with a down-row symbol on the bottom part of the tool box.
  8. To visualize the results, click on the button with a tree symbol beside the download button.