Word level based comparative text analysis

Many questions of the humanities, which relate to specific text resources, can be reduced to the analysis of vocabulary. Especially the comparison of such vocabulary is of central interest. This may require comparing two own text resources or a text resource with a reference corpus. CLARIN allows to easily perform such comparative analyses using the resources and Web tools it provides. The following guide will show this on the basis of a simple example. The show case covers the discovery and selection of resources, their processing and finally their analysis. The aim is to demonstrate to scholars how to answer own scientific questions with the help of comparative text analysis within CLARIN.

Especially interesting for

All scholars that are comparing texts or vocabulary, including:

Scholars from the historical sciences
Scholars from the political sciences
Scholars from all philologies

Requirements:

At least two texts are available.

Aim:

The vocabulary used is to be compared to find fundamental differences.

Solution:

Using the CLARIN-D infrastructure, a comparative analysis of the vocabulary of the texts can easily be conducted.

Related CLARIN-D projects:

DiaCollo: collocation analysis in diachronic perspective
Parallel search using collections in distributed locations

A Short Guide to comparative analysis of vocabulary

Search for Text Resources and Processing:

Select text resources for analysis

Entry point for resource search:: VLO, CLARIN's search engine for language resources.
Search for: "English Newspaper" in the search field
Refinement: as "Resource Type" select "Written Corpus".
Example selection: English Newspaper corpus from 2012 containing 3 Million sentences

Thematic restriction of the resource

Browse the content of the text resource: Click on "Plain text search via Federated Content Search".
Example search: "Europe"
Example selection: Display 250 hits

Processing of the text resource

Processing of the output by applying WebLicht to the search results:
1. Click on "View"
2. Click on "Use WebLicht"
3. Click on "Send To WebLicht"
Login into WebLiCHT
- A list of European academic research institutions appears.
- o Search for your research institution, if you have no AAI-enabled account at your own institution, select 'clarin.eu website account'.
- You will see a login page of your research institution.
- Log in with your details, usually this is your University account.
- You will see the WebLicht interface.
Check the data to be used: In the "Upload" section you'll see the file name of the data provided. Click "OK"
Analysis of the vocabulary of the texts by double-clicking the Web Services:
- Tokenization: IMS:Tokenizer (Stuttgart)
- POS-Tagging: SfS: POS Tagger - OpenNLP (Tübingen)
- Click on "Run Tools"
- Save the results "Save Result"
  - Go to last web service from the list at the bottom
  - You will see four icons below the line on the right side
  - Click on the arrow pointing down to download the result

Comparative analysis

The actual analysis of the data is conducted with the help of the web application CorpusDiff. The application allows for the comparison of the vocabulary of two or more text resources.

Import the resource

Click on: "Upload Own Corpus"
Load file that was previously created by WebLicht.
- Click on "Select a file on your computer or drop it here"
- Select the previously created file for upload
- Alternative: Depending on the browser and operating system you drag a file from your file manager into this area
- Click on "Upload"
Make sure that for "File Type" TCF (Text Corpus format) is selected, the output format of WebLicht.
If you want to skip the preprocessing and use a sample file for the comparative analysis, you can use this File

Conduct own analysis

"Configuration"
- Select the imported file
- Choose a reference corpus (for example, a news corpus or a Wikipedia corpus of the same year)
- Optionally more corpora or other corpora can be selected, which are then compared pair wise.
Enter a "job title" for your analysis
Press the "Compute" button.

Evaluation

Under "Job Selection" click on the completed analysis
The matrix shows the pairwise similarities between text resources with values between 0 (dissimilar corpora) and 1 (identical corpora)
Click a field of the matrix to see further results that describe the different uses of vocabulary in the texts.
- o lists of words, that occur much more frequently (relatively) in one of the two text resources
- lists of vocabulary that occurs only in one of the texts.
- Restrict the displayed results for individual parts of speech, for example, noun or proper nouns
- Example: When comparing the Europe-related texts with the contents of Wikipedia, words such as crisis, debt or fears are prominent, which clarify the thematic focus of the text resource previously generated.
At this point, other analyzes are also possible. One possibility would be the comparison of texts by different authors, from different sources (news and Wikipedia) or corpora of individual years to identify typical vocabulary or topics.