Word level based comparative text analysis
Many questions of the humanities, which relate to specific text resources, can be reduced to the analysis of vocabulary. Especially the comparison of such vocabulary is of central interest. This may require comparing two own text resources or a text resource with a reference corpus. CLARIN allows to easily perform such comparative analyses using the resources and Web tools it provides. The following guide will show this on the basis of a simple example. The show case covers the discovery and selection of resources, their processing and finally their analysis. The aim is to demonstrate to scholars how to answer own scientific questions with the help of comparative text analysis within CLARIN.
Especially interesting for
All scholars that are comparing texts or vocabulary, including:
- Scholars from the historical sciences
- Scholars from the political sciences
- Scholars from all philologies
At least two texts are available.
The vocabulary used is to be compared to find fundamental differences.
Using the CLARIN-D infrastructure, a comparative analysis of the vocabulary of the texts can easily be conducted.
Related CLARIN-D projects:
- DiaCollo: collocation analysis in diachronic perspective
- Parallel search using collections in distributed locations
A Short Guide to comparative analysis of vocabulary
Search for Text Resources and Processing:
Select text resources for analysis
Entry point for resource search:: VLO, CLARIN's search engine for language resources.
- Search for: "English Newspaper" in the search field
Refinement: as "Resource Type" select "Written Corpus".
Example selection: English Newspaper corpus from 2012 containing 3 Million sentences
Thematic restriction of the resource
Browse the content of the text resource: Click on "Plain text search via Federated Content Search".
- Example search: "Europe"
- Example selection: Display 250 hits
Processing of the text resource
Processing of the output by applying WebLicht to the search results:
- Click on "View"
- Click on "Use WebLicht"
- Click on "Send To WebLicht"
- Login into WebLiCHT
- A list of European academic research institutions appears.
- o Search for your research institution, if you have no AAI-enabled account at your own institution, select 'clarin.eu website account'.
- You will see a login page of your research institution.
- Log in with your details, usually this is your University account.
- You will see the WebLicht interface.
- Check the data to be used: In the "Upload" section you'll see the file name of the data provided. Click "OK"
- Analysis of the vocabulary of the texts by double-clicking the Web Services:
- Tokenization: IMS:Tokenizer (Stuttgart)
- POS-Tagging: SfS: POS Tagger - OpenNLP (Tübingen)
- Click on "Run Tools"
- Save the results "Save Result"
- Go to last web service from the list at the bottom
- You will see four icons below the line on the right side
- Click on the arrow pointing down to download the result
The actual analysis of the data is conducted with the help of the web application CorpusDiff. The application allows for the comparison of the vocabulary of two or more text resources.
Import the resource
Click on: "Upload Own Corpus"
- Load file that was previously created by WebLicht.
- Click on "Select a file on your computer or drop it here"
- Select the previously created file for upload
- Alternative: Depending on the browser and operating system you drag a file from your file manager into this area
- Click on "Upload"
- Make sure that for "File Type" TCF (Text Corpus format) is selected, the output format of WebLicht.
- If you want to skip the preprocessing and use a sample file for the comparative analysis, you can use this File
Conduct own analysis
- Select the imported file
- Choose a reference corpus (for example, a news corpus or a Wikipedia corpus of the same year)
- Optionally more corpora or other corpora can be selected, which are then compared pair wise.
- Enter a "job title" for your analysis
- Press the "Compute" button.
Under "Job Selection" click on the completed analysis
The matrix shows the pairwise similarities between text resources with values between 0 (dissimilar corpora) and 1 (identical corpora)
Click a field of the matrix to see further results that describe the different uses of vocabulary in the texts.
- o lists of words, that occur much more frequently (relatively) in one of the two text resources
- lists of vocabulary that occurs only in one of the texts.
- Restrict the displayed results for individual parts of speech, for example, noun or proper nouns
- Example: When comparing the Europe-related texts with the contents of Wikipedia, words such as crisis, debt or fears are prominent, which clarify the thematic focus of the text resource previously generated.
At this point, other analyzes are also possible. One possibility would be the comparison of texts by different authors, from different sources (news and Wikipedia) or corpora of individual years to identify typical vocabulary or topics.