• News
  • D4H
  • About Us
  • Team
  • Activities
    • Conferences
    • Lecture Series
    • Publications
    • Workshops
  • Blog
    • Digital History Reflection
    • Digital Tool Criticism
    • Trading Zone

Digital archive construction and modern text analysis: a two way street

13 October 2020

By shohreh.haddadan

With the aim of constructing an archive of letters written by a journalist/politician thorough his entire life, Alcide De Gasperi, the first prime minister of the Italian Republic from, the digital humanities group at Fondazione Bruno Kessler started the project Epistolario De Gasperi coordinated by “Fondazione Trentina Alcide De Gasperi” from 2016. In Figure 1 you can see a screen-shot from the landing page of the project.

Figure 1: Landing page of EPISTOLARIO website, a digital archive of Alcide De Gasperi letters throughout his life.

 

Dr. Sara Tonelli, the head of digital humanities Group at FBK gave an online presentation as a part of the lecture series of Digital humanities and Hermeunitcs doctoral school. She walked us through the process of building the digital archive, insertions and assessment of meta data, transcription and annotation, quality control using text mining methods and the development of a web publication for the created archive. Figure 2 from Dr Tonelli’s slides illustrates this process.

In this blogpost, I briefly summarize the lecture accompanied by a reflection on the trading zone between NLP and digital humanities.

I start summarizing the lecture by splitting the title of the lecture:  Collecting, Analyzing and Visualizing Documents in political Domain.

Collecting: As pointed out by Dr Tonelli, a tool for collecting the digital archive requires a client-offline version for users profiled as transcribers and a server version which aggregates the transcribed versions and collects them in a unified database for editors and supervisors to access. Such a tool needs to embed functionality for compiling standard metadata information and further user defined metadata fields for expanding usability purposes of the software. A software called LETTERE has been developed for this project. This tool liberates the project members on both sides from the burden of using cloud share services or email chains, and through this grants the collection process integrity and security.1

Figure 2: Work process of constructing a digital archive presented by Dr. Tonelli

 

Analyzing: The idea of distant reading has been introduced as the size of the digital archive available has dramatically increased. Roy Rosenzweig describes the need for the shift from close reading to distant reading (accompanied by close reading) as the following:

The injunction of traditional historians to look at “everything” cannot survive in a digital era in which “everything” has survived.2

Figure 3: Top: Frequency of letters of De Gasperi in his life span from EPISTOLARIO project. Bottom : Frequency of public documents of De Gasperi in his life span from ALCIDE project.

Distant reading methods can not only be applied to the content of the text but also on the metadata that is available for the collection. For instance, the temporal analysis of De Gasperi’s letters using metadata containing the dates of the letters, reflects his personal development. Having been a minor journalist during his youth he became one of the most prominent politicians of Italy in later years [Figure 3].

 

 

Figure 4: Sankey diagram of Senders and Receivers of De Gasperi letters archive using the meta-data.

The visualization of the metadata fields for sender and receiver shows the highest frequency of his letters sent to the Italian Roman Catholic priest and prominent politician, Luigi Sturzo [Figure 4]. 

Furthermore, context analysis of data can benefit from NLP methods. Retrieval of relevant documents, context extraction, Named Entity Identification and other Natural Language Processing algorithms assist historians creating digital archives with convenient access and tools for distant reading analysis. The ALCIDE (Analysis of Language and Content In a Digital Environment) project contains 2,762 public documents written by Alcide De Gasperi, containing around 3 million words, published between 1901 and 1954. Figure 3 contains the meta-data frequency of documents published by Gaspiari collected in the ALCIDE project, comparing the metadata by Epistolario containing only letters of De Gaspari, matches the patterns of frequency in active years of his life. Figure 4 depicts the architecture of ALCIDE as a text analysis system which contributes to the digital archive of political documents, by leveraging different NLP methods.It’s quite interesting how linguistic analysis through simplest automatic NlP algorithms can contribute to meaningful analysis of huge amounts of data. An example given by Dr Tonelli was the detection of future tense in De Gaspari’s written documents using Part of Speech. Tagging shows significant a increase before and after 1943 which can be interpreted as a rhetoric shift towards a future-oriented rhetoric to build up hope and trust in his plans to rebuild Italy and reorganise the Christian Democratic party at the end of World War II.

Figure 5: Diagram illustrating development of ALCIDE project from Dr. Tonelli’s slides.

Visualizing: The aforementioned projects both benefit from visualization tools to enable distant reading of the archives. Several of these visualizations have been shown in figures. More visualizations are provided by extraction of Named Entities, Geo-tagging and key-concepts using NLP methods.

Trading Zone 3 between NLP and Digital Humanities:

Leveraging different NLP methods for text analysis and automatic annotation of textual archives contributes to research within the digital humanities by helping to construct annotated digital archives and to interpret huge amounts of textual data. In this sense, natural language processing is used as a tool from which digital humanities benefits.

However, it is important to consider that this is indeed not a one way road. There are two main aspects to which digital humanities studies and in this case archive digitization can contribute to NLP studies:

Firstly, the introduction of novel tasks provides grounds for NLP researchers. In this lecture, with regard to the project, researchers have organized a shared task called DaDoEval2020 on the data and metadata level that allows participants to propose automated systems on the task of automatically dating documents from the transcribed archive. Thus, DaDoEval2020 introduces a task to the NLP community based on the capacities of the digitized archive. They have organized the shared task to involve two different subtasks, in which participants implement NLP-based methodologies on the data to classify the documents with regard to a period of time or assign an exact year to the document.

Secondly, as the field of NLP like many other sub-fields of machine learning is benefiting from deep learning methods which need huge amounts of data (either annotated or not), digitization adds to the pool of available data in different languages. As mentioned by Dr Tonelli in response to one of the questions in the Q&A regarding automatic transcription of the letters which are yet to be added to the archive, OCR algorithms use the previously manually transcribed data upon which they can be trained. Archives containing huge amounts of manually tagged and transcribed textual data provides grounds on which NLP algorithms can be improved and extended.

Critical reflection on the use of NLP methods in Digital Humanities:

One of the remarks mentioned in the lecture by Dr Tonelli is the trade-off between accuracy and transparency of algorithms used for text analysis of the digital archive. Although the state of the art of many NLP tasks are already dominated by deep learning methods (or more generally statistical machine learning methods), they lack the transparency that rule based machine learning methods provide. Many researches are recently focusing on explainability and transparency of such new methods. In practical cases, in which transparency is vital, it is important to maintain a trade-off between the accuracy we expect and transparency of the algorithms used. In this project for instance the use of the unsupervised topic modeling algorithm has been replaced with a rule based key concept extraction method.

Edited by Juliane Tatarinov

Notes
  1. Moretti, G., Sprugnoli, R. & Tonelli, S. (2018). LETTERE: LETters Transcription environment for REsearch. 7th Annual Conference of the Italian Association for Digital Humanities and Cultures (AIUCD)

  2. Roy Rosenzweig (2003). Scarcity or Abundance? Preserving the Past in a Digital Era, The American Historical Review, Volume 108, Issue 3, 1; see furthermore Max Kemman, Distant Reading, presentation at the University of Luxembourg, 2018

  3. Kemman, M. (2016). Trading Zones of Digital History.

Leave a Reply Cancel reply

You must be logged in to post a comment.

  • Contact
  • Partners
  • Legal notice
    

 

Copyright © Université du Luxembourg (2018). All rights reserved.
  • News
  • D4H
  • About Us
  • Team
  • Activities
    • Conferences
    • Lecture Series
    • Publications
    • Workshops
  • Blog
    • Digital History Reflection
    • Digital Tool Criticism
    • Trading Zone
This website uses WordPress cookies and Google Analytics to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT