With the aim of constructing an archive of letters written by a journalist/politician thorough his entire life, Alcide De Gasperi, the first prime minister of the Italian Republic from, the digital humanities group at Fondazione Bruno Kessler started the project Epistolario De Gasperi coordinated by “Fondazione Trentina Alcide De Gasperi” from 2016. In Figure 1 you can see a screen-shot from the landing page of the project.
Dr. Sara Tonelli, the head of digital humanities Group at FBK gave an online presentation as a part of the lecture series of Digital humanities and Hermeunitcs doctoral school. She walked us through the process of building the digital archive, insertions and assessment of meta data, transcription and annotation, quality control using text mining methods and the development of a web publication for the created archive. Figure 2 from Dr Tonelli’s slides illustrates this process.
In this blogpost, I briefly summarize the lecture accompanied by a reflection on the trading zone between NLP and digital humanities.
I start summarizing the lecture by splitting the title of the lecture: Collecting, Analyzing and Visualizing Documents in political Domain.
Collecting: As pointed out by Dr Tonelli, a tool for collecting the digital archive requires a client-offline version for users profiled as transcribers and a server version which aggregates the transcribed versions and collects them in a unified database for editors and supervisors to access. Such a tool needs to embed functionality for compiling standard metadata information and further user defined metadata fields for expanding usability purposes of the software. A software called LETTERE has been developed for this project. This tool liberates the project members on both sides from the burden of using cloud share services or email chains, and through this grants the collection process integrity and security.1
Analyzing: The idea of distant reading has been introduced as the size of the digital archive available has dramatically increased. Roy Rosenzweig describes the need for the shift from close reading to distant reading (accompanied by close reading) as the following:
The injunction of traditional historians to look at “everything” cannot survive in a digital era in which “everything” has survived.2
Distant reading methods can not only be applied to the content of the text but also on the metadata that is available for the collection. For instance, the temporal analysis of De Gasperi’s letters using metadata containing the dates of the letters, reflects his personal development. Having been a minor journalist during his youth he became one of the most prominent politicians of Italy in later years [Figure 3].
The visualization of the metadata fields for sender and receiver shows the highest frequency of his letters sent to the Italian Roman Catholic priest and prominent politician, Luigi Sturzo [Figure 4].
Furthermore, context analysis of data can benefit from NLP methods. Retrieval of relevant documents, context extraction, Named Entity Identification and other Natural Language Processing algorithms assist historians creating digital archives with convenient access and tools for distant reading analysis. The ALCIDE (Analysis of Language and Content In a Digital Environment) project contains 2,762 public documents written by Alcide De Gasperi, containing around 3 million words, published between 1901 and 1954. Figure 3 contains the meta-data frequency of documents published by Gaspiari collected in the ALCIDE project, comparing the metadata by Epistolario containing only letters of De Gaspari, matches the patterns of frequency in active years of his life. Figure 4 depicts the architecture of ALCIDE as a text analysis system which contributes to the digital archive of political documents, by leveraging different NLP methods.It’s quite interesting how linguistic analysis through simplest automatic NlP algorithms can contribute to meaningful analysis of huge amounts of data. An example given by Dr Tonelli was the detection of future tense in De Gaspari’s written documents using Part of Speech. Tagging shows significant a increase before and after 1943 which can be interpreted as a rhetoric shift towards a future-oriented rhetoric to build up hope and trust in his plans to rebuild Italy and reorganise the Christian Democratic party at the end of World War II.
Visualizing: The aforementioned projects both benefit from visualization tools to enable distant reading of the archives. Several of these visualizations have been shown in figures. More visualizations are provided by extraction of Named Entities, Geo-tagging and key-concepts using NLP methods.
Trading Zone 3 between NLP and Digital Humanities:
Leveraging different NLP methods for text analysis and automatic annotation of textual archives contributes to research within the digital humanities by helping to construct annotated digital archives and to interpret huge amounts of textual data. In this sense, natural language processing is used as a tool from which digital humanities benefits.
However, it is important to consider that this is indeed not a one way road. There are two main aspects to which digital humanities studies and in this case archive digitization can contribute to NLP studies:
Firstly, the introduction of novel tasks provides grounds for NLP researchers. In this lecture, with regard to the project, researchers have organized a shared task called DaDoEval2020 on the data and metadata level that allows participants to propose automated systems on the task of automatically dating documents from the transcribed archive. Thus, DaDoEval2020 introduces a task to the NLP community based on the capacities of the digitized archive. They have organized the shared task to involve two different subtasks, in which participants implement NLP-based methodologies on the data to classify the documents with regard to a period of time or assign an exact year to the document.
Secondly, as the field of NLP like many other sub-fields of machine learning is benefiting from deep learning methods which need huge amounts of data (either annotated or not), digitization adds to the pool of available data in different languages. As mentioned by Dr Tonelli in response to one of the questions in the Q&A regarding automatic transcription of the letters which are yet to be added to the archive, OCR algorithms use the previously manually transcribed data upon which they can be trained. Archives containing huge amounts of manually tagged and transcribed textual data provides grounds on which NLP algorithms can be improved and extended.
Critical reflection on the use of NLP methods in Digital Humanities:
One of the remarks mentioned in the lecture by Dr Tonelli is the trade-off between accuracy and transparency of algorithms used for text analysis of the digital archive. Although the state of the art of many NLP tasks are already dominated by deep learning methods (or more generally statistical machine learning methods), they lack the transparency that rule based machine learning methods provide. Many researches are recently focusing on explainability and transparency of such new methods. In practical cases, in which transparency is vital, it is important to maintain a trade-off between the accuracy we expect and transparency of the algorithms used. In this project for instance the use of the unsupervised topic modeling algorithm has been replaced with a rule based key concept extraction method.
Edited by Juliane Tatarinov
Moretti, G., Sprugnoli, R. & Tonelli, S. (2018). LETTERE: LETters Transcription environment for REsearch. 7th Annual Conference of the Italian Association for Digital Humanities and Cultures (AIUCD)
Roy Rosenzweig (2003). Scarcity or Abundance? Preserving the Past in a Digital Era, The American Historical Review, Volume 108, Issue 3, 1; see furthermore Max Kemman, Distant Reading, presentation at the University of Luxembourg, 2018
- Kemman, M. (2016). Trading Zones of Digital History.