Text as data at DHBenelux 2019 – Doctoral Training Unit “Digital History & Hermeneutics”

DHBenelux 2019 was held at the University of Liège (Belgium), and it was the first time I attended this conference. As someone working in computational text analysis for historical and literary studies, I was excited to see quite a few text-related projects.

Why do humanities researchers turn to text as social and cultural data? Edward Said argued that “texts … are always enmeshed in circumstance, time, place, and society – in short, they are in the world, and hence worldly”.¹ Therefore, by studying texts we can get an insight into the culture and worldview of a certain group of people, a culture, or an epoch.

Terminology

Recent advances in Natural Language Processing (NLP) include the improvement of machine translation, text generation, question-and-answer tools and speech recognition. How is NLP is related to computational linguistics (CompLing), which studies how humans understand and produce language? Sometimes these terms are used interchangeably, but NLP is a more pragmatic discipline aiming to actually build language-related systems. There is also corpus linguistics studying large collections of texts – as a rule, using NLP tools. What about text mining? As data mining involves extracting useful information from data, text mining means searching for meaningful patterns in text. CompLing and NLP try to understand how human language works and reproduce human brain functions, while text mining explores the data at hand. To further complicate things, 2007 saw the appearance of the term cultural analytics referring to mining of large sets of ‘culturally-relevant’ data – text, images, films, video games, books, any other print publications, artworks, or other media. Fast forward to 2010, and we are introduced to culturonomics – “a form of computational lexicology that studies human behavior and cultural trends through the quantitative analysis of digitised texts”.² Researchers use computational tools to explore the ever-growing amount of born-digital and digitised textual materials, to answer existing humanities research questions and pose new, often unexpected ones. At DHBenelux 2019, a wide range of text-related presentations discussed everything from NLP challenges and solutions to text mining for historical, social and cultural analysis.

Data pre-processing, storage, and retrieval

A few projects concerning textual data preprocessing, storage and retrieval were presented. Facebook posts, Tweets and other born-digital data can be used as is (after preprocessing). However, often researchers have to digitise printed materials, and before they can start analysing the obtained data, they face the first challenge: Optical Character Recognition (OCR) software errors contributing to noise in the data, which, in turn, can distort the results of any analysis according to the “rubbish in-rubbish out” principle. Modern scanners are very powerful; the software has certainly improved. Unfortunately working with historical sources can complicate things. One presentation discussed challenges of dealing with OCRed books of ordinances from the sixteenth to nineteenth centuries printed in a gothic and a roman font. To reduce the amount of noise in the data, machine learning (ML) was used to preprocess files with poor OCR quality of the hand-carved letters, and NLP tools such as Named Entity Recognition (NER) were used to extract dates, titles and names of persons.

Another preprocessing issue discussed at the conference is the importance of considering existing structure elements such as division into chapters, tables of contents, indexes, visual aids (bold or italic font, capitals etc.) to avoid missing implicit knowledge about the text. The researchers presented a case study of charter books where place names from the index were automatically linked to the charters in which they had been mentioned and therefore to the date of that charter, thus allowing to link 17000 historical place names to the evidence of when these names existed.

Peculiarities of certain languages also pose challenges. In historic Hebrew and Aramaic texts dates are expressed through Hebrew letters, not numerals. How do you teach an algorithm to differentiate between the two? On the other hand, Luxembourgish is often characterised as one of the most under-described, under-resourced and under-utilised languages of Europe. This, however, did not stop the STRIPS project team from building a system for retrieval, storage and analysis of information extracted from RTL text collection.

Stylometry and Computational Literary Studies

There were several talks on stylometry and computational literary studies. One examined 1.5 million words of descriptions of 18-19th century satirical prints, which were written between 1930 and 1954 by the historian Mary Dorothy George. The researchers used corpus methods to investigate language patterns in George’s work and claim that she deliberately chose to use a “clear, neutral, and confident voice” which helped her readers imagine the prints she described and led them to perceived her as an authority. Another project explored the relationship between author’s gender and writing style in Dutch suspense novels. The researchers looked at 134 novels and clustered them using ML to see if the algorithm can identify the author’s gender correctly. The algorithm did detect that books by Suzanne Vermeer, a man writing under a female pseudonym, were identified as non-female works. This finding suggests that language to some extent does reflect an author’s gender. Other digital literary studies presentations included poetry analysis and a project concerning the word embedding modelling of a collection of memoirs, diaries and letters of Indonesian soldiers, conducted with the aim to study the war crimes perpetrated during the 1945-50 Indonesian War of Independence.

Discourse Analysis

Digitised newspapers are commonly used in Digital Humanities research, especially to study a particular discourse (for example, anti-modern discourse in Swiss newspapers or relevance of social sciences and humanities in societal debates) or explore certain features of the newspaper as a medium (for example, how representative the British Newspaper Corpus of a general newspaper landscape). Media studies can also benefit from using newspapers as data, as shown by one of the presented projects aiming to extract and analyse film listings in historical newspapers. A contribution to the history of psychiatry was presented by my DTU colleague Eva Andersen and Lars Wieneke (C2DH); the team used topic modelling to search for potential patterns in the transnational dissemination of ideas within the psychiatric community as reflected in the historical publications of psychiatric associations.

All in all, the conference was very enjoyable, with informative presentations and friendly and inspiring discussions. Digital text analysis has been a fascinating direction of Digital Humanities for decades, and with the rapid advancement of NLP research and digitisation of sources the number and quality of text mining projects will only increase. However, as Gold points out in his most recent interview, reflecting critically on the tools and methods we apply in digital humanities research, looking beyond opportunities offered by technology and into its limitations, is crucial. Who compiled a corpus to be used in our research and why? What is inside the black box of the algorithms we apply? How do we validate the results? No doubt that these and related questions will inspire many academic papers, panel discussions and conference presentations, and I hope that the next DH Benelux conference will present an opportunity to dive deeper into the issues of responsible use of technology in text mining in humanities research. Meanwhile, let’s do things with words and inspire other member of the DH community to experiment and share their journeys, as we did at DH Benelux 2019 in Liege.

Said, Edward. 1983. The World, the Text, and the Critic. Cambridge, Mass.: Harvard University Press.
Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Google Books Team, … Aiden, E. L. (2011). Quantitative analysis of culture using millions of digitized books. Science (New York, N.Y.), 331(6014), 176–182.

Terminology

Data pre-processing, storage, and retrieval

Stylometry and Computational Literary Studies

Discourse Analysis

Leave a Reply Cancel reply