Mining ethnicity: Discourse-driven topic modelling (DDTM) of immigrant discourses in the United States, 1898-1920

DTU-Seminar conducted by Lorella Viola (DHARPA, C²DH)

27 May 2020 | 14:-15:00 | Webex-Platform

If you want to participate, please send an email to in order to get the Webex-link.

Topic modelling (TM) is a computational, statistical method to discover patterns and topics in large collections of (unstructured) text (e.g., emails, book chapters, newspapers’ articles, letters, reports). Its main value lies in the possibility to find semantic patterns that would be difficult to identify otherwise. Such patterns may for instance be helpful to categorize documents in large archives or to discover the underlying topical structure of big corpora without the need to read each individual document. Although potentially a very efficient distant reading technique, the output may sometimes be difficult to interpret.

In this seminar, I’ll present a method that uses the close reading technique Discourse-Historical Approach (DHA – Reisigl & Wodak 2001) to refine TM. This combined methodology, “discourse-driven topic modelling” (DDTM), aims to enable researchers to triangulate linguistic, social, and historical data in order to make the topics more interpretable and unlock the full potential of TM. To test the methodology, I investigate public discourses produced by Italian migrants in the United States using a corpus of digitized Italian ethnic newspapers published in the United States between 1898 and 1920 (ChroniclItaly – Viola 2018). The results proved DDTM to be effective in obtaining a relatively quick categorization of the topics discussed in the immigrant press. Moreover, the changing distribution of topics over time revealed how the Italian immigrant community negotiated their sense of connectedness with both the host country and the homeland and how the changing experience of migration, identity construction, and assimilation was reflected over time in the accounts of the minorities themselves. At the same time, without jeopardizing the analytical depth of the findings, the method proved its value of minimising the risk of biases when identifying the topics which stemmed from the data rather than from preconceived assumptions.

Leave a Reply