Do historians use digital tools in the same way? This question will be central in this blog. As historians of medicine, working with digitised medical journals and digital tools, with an interest in the development of discourses, concepts and ideas, we took the opportunity to compare our research methods. In her PhD, Eva Andersen (University of Luxembourg) focuses on the exchange of psychiatric knowledge across Europe, while the PhD project of Jolien Gijbels (KU Leuven) investigates the role of religion and ideology in debates between Catholic and liberal gynaecologists and obstetricians in nineteenth-century Belgium. To perform our research, we both used the text analysis tool AntConc developed by Laurence Anthony in order to analyse the historical development of particular concepts and ideas.1 As our research has much in common in terms of research field and methodology, you might imagine that we have used AntConc in the same way. After all, there is often the idea that there exist “one-size-fits-all” tools. As it turns out, however, our research methods in AntConc were quite different.
In general terms, AntConc is a useful tool for research into the historical development of ideas. It allows large text corpora to be uploaded and analysed in text format, and it is user-friendly due to its user interface. A number of functions allow historians to search for words and word combinations in the selected text files. In short, the tool can help historians to identify meaningful passages by using the concordance function (showing the words occurring to the left and right of a keyword), by detecting major debates through the concordance plot (which displays the high or low concentration of hits plotted in ‘barcode’ format) and by investigating co-occurring words with the collocate function (showing the words that frequently or in a statistical meaningful way co-occur with a keyword). It is also possible to look closer at the paragraphs surrounding the keyword by means of the file view (displaying the raw text of the individual text files). Other functions in AntConc, such as the word list and keyword list (counting the words in the corpus, whether or not in comparison with the words in a reference corpus) served our research less, because the contextualisation of the individual words is lost within those lists.
In what follows, we will reflect on the different ways that we have used the functions in AntConc. First, we will discuss the specificity of medical journals and the challenges that working with its digitised form pose. Second, we will explain how we became acquainted with our digital sources by using, among other things, AntConc. Finally, we will pay particular attention to some specific challenges in terms of interpreting the research results in AntConc. As we will show, in every step of our research process, methods of close reading were essential to overcome the difficulties of distant reading in AntConc (the use of different functions in AntConc to detect passages of text and patterns of language for further scrutiny).
The nineteenth century is known for the rise in the availability of medical journals all across Europe. This upsurge included the spread of publications directed at a public of general medical practitioners, as well as expert journals. Towards the end of the nineteenth century, the medical field began to professionalise more and more, creating different specialities such as psychiatry, gynaecology or ophthalmology. The former field saw for example the birth of the French Annales médico-psychologiques (AMP) in 1843 making it one of the oldest European journals dedicated to psychiatry. In addition, periodicals covering medicine in general also kept circulating, such as the Bulletin de l’Académie Royale de Médecine (BARMB), established in 1841, which assembled the verbatim meeting reports of the Royal Academy of Medicine of Belgium.
For medical historians these journals offer insights in many facets of medical knowledge and practices such as the development of pathologies, treatments and therapies. They also shed light on socio-cultural aspects such as physicians’ religious attitudes, the medical construction of gender, and the role of class in medical treatments. The medical journal as a genre forms another promising avenue for research, which includes studies on their role in professional community building and on the relation between medical periodicals and the general press.
Medical journals consisted of different sections covering, among other things, original research articles, book reviews, meeting reports, conference announcements, obituaries and a section reporting on the employment status of scholars in the field. These sections give the historian a wide variety of content that can be researched. Furthermore, periodicals are an interesting source because of its serial character, which allows the researcher to investigate the long-term development of ideas more easily. However, close-reading a large number of journal volumes in their non-digital form is often not feasible.
Digitised corpora can overcome these problems to a certain extent, but also create their own specific obstacles. After digitisation the historian often works with yearly print runs or specific journal volumes, which does not allow the investigation of one single section within a journal (e.g. all obituaries). Breaking down the digitised journal volumes into sections can be done with manual labour, but can be very time consuming. The creation of sub-corpora can be automated by detecting certain boundaries between journal sections (i.e. training a computer programme to detect the beginning of a section based on a different font size or bold lettering2) or by using annotation and tagging of different sections within a journal via for example xml encoding.3 In both Jolien’s and Eva’s case the digital corpora had no easily detectable boundaries between journal sections, which made splitting the journals automatically impossible. Furthermore, investing in annotation and tagging the different journal sections was not feasible.
Besides, the automation of sub-corpora by no means implies that researchers do not affect the end result. During an experiment with automated splitting, Jolien noticed for instance problems such as the absence of names of journal sections at the start of a significant number of sections (with the result that sections without names were not incorporated in the subcorpora) or the problem of changing names of journal sections over time. The miscellaneous section in Belgian medical journals had for instance names such as Variétés, Nouvelles de la semaine, mélanges scientifiques, faits divers, etc. Overcoming such obstacles in automated sub-corpora hence implies active interventions of the researcher. Correcting automated sections means assigning section names manually to particular content of the corpus. To be able to compare all the miscellaneous sections of a particular journal, for instance, researchers have to assign them the same name.
Getting to know the digital corpus
Using digitised corpora means that the historian no longer solely relies on the physical copy of a journal, but uses its digital equivalent in PDF or text format. The latter is preferable because most digital tools are not able to process the content of PDF files.
The computer’s readability is not affected (if accurate OCR is available4) in these text files. Yet, plain text can be hard to read for the human eye depending on how the words and sentences are structured within the text files. The use of specific file formats and the order in which they are converted (PDF, XML, JPEG) to a text format can compromise the human file readability because the original layout is not preserved (i.e. paragraphs are lost) [Fig. 1 vs Fig. 2].
In addition, the researcher’s decision to clean or not clean the corpus (e.g. hyphenation problems, corpus normalisation, eliminating OCR errors) and more specifically his or her access to scripts that can reduce these problems is essential, as it can change both a researcher’s experience with digital tools and the research outcomes tremendously.
Historians’ search strategies constantly change, depending on the digital format(s) of their corpus, their knowledge of their sources and their research topic. Before and while using an application like AntConc, one especially needs to have some affinity with medical journals and some background knowledge about frequently discussed medical themes and common medical terms. Without such knowledge, it is not only difficult to think of meaningful keywords, it is also difficult to make sense of the outcomes of search results. One way to get insight into the terminology of a corpus is via the collocates function in AntConc. For example, in the BARMB (1842-1914) the word “religion” collocates with frequently used words such as “génitale” and “accouchements”, which hints at the importance of religion in obstetric and gynaecological debates [Fig. 3]. In addition, the concordance function in AntConc, showing one or more keywords in their context, allows the researcher to see at a glance in what type of debates a specific word was used. These two functionalities were, besides close-reading methods, frequently used by Jolien to familiarise herself with the content of medical journals.
In contrast, these functionalities were less or not at all used by Eva. Due to previous research she was already familiar with dominant discussions and terminology within psychiatry. For example, when researching the medical debate about “non-restraint” (which referred to the disuse of mechanical restraint apparatuses on psychiatric patients and was largely supported by psychiatrists in the United Kingdom) , the collocate tables and concordance function did not offer much additional and useful information. In Eva’s case the concordance function only gave slight variations for the word “non-restraint” such as “restraint”, “système”, “pratique”, “anglaise” and thus did not provide Eva with new insights or specific information about the use of non-restraint. Therefore, Eva relied much more on the concordance plot which gave her a better overview of where to find crucial moments of debate within the journals about this specific topic. In addition, unlike Jolien who works with a corpus of French-speaking medical journals, Eva uses corpora in different languages, which complicates the research process. Additional steps need to be taken. The non-restraint was mainly developed and acknowledged within the United Kingdom but was also discussed by alienists in other countries. This means that the researcher needs to be aware of (faulty) translations of this keyword (e.g. “no-restraint”, “nulle contrainte”, “non-restreinte”) and needs to take these into account as well while working with tools such as AntConc. Thus, a different starting point and previous knowledge of the researcher can alter why and how the same tool and its functionalities are used.
Interpreting Search Results
Once we were familiar with the data, some of the medical debates and the vocabulary used in the debates, we were ready to interpret the search results based on the keywords that we entered in AntConc. It is important to note here that interpretation means ‘going back and forth’ between the different functionalities in AntConc: from the concordance tool (where the word appears in its immediate context), to the concordance plot (where the concentration of the word is plotted over the different journal volumes), to the file view (in which the word appears in the text file). While Eva mainly relied on the concordance plot in combination with the file view, Jolien used the concordance function alongside the concordance plot and the file view. Besides, Eva and Jolien both regularly consulted the PDF-format of the journal volumes to read the complete journal contributions and to check in which sections of the journal they appeared.
Both the concordance plot and the concordance function have the advantage that they not only lead to important debates, but also uncover (sometimes very) small but highly interesting passages that are difficult to detect without text-mining tools. The concordance function allows you to see a small part of the context of the keyword, on the basis of which the researcher can decide whether it is relevant to take a closer look at the keyword within a larger context (i.e. a paragraph or the full article). An advantage that is specific to the concordance plot is that it is a visually attractive function that allows historians to detect ‘important years’ in which a keyword appeared more than usually.
Yet, the functions also present difficulties. The concentration of occurrences of a certain word in the boxplot, for instance, can lead to interesting debates, yet they can also be misleading. Figure 4 and 5 show the occurrences of the word “catholique” in the annual volumes of the BARMB. Both the years 1852 and 1898 appear interesting at first sight. However, while the former year indeed leads to interesting passages about the sacrament of baptism and the religious commandment “Thou shalt not kill” in the context of a debate about medical abortion, the latter year mentions “catholique” in a table that refers to the religion of patients suffering from tuberculosis. While close reading affirms that the year 1852 is interesting for historical research about the religious beliefs of gynaecologists and obstetricians, the results from the year 1898 are irrelevant in that respect. Closer analysis, in fact, revealed that more than half of the number of hits for “catholique” in the BARMB for the period 1842-1914 turned out to be irrelevant for Jolien’s research; they were not only irrelevant because they did not appear in debates on gynaecology and obstetrics, but also more generally in the sense that the hits were meaningless in the context of research on medicine and religion. The word “catholique” was for instance often mentioned as part of the institutional name of the Leuven university (“Université Catholique de Louvain”). Such examples show that historical explanations about the importance or weight of debates cannot be derived from the statistical occurrence of particular words alone.
In this context it is also important to note that the usefulness of the tool can vary and that search strategies diverge between researchers. In her example, Jolien often used non-medical terms such as the word “catholique”, which will statistically appear less frequently than medical keywords. In Eva’s case, were mostly medical keywords were used, this gave sometimes more leeway to false positive results when only a single keyword was used (e.g. “non-restraint”) For example, in 1914 the word “non-restraint” is mentioned within the AMP but not within the context of an actual debate, rather it is only a reminiscence of important discussions in the past within the French Société Médico-Psychologiques as the following sentence shows: “C’est ainsi que sont restées mémorables les discussions sur les monomanies, les névroses extraordinaires, les folies sympathiques, les folies héréditaires, le délire chronique, le non-restraint, l’assistance des aliénés, les aliénés dangereux” . In other words, the use of a single keyword does not always provide the researcher with enough “hits” in the corpus to trace this specific debate in all its facets or to understand what made the debate about the non-restraint highly polemic amongst European psychiatrists.
To counteract this problem, Eva worked with keyword lists that were steadily built up for the different language corpora by close and distant reading as well as with computer and human intervention. This means that related words were also used as search terms (e.g. “padded room”, “conollyism”, “straightjacket”, “cellule matelassée”, “mechanischen Zwangsmittel”, …) which returned much more useful results and less false positives, allowing the debate to be studied with more focus and precision [Fig. 6 vs Fig. 7].
The comparison of Jolien’s and Eva’s use of the tool AntConc showed that digital research strategies can be very diverse. Every researcher creates his/her own research process and deals with the constraints of the tool differently. Due to the non-medical character of Jolien’s keywords within a medical corpus her search strategy and research goals were manageable within a tool such as AntConc. In Eva’s case this was less feasible as she dealt with medical keywords, making it less easy to find the proverbial needle in a haystack. The use of a single keyword did not provide enough information about the non-restraint debate itself, but rather required the use of additional keywords in order to find passages that were purely focused on debates about non-restraint. This brought Eva to additional explorations with other tools that facilitated her research goals more easily, for example the use of topic modelling.5
However, we still share the same critical views on how to approach digital sources and/or tools and the not always straightforward research output these applications create. Experimenting with analysing texts in AntConc has taught us that distant and close reading need to go hand in hand within digital historical research. As Nan Z. Da has recently stated, “Literary objects are too few, and too complex, to respond interestingly to computational interpretation — not mathematically complex, but complex with respect to meaning, which is in turn activated by the quality of thought, experience, and writing that attends it”.6 While digital analysis leads to statistically significant patterns of language, these patterns only have meaning once they are subject to verification and can be properly explained. For this reason, and because all algorithms produce measurement errors, we consider close reading methods as an interpretational necessity and a controlling mechanism.
Edited by Juliane Tatarinov
- Anthony, L. 2005. “AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom.” In IPCC 2005, 729–37; Anthony, L. 2013. “Developing AntConc for a New Generation of Corpus Linguists.” In Proceedings of the Corpus Linguistics Conference, 14–16. Lancaster University.
- However, font size and other text characteristics are almost never preserved in plain text files.
- Pablo Picasso Feliciano de Faria, Fabio Natanael Kepler, and Maria Clara Paixão de Sousa, ‘An Integrated Tool for Annotating Historical Corpora’, Uppsala, Sweden, Proceedings of the Fourth Linguistic Annotation Workshop, July 2010, 217–221.
- The OCR rate depends on the quality of the original PDF or image scan, the software used to apply OCR and convert the files to plain text, as well as the language of the source because some languages are better recognised by software than others.
- Eva Andersen, Maria Biryukov, Roman Kalyakin and Lars Wieneke, ‘How to read the 52.000 pages of the British Journal of Psychiatry? A collaborative approach to source exploration’, Journal of Data Mining and Digital Humanities, In Press.
- Nan Z. Da, ‘The Digital Humanities Debacle: Computational Methods Repeatedly Come up Short’, The Chronicle of Higher Education, 27 March 2019, https://www.chronicle.com/article/The-Digital-Humanities-Debacle/245986.