Word Embeddings in Humanities – Doctoral Training Unit “Digital History & Hermeneutics”

“You shall know a word by the company it keeps”

I have grown up eating almonsay. Almonsay fishing is a popular activity worldwide, but almonsay population levels are of concern in both the Atlantic and the Pacific Oceans. Different varieties of almonsay taste different, and there are so many ways to cook it.

I doubt you have ever tried almonsay because this is just the word ‘salmon’ translated into Pig Latin. Yet, while reading the first paragraph, what kind of images appeared in your mind, of a vegetable, or a little furry animal, or rather something else, cold-blooded, with fins and gills?

In the middle of the 20th century, Firth (1957) and Harris (1954) proposed and popularised the distributional hypothesis according to which words used in the same context have similar meaning. This is why one can guess the meaning of ‘almonsay’ quite accurately by looking at the neighbouring words – “eating”, “fishing”, “varieties”, “taste”, “cook”, “Atlantic”, “Pacific” and “ocean” (or, as Firth (1957) famously put it, “You shall know a word by the company it keeps”).

Can we teach a machine to read?

Humanities scholars have long turned to computational methods to complement the traditional approaches, particularly when it comes to text analysis (think Father Busa and his famous Index Thomisticus, one of the first large searchable digital corpora founded in 1949). Nowadays, it seems that applying computer science and artificial intelligence techniques (e.g. natural language processing and machine learning) is becoming almost a necessity in the fields of digital history and literary studies, which rely on the ever-growing amount of textual data. The need to process and interpret the data brought along such approaches to humanistic research as distant reading (https://www.nytimes.com/2011/06/26/books/review/the-mechanic-muse-what-is-distant-reading.html) (Moretti), macroanalysis (https://lareviewofbooks.org/article/an-impossible-number-of-books/#!) (Jockers) and culturomics (http://www.culturomics.org/cultural-observatory-at-harvard/papers) (Michel, 2011), all of which are believed to offer alternative perspectives on texts, bring out some (often unexpected) patterns and prompt researchers to ask new questions.

Word embeddings have recently become not only a buzzword among digital humanists (Schmidt, 2015) but also an essential part of many Natural Language Processing systems. The technique allows to represent words in a text as vectors (https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/modal/v/vector-introduction-linear-algebra) based on each word’s context and ‘embed’ these vectors in a so-called vector space, which let us apply algebraic operations to them (for example, add and subtract, like in the most famous (and, perhaps, overused) example of “King – man + woman = queen”), as well as find words semantically close to the query, calculate semantic similarity between a pair of words, visualise semantic relations between words in a corpus, or get raw vectors to use them later in another algorithm (Kutuzov & Kuzmenko, 2017).

How can word vectors help humanities scholars?

The way humanities scholars apply word embeddings depends on the research question, and here are some of the options that have been described in the literature.

1. Tracking semantic evolution (diachronic word embeddings)

If you have a time-stamped corpus it is possible to track the evolution of certain concepts (think such words as ‘viral’ or ‘cloud’, for example, and how their meaning has changed over the last decades). HistWords (https://nlp.stanford.edu/projects/histwords/) aims to study the semantic evolution of more than 30,000 words across 4 languages. Papers by (Hamilton, Leskovec & Jurafsky, 2016; Barranco, Dos Santos & Hossain, 2018; Kutuzov, Velldal & Øvrelid, 2017; Tahmasebi, 2017) showcase the wide range of topics the authors have approached with word embeddings: from exploring the laws of semantic change (Hamilton et al, 2016) to tracing and predicting armed conflicts (Kutuzov et al, 2017).

Studying semantic shifts help researchers learn more about human language and the driving forces of the semantic evolution which include linguistic, psychological, social and cultural factors. Kutuzov, Øvrelid, Szymanski & Velldal (2018) offer an overview of the diachronic word embeddings with a section on their applications.

2. Tracing historical development of a vocabulary

Van Lange & Futselaar (2018) attempted to track the vocabulary in the Dutch parliament discussions within a specific time period to link the developments in this vocabulary to historical events and discourse changes. The researchers ask historiography questions and seek to assess the validity of traditional scholarship using computational methods, thus aiming to complement, not replace, the traditional historical approaches. In the field of literary studies, Maslinsky (2018) is using word vectors to explore the history of Soviet children’s literature and changes in the use of emotional language in the corpus of 400 books published in 1920s-1980s.

3. Understanding relationship between words

Word embeddings seem to be a promising tool for modelling interactions between people and concepts in literature. Some examples of such projects include exploring 18th century literature (Heuser, 2016), clustering of characters in the 19th century novels (Grayson, Mulvany, Wade, Meaney and Green, 2016), comparing novels by Austen and Edgeworth (Kerr, 2017), and investigating relationships between persons and concepts in over one billion words of the works of a 6th century scholar (Bjerva and Praet, 2015).

Challenges and promises

There are certain challenges involved in applying word vectors to the humanities research. Firstly, there is a steep learning curve and a necessity to learn certain technical skills (for example, a programming language and fundamentals of machine learning, neural networks and natural language processing). However, there are ways around it: for example, WebVectors is a free open-source toolkit developed to assist digital humanists wishing to play with word embeddings trained on different corpora (at the moment, one can choose from British National Corpus, English Wikipedia, Google News corpus and even a corpus in Norwegian; for Russian vectors, check out http://rusvectores.org/ru/ (Kutuzov & Kuzmenko, 2017) developed by the same researchers). Secondly, it is not always clear how to interpret the results, and this is why word embeddings should be used as a complementary tool often requiring active participation of a domain expert.

Nevertheless, word embeddings have proved an interesting approach to a variety of problems in digital humanities, particularly in digital history and literary studies, as demonstrated by the examples in this post, and it may be worth trying it out as part of your “thinkering” strategy.

Image: Embedding projector – a web application for interactive visualisation and analysis of high-dimensional data (https://ai.googleblog.com/2016/12/open-sourcing-embedding-projector-tool.html)

References

Barranco, R.C., Dos Santos, R.F. and Hossain, M. (2018). Tracking the Evolution of Words with Time-reflective Text Representations.

Bjerva, J. & Praet, R. G. L. M. (2015). Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Association for Computational Linguistics (ACL), p. 53-57

Busa, Roberto (1974-1980). Index Thomisticus (http://www.corpusthomisticum.org/it/index.age)

Firth, J.R. (1957). A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. Oxford: Philological Society. Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952-1959, London: Longman (1968)

Hamilton, W., Leskovec, J. and Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.

Grayson, S., Mulvany, M., Wade, K., Meaney, G. and Green, D (2016). Novel2Vec : Characterising 19th Century Fiction via Word Embeddings. In: 24th Irish Conference on Artificial Intelligence and Cognitive Science (AICS’16), University College Dublin, Dublin, Ireland, 20-21 September 2016

Harris, Z. (1954). Distributional structure. Word, 10(23): 146-162

Heuser, R. (2016). Word vectors in the eighteenth century, episode 4: Semantic networks. Adventures of the Virtual [online]. Available at: http://ryanheuser.org/word-vectors-4

Kerr, Sara J. (2017) When Computer Science Met Austen and Edgeworth. NPPSH Reflections, 1. pp. 38-52. ISSN 2565-6031

Kutuzov A. and Kuzmenko E. (2017) WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models. In: Ignatov D. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2016. Communications in Computer and Information Science, vol 661. Springer, Cham (pdf, bibtex)

Kutuzov, A., Velldal, E. and Øvrelid, L. (2017). Tracing armed conflicts with diachronic word embedding models. Anthology: W17-2705; Volume: Proceedings of the Events and Stories in the News Workshop.

Kutuzov, A., Øvrelid, L., Szymanski, T. and Velldal, E. (2018). Diachronic word embeddings and semantic shifts: a survey.

Maslinsky, K. (2018). From history of emotions back to literary history: a case of Soviet realistic prose for children and young adults. In: Presentation Abstracts. 6th Estonian Digital Humanities Conference Data, humanities & language: tools & applications. 26-28 September 2018. Available at: https://drive.google.com/file/d/13k2Bj7SZrdSYyb7L12iffAyQe6Q1IJdj/view?usp=sharing

Tahmasebi, N. & Risse, T. (2017). On the Uses of Word Sense Change for Research in the Digital
Humanities. In J. Kamps, G. Tsakonas, Y. Manolopoulos, L. Iliadis & I. Karydis (eds.), Research and Advanced Technology for Digital Libraries: 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, September 18-21, 2017, Proceedings (pp. 246–257). Springer International Publishing. ISBN: 978-3-319-67008-9

Schmidt, B. (2015). Vector Space Models for the Digital Humanities. Available at: http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html

Leave a Reply Cancel reply