Christof Schöch lectures on the Use and Abuse of Word Embedding

4 December 2019 | 14.00-15.00 | Speaker: Christof Schöch, Professor of Digital Humanities at the University of Trier

 

Word embeddings modelling is currently at the cutting edge of natural language processing, with a significant amount of research exploring its opportunities and limitations. However, such research mainly comes from the computer science community. While digital humanists have recently been applying, thinking and talking about word embeddings quite a lot, can we argue that the DH community is catching up with the computer scientists in this seemingly promising area? Is it supposed to? If so, how? And how can word embeddings modelling help humanities scholars in their investigations?

These and other questions were discussed at the talk within the DTU Lecture Series on the 4th of December which attracted quite a crowd from both the humanities and computer science departments. Christof Schöch, Professor of Digital Humanities from the University of Trier, shared his thoughts and experiences with word embedding models and provided an excellent introduction to the intuition behind the algorithms implementing distributional semantics and to the application of this method in digital humanities.

The talk began with an overview of word embeddings history. Schöch drew the audience’s attention to the fact that the development of the method and its application in the humanities owes much to the convergence of conceptual and technical advances. For example, distributional semantics, on which word embeddings modelling is based, was born in the 1970s, long before word embeddings algorithms were developed. Initially, vector space models were extensively used in the field of Information Retrieval, where the documents are represented as vectors allowing to determine distances between them. In turn, distances are seen as measures of semantic closeness used for retrieving documents semantically related to a certain query. These days, word embeddings are used and may be potentially useful for just about any research in the humanities, according to Schöch. He presented two examples of his own projects: a model built of French Wikipedia (https://zenodo.org/record/1627920) and a model built on a corpus of French books.

 

Modelling can be performed using different programming languages and algorithms. Gensim package for Python is widely used nowadays due to its relative simplicity, with word2vec being the most popular algorithm. In a nutshell, word2vec figures out relationships between words in a text or a collection of texts by learning to predict a missing word by looking at its context words (there is another variation of word2vec designed to predict context words from the target word).

Evaluation of the embeddings quality is a challenge, though. There is no gold standard of relations between words; scholars often generate domain-specific lists of relations by hand and measure how far their model is from this list. Another task often used for vector evaluation is “one odd out”: asking the model to choose a word that does not fit (for example, “banana”, “apple”, “car”, “pear” – which one is odd?). A model that captures semantics well enough should be able to catch the odd one.

Schöch talked about two possible applications of word embeddings in digital humanities: topic modelling and sentiment analysis. While topic modelling has been extensively used in digital humanities over the last decade, according to Schöch, the hype is now over as in the family of tools used by digital humanists for text analysis, topic modeling is a ‘hammer’ when compared to word embeddings models that are able to produce significantly more precise, more fine-grained results. However, topic models can be enhanced with word embeddings, with the latter used as a tool for topic coherence evaluation. As for sentiment analysis, assigning specific words as centroids for positivity and negativity (for example, ‘joy’ and ‘happiness’ for positive and ‘sorrow’ and ‘sadness’ for negative), it is possible to map all the words in the corpus vocabulary to these two axes and explore the sentiment-based structure of the text. Here Christof Schöch referred to Ryan Heuser who had published some work exploring this curious research direction.

The lecture was followed by a lively discussion, and one of the most prominent topics raised was the future of the word embedding modelling. Christof Schöch expressed confidence that the future looks quite bright. The difference between applying distributional semantics in the 1970s and now is the amount of data and processing power available to scholars. It is, in his opinion, the confluence of the big data revolution, hardware advancements (graphic cards/GPUs) and more intense thinking about the method that has led to such a breakthrough in using vectors in natural language processing. The future, Schöch suggested, is not entirely clear but it is a very active area of NLP research at the moment, and it looks very promising. The key is to make sure that digital humanities catch up with the NLP advancements; for example, in stylometry, it took researchers decades to move from using Manhattan and Euclidean distance to cosine distance (which had been done by computer scientists long before), but when it happened it did make a lot of difference. Word embeddings modelling seems to be part of the same story: the first paper was published in 2013; the first DH application paper  – in 2015. It seems to be a trend – and a challenge for digital humanities. Topic modelling is not at the forefront anymore, and the pressure on digital humanists to catch up with computer science advancements is increasing.

The audience was also interested in whether word embedding models can deal with the issue of polysemy. Christof Schöch explained that while standard models are not capable of solving this issue (that is, for example, the word “bank” as in “river bank” will be represented by the same vector as “bank” in “robbed a bank”), some newer models take context into consideration and look very promising in this regard, being able to obtain different vectors for different meanings of the same word.

Finally, a provocative question was asked: “Human thought cannot be captured by math, or can it?” Christof replied by referring to McCarthy’s “Residue of Uniqueness”: models do not capture all the details of the reality being modeled but the act of constructing a model makes you as a researcher more aware of these aspects: as you adjust the model, you push further. By failing to model everything, we start thinking about the problem more precisely and get better intuition of the underlying matters. Therefore, even imperfect modelling is useful by making hidden factors obvious.

 

 

 

 

Leave a Reply