From 16 to 17 November 2017 we had a DTU skills training on Introduction to Programming with Python, organised by dr. Folgert Karsdorp. After the workshop we built a small program to create concordance tables based on a file containing a lists of terms that was related to Eva’s research about the exchange of psychiatric knowledge in the nineteenth century. As a case study we wanted to delve into the travelling patterns of these psychiatrists.
At first, we printed each sentence within the source, including the highlighted country names for a specific document. Next, we decided to export the results to a comma separated value (csv) file to work on afterwards. The second extension we wrote asked for input inside the terminal, so that the program could be executed for a specific document with a specific file path and for a single list of terms (such as country, …). Since Eva is working on a fairly large number of text files, the third and final extension we created allowed her to specify only the file path to a folder so that all text files within the folder would be processed for a single list of terms. This creates several csv files named after the original name of the text file and the list of terms used. A final variable that you could change within the terminal, was the number of words to be printed before and after the search term.
1. Highlighting terms in a concordance table
We started with an outline of all the necessary steps we would need to take, such as: ‘make a list of terms’, ‘transform the terms list into small caps’, ‘make connection to the text files and read the documents into Python’, etc. In the next step we actually wrote out the program. As an example, we first we started with a simple list of countries (copy-pasted from a document), such as shown below.
landen_lijst = ['États-Unis', 'Amerika', 'Verenigde Staten',
'United States', 'Grande-Bretagne', 'Angleterre', 'Royaume-uni','Ecosse’]
We changed this later on into a file path to a list on Eva’s computer, creating a dictionary out of it — here we also needed to import the right libraries such as collections. This workflow made it much easier to account for changes in the list of terms when new keywords were added. In this piece of code we first defined the path to the list, stored under the variable terms, reading the document line by line, and closing the file when it was done. This reduces the memory usage and is considered a good practice in programming.
interms = open('.../Termenlijst.txt') termlines = interms.readlines() interms.close()
terms = collections.defaultdict(list) header = None for line in termlines: if line.startswith('#'): header = line.strip() else: terms[header].append(line.strip())
We also turned all list terms to small caps to make processing and searching for particular words throughout the sources much easier. Afterwards, we repeated the same steps for the actual sources (loading and reading them, making them small caps).
Now we had to think about how to combine the list of terms and the actual sources to find the right words and put them in context by showing some of the text before and after the specific keywords. We tried a few different things such as defining the amount of characters we wanted, but this was not convenient. Then we tried to split the source text into sentences such as seen in the code below, to search more easily for specific words. One of the reasons this method also did not work is that in order to find all instances of ‘France’ for instance, the results also included ‘souffrance’. This was of course not the intention, so we kept on searching for other options.
split_sentence = re.split('\.\!\?\;', issue_lower) print (split_sentence[0])
In the end, the best option was to split the source text into words with the help of the regular expression library. Locating words from the term list in the source was finally possible. We also included the option to specify how many words before and after a certain term we would like to contextualize (see num_words = 200).
tok_regex = '\w+|[^\w\s]+' re_tokens = re.findall(tok_regex, issue_lower concordance = [] num_words = 200 for i in range(len(re_tokens)): word = re_tokens[i] prev_words = re_tokens[max(i-num_words, 0):i] next_words = re_tokens[i+1:i+1+num_words] if word in landen_lijst: print(' '.join(prev_words)) click.secho(word, bg='yellow') print (' '.join(next_words)) print ('---------------') concordance.append([str(i),' '.join(prev_words), word, ' '.join(next_words)]) print (concordance[5])
Highlighting the original terms in the concordance table made it easier to find them. Here we used another library named click which gives the text or the background any colour specified by the user. The statement click.secho(word, bg=’yellow’) goes into the print command, as shown above. The output in the terminal can be seen in the screenshot below. In this example the program included terms that describe travels of psychiatrists — words such as ‘étranger, voyage, visit, etc’. The fragment below describes the visit of doctor Webster (from the UK) to the French Charenton asylum. These results in turn could lead to the structured pdf text and a more directed search in archives and libraries to check if Webster wrote an account on his visit to the asylum.
To export the data outside of the terminal environment an extra piece of code generated a concordance csv file. First we needed to open a file to write in, specifying the title of the file and the .csv extension, as well as the UTF-8 encoding. We used the csv library, with the csv.writer() function followed by writerows() which contained the concordance list we created above.
with open('test.csv', 'w', encoding='UTF-8') as f: writer = csv.writer(f) writer.writerows(concordance)
2. Specifying text name, file path and term list
In order to prompt a user for input in the terminal, Python 3 (as opposed to Python 2) uses a command called input() where you can add the question you want to print in the terminal between brackets. Because the user’s response was used throughout the program, we stored the result in a variable. We posed three different questions of which the answers were stored in different variables and printed the results to double check our code.
path = input('Which folder is this file in? ') txtname = input('Which file do you want? ') filename = path + txtname print (filename) termname = input('Which termlist do you want? ') print (termname)
At first we simply asked for the path to a specific file, but we needed a new filename for each resulting csv file. Therefore, we reused the name of the text file at the end of the program and stored it as a separate variable. However, we also needed to read-in the correct file within our code, so we concatenated the path and the text name under the variable filename for internal use. To store the result we created a new csv file. First we stripped the name of the text file of the .txt and made sure that the # symbol was removed from the name of the term list, since Eva formatted the titles in the term lists with an #. We then created the correct string under the variable csv file and printed the result to double check and find the result afterwards. We simply replaced the name of the document with our variable called csv file.
txtname = txtname.strip('.txt') termname= termname.strip('# ') csvfile = '%s.%s.csv' %(txtname, termname) print (csvfile)
with open(csvfile, 'w', encoding='UTF-8') as f: writer = csv.writer(f) writer.writerows(concordance)
3. Batch processing a folder
Because we did not need the text name for batch processing, we simply left out the line asking for the name of the text file. We did add a line to ask for the number of words needed “as context”, or the number of words needed before and after a certain term. The initiation prompts thus changed to:
path = input('Which folder is this file in? ') word_range = int(input('How many words do you need as context?')) termname = input('Which termlist do you want? ')
We had to convert the string for word_range into an integer. In order to create a list of text files within the folder, we wrote a small for loop, going over all the files in the directory and appending them to our new list of journals.
journals = [] for file in os.scandir(path): journals.append(file.path)
The function os.scandir() can only be used when the os library is imported at the start of the Python file: import os. Once we added all the text files to our journals list, we needed to loop over all of the text files and perform the same actions as we did in our initial python code. We ensured all the indents were adapted and recreated the variable containing the file name of the csv file.
for journal in journals: infile = open(journal, 'rb') issue = infile.read() infile.close() # initial code was inserted here txtname = journal.strip('.txt') termname= termname.strip('# ') csvfile = '%s.%s.csv' %(txtname, termname) print (csvfile
Conclusion
For Eva it was not always clear how to construct certain parts of the code or to understand their meaning, but with help from Sytze and Shoreh she gained more insight into how the code operated inside and outside of the Terminal workspace. As instructor Folgert Karsdorp said: “everything starts with being able to read/understand code”. Since Sytze had an introduction to Python before, the workshop demonstrated the differences between Python 2.7 and Python 3.6 to her. Furthermore, she now learned how to parse documents using existing libraries. Finally, the exercise with Eva’s data provided a nice challenge.