Nearly everyone uses search engines such as Google on a regular basis, and that is no different for researchers at our institution. Powerful as this tool can be, we should be mindful of what lies hidden behind the search bar and what information we receive through these search portals. Search engines are not impartial, nor do they search the entire web. Perhaps we need to think of other places to look for information that are less commercially oriented and supports Open Access.
The past two weeks I had a “presentation skills training” where we had to choose a topic that was completely new to us. Because I was quite interested in looking at a tool most of us use every day, I decided to take a closer look at the Anatomy and Workings of Search Engines. Since one of those search engines has become a common verb in most languages, let’s start from the Oxford English Dictionary’s definition of “to google”:
Pronunciation:Brit. /ul/, U.S. /u()l/
1. intr. To use the Google search engine to find information on the Internet.
2. trans. To enter (a search term) into the Google search engine to find information on the Internet; to search for information about (a person or thing) in this way.
Since people use Google and other internet search engine services regularly, they tend to rely on four common assumptions as identified by Bettina Fabos in 2006 :
- 1. Search engines are
- information tools.
- 2. Search engines search the
- , gleaning the most relevant results.
- 3. Search engines vary greatly, thus offering
- and a
- 4. Search engines are the
only place to go
- for relevant information on the web.
Before I can contradict these four assumptions, it is important to understand the structure of the search engine industry.
Search Industry Structure
LESSON 1: Search Engines are not impartial and they are part of an industry.
Search Industry structure, based on the article by Bettina Fabos in 2006, using the Eleanor template from slidescarnival.com.
Directories are quite simply databases that contain information. In this case they contain web pages in indexed lists to feed the search engine providers. Search engine providers may use existing directories, or build their own directories by crawling the web. In order to understand the workings of search engine providers, a good starting point is the famous paper dating back to 1998 by Sergey Brin and Lawrence Page, PhD students at Stanford University and founders of Google.  You can understand crawling as spiders that are sent out over the web to the location that the URL server gave them in order to gather the information in a web repository. After collecting web pages, the indexer will convert web pages (also called documents) into a list of word occurrences or hits and add sufficient metadata which is stored in barrels. The sorter than needs to invert this index by converting the list of words attached to each document into a list of documents for each word.
So imagine looking for Katy Perry (famous example in this video). Simply put, the search engine provider will compare a list of documents containing the word ‘Katy’ to a list containing the word ‘Perry’ and feed back only those documents containing both words, preferably close to each other. However, you don’t want just any document that contains Katy Perry, you need relevant web pages. This is where the ranking algorithm comes into place, especially Google’s PageRank algorithm. The algorithm looks at the ‘popularity’ of a web page based on how many other (relevant) web pages refer to it. The PageRank algorithm has been updated and modified since to include individual and collective search history, as well as localised information. For a more aprehensive overview of current information retrieval, see the book by Croft, Metzler and Strohman from 2015. 
Finally, the search engine portal is any website containing a search bar, which means even this page could be understood as a search engine portal. The importance of portals lies in their usability and their user friendliness.
LESSON 2: Search engines do not search the entire web, especially not the deep web. Some directories only include paid-for content.
Models of sponsorship, based on the article by Bettina Fabos in 2006, using the Eleanor template from slidescarnival.com
Google’s initial strategy was selling their technology as a search engine provider not only to other search engines, but also to other websites that use Google’s technology to power search on their own website. This strategy brings in some money, but it’s not a continuous flow of cash. Furthermore, there are other open source technologies you could use nowadays that allow much more customisation than Google search (e.g. SOLR, Elasticsearch).  Some other search engines chose to focus more on marketing agencies trading for space thus including sponsored links. Within the search industry structure, it is clear that the best way to appear in search results over different search engines, is to pay in order to add (commercial) content to the repository. However, in order to determine the efficiency of advertisements, search engines and marketeers have agreed to a pay-by-performance strategy, which means advertisers only pay the search engine when someone actually clicks on a link. Other search engines have gone for more aggressive strategies such as paid inclusion, where the advertisement appears in every search, but had to sacrifice users because of it and has therefor largely disappeared.
LESSON 3: There is little competition in the marketplace, with mainly American companies dominating the search industry.
The search engine industry used to be dominated by three main companies: Google, Yahoo!, and Microsoft. Add to that the current trends in voice search and as this infographic beautifully demonstrates: Amazon might not be the richest company, but they certainly have the first-mover advantage and are quickly connecting their Alexa voice assistant to cars and home devices, effectively taking over parts of our lives. Another less known company that offers a better technology through DeepSpeech 2 is provided by Baidu and funded by the government of China.
Search Engine Watch uncovered the challenges related to monitising voice search. Noble as the goals of Google may be in its promises of search engine integrity, the main thing that counts at the end of the day, is money. The reason Amazon holds such a large segment of the market, could be due to the connection with their online store possibilities and connections to apps and services such as music streaming services.
LESSON 4: There are other places to go for information other than search engines. – perhaps the library? –
Independent institutions such as the university, but also public libraries should warrant access to knowledge that does not necessarily have a commercial value. Therefore, these public institutions need to counteract the commercial interests of search engines by focusing on Open Access. Scholars on the other hand also use this search technology in their daily research and the visibility of online research projects also largely depends on users that found the project via Google. The Digital Humanities field is one of the first and most important advocates for Open Access. It is this years main theme of the Alliance of Digital Humanities conference in Montréal. Another important yet political and legal institution working on Open Access is the Open Access Infrastructure for Research in Europe or OpenAIRE. Preferably Open Access information is accessible through Google, but we also need to think about back-ups and other ways of disseminating information. One final question that is open to debate is how commercial search engines in turn profit from Open Access as this allows them to access information and integrate them in their search index.
 Bettina Fabos, “Search Engine Anatomy: The industry and its commercial structure” in Libraries: Changing Information Space and Practices, ed. Cushla Kapitzke and Chip Bruce. (Lawrence Erlbaum Associates, 2006), 229-249.
 Sergey Brin, and Lawrence Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” see infolab.stanford.edu/~backrub/google.
 Bruce W. Croft, Donald Metzler, and Trevor Strohman, Search engines: Information retrieval in practice, (Pearson Education, 2015).
 I would like to thank Lars Wieneke for his suggestions in this regard.