Information retrieval – Searching of text using Elasticsearch

I was working on a project and came across the need of writing code for a search engine. While doing some research, I came across the term “Information retrieval” which I found pretty interesting and wanted to share my understanding through this post.

Information retrieval is the process of obtaining relevant information resources from the collection of resources relevant to information need. It is most visible in web search engines. Eg. If we have a blog with heaps of posts on movie reviews and say somebody is searching for “Comedy movies of Jim Carrey”, how would we do the search from the collection of information resources that we have (which is the blog post), to show the most relevant results?

You have the following keywords if you explode the phrase by space

Comedy

movies

Jim

Carrey

Note that the words are separated by spaces to get the keyword. Now for each of the keyword you can do a search in each of the post and save the number of occurrence of each term. Based on the frequency of occurrences, you can show the best match. Broadly, this is how a full-text search works. Many database engines like mysql, postgresql provide the provision for a full-text search. Each word of the articles are converted to lower-case and indexed. The lower-case keywords can now be matched against those indexed value.

Articles: A1, A2, A3….

This is called inverted indexing. However, we can see that there are lot of words which would have a higher number of frequency like is, a, the.. We call this stop words and depending upon our search requirements, we can remove it while indexing to reduce the index size. This is pretty cool as it’s different from the regular “exact match” and “like” queries

But what if the blog post had the term ‘movie’ instead of ‘movies’? Would the search engine still give relevant results? This is where stemming and lemmatisation comes into play. Now, imagine instead of indexing the word with actual word what if we index with it’s root form. E.g. movies = movie. The root form may not be an actual English word. Eg. nurses = nurs, studying = studi etc. This will help us in matching nurses, nurse, nursing which looks much better than just finding out the exact “nurse”. The process is called stemming. Lemmatization on the other hand, takes into account the morphological analysis of words and link the word back to it’s lemma. Eg. The lemma of studying, studies, study is study so the indexing should be done for study. It requires to have dictionaries where the algorithm can search through, to find out the lemma.

Now what if somebody is searching for comedy movie using the keyword ‘funny’ or ‘humorous’. It’s quite possible to miss a match. Another example is searching for U.S.A or USA or United States of America, each of them should ideally give same relevant results. To tackle this situation we could use synonyms. i.e. index the words comedy, humour, funny with comedy or any one of those. Just need to make sure that the search term also goes through same translation.

In some cases, the combination of words combination of the terms or phrases are more important than the term itself. Eg. ‘Movie of 2018’ could be of more significance than individual terms and the searching for phrase could yield more relevant results. For this scenario, we could select the minimum and maximum number of terms to incorporate and index all of them. Wiki says

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. When the items are words, n-grams may also be called shingles. For sequences of words, the trigrams (shingles) that can be generated from “the dog smelled like a skunk” are “# the dog”, “the dog smelled”, “dog smelled like”, “smelled like a”, “like a skunk” and “a skunk #”

Elasticsearch helps us in making such searches without having to worry much about the filters. They have out of the box “character filters” and “token filters” which we can use to create our own analysers. Analyzers in elastic search is the process which takes in the main text and coverts them into tokens for creating index.

elasticsearchinformation retrievalsearching