Building Simple Full Text Search

Full Text Search

Full Text Search refers to technique of searching text across stored documents and obtaining matched results in order of the search score. When a user searches for a word or set of words across documents, the search engine has to fetch all the documents that contains one or more occurences of those search terms and return those results after scoring them according to certain criteria. One such technique is tfidf.

TF-IDF

TF-IDF is short for term frequency–inverse document frequency. In this method, the scoring of the result document is done by using term frequency and inverted document frequency. Term frequency refers to the number of occurences of search terms in the document. Inverted document frequency refers to the inverse of the number of documents that has the occurence of that search term.

Lets try to create such an engine using python. For this we can use the bbc news data corpus.

To search across documents first, we need to create an inverted index using those documents. Inverted index can be thought of as hash map with words as keys and list of documents that has those words as values.

Building inverted index python

The corpus consists of set of directories (per news category), inside the directory, we have text files. For this purpose, we can use text files from politics. We will break the text inside those files into sentences and then break them into individual words. Instead of using the words, we are using the lower cased lemma (after stemming) of those words for indexing. For this purpose, we use spacy (nlp python library).

from pathlib import Path
import spacy


nlp = spacy.load("en_core_web_sm")
index = {}
documents = {}


def index_file(file_):
    with open(file_, "r") as f:
        data = f.read()
    file_id = Path(file_).stem
    documents[file_id] = data
    words = parse_text(data)
    index_words(file_id, words)


def parse_text(data):
    parsed_text = nlp(data)
    words = []
    for word in parsed_text:
        words.append(word.lemma_.lower())
    return words


def index_words(file_id, words):
    for word in words:
        try:
            index[word].append(file_id)
        except KeyError:    
            index[word] = [file_id]

Now to search, all we have to get the search terms from user and use the inverted index to find the documents that contains them.

Let's slightly modify the function index_words so that we can also have reference to both the document and count. Using this, the term frequency required for tf-idf can be calculated. For Inverted document frequency, we can using the inverse of the length of value in the dictionary.

def index_words(file_id, words):
    for word in words:
        try:
            current_count = index[word][file_id]
        except KeyError:    
            current_count = 0
        try:
            index[word][file_id] = current_count + 1
        except KeyError:    
            index[word] = {file_id: 1}

Now lets build a simple function which can obtain documents for search terms and calculates the tf-idf score and returns those documents in the descending order (max score first) of tf-idf score.

def search(words):
    docs = []
    for word in words:
        file_ids = index.get(word, []).keys()
        num_of_docs = len(documents)
        docs_with_word = len(file_ids)
        for file_id, count in docs_.items():
            score = count * math.log(num_of_docs / docs_with_word)
            doc_ = documents[file_id]
            docs.append((score, doc_))
    docs.sort(key=lambda x: x[0])
    return docs

We can enhance this further by storing the documents and inverted index in redis and doing the search using datastructures in redis. We can use hash map data structure for this purpose.

Let's create functions to create, delete documents which will store the documents and create / update the inverted index.

from pathlib import Path
import math
import spacy
import redis


nlp = spacy.load("en_core_web_sm")
redis_client = redis.Redis(host="172.20.0.2", port=6379)


def parse_text(data):
    parsed_text = nlp(data)
    words = []
    for word in parsed_text:
        if not any((filter_(word) for filter_ in filters)):
            words.append(word.lemma_.lower())
    return words


def index_words(index_name, file_id, data, words):
    redis_client.hset(f"__{index_name}", file_id, data)
    for word in words:
        hash_key = f"{index_name}::{word}"
        redis_client.hincrby(hash_key, file_id, 1)


def index_file(index_name, file_):
    with open(file_, "r") as f:
        data = f.read()
    file_id = Path(file_).stem
    words = parse_text(data)
    index_words(index_name, file_id, data, words)


def remove_file(index_name, file_):
    file_id = Path(file_).stem
    doc = redis_client.hget(f"__{index_name}", file_id)
    doc = doc.decode()
    words = parsed_text(doc)
    for word in words:
        hash_key = f"{index_name}::{word}"
        redis_client.hdel(hash_key, file_id)

def update_file(index_name, file_):
    remove_file(index_name, file_)
    index_file(index_name, file_, index=True)

Now we can modify the above mentioned function to search for the documents stored id redis. all we have to do is to modify the function to do search in redis.

def search(index_name, words):
    docs = []
    words = [f"{index_name}::{word}" for word in words]
    for word in words:
        docs_ = redis_client.hgetall(word)
        num_of_docs = redis_client.hlen(f"__{index_name}")
        docs_with_word = redis_client.hlen(word)
        for doc_, count_ in docs_.items():
            file_id = doc_.decode()
            count = int(count_.decode())
            score = count * math.log(num_of_docs / docs_with_word)
            doc_ = (redis_client.hget(f"__{index_name}", file_id).decode(), score)
            docs.append(doc_)
    return docs

The above implementation is a simple one, but this can be a helpful starting point for building such search engine. This can be further improved by