Search a inverted positional index and return ranked references to documents relevant to the search phrase.



The compoments of this library:

  • parse a free-text phrase to a query;
  • search the dictionary and postings of a text index for the query terms;
  • perform iterative scoring and ranking of the returned dictionary entries and postings; and
  • return ranked references to documents relevant to the search phrase.

Free text search overview


class FreeTextQuery

class QueryParser


The following definitions are used throughout the documentation:

  • corpus– the collection of documents for which an index is maintained.
  • dictionary – is a hash of terms (vocabulary) to the frequency of occurence in the corpus documents.
  • document – a record in the corpus, that has a unique identifier (docId) in the corpus‘s primary key and that contains one or more text fields that are indexed.
  • index – an inverted index used to look up document references from the corpus against a vocabulary of terms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexed term in each document.
  • postings – a separate index that records which documents the vocabulary occurs in. In this implementation we also record the positions of each term in the text to create a positional inverted index.
  • postings list – a record of the positions of a term in a document. A position of a term refers to the index of the term in an array that contains all the terms in the text.
  • term – a word or phrase that is indexed from the corpus. The term may differ from the actual word used in the corpus depending on the tokenizer used.
  • text – the indexable content of a document.
  • token – representation of a term in a text source returned by a tokenizer. The token may include information about the term such as its position(s) in the text or frequency of occurrence.
  • tokenizer – a function that returns a collection of tokens from text, after applying a character filter, term filter, stemmer and / or lemmatizer.
  • vocabulary – the collection of terms indexed from the corpus.



