free_text_search

Search a inverted positional index and return ranked references to documents relevant to the search phrase.

THIS PACKAGE IS IN BETA DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.

Objective

The compoments of this library:

  • parse a free-text phrase to a query;
  • search the dictionary and postings of a text index for the query terms;
  • perform iterative scoring and ranking of the returned dictionary entries and postings; and
  • return ranked references to documents relevant to the search phrase.

Free text search overview

API

class FreeTextQuery

class QueryParser

Usage

TODO: describe usage.

Definitions

The following definitions are used throughout the documentation:

  • corpus– the collection of documents for which an index is maintained.
  • dictionary – is a hash of terms (vocabulary) to the frequency of occurence in the corpus documents.
  • document – a record in the corpus, that has a unique identifier (docId) in the corpus‘s primary key and that contains one or more text fields that are indexed.
  • index – an inverted index used to look up document references from the corpus against a vocabulary of terms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexed term in each document.
  • postings – a separate index that records which documents the vocabulary occurs in. In this implementation we also record the positions of each term in the text to create a positional inverted index.
  • postings list – a record of the positions of a term in a document. A position of a term refers to the index of the term in an array that contains all the terms in the text.
  • term – a word or phrase that is indexed from the corpus. The term may differ from the actual word used in the corpus depending on the tokenizer used.
  • text – the indexable content of a document.
  • token – representation of a term in a text source returned by a tokenizer. The token may include information about the term such as its position(s) in the text or frequency of occurrence.
  • tokenizer – a function that returns a collection of tokens from text, after applying a character filter, term filter, stemmer and / or lemmatizer.
  • vocabulary – the collection of terms indexed from the corpus.

References

Issues

If you find a bug please fill an issue.

This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don’t respond immediately to issues or pull requests.

GitHub

View Github