Search a inverted positional index and return ranked references to documents relevant to the search phrase.
THIS PACKAGE IS IN BETA DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
The compoments of this library:
- parse a free-text phrase to a query;
- search the
postingsof a text
indexfor the query
- perform iterative scoring and ranking of the returned dictionary entries and postings; and
- return ranked references to documents relevant to the search phrase.
TODO: describe usage.
The following definitions are used throughout the documentation:
corpus– the collection of
documentsfor which an
dictionary– is a hash of
vocabulary) to the frequency of occurence in the
document– a record in the
corpus, that has a unique identifier (
docId) in the
corpus‘s primary key and that contains one or more text fields that are indexed.
index– an inverted index used to look up
documentreferences from the
terms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexed
postings– a separate index that records which
vocabularyoccurs in. In this implementation we also record the positions of each
textto create a positional inverted
postings list– a record of the positions of a
document. A position of a
termrefers to the index of the
termin an array that contains all the
term– a word or phrase that is indexed from the
termmay differ from the actual word used in the corpus depending on the
text– the indexable content of a
token– representation of a
termin a text source returned by a
tokenizer. The token may include information about the
termsuch as its position(s) in the text or frequency of occurrence.
tokenizer– a function that returns a collection of
text, after applying a character filter,
termfilter, stemmer and / or lemmatizer.
vocabulary– the collection of
termsindexed from the
- Manning, Raghavan and Schütze, “Introduction to Information Retrieval“, Cambridge University Press, 2008
- University of Cambridge, 2016 “Information Retrieval“, course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), “Inverted Index“, from Wikipedia, the free encyclopedia
- Wikipedia (2), “Lemmatisation“, from Wikipedia, the free encyclopedia
- Wikipedia (3), “Stemming“, from Wikipedia, the free encyclopedia
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don’t respond immediately to issues or pull requests.