free_text_search
Search a inverted positional index and return ranked references to documents relevant to the search phrase.
THIS PACKAGE IS IN BETA DEVELOPMENT AND SUBJECT TO DAILY BREAKING CHANGES.
Objective
The compoments of this library:
- parse a free-text phrase to a query;
- search the
dictionaryandpostingsof a textindexfor the queryterms; - perform iterative scoring and ranking of the returned dictionary entries and postings; and
- return ranked references to documents relevant to the search phrase.
API
class FreeTextQuery
class QueryParser
Usage
TODO: describe usage.
Definitions
The following definitions are used throughout the documentation:
corpus– the collection ofdocumentsfor which anindexis maintained.dictionary– is a hash ofterms(vocabulary) to the frequency of occurence in thecorpusdocuments.document– a record in thecorpus, that has a unique identifier (docId) in thecorpus‘s primary key and that contains one or more text fields that are indexed.index– an inverted index used to look updocumentreferences from thecorpusagainst avocabularyofterms. The implementation in this package builds and maintains a positional inverted index, that also includes the positions of the indexedtermin eachdocument.postings– a separate index that records whichdocumentsthevocabularyoccurs in. In this implementation we also record the positions of eachtermin thetextto create a positional invertedindex.postings list– a record of the positions of atermin adocument. A position of atermrefers to the index of thetermin an array that contains all thetermsin thetext.term– a word or phrase that is indexed from thecorpus. Thetermmay differ from the actual word used in the corpus depending on thetokenizerused.text– the indexable content of adocument.token– representation of atermin a text source returned by atokenizer. The token may include information about thetermsuch as its position(s) in the text or frequency of occurrence.tokenizer– a function that returns a collection oftokens fromtext, after applying a character filter,termfilter, stemmer and / or lemmatizer.vocabulary– the collection oftermsindexed from thecorpus.
References
- Manning, Raghavan and Schütze, “Introduction to Information Retrieval“, Cambridge University Press, 2008
- University of Cambridge, 2016 “Information Retrieval“, course notes, Dr Ronan Cummins, 2016
- Wikipedia (1), “Inverted Index“, from Wikipedia, the free encyclopedia
- Wikipedia (2), “Lemmatisation“, from Wikipedia, the free encyclopedia
- Wikipedia (3), “Stemming“, from Wikipedia, the free encyclopedia
Issues
If you find a bug please fill an issue.
This project is a supporting package for a revenue project that has priority call on resources, so please be patient if we don’t respond immediately to issues or pull requests.
