Class TermPruningPolicy
- java.lang.Object
-
- org.apache.lucene.index.pruning.PruningPolicy
-
- org.apache.lucene.index.pruning.TermPruningPolicy
-
- Direct Known Subclasses:
CarmelTopKTermPruningPolicy
,CarmelUniformTermPruningPolicy
,RIDFTermPruningPolicy
,TFTermPruningPolicy
public abstract class TermPruningPolicy extends PruningPolicy
Policy for producing smaller index out of an input index, by examining its terms and removing from the index some or all of their data as follows:- all terms of a certain field - see
pruneAllFieldPostings(String)
- all data of a certain term - see
pruneTermEnum(TermEnum)
- all positions of a certain term in a certain document - see #pruneAllPositions(TermPositions, Term)
- some positions of a certain term in a certain document - see #pruneSomePositions(int, int[], Term)
The pruned, smaller index would, for many types of queries return nearly identical top-N results as compared with the original index, but with increased performance.
Pruning of indexes is handy for producing small first-tier indexes that fit completely in RAM, and store these indexes using
IndexWriter.addIndexes(IndexReader...)
Interestingly, if the input index is optimized (i.e. doesn't contain deletions), then the index produced via
IndexWriter.addIndexes(IndexReader[])
will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. SeeStorePruningPolicy
for information about removing stored fields.Please note that while this family of policies method produces good results for term queries it often leads to poor results for phrase queries (because postings are removed without considering whether they belong to an important phrase).
Aggressive pruning policies produce smaller indexes - search performance increases, and recall decreases (i.e. search quality deteriorates).
See the following papers for a discussion of this problem and the proposed solutions to improve the quality of a pruned index (not implemented here):
- Pruned query evaluation using pre-computed impacts, V. Anh et al, ACM SIGIR 2006
- A document-centric approach to static index pruning in text retrieval systems, S. Buettcher et al, ACM SIGIR 2006
- Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee, A. Ntoulas et al, ACM SIGIR 2007.
-
-
Field Summary
Fields Modifier and Type Field Description protected Map<String,Integer>
fieldFlags
Pruning operations to be conducted on fields.protected IndexReader
in
-
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
TermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags)
Construct a policy.
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract void
initPositionsTerm(TermPositions in, Term t)
Called when movingTermPositions
to a newTerm
.boolean
pruneAllFieldPostings(String field)
Pruning of all postings for a fieldabstract boolean
pruneAllPositions(TermPositions termPositions, Term t)
Prune all postings per term (invoked once per term per doc)boolean
prunePayload(TermPositions in, Term curTerm)
Called when checking for the presence of payload for the current term at a current positionabstract int
pruneSomePositions(int docNum, int[] positions, Term curTerm)
Prune some postings per term (invoked once per term per doc).abstract boolean
pruneTermEnum(TermEnum te)
Pruning of all postings for a term (invoked once per term).abstract int
pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector v)
Pruning of individual terms in term vectors.boolean
pruneWholeTermVector(int docNumber, String field)
Term vector pruning.
-
-
-
Field Detail
-
in
protected IndexReader in
-
-
Constructor Detail
-
TermPruningPolicy
protected TermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags)
Construct a policy.- Parameters:
in
- input readerfieldFlags
- a map, where keys are field names and values are bitwise-OR flags of operations to be performed (seePruningPolicy
for more details).
-
-
Method Detail
-
pruneWholeTermVector
public boolean pruneWholeTermVector(int docNumber, String field) throws IOException
Term vector pruning.- Parameters:
docNumber
- document numberfield
- field name- Returns:
- true if the complete term vector for this field should be
removed (as specified by
PruningPolicy.DEL_VECTOR
flag). - Throws:
IOException
-
pruneAllFieldPostings
public boolean pruneAllFieldPostings(String field) throws IOException
Pruning of all postings for a field- Parameters:
field
- field name- Returns:
- true if all postings for all terms in this field should be
removed (as specified by
PruningPolicy.DEL_POSTINGS
). - Throws:
IOException
-
initPositionsTerm
public abstract void initPositionsTerm(TermPositions in, Term t) throws IOException
Called when movingTermPositions
to a newTerm
.- Parameters:
in
- input term positionst
- current term- Throws:
IOException
-
prunePayload
public boolean prunePayload(TermPositions in, Term curTerm)
Called when checking for the presence of payload for the current term at a current position- Parameters:
in
- positioned term positionscurTerm
- current term associated with these positions- Returns:
- true if the payload should be removed, false otherwise.
-
pruneTermVectorTerms
public abstract int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector v) throws IOException
Pruning of individual terms in term vectors.- Parameters:
docNumber
- document numberfield
- field nameterms
- array of termsfreqs
- array of term frequenciesv
- the original term frequency vector- Returns:
- 0 if no terms are to be removed, positive number to indicate how many terms need to be removed. The same number of entries in the terms array must be set to null to indicate which terms to remove.
- Throws:
IOException
-
pruneTermEnum
public abstract boolean pruneTermEnum(TermEnum te) throws IOException
Pruning of all postings for a term (invoked once per term).- Parameters:
te
- positioned term enum.- Returns:
- true if all postings for this term should be removed, false otherwise.
- Throws:
IOException
-
pruneAllPositions
public abstract boolean pruneAllPositions(TermPositions termPositions, Term t) throws IOException
Prune all postings per term (invoked once per term per doc)- Parameters:
termPositions
- positioned term positions. Implementations MUST NOT advance this by callingTermPositions
methods that advance either the position pointer (next, skipTo) or term pointer (seek).t
- current term- Returns:
- true if the current posting should be removed, false otherwise.
- Throws:
IOException
-
pruneSomePositions
public abstract int pruneSomePositions(int docNum, int[] positions, Term curTerm)
Prune some postings per term (invoked once per term per doc).- Parameters:
docNum
- current document numberpositions
- original term positions in the document (and indirectly term frequency)curTerm
- current term- Returns:
- 0 if no postings are to be removed, or positive number to indicate how many postings need to be removed. The same number of entries in the positions array must be set to -1 to indicate which positions to remove.
-
-