![]() ![]() For example, a query "find the" would preserve the 'the' since it was not followed by a space, punctuation, etc., and mark it as a KEYWORD so that following filters will not change or remove it.īy contrast, a query like “find the popsicle” would remove ‘the’ as a stopword, since it’s followed by a space. Suggest Stop Filter differs from Stop Filter in that it will not remove the last token unless it is followed by a token separator. Like Stop Filter, this filter discards, or stops analysis of, tokens that are on the given stop words list. StopFilterFactory, CommonGramsFilterFactory, and CommonGramsQueryFilterFactory can optionally read stopwords in Snowball format (specify format="snowball" in the configuration of those FilterFactories).įactory class: solr.SnowballPorterFilterFactory Solr contains Snowball stemmers for Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish. Table-driven stemmers are labor intensive to create and maintain and so are typically commercial products. This type of stemmer is not as accurate as a table-based stemmer, but is faster and less complex. Snowball is a software package that generates pattern-based word stemmers. This filter factory instantiates a language-specific stemmer generated by Snowball. In these situations, both the stemmer and the synonym filter can cause completely identical terms with the same positions to end up in the stream, increasing index size with no benefit.Ĭonsider the following entry from a synonyms.txt file: One example of where RemoveDuplicatesTokenFilterFactory is useful in situations where a synonym file is being used in conjunction with a stemmer. It has been so named for brevity, even though it is potentially misleading.įactory class: solr.RemoveDuplicatesTokenFilterFactory It is a very specialized filter that is only useful in very specific circumstances. Tokens are considered to be duplicates ONLY if they have the same text and position values.īecause positions must be the same, this filter might not do what a user expects it to do based on its name. The filter removes duplicate tokens in the stream. ![]() ![]() If you are wrapping a TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use the consumeAllTokens="true" option.įactory class: solr.LimitTokenOffsetFilterFactory ![]() For most TokenStream implementations this should be acceptable, and faster then consuming the full stream. This can be useful to limit highlighting, for example.īy default, this filter ignores any tokens in the wrapped TokenStream once the limit has been reached, which can result in reset() being called prior to incrementToken() returning false. This filter limits tokens to those before a configured maximum start character offset. On the other hand, for languages that have no stemmer but do have an extensive dictionary file, the Hunspell stemmer may be a good choice.įactory class: solr.HunspellStemFilterFactory For example, some languages have only a minimal word list with no morphological information. You can download those language files here.īe aware that your results will vary widely based on the quality of the provided dictionary and rules files. aff) files for each language you wish to use with the Hunspell Stem Filter. The Hunspell Stem Filter provides support for several languages. See the examples below for Synonym Graph Filter and Word Delimiter Graph Filter. This filter must be included on index-time analyzer specifications that include at least one graph-aware filter, including Synonym Graph Filter and Word Delimiter Graph Filter.įactory class: solr.FlattenGraphFilterFactory Out: "brown_dog_fox_jumped_lazy_over_quick_the" Flatten Graph Filter Tokenizer to Filter: "the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog" In: "the quick brown fox jumped over the lazy dog" Filter definitions should follow a tokenizer or another filter definition because they take a TokenStream as input. You configure each filter with a element in schema.xml as a child of, following the element. Filters examine a stream of tokens and keep them, transform them or discard them, depending on the filter type being used. ![]()
0 Comments
Leave a Reply. |