Understanding Analyzers, Tokenizers, and Filters in Solr:
Field analyzers are used both during ingestion, when a document is indexed, and at query time. An analyzer examines the text of fields and generates a token stream.
Tokenizers break field data into lexical units, or tokens.
Filters examine a stream of tokens and keep them, transform or discard them, or create new ones.
An analyzer examines the text of fields and generates a token stream. Analyzers are specified as a child of the <fieldType> element in the schema.xml configuration file. For example:
<fieldType name=”nametext” class=”solr.TextField”>
In this case a single class,
WhitespaceAnalyzer, is responsible for analyzing the content of the named text field and emitting the corresponding tokens.
The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence of the characters in the text. An analyzer is aware of the field it is configured for, but a tokenizer is not. Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a TokenStream). For example:
<fieldType name=”text” class=”solr.TextField”><analyzer>
The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the org.apache.solr.analysis. TokenizerFactory interface. This factory class will be called upon to create new tokenizer instances as needed. Objects created by the factory must derive from org.apache.lucene.analysis.TokenStream, which indicates that they produce sequences of tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer. Otherwise, the tokenizer’s output tokens will serve as input to the first filter stage in the pipeline.
Filters consume input and produce a stream of tokens.The job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially and decides whether to pass it along, replace it or discard it. For example,
<fieldType name=”text” class=”solr.TextField”>
This example starts with Solr’s standard tokenizer, which breaks the field’s text into tokens. Those tokens then pass through Solr’s standard filter, which removes dots from acronyms, and performs a few other common operations. All the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time.
ProsperaSoft offers Solr development solutions. You can email at firstname.lastname@example.org to get in touch with ProsperaSoft Solr experts and consultants.