WARNING: Version 5.4 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Pattern Analyzer
editPattern Analyzer
editThe pattern
analyzer uses a regular expression to split the text into terms.
The regular expression should match the token separators not the tokens
themselves. The regular expression defaults to \W+
(or all non-word characters).
Beware of Pathological Regular Expressions
The pattern analyzer uses Java Regular Expressions.
A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.
Read more about pathological regular expressions and how to avoid them.
Definition
editIt consists of:
- Tokenizer
- Token Filters
-
- Lower Case Token Filter
- Stop Token Filter (disabled by default)
Example output
editPOST _analyze { "analyzer": "pattern", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
The above sentence would produce the following terms:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Configuration
editThe pattern
analyzer accepts the following parameters:
|
A Java regular expression, defaults to |
|
Java regular expression flags.
Flags should be pipe-separated, eg |
|
Should terms be lowercased or not. Defaults to |
|
A pre-defined stop words list like |
|
The path to a file containing stop words. |
See the Stop Token Filter for more information about stop word configuration.
Example configuration
editIn this example, we configure the pattern
analyzer to split email addresses
on non-word characters or on underscores (\W|_
), and to lower-case the result:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_email_analyzer": { "type": "pattern", "pattern": "\\W|_", "lowercase": true } } } } } POST my_index/_analyze { "analyzer": "my_email_analyzer", "text": "[email protected]" }
The above example produces the following terms:
[ john, smith, foo, bar, com ]
CamelCase tokenizer
editThe following more complicated example splits CamelCase text into tokens:
PUT my_index { "settings": { "analysis": { "analyzer": { "camel": { "type": "pattern", "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])" } } } } } GET my_index/_analyze { "analyzer": "camel", "text": "MooseX::FTPClass2_beta" }
The above example produces the following terms:
[ moose, x, ftp, class, 2, beta ]
The regex above is easier to understand as:
([^\p{L}\d]+) # swallow non letters and numbers, | (?<=\D)(?=\d) # or non-number followed by number, | (?<=\d)(?=\D) # or number followed by non-number, | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case (?=\p{Lu}) # followed by upper case, | (?<=\p{Lu}) # or upper case (?=\p{Lu} # followed by upper case [\p{L}&&[^\p{Lu}]] # then lower case )