WARNING: Version 2.2 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

« CJK Width Token Filter Delimited Payload Token Filter »

› › ›

CJK Bigram Token Filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

CJK Bigram Token Filter

edit

The cjk_bigram token filter forms bigrams out of the CJK terms that are generated by the standard tokenizer or the icu_tokenizer (see analysis-icu plugin).

By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you always want to output both unigrams and bigrams, set the output_unigrams flag to true. This can be used for a combined unigram+bigram approach.

Bigrams are generated for characters in han, hiragana, katakana and hangul, but bigrams can be disabled for particular scripts with the ignored_scripts parameter. All non-CJK input is passed through unmodified.

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "han_bigrams" : {
                    "tokenizer" : "standard",
                    "filter" : ["han_bigrams_filter"]
                }
            },
            "filter" : {
                "han_bigrams_filter" : {
                    "type" : "cjk_bigram",
                    "ignored_scripts": [
                        "hiragana",
                        "katakana",
                        "hangul"
                    ],
                    "output_unigrams" : true
                }
            }
        }
    }
}

« CJK Width Token Filter Delimited Payload Token Filter »