All Products
Search
Document Center

Tair (Redis® OSS-Compatible):Search tokenizers

Last Updated:Mar 30, 2026

Analyzers parse and tokenize text fields so that TairSearch can build an index and answer full-text queries. Each analyzer runs three stages in order: character filters, a tokenizer, and token filters. TairSearch ships with nine built-in analyzers. For specialized requirements, you can build a custom analyzer from individual components.

Analyzer workflow
Stage Purpose Count per analyzer
Character filter Preprocesses the raw text before tokenization (for example, replace :) with _happy_) Zero or more, run in order
Tokenizer Splits the preprocessed text into tokens Exactly one
Token filter Post-processes each token (for example, lowercase, stop-word removal, stemming) Zero or more, run in order

Choose an analyzer

Analyzer Best for Tokenizes by Lowercases Filters stop words
Standard Most languages Unicode word boundaries Yes Yes
Stop Most languages (stop-word focus) Non-letter characters Yes Yes
Jieba Chinese text Trained dictionary English tokens only Yes
IK Chinese text (Elasticsearch-compatible) Trained dictionary (two modes) Yes (default) Optional
Pattern Custom delimiter logic Regex pattern Optional Optional
Whitespace Pre-tokenized or structured text Whitespace characters No No
Simple Western text, case-insensitive Non-letter characters Yes No
Keyword Exact-match fields No splitting (whole field = one token) No No
Language Specific natural languages Language-specific rules Yes Yes

How it works

An analyzer processes a document through three sequential stages.

Stage 1 — Character filter: Zero or more character filters preprocess the raw document text. Filters run in the order they are listed. For example, a mapping character filter can replace "(:" with "happy" before tokenization begins.

Stage 2 — Tokenizer: Exactly one tokenizer splits the (possibly filtered) text into tokens. For example, the whitespace tokenizer splits "I am very happy" into ["I", "am", "very", "happy"].

Stage 3 — Token filter: Zero or more token filters post-process the tokens from the tokenizer. Filters run in the order they are listed. For example, the stop token filter removes common words such as "the" and "is".

Built-in analyzers

Standard

The standard analyzer is the default choice for most languages. It splits text on Unicode word boundaries (per Unicode Standard Annex #29), lowercases all tokens, and removes common stop words.

Components: standard tokenizerlowercase token filterstop token filter

No character filters are included.

Optional parameters:

Parameter Description Default
stopwords Array of stop words to filter. Replaces the default list entirely. See below
max_token_length Maximum character length per token. Tokens longer than this are split at the limit. 255

Default stop words:

["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in",
 "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",
 "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}
// Custom stop words and token length
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "standard",
          "max_token_length": 10,
          "stopwords": ["memory", "disk", "is", "a"]
        }
      }
    }
  }
}

Stop

The stop analyzer splits text at any non-letter character, lowercases all tokens, and removes stop words.

Components: lowercase tokenizerstop token filter

Optional parameters:

Parameter Description Default
stopwords Array of stop words to filter. Replaces the default list entirely. Same as standard

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "stop"
      }
    }
  }
}
// Custom stop words
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "stop",
          "stopwords": ["memory", "disk", "is", "a"]
        }
      }
    }
  }
}

Jieba

The jieba analyzer is recommended for Chinese text. It splits text using jieba dictionary-based segmentation, lowercases English tokens, and removes stop words.

Components: jieba tokenizerlowercase token filterstop token filter

Important
  • The jieba analyzer loads a 20 MB built-in dictionary into memory. Only one copy is loaded globally. The first use of jieba may cause a brief latency spike while the dictionary loads.

  • Words in a custom dictionary cannot contain spaces or any of the following characters: \t, \n, ,

Optional parameters:

Parameter Description Default
userwords Array of strings added to the default dictionary. See the Jieba default dictionarydefault jieba dictionary. Empty
use_hmm Use a hidden Markov model (HMM) to handle out-of-vocabulary words. true
stopwords Array of stop words to filter. Replaces the default list entirely. See the Jieba default stop wordsdefault jieba stop words. Built-in list

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "jieba"
      }
    }
  }
}
// Custom dictionary, stop words, and HMM
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "jieba",
          "stopwords": ["memory", "disk", "is", "a"],
          "userwords": ["Redis", "open-source", "flexible"],
          "use_hmm": true
        }
      }
    }
  }
}

IK

The IK analyzer splits Chinese text and is compatible with the IK analyzer plug-in for Alibaba Cloud Elasticsearch. It supports two modes:

  • `ik_max_word`: identifies all possible tokens.

  • `ik_smart`: filters the results of the ik_max_word mode to identify the most possible tokens.

Components: IK tokenizer (no token filters by default)

Optional parameters:

Parameter Description Default
stopwords Array of stop words to filter. Replaces the default list entirely. Same as standard
userwords Array of strings added to the default IK dictionary. See the default IK dictionary. Empty
quantifiers Array of quantifiers added to the default IK quantifier dictionary. See the default quantifier dictionary. Empty
enable_lowercase Convert uppercase letters to lowercase before tokenization. true
Important

If your custom dictionary contains uppercase letters, set enable_lowercase to false. Lowercase conversion happens before splitting, so uppercase entries in the dictionary would never match.

Configuration examples:

// Default configuration: both IK modes
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "ik_smart"
      },
      "f1": {
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}
// Custom stop words, dictionary, and quantifiers
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_ik_smart_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_ik_max_word_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ik_smart_analyzer": {
          "type": "ik_smart",
          "stopwords": ["memory", "disk", "is", "a"],
          "userwords": ["Redis", "open-source", "flexible"],
          "quantifiers": ["ns"],
          "enable_lowercase": false
        },
        "my_ik_max_word_analyzer": {
          "type": "ik_max_word",
          "stopwords": ["memory", "disk", "is", "a"],
          "userwords": ["Redis", "open-source", "flexible"],
          "quantifiers": ["ns"],
          "enable_lowercase": false
        }
      }
    }
  }
}

Pattern

The pattern analyzer splits text using a regular expression. By default, the matched text is treated as a delimiter (tokens are the text between matches). It also lowercases tokens and filters stop words.

Components: pattern tokenizerlowercase token filterstop token filter

Optional parameters:

Parameter Description Default
pattern Regular expression. Text matching the pattern is used as a delimiter. See RE2 syntax. \W+
stopwords Array of stop words. Replaces the default list entirely. Same as standard
lowercase Convert tokens to lowercase. true
flags Set to CASE_INSENSITIVE to make the regex case-insensitive. Empty (case-sensitive)

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "pattern"
      }
    }
  }
}
// Custom pattern with case-insensitive matching
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "pattern",
          "pattern": "\\'([^\\']+)\\'",
          "stopwords": ["aaa", "@"],
          "lowercase": false,
          "flags": "CASE_INSENSITIVE"
        }
      }
    }
  }
}

Whitespace

The whitespace analyzer splits text at whitespace characters. It does not lowercase tokens or remove stop words.

Components: whitespace tokenizer

Optional parameters: None

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "whitespace"
      }
    }
  }
}

Simple

The simple analyzer splits text at any non-letter character and lowercases all tokens. It does not filter stop words.

Components: lowercase tokenizer

Optional parameters: None

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "simple"
      }
    }
  }
}

Keyword

The keyword analyzer treats the entire field value as a single token without any splitting. Use it for fields that require exact-match queries, such as IDs, status codes, or tags.

Components: keyword tokenizer

Optional parameters: None

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "keyword"
      }
    }
  }
}

Language

The language analyzer supports language-specific tokenization and stop-word removal for a fixed set of languages: arabic, cjk, chinese, brazilian, czech, german, greek, persian, french, dutch, and russian.

Optional parameters:

Parameter Description Default Supported languages
stopwords Array of stop words. Replaces the default list. See Appendix 4 for defaults. Language-specific All except chinese
stem_exclusion Array of words whose stems are not extracted. For example, adding "apples" prevents it from being reduced to "apple". Empty brazilian, german, french, dutch
The stop words of the chinese analyzer cannot be modified.

Configuration examples:

// Default configuration (Arabic)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "arabic"
      }
    }
  }
}
// Custom stop words and stem exclusion (German)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "german",
          "stopwords": ["ein"],
          "stem_exclusion": ["speicher"]
        }
      }
    }
  }
}

Custom analyzers

Build a custom analyzer when no built-in analyzer fits your needs. Define the analyzer in settings and reference it by name in mappings.

Parameters:

Parameter Required Description Valid values
type Yes Identifies this as a custom analyzer. custom
tokenizer Yes The tokenizer to use. Only one is allowed. whitespace, lowercase, standard, classic, letter, keyword, jieba, pattern, ik_max_word, ik_smart
char_filter No Array of character filters to apply before tokenization. mapping (see Appendix 1)
filter No Array of token filters to apply after tokenization. classic, elision, lowercase, snowball, stop, asciifolding, length, arabic_normalization, persian_normalization (see Appendix 3)

Example: custom analyzer with emoticon replacement and stop-word removal

// Character filters replace emoticons and expand "&" before tokenization.
// The whitespace tokenizer splits on spaces.
// Token filters lowercase and remove stop words.
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["lowercase", "stop"],
          "char_filter": ["emoticons", "conjunctions"]
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [":) => _happy_", ":( => _sad_"]
        },
        "conjunctions": {
          "type": "mapping",
          "mappings": ["&=>and"]
        }
      }
    }
  }
}

Appendix 1: Supported character filters

Mapping character filter

Replaces specified strings using key-value pairs. When the input contains a key, it is replaced with the corresponding value. Multiple mapping character filters can be used in a single analyzer.

Parameters:

Parameter Required Description
mappings Yes Array of replacement rules. Each rule must use the format "key => value". For example: "& => and".

Configuration example:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": ["emoticons"]
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [":) => _happy_", ":( => _sad_"]
        }
      }
    }
  }
}

Appendix 2: Supported tokenizers

whitespace

Splits text at whitespace characters. Tokens that exceed max_token_length are split at the limit.

Optional parameters:

Parameter Description Default
max_token_length Maximum character length per token. Tokens longer than this are split at the limit. 255

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}
// Custom max token length
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "token1"
        }
      },
      "tokenizer": {
        "token1": {
          "type": "whitespace",
          "max_token_length": 2
        }
      }
    }
  }
}

standard

Splits text using the Unicode Text Segmentation algorithm (Unicode Standard Annex #29). Suitable for most languages.

Optional parameters:

Parameter Description Default
max_token_length Maximum character length per token. Tokens longer than this are split at the limit. 255

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}
// Custom max token length
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "token1"
        }
      },
      "tokenizer": {
        "token1": {
          "type": "standard",
          "max_token_length": 2
        }
      }
    }
  }
}

classic

Splits text using English grammar rules and handles specific patterns specially:

  • Splits at punctuation and removes it. Periods (.) surrounded by non-whitespace are kept — for example, red.apple is not split, but red. apple produces red and apple.

  • Splits at hyphens, unless the token contains digits (interpreted as a product number and kept intact).

  • Recognizes email addresses and hostnames as single tokens.

Tokens that exceed max_token_length are skipped, not split.

Optional parameters:

Parameter Description Default
max_token_length Maximum character length per token. Tokens longer than this are skipped. 255

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "classic"
        }
      }
    }
  }
}
// Custom max token length
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "token1"
        }
      },
      "tokenizer": {
        "token1": {
          "type": "classic",
          "max_token_length": 2
        }
      }
    }
  }
}

letter

Splits text at any non-letter character. Works well for European languages.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "letter"
        }
      }
    }
  }
}

lowercase

Splits text at any non-letter character and converts all tokens to lowercase. Equivalent to combining the letter tokenizer with the lowercase token filter, but faster because it traverses the document only once.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "lowercase"
        }
      }
    }
  }
}

keyword

Treats the entire input as a single token without splitting. Typically paired with a token filter such as lowercase.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "keyword"
        }
      }
    }
  }
}

jieba

Splits Chinese text using a trained dictionary. Recommended for Chinese-language fields.

Important

Words in a custom dictionary cannot contain spaces or any of the following characters: \t, \n, ,

Optional parameters:

Parameter Description Default
userwords Array of strings added to the default dictionary. See the default jieba dictionary. Empty
use_hmm Use a hidden Markov model (HMM) to handle out-of-vocabulary words. true

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "jieba"
        }
      }
    }
  }
}
// Custom dictionary
{
  "mappings": {
    "properties": {
      "f1": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "token1"
        }
      },
      "tokenizer": {
        "token1": {
          "type": "jieba",
          "userwords": ["Redis", "open-source", "flexible"],
          "use_hmm": true
        }
      }
    }
  }
}

pattern

Splits text using a regular expression. The matched text is treated as a delimiter by default. Use the group parameter to treat matched text as tokens instead.

Optional parameters:

Parameter Description Default
pattern Regular expression. See RE2 syntax. \W+
group Controls how the regex result is used. -1 uses matched text as delimiters. 0 uses the full match as a token. 1 or higher uses the corresponding capture group as a token. -1
flags Set to CASE_INSENSITIVE to make the regex case-insensitive. Empty (case-sensitive)

Example of `group` behavior:

Regex: "a(b+)c", input: "abbbcdefabc"

  • group: 0 → tokens: [ abbbc, abc ] (full matches)

  • group: 1 → tokens: [ bbb, b ] (first capture group)

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "pattern"
        }
      }
    }
  }
}
// Custom pattern with capture group
{
  "mappings": {
    "properties": {
      "f1": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "pattern_tokenizer"
        }
      },
      "tokenizer": {
        "pattern_tokenizer": {
          "type": "pattern",
          "pattern": "AB(A(\\w+)C)",
          "flags": "CASE_INSENSITIVE",
          "group": 2
        }
      }
    }
  }
}

IK

Splits Chinese text. Supports two modes:

  • `ik_max_word`: identifies all possible tokens (maximum granularity).

  • `ik_smart`: identifies the most likely tokens (coarser granularity).

Optional parameters:

Parameter Description Default
stopwords Array of stop words. Replaces the default list entirely. Same as standard
userwords Array of strings added to the default IK dictionary. See the default IK dictionary. Empty
quantifiers Array of quantifiers added to the default quantifier dictionary. See the default quantifier dictionary. Empty
enable_lowercase Convert uppercase letters to lowercase before tokenization. true
Important

If your custom dictionary contains uppercase letters, set enable_lowercase to false. Lowercase conversion happens before splitting.

Configuration examples:

// Default configuration: both IK modes
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_ik_smart_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_custom_ik_max_word_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_ik_smart_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart"
        },
        "my_custom_ik_max_word_analyzer": {
          "type": "custom",
          "tokenizer": "ik_max_word"
        }
      }
    }
  }
}
// Custom dictionary, stop words, and quantifiers
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_ik_smart_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_custom_ik_max_word_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_ik_smart_analyzer": {
          "type": "custom",
          "tokenizer": "my_ik_smart_tokenizer"
        },
        "my_custom_ik_max_word_analyzer": {
          "type": "custom",
          "tokenizer": "my_ik_max_word_tokenizer"
        }
      },
      "tokenizer": {
        "my_ik_smart_tokenizer": {
          "type": "ik_smart",
          "userwords": ["The tokenizer for the Chinese language", "The custom stop words"],
          "stopwords": ["about", "test"],
          "quantifiers": ["ns"],
          "enable_lowercase": false
        },
        "my_ik_max_word_tokenizer": {
          "type": "ik_max_word",
          "userwords": ["The tokenizer for the Chinese language", "The custom stop words"],
          "stopwords": ["about", "test"],
          "quantifiers": ["ns"],
          "enable_lowercase": false
        }
      }
    }
  }
}

Appendix 3: Supported token filters

classic

Removes possessive 's from the end of tokens and strips periods from acronyms. For example, Fig. becomes Fig.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "classic",
          "filter": ["classic"]
        }
      }
    }
  }
}

elision

Removes specified elisions from the beginning of tokens. Primarily used for French text (for example, l'avionavion).

Optional parameters:

Parameter Description Default
articles Array of elisions to remove. Replaces the default list entirely. ["l", "m", "t", "qu", "n", "s", "j"]
articles_case Whether elision matching is case-sensitive. false

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["elision"]
        }
      }
    }
  }
}
// Custom elisions with case-sensitive matching
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["elision_filter"]
        }
      },
      "filter": {
        "elision_filter": {
          "type": "elision",
          "articles": ["l", "m", "t", "qu", "n", "s", "j"],
          "articles_case": true
        }
      }
    }
  }
}

lowercase

Converts all tokens to lowercase.

Optional parameters:

Parameter Description Valid values
language Apply language-specific lowercasing rules. If not set, standard English rules apply. greek, russian

Configuration examples:

// Default configuration (English)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["lowercase"]
        }
      }
    }
  }
}
// Language-specific lowercasing (Greek and Russian)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_greek_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_custom_russian_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_greek_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["greek_lowercase"]
        },
        "my_custom_russian_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["russian_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        },
        "russian_lowercase": {
          "type": "lowercase",
          "language": "russian"
        }
      }
    }
  }
}

snowball

Extracts the stem from each token. For example, cats becomes cat and running becomes run.

Optional parameters:

Parameter Description Default Valid values
language The language whose stemming rules to apply. english english, german, french, dutch

Configuration examples:

// Default configuration (English)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["snowball"]
        }
      }
    }
  }
}
// English stemming with standard tokenizer
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["my_filter"]
        }
      },
      "filter": {
        "my_filter": {
          "type": "snowball",
          "language": "english"
        }
      }
    }
  }
}

stop

Removes stop words from the token stream.

Optional parameters:

Parameter Description Default
stopwords Array of stop words. Replaces the default list entirely. Same as standard
ignoreCase Whether stop-word matching is case-insensitive. false

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["stop"]
        }
      }
    }
  }
}
// Custom stop words with case-insensitive matching
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["stop_filter"]
        }
      },
      "filter": {
        "stop_filter": {
          "type": "stop",
          "stopwords": ["the"],
          "ignore_case": true
        }
      }
    }
  }
}

asciifolding

Converts alphabetic, numeric, and symbolic characters outside the Basic Latin Unicode block to their ASCII equivalents. For example, é becomes e and ü becomes u. Use this filter to normalize accented characters in European text.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["asciifolding"]
        }
      }
    }
  }
}

length

Removes tokens that are shorter or longer than specified character lengths.

Optional parameters:

Parameter Description Default
min Minimum number of characters a token must have to be kept. 0
max Maximum number of characters a token can have to be kept. 2147483647 (2^31 - 1)

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["length"]
        }
      }
    }
  }
}
// Keep only tokens between 2 and 5 characters
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["length_filter"]
        }
      },
      "filter": {
        "length_filter": {
          "type": "length",
          "max": 5,
          "min": 2
        }
      }
    }
  }
}

Normalization

Normalizes language-specific characters. Use arabic_normalization for Arabic text and persian_normalization for Persian text. Pair this filter with the standard tokenizer for best results.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_arabic_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_persian_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_arabic_analyzer": {
          "type": "custom",
          "tokenizer": "arabic",
          "filter": ["arabic_normalization"]
        },
        "my_persian_analyzer": {
          "type": "custom",
          "tokenizer": "arabic",
          "filter": ["persian_normalization"]
        }
      }
    }
  }
}

Appendix 4: Default stop words for language analyzers

arabic

["من","ومن","منها","منه","في","وفي","فيها","فيه","و","ف","ثم","او","أو","ب","بها","به","ا","أ","اى","اي","أي","أى","لا","ولا","الا","ألا","إلا","لكن","ما","وما","كما","فما","عن","مع","اذا","إذا","ان","أن","إن","انها","أنها","إنها","انه","أنه","إنه","بان","بأن","فان","فأن","وان","وأن","وإن","التى","التي","الذى","الذي","الذين","الى","الي","إلى","إلي","على","عليها","عليه","اما","أما","إما","ايضا","أيضا","كل","وكل","لم","ولم","لن","ولن","هى","هي","هو","وهى","وهي","وهو","فهى","فهي","فهو","انت","أنت","لك","لها","له","هذه","هذا","تلك","ذلك","هناك","كانت","كان","يكون","تكون","وكانت","وكان","غير","بعض","قد","نحو","بين","بينما","منذ","ضمن","حيث","الان","الآن","خلال","بعد","قبل","حتى","عند","عندما","لدى","جميع"]

cjk

["with","will","to","this","there","then","the","t","that","such","s","on","not","no","it","www","was","is","","into","their","or","in","if","for","by","but","they","be","these","at","are","as","and","of","a"]

brazilian

["uns","umas","uma","teu","tambem","tal","suas","sobre","sob","seu","sendo","seja","sem","se","quem","tua","que","qualquer","porque","por","perante","pelos","pelo","outros","outro","outras","outra","os","o","nesse","nas","na","mesmos","mesmas","mesma","um","neste","menos","quais","mediante","proprio","logo","isto","isso","ha","estes","este","propios","estas","esta","todas","esses","essas","toda","entre","nos","entao","em","eles","qual","elas","tuas","ela","tudo","do","mesmo","diversas","todos","diversa","seus","dispoem","ou","dispoe","teus","deste","quer","desta","diversos","desde","quanto","depois","demais","quando","essa","deles","todo","pois","dele","dela","dos","de","da","nem","cujos","das","cujo","durante","cujas","portanto","cuja","contudo","ele","contra","como","com","pelas","assim","as","aqueles","mais","esse","aquele","mas","apos","aos","aonde","sua","e","ao","antes","nao","ambos","ambas","alem","ainda","a"]

czech

["a","s","k","o","i","u","v","z","dnes","cz","tímto","budeš","budem","byli","jseš","muj","svým","ta","tomto","tohle","tuto","tyto","jej","zda","proc","máte","tato","kam","tohoto","kdo","kterí","mi","nám","tom","tomuto","mít","nic","proto","kterou","byla","toho","protože","asi","ho","naši","napište","re","což","tím","takže","svých","její","svými","jste","aj","tu","tedy","teto","bylo","kde","ke","pravé","ji","nad","nejsou","ci","pod","téma","mezi","pres","ty","pak","vám","ani","když","však","neg","jsem","tento","clánku","clánky","aby","jsme","pred","pta","jejich","byl","ješte","až","bez","také","pouze","první","vaše","která","nás","nový","tipy","pokud","muže","strana","jeho","své","jiné","zprávy","nové","není","vás","jen","podle","zde","už","být","více","bude","již","než","který","by","které","co","nebo","ten","tak","má","pri","od","po","jsou","jak","další","ale","si","se","ve","to","jako","za","zpet","ze","do","pro","je","na","atd","atp","jakmile","pricemž","já","on","ona","ono","oni","ony","my","vy","jí","ji","me","mne","jemu","tomu","tem","temu","nemu","nemuž","jehož","jíž","jelikož","jež","jakož","nacež"]

german

["wegen","mir","mich","dich","dir","ihre","wird","sein","auf","durch","ihres","ist","aus","von","im","war","mit","ohne","oder","kein","wie","was","es","sie","mein","er","du","daß","dass","die","als","ihr","wir","der","für","das","einen","wer","einem","am","und","eines","eine","in","einer"]

greek

["ο","η","το","οι","τα","του","τησ","των","τον","την","και","κι","κ","ειμαι","εισαι","ειναι","ειμαστε","ειστε","στο","στον","στη","στην","μα","αλλα","απο","για","προσ","με","σε","ωσ","παρα","αντι","κατα","μετα","θα","να","δε","δεν","μη","μην","επι","ενω","εαν","αν","τοτε","που","πωσ","ποιοσ","ποια","ποιο","ποιοι","ποιεσ","ποιων","ποιουσ","αυτοσ","αυτη","αυτο","αυτοι","αυτων","αυτουσ","αυτεσ","αυτα","εκεινοσ","εκεινη","εκεινο","εκεινοι","εκεινεσ","εκεινα","εκεινων","εκεινουσ","οπωσ","ομωσ","ισωσ","οσο","οτι"]

persian

["انان","نداشته","سراسر","خياه","ايشان","وي","تاكنون","بيشتري","دوم","پس","ناشي","وگو","يا","داشتند","سپس","هنگام","هرگز","پنج","نشان","امسال","ديگر","گروهي","شدند","چطور","ده","و","دو","نخستين","ولي","چرا","چه","وسط","ه","كدام","قابل","يك","رفت","هفت","همچنين","در","هزار","بله","بلي","شايد","اما","شناسي","گرفته","دهد","داشته","دانست","داشتن","خواهيم","ميليارد","وقتيكه","امد","خواهد","جز","اورده","شده","بلكه","خدمات","شدن","برخي","نبود","بسياري","جلوگيري","حق","كردند","نوعي","بعري","نكرده","نظير","نبايد","بوده","بودن","داد","اورد","هست","جايي","شود","دنبال","داده","بايد","سابق","هيچ","همان","انجا","كمتر","كجاست","گردد","كسي","تر","مردم","تان","دادن","بودند","سري","جدا","ندارند","مگر","يكديگر","دارد","دهند","بنابراين","هنگامي","سمت","جا","انچه","خود","دادند","زياد","دارند","اثر","بدون","بهترين","بيشتر","البته","به","براساس","بيرون","كرد","بعضي","گرفت","توي","اي","ميليون","او","جريان","تول","بر","مانند","برابر","باشيم","مدتي","گويند","اكنون","تا","تنها","جديد","چند","بي","نشده","كردن","كردم","گويد","كرده","كنيم","نمي","نزد","روي","قصد","فقط","بالاي","ديگران","اين","ديروز","توسط","سوم","ايم","دانند","سوي","استفاده","شما","كنار","داريم","ساخته","طور","امده","رفته","نخست","بيست","نزديك","طي","كنيد","از","انها","تمامي","داشت","يكي","طريق","اش","چيست","روب","نمايد","گفت","چندين","چيزي","تواند","ام","ايا","با","ان","ايد","ترين","اينكه","ديگري","راه","هايي","بروز","همچنان","پاعين","كس","حدود","مختلف","مقابل","چيز","گيرد","ندارد","ضد","همچون","سازي","شان","مورد","باره","مرسي","خويش","برخوردار","چون","خارج","شش","هنوز","تحت","ضمن","هستيم","گفته","فكر","بسيار","پيش","براي","روزهاي","انكه","نخواهد","بالا","كل","وقتي","كي","چنين","كه","گيري","نيست","است","كجا","كند","نيز","يابد","بندي","حتي","توانند","عقب","خواست","كنند","بين","تمام","همه","ما","باشند","مثل","شد","اري","باشد","اره","طبق","بعد","اگر","صورت","غير","جاي","بيش","ريزي","اند","زيرا","چگونه","بار","لطفا","مي","درباره","من","ديده","همين","گذاري","برداري","علت","گذاشته","هم","فوق","نه","ها","شوند","اباد","همواره","هر","اول","خواهند","چهار","نام","امروز","مان","هاي","قبل","كنم","سعي","تازه","را","هستند","زير","جلوي","عنوان","بود"]

french

["ô","être","vu","vous","votre","un","tu","toute","tout","tous","toi","tiens","tes","suivant","soit","soi","sinon","siennes","si","se","sauf","s","quoi","vers","qui","quels","ton","quelle","quoique","quand","près","pourquoi","plus","à","pendant","partant","outre","on","nous","notre","nos","tienne","ses","non","qu","ni","ne","mêmes","même","moyennant","mon","moins","va","sur","moi","miens","proche","miennes","mienne","tien","mien","n","malgré","quelles","plein","mais","là","revoilà","lui","leurs","๺","toutes","le","où","la","l","jusque","jusqu","ils","hélas","ou","hormis","laquelle","il","eu","nôtre","etc","est","environ","une","entre","en","son","elles","elle","dès","durant","duquel","été","du","voici","par","dont","donc","voilà","hors","doit","plusieurs","diverses","diverse","divers","devra","devers","tiennes","dessus","etre","dessous","desquels","desquelles","ès","et","désormais","des","te","pas","derrière","depuis","delà","hui","dehors","sans","dedans","debout","vôtre","de","dans","nôtres","mes","d","y","vos","je","concernant","comme","comment","combien","lorsque","ci","ta","nບnmoins","lequel","chez","contre","ceux","cette","j","cet","seront","que","ces","leur","certains","certaines","puisque","certaine","certain","passé","cependant","celui","lesquelles","celles","quel","celle","devant","cela","revoici","eux","ceci","sienne","merci","ce","c","siens","les","avoir","sous","avec","pour","parmi","avant","car","avait","sont","me","auxquels","sien","sa","excepté","auxquelles","aux","ma","autres","autre","aussi","auquel","aujourd","au","attendu","selon","après","ont","ainsi","ai","afin","vôtres","lesquels","a"]

dutch

["andere","uw","niets","wil","na","tegen","ons","wordt","werd","hier","eens","onder","alles","zelf","hun","dus","kan","ben","meer","iets","me","veel","omdat","zal","nog","altijd","ja","want","u","zonder","deze","hebben","wie","zij","heeft","hoe","nu","heb","naar","worden","haar","daar","der","je","doch","moet","tot","uit","bij","geweest","kon","ge","zich","wezen","ze","al","zo","dit","waren","men","mijn","kunnen","wat","zou","dan","hem","om","maar","ook","er","had","voor","of","als","reeds","door","met","over","aan","mij","was","is","geen","zijn","niet","iemand","het","hij","een","toen","in","toch","die","dat","te","doen","ik","van","op","en","de"]

russian

["а","без","более","бы","был","была","были","было","быть","в","вам","вас","весь","во","вот","все","всего","всех","вы","где","да","даже","для","до","его","ее","ей","ею","если","есть","еще","же","за","здесь","и","из","или","им","их","к","как","ко","когда","кто","ли","либо","мне","может","мы","на","надо","наш","не","него","нее","нет","ни","них","но","ну","о","об","однако","он","она","они","оно","от","очень","по","под","при","с","со","так","также","такой","там","те","тем","то","того","тоже","той","только","том","ты","у","уже","хотя","чего","чей","чем","что","чтобы","чье","чья","эта","эти","это","я"]