Configure search analyzers and their components - Tair (Redis® OSS-Compatible) - Alibaba Cloud - Tair (Redis® OSS-Compatible)

Analyzers parse and tokenize text fields so that TairSearch can build an index and answer full-text queries. Each analyzer runs three stages in order: character filters, a tokenizer, and token filters. TairSearch ships with nine built-in analyzers. For specialized requirements, you can build a custom analyzer from individual components.

Stage	Purpose	Count per analyzer
Character filter	Preprocesses the raw text before tokenization (for example, replace `:)` with `_happy_`)	Zero or more, run in order
Tokenizer	Splits the preprocessed text into tokens	Exactly one
Token filter	Post-processes each token (for example, lowercase, stop-word removal, stemming)	Zero or more, run in order

Choose an analyzer

Analyzer	Best for	Tokenizes by	Lowercases	Filters stop words
Standard	Most languages	Unicode word boundaries	Yes	Yes
Stop	Most languages (stop-word focus)	Non-letter characters	Yes	Yes
Jieba	Chinese text	Trained dictionary	English tokens only	Yes
IK	Chinese text (Elasticsearch-compatible)	Trained dictionary (two modes)	Yes (default)	Optional
Pattern	Custom delimiter logic	Regex pattern	Optional	Optional
Whitespace	Pre-tokenized or structured text	Whitespace characters	No	No
Simple	Western text, case-insensitive	Non-letter characters	Yes	No
Keyword	Exact-match fields	No splitting (whole field = one token)	No	No
Language	Specific natural languages	Language-specific rules	Yes	Yes

How it works

An analyzer processes a document through three sequential stages.

Stage 1 — Character filter: Zero or more character filters preprocess the raw document text. Filters run in the order they are listed. For example, a mapping character filter can replace "(:" with "happy" before tokenization begins.

Stage 2 — Tokenizer: Exactly one tokenizer splits the (possibly filtered) text into tokens. For example, the whitespace tokenizer splits "I am very happy" into ["I", "am", "very", "happy"].

Stage 3 — Token filter: Zero or more token filters post-process the tokens from the tokenizer. Filters run in the order they are listed. For example, the stop token filter removes common words such as "the" and "is".

Built-in analyzers

Standard

The standard analyzer is the default choice for most languages. It splits text on Unicode word boundaries (per Unicode Standard Annex #29), lowercases all tokens, and removes common stop words.

Components: standard tokenizer → lowercase token filter → stop token filter

No character filters are included.

Optional parameters:

Parameter	Description	Default
`stopwords`	Array of stop words to filter. Replaces the default list entirely.	See below
`max_token_length`	Maximum character length per token. Tokens longer than this are split at the limit.	255

Default stop words:

["a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in",
 "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",
 "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"]

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}

// Custom stop words and token length
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "standard",
          "max_token_length": 10,
          "stopwords": ["memory", "disk", "is", "a"]
        }
      }
    }
  }
}

Stop

The stop analyzer splits text at any non-letter character, lowercases all tokens, and removes stop words.

Components: lowercase tokenizer → stop token filter

Optional parameters:

Parameter	Description	Default
`stopwords`	Array of stop words to filter. Replaces the default list entirely.	Same as standard

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "stop"
      }
    }
  }
}

// Custom stop words
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "stop",
          "stopwords": ["memory", "disk", "is", "a"]
        }
      }
    }
  }
}

Jieba

The jieba analyzer is recommended for Chinese text. It splits text using jieba dictionary-based segmentation, lowercases English tokens, and removes stop words.

Components: jieba tokenizer → lowercase token filter → stop token filter

Important

The jieba analyzer loads a 20 MB built-in dictionary into memory. Only one copy is loaded globally. The first use of jieba may cause a brief latency spike while the dictionary loads.
Words in a custom dictionary cannot contain spaces or any of the following characters: \t, \n, ，, 。

Optional parameters:

Parameter	Description	Default
`userwords`	Array of strings added to the default dictionary. See the Jieba default dictionarydefault jieba dictionary.	Empty
`use_hmm`	Use a hidden Markov model (HMM) to handle out-of-vocabulary words.	`true`
`stopwords`	Array of stop words to filter. Replaces the default list entirely. See the Jieba default stop wordsdefault jieba stop words.	Built-in list

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "jieba"
      }
    }
  }
}

// Custom dictionary, stop words, and HMM
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "jieba",
          "stopwords": ["memory", "disk", "is", "a"],
          "userwords": ["Redis", "open-source", "flexible"],
          "use_hmm": true
        }
      }
    }
  }
}

IK

The IK analyzer splits Chinese text and is compatible with the IK analyzer plug-in for Alibaba Cloud Elasticsearch. It supports two modes:

`ik_max_word`: identifies all possible tokens.
`ik_smart`: filters the results of the ik_max_word mode to identify the most possible tokens.

Components: IK tokenizer (no token filters by default)

Optional parameters:

Parameter	Description	Default
`stopwords`	Array of stop words to filter. Replaces the default list entirely.	Same as standard
`userwords`	Array of strings added to the default IK dictionary. See the default IK dictionary.	Empty
`quantifiers`	Array of quantifiers added to the default IK quantifier dictionary. See the default quantifier dictionary.	Empty
`enable_lowercase`	Convert uppercase letters to lowercase before tokenization.	`true`

Important

If your custom dictionary contains uppercase letters, set enable_lowercase to false. Lowercase conversion happens before splitting, so uppercase entries in the dictionary would never match.

Configuration examples:

// Default configuration: both IK modes
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "ik_smart"
      },
      "f1": {
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

// Custom stop words, dictionary, and quantifiers
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_ik_smart_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_ik_max_word_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ik_smart_analyzer": {
          "type": "ik_smart",
          "stopwords": ["memory", "disk", "is", "a"],
          "userwords": ["Redis", "open-source", "flexible"],
          "quantifiers": ["ns"],
          "enable_lowercase": false
        },
        "my_ik_max_word_analyzer": {
          "type": "ik_max_word",
          "stopwords": ["memory", "disk", "is", "a"],
          "userwords": ["Redis", "open-source", "flexible"],
          "quantifiers": ["ns"],
          "enable_lowercase": false
        }
      }
    }
  }
}

Pattern

The pattern analyzer splits text using a regular expression. By default, the matched text is treated as a delimiter (tokens are the text between matches). It also lowercases tokens and filters stop words.

Components: pattern tokenizer → lowercase token filter → stop token filter

Optional parameters:

Parameter	Description	Default
`pattern`	Regular expression. Text matching the pattern is used as a delimiter. See RE2 syntax.	`\W+`
`stopwords`	Array of stop words. Replaces the default list entirely.	Same as standard
`lowercase`	Convert tokens to lowercase.	`true`
`flags`	Set to `CASE_INSENSITIVE` to make the regex case-insensitive.	Empty (case-sensitive)

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "pattern"
      }
    }
  }
}

// Custom pattern with case-insensitive matching
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "pattern",
          "pattern": "\\'([^\\']+)\\'",
          "stopwords": ["aaa", "@"],
          "lowercase": false,
          "flags": "CASE_INSENSITIVE"
        }
      }
    }
  }
}

Whitespace

The whitespace analyzer splits text at whitespace characters. It does not lowercase tokens or remove stop words.

Components: whitespace tokenizer

Optional parameters: None

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "whitespace"
      }
    }
  }
}

Simple

The simple analyzer splits text at any non-letter character and lowercases all tokens. It does not filter stop words.

Components: lowercase tokenizer

Optional parameters: None

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "simple"
      }
    }
  }
}

Keyword

The keyword analyzer treats the entire field value as a single token without any splitting. Use it for fields that require exact-match queries, such as IDs, status codes, or tags.

Components: keyword tokenizer

Optional parameters: None

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "keyword"
      }
    }
  }
}

Language

The language analyzer supports language-specific tokenization and stop-word removal for a fixed set of languages: arabic, cjk, chinese, brazilian, czech, german, greek, persian, french, dutch, and russian.

Optional parameters:

Parameter	Description	Default	Supported languages
`stopwords`	Array of stop words. Replaces the default list. See Appendix 4 for defaults.	Language-specific	All except `chinese`
`stem_exclusion`	Array of words whose stems are not extracted. For example, adding `"apples"` prevents it from being reduced to `"apple"`.	Empty	`brazilian`, `german`, `french`, `dutch`

The stop words of the chinese analyzer cannot be modified.

Configuration examples:

// Default configuration (Arabic)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "arabic"
      }
    }
  }
}

// Custom stop words and stem exclusion (German)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "german",
          "stopwords": ["ein"],
          "stem_exclusion": ["speicher"]
        }
      }
    }
  }
}

Custom analyzers

Build a custom analyzer when no built-in analyzer fits your needs. Define the analyzer in settings and reference it by name in mappings.

Parameters:

Parameter	Required	Description	Valid values
`type`	Yes	Identifies this as a custom analyzer.	`custom`
`tokenizer`	Yes	The tokenizer to use. Only one is allowed.	`whitespace`, `lowercase`, `standard`, `classic`, `letter`, `keyword`, `jieba`, `pattern`, `ik_max_word`, `ik_smart`
`char_filter`	No	Array of character filters to apply before tokenization.	`mapping` (see Appendix 1)
`filter`	No	Array of token filters to apply after tokenization.	`classic`, `elision`, `lowercase`, `snowball`, `stop`, `asciifolding`, `length`, `arabic_normalization`, `persian_normalization` (see Appendix 3)

Example: custom analyzer with emoticon replacement and stop-word removal

// Character filters replace emoticons and expand "&" before tokenization.
// The whitespace tokenizer splits on spaces.
// Token filters lowercase and remove stop words.
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["lowercase", "stop"],
          "char_filter": ["emoticons", "conjunctions"]
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [":) => _happy_", ":( => _sad_"]
        },
        "conjunctions": {
          "type": "mapping",
          "mappings": ["&=>and"]
        }
      }
    }
  }
}

Appendix 1: Supported character filters

Mapping character filter

Replaces specified strings using key-value pairs. When the input contains a key, it is replaced with the corresponding value. Multiple mapping character filters can be used in a single analyzer.

Parameters:

Parameter	Required	Description
`mappings`	Yes	Array of replacement rules. Each rule must use the format `"key => value"`. For example: `"& => and"`.

Configuration example:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": ["emoticons"]
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [":) => _happy_", ":( => _sad_"]
        }
      }
    }
  }
}

Appendix 2: Supported tokenizers

whitespace

Splits text at whitespace characters. Tokens that exceed max_token_length are split at the limit.

Optional parameters:

Parameter	Description	Default
`max_token_length`	Maximum character length per token. Tokens longer than this are split at the limit.	255

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace"
        }
      }
    }
  }
}

// Custom max token length
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "token1"
        }
      },
      "tokenizer": {
        "token1": {
          "type": "whitespace",
          "max_token_length": 2
        }
      }
    }
  }
}

standard

Splits text using the Unicode Text Segmentation algorithm (Unicode Standard Annex #29). Suitable for most languages.

Optional parameters:

Parameter	Description	Default
`max_token_length`	Maximum character length per token. Tokens longer than this are split at the limit.	255

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard"
        }
      }
    }
  }
}

// Custom max token length
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "token1"
        }
      },
      "tokenizer": {
        "token1": {
          "type": "standard",
          "max_token_length": 2
        }
      }
    }
  }
}

classic

Splits text using English grammar rules and handles specific patterns specially:

Splits at punctuation and removes it. Periods (.) surrounded by non-whitespace are kept — for example, red.apple is not split, but red. apple produces red and apple.
Splits at hyphens, unless the token contains digits (interpreted as a product number and kept intact).
Recognizes email addresses and hostnames as single tokens.

Tokens that exceed max_token_length are skipped, not split.

Optional parameters:

Parameter	Description	Default
`max_token_length`	Maximum character length per token. Tokens longer than this are skipped.	255

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "classic"
        }
      }
    }
  }
}

// Custom max token length
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "token1"
        }
      },
      "tokenizer": {
        "token1": {
          "type": "classic",
          "max_token_length": 2
        }
      }
    }
  }
}

letter

Splits text at any non-letter character. Works well for European languages.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "letter"
        }
      }
    }
  }
}

lowercase

Splits text at any non-letter character and converts all tokens to lowercase. Equivalent to combining the letter tokenizer with the lowercase token filter, but faster because it traverses the document only once.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "lowercase"
        }
      }
    }
  }
}

keyword

Treats the entire input as a single token without splitting. Typically paired with a token filter such as lowercase.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "keyword"
        }
      }
    }
  }
}

jieba

Splits Chinese text using a trained dictionary. Recommended for Chinese-language fields.

Important

Words in a custom dictionary cannot contain spaces or any of the following characters: \t, \n, ，, 。

Optional parameters:

Parameter	Description	Default
`userwords`	Array of strings added to the default dictionary. See the default jieba dictionary.	Empty
`use_hmm`	Use a hidden Markov model (HMM) to handle out-of-vocabulary words.	`true`

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "jieba"
        }
      }
    }
  }
}

// Custom dictionary
{
  "mappings": {
    "properties": {
      "f1": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "token1"
        }
      },
      "tokenizer": {
        "token1": {
          "type": "jieba",
          "userwords": ["Redis", "open-source", "flexible"],
          "use_hmm": true
        }
      }
    }
  }
}

pattern

Splits text using a regular expression. The matched text is treated as a delimiter by default. Use the group parameter to treat matched text as tokens instead.

Optional parameters:

Parameter	Description	Default
`pattern`	Regular expression. See RE2 syntax.	`\W+`
`group`	Controls how the regex result is used. `-1` uses matched text as delimiters. `0` uses the full match as a token. `1` or higher uses the corresponding capture group as a token.	`-1`
`flags`	Set to `CASE_INSENSITIVE` to make the regex case-insensitive.	Empty (case-sensitive)

Example of `group` behavior:

Regex: "a(b+)c", input: "abbbcdefabc"

group: 0 → tokens: [ abbbc, abc ] (full matches)
group: 1 → tokens: [ bbb, b ] (first capture group)

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "pattern"
        }
      }
    }
  }
}

// Custom pattern with capture group
{
  "mappings": {
    "properties": {
      "f1": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "pattern_tokenizer"
        }
      },
      "tokenizer": {
        "pattern_tokenizer": {
          "type": "pattern",
          "pattern": "AB(A(\\w+)C)",
          "flags": "CASE_INSENSITIVE",
          "group": 2
        }
      }
    }
  }
}

IK

Splits Chinese text. Supports two modes:

`ik_max_word`: identifies all possible tokens (maximum granularity).
`ik_smart`: identifies the most likely tokens (coarser granularity).

Optional parameters:

Parameter	Description	Default
`stopwords`	Array of stop words. Replaces the default list entirely.	Same as standard
`userwords`	Array of strings added to the default IK dictionary. See the default IK dictionary.	Empty
`quantifiers`	Array of quantifiers added to the default quantifier dictionary. See the default quantifier dictionary.	Empty
`enable_lowercase`	Convert uppercase letters to lowercase before tokenization.	`true`

Important

If your custom dictionary contains uppercase letters, set enable_lowercase to false. Lowercase conversion happens before splitting.

Configuration examples:

// Default configuration: both IK modes
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_ik_smart_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_custom_ik_max_word_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_ik_smart_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart"
        },
        "my_custom_ik_max_word_analyzer": {
          "type": "custom",
          "tokenizer": "ik_max_word"
        }
      }
    }
  }
}

// Custom dictionary, stop words, and quantifiers
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_ik_smart_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_custom_ik_max_word_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_ik_smart_analyzer": {
          "type": "custom",
          "tokenizer": "my_ik_smart_tokenizer"
        },
        "my_custom_ik_max_word_analyzer": {
          "type": "custom",
          "tokenizer": "my_ik_max_word_tokenizer"
        }
      },
      "tokenizer": {
        "my_ik_smart_tokenizer": {
          "type": "ik_smart",
          "userwords": ["The tokenizer for the Chinese language", "The custom stop words"],
          "stopwords": ["about", "test"],
          "quantifiers": ["ns"],
          "enable_lowercase": false
        },
        "my_ik_max_word_tokenizer": {
          "type": "ik_max_word",
          "userwords": ["The tokenizer for the Chinese language", "The custom stop words"],
          "stopwords": ["about", "test"],
          "quantifiers": ["ns"],
          "enable_lowercase": false
        }
      }
    }
  }
}

Appendix 3: Supported token filters

classic

Removes possessive 's from the end of tokens and strips periods from acronyms. For example, Fig. becomes Fig.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "classic",
          "filter": ["classic"]
        }
      }
    }
  }
}

elision

Removes specified elisions from the beginning of tokens. Primarily used for French text (for example, l'avion → avion).

Optional parameters:

Parameter	Description	Default
`articles`	Array of elisions to remove. Replaces the default list entirely.	`["l", "m", "t", "qu", "n", "s", "j"]`
`articles_case`	Whether elision matching is case-sensitive.	`false`

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["elision"]
        }
      }
    }
  }
}

// Custom elisions with case-sensitive matching
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["elision_filter"]
        }
      },
      "filter": {
        "elision_filter": {
          "type": "elision",
          "articles": ["l", "m", "t", "qu", "n", "s", "j"],
          "articles_case": true
        }
      }
    }
  }
}

lowercase

Converts all tokens to lowercase.

Optional parameters:

Parameter	Description	Valid values
`language`	Apply language-specific lowercasing rules. If not set, standard English rules apply.	`greek`, `russian`

Configuration examples:

// Default configuration (English)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

// Language-specific lowercasing (Greek and Russian)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_greek_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_custom_russian_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_greek_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["greek_lowercase"]
        },
        "my_custom_russian_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["russian_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        },
        "russian_lowercase": {
          "type": "lowercase",
          "language": "russian"
        }
      }
    }
  }
}

snowball

Extracts the stem from each token. For example, cats becomes cat and running becomes run.

Optional parameters:

Parameter	Description	Default	Valid values
`language`	The language whose stemming rules to apply.	`english`	`english`, `german`, `french`, `dutch`

Configuration examples:

// Default configuration (English)
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["snowball"]
        }
      }
    }
  }
}

// English stemming with standard tokenizer
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["my_filter"]
        }
      },
      "filter": {
        "my_filter": {
          "type": "snowball",
          "language": "english"
        }
      }
    }
  }
}

stop

Removes stop words from the token stream.

Optional parameters:

Parameter	Description	Default
`stopwords`	Array of stop words. Replaces the default list entirely.	Same as standard
`ignoreCase`	Whether stop-word matching is case-insensitive.	`false`

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["stop"]
        }
      }
    }
  }
}

// Custom stop words with case-insensitive matching
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["stop_filter"]
        }
      },
      "filter": {
        "stop_filter": {
          "type": "stop",
          "stopwords": ["the"],
          "ignore_case": true
        }
      }
    }
  }
}

asciifolding

Converts alphabetic, numeric, and symbolic characters outside the Basic Latin Unicode block to their ASCII equivalents. For example, é becomes e and ü becomes u. Use this filter to normalize accented characters in European text.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["asciifolding"]
        }
      }
    }
  }
}

length

Removes tokens that are shorter or longer than specified character lengths.

Optional parameters:

Parameter	Description	Default
`min`	Minimum number of characters a token must have to be kept.	`0`
`max`	Maximum number of characters a token can have to be kept.	`2147483647` (2^31 - 1)

Configuration examples:

// Default configuration
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["length"]
        }
      }
    }
  }
}

// Keep only tokens between 2 and 5 characters
{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["length_filter"]
        }
      },
      "filter": {
        "length_filter": {
          "type": "length",
          "max": 5,
          "min": 2
        }
      }
    }
  }
}

Normalization

Normalizes language-specific characters. Use arabic_normalization for Arabic text and persian_normalization for Persian text. Pair this filter with the standard tokenizer for best results.

Configuration:

{
  "mappings": {
    "properties": {
      "f0": {
        "type": "text",
        "analyzer": "my_arabic_analyzer"
      },
      "f1": {
        "type": "text",
        "analyzer": "my_persian_analyzer"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_arabic_analyzer": {
          "type": "custom",
          "tokenizer": "arabic",
          "filter": ["arabic_normalization"]
        },
        "my_persian_analyzer": {
          "type": "custom",
          "tokenizer": "arabic",
          "filter": ["persian_normalization"]
        }
      }
    }
  }
}

Appendix 4: Default stop words for language analyzers

arabic

["من","ومن","منها","منه","في","وفي","فيها","فيه","و","ف","ثم","او","أو","ب","بها","به","ا","أ","اى","اي","أي","أى","لا","ولا","الا","ألا","إلا","لكن","ما","وما","كما","فما","عن","مع","اذا","إذا","ان","أن","إن","انها","أنها","إنها","انه","أنه","إنه","بان","بأن","فان","فأن","وان","وأن","وإن","التى","التي","الذى","الذي","الذين","الى","الي","إلى","إلي","على","عليها","عليه","اما","أما","إما","ايضا","أيضا","كل","وكل","لم","ولم","لن","ولن","هى","هي","هو","وهى","وهي","وهو","فهى","فهي","فهو","انت","أنت","لك","لها","له","هذه","هذا","تلك","ذلك","هناك","كانت","كان","يكون","تكون","وكانت","وكان","غير","بعض","قد","نحو","بين","بينما","منذ","ضمن","حيث","الان","الآن","خلال","بعد","قبل","حتى","عند","عندما","لدى","جميع"]

cjk

["with","will","to","this","there","then","the","t","that","such","s","on","not","no","it","www","was","is","","into","their","or","in","if","for","by","but","they","be","these","at","are","as","and","of","a"]

brazilian

["uns","umas","uma","teu","tambem","tal","suas","sobre","sob","seu","sendo","seja","sem","se","quem","tua","que","qualquer","porque","por","perante","pelos","pelo","outros","outro","outras","outra","os","o","nesse","nas","na","mesmos","mesmas","mesma","um","neste","menos","quais","mediante","proprio","logo","isto","isso","ha","estes","este","propios","estas","esta","todas","esses","essas","toda","entre","nos","entao","em","eles","qual","elas","tuas","ela","tudo","do","mesmo","diversas","todos","diversa","seus","dispoem","ou","dispoe","teus","deste","quer","desta","diversos","desde","quanto","depois","demais","quando","essa","deles","todo","pois","dele","dela","dos","de","da","nem","cujos","das","cujo","durante","cujas","portanto","cuja","contudo","ele","contra","como","com","pelas","assim","as","aqueles","mais","esse","aquele","mas","apos","aos","aonde","sua","e","ao","antes","nao","ambos","ambas","alem","ainda","a"]

czech

["a","s","k","o","i","u","v","z","dnes","cz","tímto","budeš","budem","byli","jseš","muj","svým","ta","tomto","tohle","tuto","tyto","jej","zda","proc","máte","tato","kam","tohoto","kdo","kterí","mi","nám","tom","tomuto","mít","nic","proto","kterou","byla","toho","protože","asi","ho","naši","napište","re","což","tím","takže","svých","její","svými","jste","aj","tu","tedy","teto","bylo","kde","ke","pravé","ji","nad","nejsou","ci","pod","téma","mezi","pres","ty","pak","vám","ani","když","však","neg","jsem","tento","clánku","clánky","aby","jsme","pred","pta","jejich","byl","ješte","až","bez","také","pouze","první","vaše","která","nás","nový","tipy","pokud","muže","strana","jeho","své","jiné","zprávy","nové","není","vás","jen","podle","zde","už","být","více","bude","již","než","který","by","které","co","nebo","ten","tak","má","pri","od","po","jsou","jak","další","ale","si","se","ve","to","jako","za","zpet","ze","do","pro","je","na","atd","atp","jakmile","pricemž","já","on","ona","ono","oni","ony","my","vy","jí","ji","me","mne","jemu","tomu","tem","temu","nemu","nemuž","jehož","jíž","jelikož","jež","jakož","nacež"]

german

["wegen","mir","mich","dich","dir","ihre","wird","sein","auf","durch","ihres","ist","aus","von","im","war","mit","ohne","oder","kein","wie","was","es","sie","mein","er","du","daß","dass","die","als","ihr","wir","der","für","das","einen","wer","einem","am","und","eines","eine","in","einer"]

greek

["ο","η","το","οι","τα","του","τησ","των","τον","την","και","κι","κ","ειμαι","εισαι","ειναι","ειμαστε","ειστε","στο","στον","στη","στην","μα","αλλα","απο","για","προσ","με","σε","ωσ","παρα","αντι","κατα","μετα","θα","να","δε","δεν","μη","μην","επι","ενω","εαν","αν","τοτε","που","πωσ","ποιοσ","ποια","ποιο","ποιοι","ποιεσ","ποιων","ποιουσ","αυτοσ","αυτη","αυτο","αυτοι","αυτων","αυτουσ","αυτεσ","αυτα","εκεινοσ","εκεινη","εκεινο","εκεινοι","εκεινεσ","εκεινα","εκεινων","εκεινουσ","οπωσ","ομωσ","ισωσ","οσο","οτι"]

persian

["انان","نداشته","سراسر","خياه","ايشان","وي","تاكنون","بيشتري","دوم","پس","ناشي","وگو","يا","داشتند","سپس","هنگام","هرگز","پنج","نشان","امسال","ديگر","گروهي","شدند","چطور","ده","و","دو","نخستين","ولي","چرا","چه","وسط","ه","كدام","قابل","يك","رفت","هفت","همچنين","در","هزار","بله","بلي","شايد","اما","شناسي","گرفته","دهد","داشته","دانست","داشتن","خواهيم","ميليارد","وقتيكه","امد","خواهد","جز","اورده","شده","بلكه","خدمات","شدن","برخي","نبود","بسياري","جلوگيري","حق","كردند","نوعي","بعري","نكرده","نظير","نبايد","بوده","بودن","داد","اورد","هست","جايي","شود","دنبال","داده","بايد","سابق","هيچ","همان","انجا","كمتر","كجاست","گردد","كسي","تر","مردم","تان","دادن","بودند","سري","جدا","ندارند","مگر","يكديگر","دارد","دهند","بنابراين","هنگامي","سمت","جا","انچه","خود","دادند","زياد","دارند","اثر","بدون","بهترين","بيشتر","البته","به","براساس","بيرون","كرد","بعضي","گرفت","توي","اي","ميليون","او","جريان","تول","بر","مانند","برابر","باشيم","مدتي","گويند","اكنون","تا","تنها","جديد","چند","بي","نشده","كردن","كردم","گويد","كرده","كنيم","نمي","نزد","روي","قصد","فقط","بالاي","ديگران","اين","ديروز","توسط","سوم","ايم","دانند","سوي","استفاده","شما","كنار","داريم","ساخته","طور","امده","رفته","نخست","بيست","نزديك","طي","كنيد","از","انها","تمامي","داشت","يكي","طريق","اش","چيست","روب","نمايد","گفت","چندين","چيزي","تواند","ام","ايا","با","ان","ايد","ترين","اينكه","ديگري","راه","هايي","بروز","همچنان","پاعين","كس","حدود","مختلف","مقابل","چيز","گيرد","ندارد","ضد","همچون","سازي","شان","مورد","باره","مرسي","خويش","برخوردار","چون","خارج","شش","هنوز","تحت","ضمن","هستيم","گفته","فكر","بسيار","پيش","براي","روزهاي","انكه","نخواهد","بالا","كل","وقتي","كي","چنين","كه","گيري","نيست","است","كجا","كند","نيز","يابد","بندي","حتي","توانند","عقب","خواست","كنند","بين","تمام","همه","ما","باشند","مثل","شد","اري","باشد","اره","طبق","بعد","اگر","صورت","غير","جاي","بيش","ريزي","اند","زيرا","چگونه","بار","لطفا","مي","درباره","من","ديده","همين","گذاري","برداري","علت","گذاشته","هم","فوق","نه","ها","شوند","اباد","همواره","هر","اول","خواهند","چهار","نام","امروز","مان","هاي","قبل","كنم","سعي","تازه","را","هستند","زير","جلوي","عنوان","بود"]

french

["ô","être","vu","vous","votre","un","tu","toute","tout","tous","toi","tiens","tes","suivant","soit","soi","sinon","siennes","si","se","sauf","s","quoi","vers","qui","quels","ton","quelle","quoique","quand","près","pourquoi","plus","à","pendant","partant","outre","on","nous","notre","nos","tienne","ses","non","qu","ni","ne","mêmes","même","moyennant","mon","moins","va","sur","moi","miens","proche","miennes","mienne","tien","mien","n","malgré","quelles","plein","mais","là","revoilà","lui","leurs","๺","toutes","le","où","la","l","jusque","jusqu","ils","hélas","ou","hormis","laquelle","il","eu","nôtre","etc","est","environ","une","entre","en","son","elles","elle","dès","durant","duquel","été","du","voici","par","dont","donc","voilà","hors","doit","plusieurs","diverses","diverse","divers","devra","devers","tiennes","dessus","etre","dessous","desquels","desquelles","ès","et","désormais","des","te","pas","derrière","depuis","delà","hui","dehors","sans","dedans","debout","vôtre","de","dans","nôtres","mes","d","y","vos","je","concernant","comme","comment","combien","lorsque","ci","ta","nບnmoins","lequel","chez","contre","ceux","cette","j","cet","seront","que","ces","leur","certains","certaines","puisque","certaine","certain","passé","cependant","celui","lesquelles","celles","quel","celle","devant","cela","revoici","eux","ceci","sienne","merci","ce","c","siens","les","avoir","sous","avec","pour","parmi","avant","car","avait","sont","me","auxquels","sien","sa","excepté","auxquelles","aux","ma","autres","autre","aussi","auquel","aujourd","au","attendu","selon","après","ont","ainsi","ai","afin","vôtres","lesquels","a"]

dutch

["andere","uw","niets","wil","na","tegen","ons","wordt","werd","hier","eens","onder","alles","zelf","hun","dus","kan","ben","meer","iets","me","veel","omdat","zal","nog","altijd","ja","want","u","zonder","deze","hebben","wie","zij","heeft","hoe","nu","heb","naar","worden","haar","daar","der","je","doch","moet","tot","uit","bij","geweest","kon","ge","zich","wezen","ze","al","zo","dit","waren","men","mijn","kunnen","wat","zou","dan","hem","om","maar","ook","er","had","voor","of","als","reeds","door","met","over","aan","mij","was","is","geen","zijn","niet","iemand","het","hij","een","toen","in","toch","die","dat","te","doen","ik","van","op","en","de"]

russian

["а","без","более","бы","был","была","были","было","быть","в","вам","вас","весь","во","вот","все","всего","всех","вы","где","да","даже","для","до","его","ее","ей","ею","если","есть","еще","же","за","здесь","и","из","или","им","их","к","как","ко","когда","кто","ли","либо","мне","может","мы","на","надо","наш","не","него","нее","нет","ни","них","но","ну","о","об","однако","он","она","они","оно","от","очень","по","под","при","с","со","так","также","такой","там","те","тем","то","того","тоже","той","только","том","ты","у","уже","хотя","чего","чей","чем","что","чтобы","чье","чья","эта","эти","это","я"]