Getting Started with Languages
Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:
Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.
These analyzers typically perform four roles:
-
Tokenize text into individual words:
The quick brown foxes→ [The,quick,brown,foxes] -
Lowercase tokens:
The→the -
Remove common stopwords:
[`The`,
quick,brown,foxes] → [quick,brown,foxes] -
Stem tokens to their root form:
foxes→fox
Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:
-
The
englishanalyzer removes the possessive's:John’s→john -
The
frenchanalyzer removes elisions likel'andqu'and diacritics like¨or^:l'église→eglis -
The
germananalyzer normalizes terms, replacingäandaewitha, orßwithss, among others:äußerst→ausserst