Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-105835

Improve Elasticsearch catalan and other analyzers configuration

    Details

      Description

      It is necessary to improve Elasticsearch catalan analyzer configuration, as out-of-the-box configuration has some issues:

      • Search results are not the same when you search a word without accents that was indexed with accents.
      • For example, this is reproduced in case you search "diputacio" but original indexed word was "diputació" (with an accent in ó)
      • The root cause of the issue is inside catalan analyzer of Elasticsearch, it is doing following tokenization:
        • diputació => diput
        • diputacio => diputac
      • So at indexation time, "diputació" is tokenized and stored in Elasticsearch as "diput"
      • But at search time, in case somebody doesn't type the ó accent and tries searching "diputacio", that search is tokenized to "diputac" and no results are returned (because information in Elasticsearch was stored as "diput".

      This issue can be also reproduced with other Elasticsearch analyzers for example brazilian or romanian. It cannot be reproduced in those analyzers that uses a "light" stemmer (see https://www.elastic.co/guide/en/elasticsearch/reference/6.8/analysis-stemmer-tokenfilter.html)

      Ignoring accents is a very common mistake in both catalan and spanish, so in my opinion it is a fair claim expecting searchs to be insensitive to accents.

      This issue can be solved redefining catalan analyzer with an asciifolding filter during index creation in default src/main/resources/META-INF/index-settings.json configuration file:

          "analysis": {
            "filter": {
              "catalan_elision": {
                "type":       "elision",
                "articles":   [ "d", "l", "m", "n", "s", "t"],
                "articles_case": true
              },
              "catalan_stop": {
                "type":       "stop",
                "stopwords":  "_catalan_" 
              },
              "catalan_stemmer": {
                "type":       "stemmer",
                "language":   "catalan"
              }
            },
            "analyzer": {
              "catalan": {
                "tokenizer":  "standard",
                "filter": [
                  "catalan_elision",
                  "lowercase",
                  "asciifolding",
                  "catalan_stop",
                  "catalan_stemmer"
                ]
              }
            }
          }
      

      This custom analyzer was created as a copy of original catalan analyzer, see:

        Attachments

          Activity

            People

            • Assignee:
              support-lep@liferay.com SE Support
              Reporter:
              jorge.diaz Jorge Diaz
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Packages

                Version Package