Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-67687

Liferay not searching correctly in Japanese even though proper tuning is performed

    Details

    • Branch Version/s:
      7.0.x
    • Backported to Branch:
      Committed
    • Story Points:
      1.5
    • OS:
      Mac OS X 10.9
    • JDK:
      Oracle Sun JDK 8
    • Application Servers:
      Apache Tomcat 8.0.x
    • Browsers:
      Safari 9
    • Databases:
      HSQLDB 2
    • Device Type:
      Desktop

      Description

      Issue
      Liferay not searching correctly in Japanese even though proper tuning is performed. Content that should not be picked up is showing.

      Steps to reproduce

      1. Download Elasticsearch elasticsearch-2.3.5
      2. Unzip and edit <elasticsearch>/config/elasticsearch.yml
      3. Uncomment cluster.name and set it to LiferayElasticsearchCluster
        cluster.name: LiferayElasticsearchCluster
        
      4. Install additional analysis plugins installed by default in embedded ES
        bin/plugin install analysis-icu
        bin/plugin install analysis-kuromoji
        bin/plugin install analysis-smartcn
        bin/plugin install analysis-stempel
        
      5. Download Kibana kibana-4.5.4
      6. Unzip and edit <kibana>/config/kibana.yml
      7. Uncomment elasticsearch.url and make it point to ES (http://localhost:9200)
        elasticsearch.url: "http://localhost:9200"
        
      8. Install Sense inside Kibana
        bin/kibana plugin --install elastic/sense
        
      9. Start Elasticsearch
        bin/elasticsearch
        
      10. Confirm it is running by executing the following command (or accessing http://localhost:9200/)
        curl -X GET http://localhost:9200/
        
      11. Start Kibana
        bin/kibana
        
      12. Access Sense at http://localhost:5601/app/sense
      13. Confirm that Kibana is connected successfully by running in Sense (.kibana index should show)
        GET _cat/indices
        
      14. Prepare and start a clean DXP bundle
      15. Login as administrator
      16. Change to Japanese language (or access http://localhost:8080/ja/)
      17. Go to Control Panel > Configuration > System Settings > Foundation
      18. Click on Elasticsearch
      19. Change the Operation mode to REMOTE
      20. Set Network hosts to localhost
      21. Set Transport Port to 9300
      22. Save
      23. Shutdown Liferay
      24. Remove <liferay>/data/elasticsearch folder (embedded ES)
      25. Start Liferay again
      26. Login and reindex
      27. Note that <liferay>/data/elasticsearch is not created (successfully using remote ES)
      28. Go back to Sense and list all indices
        GET _cat/indices
        
      29. Note that you should see liferay-0 and liferay-20116 indices
      30. Create web content with the following titles:
        作戦大成功
        新大阪
        新規作成
        東京特許許可局局長
        京都
      31. Go back to home page
      32. Search for 新規 (Make sure you're still viewing the site in Japanese)
      33. Note that both 新規作成 and 新大阪 are displayed where only 新規作成 should hit
      34. Search for 作成 (Make sure you're still viewing the site in Japanese)
      35. Note that both 新規作成 and 作戦大成功 are displayed where only 新規作成 should hit
      36. To make an analogy, it is searching by letter instead of by word
      37. Go back to Sense
      38. Test the analyzer
        GET /liferay-20116/_search
        {
          "query": { "match_phrase": { "title": "新規" } }
        }
        
        GET /liferay-20116/_search
        {
          "query": { "match_phrase": { "title": "作成" } }
        }
        
      39. Note that only 新規作成 content is picked (correct results)
      40. Go to Control Panel > Configuration > System Settings > Foundation > Elasticsearch
      41. In "Additional index configurations" add the following search tuning (tuned kuromoji)
        {
          "analysis": {
            "filter": {
              "pos_filter": {
                "type": "kuromoji_part_of_speech",
                "stoptags": [
                  "助詞-格助詞-一般",
                  "助詞-格助詞-引用",
                  "助詞-係助詞",
                  "助詞-接続助詞",
                  "助詞-終助詞",
                  "助詞-特殊",
                  "助詞-副詞化",
                  "助詞-副助詞",
                  "助詞-連体化"
                ]
              }
            },
            "tokenizer": {
              "liferay_kuromoji_tokenizer": {
                "type": "kuromoji_tokenizer",
                "mode": "search"
              }
            },
            "analyzer": {
              "liferay_kuromoji_analyzer": {
                "type": "custom",
                "tokenizer": "liferay_kuromoji_tokenizer",
                "char_filter": [
                  "html_strip",
                  "kuromoji_iteration_mark"
                ],
                "filter": [
                  "lowercase",
                  "kuromoji_baseform",
                  "pos_filter"
                ]
              }
            }
          }
        }
        
      42. In "Additional type mappings", add the following map to use the tuned kuromoji
        {
          "dynamic_templates": [
          {
          "template_ja": {
             "mapping": {
             "index": "analyzed",
             "store": "true",
             "analyzer": "liferay_kuromoji_analyzer",
             "type": "string",
             "term_vector": "with_positions_offsets"
             },
             "match": "\\w+_ja\\b|\\w+_ja_[A-Z]{2}\\b",
             "match_mapping_type": "string",
             "match_pattern": "regex"
           }
          }]
        }
        
      43. Go to Control Panel > Configuration > Server Administration
      44. Reindex with "Reindex all search indexes"
      45. Go back to home page
      46. Repeat the searches (Make sure you're still viewing the site in Japanese)
      47. Note that we get same results
      48. Go back to Sense
      49. Confirm that index settings are the ones specified in step 38
        GET /liferay-20116/_settings
        
      50. Confirm that index mappings settings for japanese (template_ja) are the ones specified in step 39
        GET /liferay-20116/_mappings/LiferayDocumentType
        
      51. Test the tokenizer to confirm how the string is tokenized
        GET /liferay-20116/_analyze?analyzer=liferay_kuromoji_analyzer
        { "text":"新規" }
        
        GET /liferay-20116/_analyze?analyzer=liferay_kuromoji_analyzer
        { "text":"作成" }
        
      52. Note that it is considered as 1 word, not as 2 different tokens?
      53. Test the analyzer
        GET /liferay-20116/_search
        {
          "query": { "match_phrase": { "title": "新規" } }
        }
        
        GET /liferay-20116/_search
        {
          "query": { "match_phrase": { "title": "作成" } }
        }
        
      54. Note that only 新規作成 content is picked (correct results)

      Expected results
      Searching from Liferay produces correct results with the default bigram configuration as well as produce correct results when configured it to use kuromoji

      Actual results
      Searching from Liferay produces incorrect results even though tokenizer, analyzer and mappings are tuned correctly

        Attachments

        1. ee-7.0.x.png
          ee-7.0.x.png
          37 kB
        2. master.png
          master.png
          30 kB
        3. reproduce.png
          reproduce.png
          46 kB

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Days since last comment:
                  2 years, 20 weeks, 1 day ago

                  Packages

                  Version Package
                  7.0.0 DXP FP24
                  7.0.X EE
                  7.0.4 CE GA5
                  Master