Details
-
Bug
-
Status: Closed
-
Resolution: Fixed
-
7.0.0 DXP SP1
-
7.0.x
-
Committed
-
1.5
-
Mac OS X 10.9
-
Oracle Sun JDK 8
-
Apache Tomcat 8.0.x
-
Safari 9
-
HSQLDB 2
-
Desktop
Description
Issue
Liferay not searching correctly in Japanese even though proper tuning is performed. Content that should not be picked up is showing.
Steps to reproduce
- Download Elasticsearch elasticsearch-2.3.5
- Unzip and edit <elasticsearch>/config/elasticsearch.yml
- Uncomment cluster.name and set it to LiferayElasticsearchCluster
cluster.name: LiferayElasticsearchCluster
- Install additional analysis plugins installed by default in embedded ES
bin/plugin install analysis-icu bin/plugin install analysis-kuromoji bin/plugin install analysis-smartcn bin/plugin install analysis-stempel
- Download Kibana kibana-4.5.4
- Unzip and edit <kibana>/config/kibana.yml
- Uncomment elasticsearch.url and make it point to ES (http://localhost:9200)
elasticsearch.url: "http://localhost:9200"
- Install Sense inside Kibana
bin/kibana plugin --install elastic/sense
- Start Elasticsearch
bin/elasticsearch
- Confirm it is running by executing the following command (or accessing http://localhost:9200/)
curl -X GET http://localhost:9200/
- Start Kibana
bin/kibana
- Access Sense at http://localhost:5601/app/sense
- Confirm that Kibana is connected successfully by running in Sense (.kibana index should show)
GET _cat/indices
- Prepare and start a clean DXP bundle
- Login as administrator
- Change to Japanese language (or access http://localhost:8080/ja/)
- Go to Control Panel > Configuration > System Settings > Foundation
- Click on Elasticsearch
- Change the Operation mode to REMOTE
- Set Network hosts to localhost
- Set Transport Port to 9300
- Save
- Shutdown Liferay
- Remove <liferay>/data/elasticsearch folder (embedded ES)
- Start Liferay again
- Login and reindex
- Note that <liferay>/data/elasticsearch is not created (successfully using remote ES)
- Go back to Sense and list all indices
GET _cat/indices
- Note that you should see liferay-0 and liferay-20116 indices
- Create web content with the following titles:
作戦大成功
新大阪
新規作成
東京特許許可局局長
京都 - Go back to home page
- Search for 新規 (Make sure you're still viewing the site in Japanese)
- Note that both 新規作成 and 新大阪 are displayed where only 新規作成 should hit
- Search for 作成 (Make sure you're still viewing the site in Japanese)
- Note that both 新規作成 and 作戦大成功 are displayed where only 新規作成 should hit
- To make an analogy, it is searching by letter instead of by word
- Go back to Sense
- Test the analyzer
GET /liferay-20116/_search { "query": { "match_phrase": { "title": "新規" } } }
GET /liferay-20116/_search { "query": { "match_phrase": { "title": "作成" } } }
- Note that only 新規作成 content is picked (correct results)
- Go to Control Panel > Configuration > System Settings > Foundation > Elasticsearch
- In "Additional index configurations" add the following search tuning (tuned kuromoji)
{ "analysis": { "filter": { "pos_filter": { "type": "kuromoji_part_of_speech", "stoptags": [ "助詞-格助詞-一般", "助詞-格助詞-引用", "助詞-係助詞", "助詞-接続助詞", "助詞-終助詞", "助詞-特殊", "助詞-副詞化", "助詞-副助詞", "助詞-連体化" ] } }, "tokenizer": { "liferay_kuromoji_tokenizer": { "type": "kuromoji_tokenizer", "mode": "search" } }, "analyzer": { "liferay_kuromoji_analyzer": { "type": "custom", "tokenizer": "liferay_kuromoji_tokenizer", "char_filter": [ "html_strip", "kuromoji_iteration_mark" ], "filter": [ "lowercase", "kuromoji_baseform", "pos_filter" ] } } } }
- In "Additional type mappings", add the following map to use the tuned kuromoji
{ "dynamic_templates": [ { "template_ja": { "mapping": { "index": "analyzed", "store": "true", "analyzer": "liferay_kuromoji_analyzer", "type": "string", "term_vector": "with_positions_offsets" }, "match": "\\w+_ja\\b|\\w+_ja_[A-Z]{2}\\b", "match_mapping_type": "string", "match_pattern": "regex" } }] }
- Go to Control Panel > Configuration > Server Administration
- Reindex with "Reindex all search indexes"
- Go back to home page
- Repeat the searches (Make sure you're still viewing the site in Japanese)
- Note that we get same results
- Go back to Sense
- Confirm that index settings are the ones specified in step 38
GET /liferay-20116/_settings
- Confirm that index mappings settings for japanese (template_ja) are the ones specified in step 39
GET /liferay-20116/_mappings/LiferayDocumentType
- Test the tokenizer to confirm how the string is tokenized
GET /liferay-20116/_analyze?analyzer=liferay_kuromoji_analyzer { "text":"新規" }
GET /liferay-20116/_analyze?analyzer=liferay_kuromoji_analyzer { "text":"作成" }
- Note that it is considered as 1 word, not as 2 different tokens?
- Test the analyzer
GET /liferay-20116/_search { "query": { "match_phrase": { "title": "新規" } } }
GET /liferay-20116/_search { "query": { "match_phrase": { "title": "作成" } } }
- Note that only 新規作成 content is picked (correct results)
Expected results
Searching from Liferay produces correct results with the default bigram configuration as well as produce correct results when configured it to use kuromoji
Actual results
Searching from Liferay produces incorrect results even though tokenizer, analyzer and mappings are tuned correctly
Attachments
Issue Links
- is related to
-
LPS-69805 DDM search not picking up correct results in Japanese
- Closed
-
LPS-69806 Calendar event search not picking up correct results in Japanese
- Closed
- relates
-
LPS-70315 Journal Article non-localized content field is being searched in Elasticsearch when searching with other Assets/portlets
- Closed
- Testing discovered
-
LPS-78119 Search is giving locale based results even for administrator role for Web Content articles
- Closed