Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-70810

Japanese text is not properly being extracted from documents for indexing

    Details

      Description

      Steps to reproduce

      1) Add attached text files to Document and Media
      ① en-ja-sjis-crlf-word.txt
      ② en-ja-sjis-crlf.txt
      ③ en-ja-sjis-lf-word.txt
      ④ en-ja-sjis-lf.txt
      ⑤ en-sjis-crlf.txt
      ⑥ en-sjis-lf.txt
      ⑦ ja-sjis-crlf-short.txt
      ⑧ ja-sjis-lf-short.txt

      2) Go to a site page and search these two cases from the Search portlet:
      Case 1:
      Search for "東京"
      Case 2:
      Search for "ahead"

      Expected behavior
      Case 1:
      ①②③④⑦⑧ would be returned
      Case 2:
      ①②③④⑤⑥ would be returned

      Actual behavior
      Case 1:
      Only ②④ are returned in the results
      Case 2:
      Only ②④⑤⑥ are returned in the results

      Reproduced on
      master - e9ed60f

        Attachments

        1. 1-en-ja-sjis-crlf-word.txt
          0.0 kB
        2. 2-en-ja-sjis-crlf.txt
          0.2 kB
        3. 3-en-ja-sjis-lf-word.txt
          0.0 kB
        4. 4-en-ja-sjis-lf.txt
          0.2 kB
        5. 5-en-sjis-crlf.txt
          0.1 kB
        6. 6-en-sjis-lf.txt
          0.1 kB
        7. 7-ja-sjis-crlf-short.txt
          0.0 kB
        8. 8-ja-sjis-lf-short.txt
          0.0 kB

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since last comment:
                2 years, 31 weeks ago

                Packages

                Version Package
                7.0.0 DXP SP2
                7.0.0 DXP FP13
                7.0.0 DXP SP3
                7.0.3 CE GA4
                7.1.X
                Master