Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-91059

FileUtil.extractText is not correctly working for Documents in docx format (Word 2007+) files

Details

    Description

      FileUtil.extractText is not correctly working for Documents in docx format (Word 2007+) files

      It is handled as they were a zip file, unziping it and indexing internal xml files
       
      Steps to reproduce

      1. Create some test files with docx, odt, doc and rtf formats with content:

        This is a test

      2. You can also use JIRA ticket attached files
      3. Create a site and attach that files to document library
      4. Get indexed data from elasticsearch, opening following URL: (note: set your groupId)
        http://localhost:9200/_search?q=%2BgroupId:<<YOUR_GROUPID>>+%2BentryClassName:com.liferay.document.library.kernel.model.DLFileEntry&pretty
      1. Review "content_en_US" field of each returned document:
        • Expected behavior: all documents have "This is a test" text in "content_en_US" field
        • Wrong behavior: Test.docx document doesn't have "This is a test" text in "content_en_US" field and it have a different one instead. Test.odt, Test.doc and Test.rtf are fine

      Attachments

        Issue Links

          Activity

            People

              yvonne.han Yvonne Han
              jorge.diaz Jorge Diaz
              Marta Elicegui Marta Elicegui
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:
                4 years, 3 weeks ago

                Packages

                  Version Package
                  6.2.X EE
                  7.0.0 DXP FP79
                  7.0.10.11 DXP SP11
                  7.0.X
                  7.1.10 DXP FP10
                  7.1.10.2 SP2
                  7.1.3 CE GA4
                  7.1.X
                  Master