Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-91059

FileUtil.extractText is not correctly working for Documents in docx format (Word 2007+) files

    Details

      Description

      FileUtil.extractText is not correctly working for Documents in docx format (Word 2007+) files

      It is handled as they were a zip file, unziping it and indexing internal xml files
       
      Steps to reproduce

      1. Create some test files with docx, odt, doc and rtf formats with content:

        This is a test

      2. You can also use JIRA ticket attached files
      3. Create a site and attach that files to document library
      4. Get indexed data from elasticsearch, opening following URL: (note: set your groupId)
        http://localhost:9200/_search?q=%2BgroupId:<<YOUR_GROUPID>>+%2BentryClassName:com.liferay.document.library.kernel.model.DLFileEntry&pretty
      1. Review "content_en_US" field of each returned document:
        • Expected behavior: all documents have "This is a test" text in "content_en_US" field
        • Wrong behavior: Test.docx document doesn't have "This is a test" text in "content_en_US" field and it have a different one instead. Test.odt, Test.doc and Test.rtf are fine

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:
                  Days since last comment:
                  10 weeks, 2 days ago