Details
-
Bug
-
Status: Closed
-
Resolution: Fixed
-
6.2.X EE, 7.0.X, 7.1.X, Master
-
7.1.x, 7.0.x, 6.2.x
-
Committed
-
3
-
021 - Spearow
-
Regression Bug
Description
FileUtil.extractText is not correctly working for Documents in docx format (Word 2007+) files
It is handled as they were a zip file, unziping it and indexing internal xml files
Steps to reproduce
- Create some test files with docx, odt, doc and rtf formats with content:
This is a test
- You can also use JIRA ticket attached files
- Create a site and attach that files to document library
- Get indexed data from elasticsearch, opening following URL: (note: set your groupId)
http://localhost:9200/_search?q=%2BgroupId:<<YOUR_GROUPID>>+%2BentryClassName:com.liferay.document.library.kernel.model.DLFileEntry&pretty
- Review "content_en_US" field of each returned document:
- Expected behavior: all documents have "This is a test" text in "content_en_US" field
- Wrong behavior: Test.docx document doesn't have "This is a test" text in "content_en_US" field and it have a different one instead. Test.odt, Test.doc and Test.rtf are fine
- Expected behavior: all documents have "This is a test" text in "content_en_US" field