Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-91446

Microsoft OpenXML files are manages as zip by Tika

    Details

      Description

      When you upload a file into Document Media Library it is analyzed by Tika library to extract commons metadata.

      Tika supports different parsers, also for Microsoft OpenXML formats (.docx, .xlsx, .pptx).

       

      Liferay customize the list of available parsers inside the file tika.xml. Unfortunately this file does not include OpenXML ones.

      So docx is managed as a zip and no metadata are extracted (see view_file.png)

       

      This cause a more serious problem on document indexing, because the abstract is composed by text extracted from internal xml files (see search_result.png)

       

      Please add this code to the tika.xml file

       

      <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
      <parser class="org.apache.tika.parser.microsoft.xml.WordMLParser" />
      <parser class="org.apache.tika.parser.microsoft.xml.SpreadsheetMLParser" />
      

       

       

       

       

       

       

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since last comment:
                1 year, 18 weeks ago

                Packages

                Version Package
                7.0.0 DXP FP79
                7.0.X
                7.1.10 DXP FP9
                7.1.X
                Master