Uploaded image for project: 'PUBLIC - Liferay Portal Community Edition'
  1. PUBLIC - Liferay Portal Community Edition
  2. LPS-91446

Microsoft OpenXML files are manages as zip by Tika

    Details

      Description

      When you upload a file into Document Media Library it is analyzed by Tika library to extract commons metadata.

      Tika supports different parsers, also for Microsoft OpenXML formats (.docx, .xlsx, .pptx).

       

      Liferay customize the list of available parsers inside the file tika.xml. Unfortunately this file does not include OpenXML ones.

      So docx is managed as a zip and no metadata are extracted (see view_file.png)

       

      This cause a more serious problem on document indexing, because the abstract is composed by text extracted from internal xml files (see search_result.png)

       

      Please add this code to the tika.xml file

       

      <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
      <parser class="org.apache.tika.parser.microsoft.xml.WordMLParser" />
      <parser class="org.apache.tika.parser.microsoft.xml.SpreadsheetMLParser" />
      

       

       

       

       

       

       

        Attachments

          Activity

            People

            Assignee:
            michael.saechang Michael Saechang
            Reporter:
            maumar Mauro Mariuzzo
            Participants of an Issue:
            Recent user:
            Michael Saechang
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:
              Days since last comment:
              2 years, 41 weeks, 5 days ago

                Packages

                Version Package
                7.0.0 DXP FP79
                7.0.X
                7.1.10 DXP FP9
                7.1.X
                Master