PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Problems indexing Office documents/extraction of metadata/plugins

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems indexing Office documents/extraction of metadata/plugins

    Hi,

    I have a site with a large number of Office 97 - 2010 documents (doc, docx, ppt, pptx, xls and xlsx). I am indexing the site offline and had initially set the option to Extract metadata for the Office 2007 file types, and I am not indexing content. However, the resulting database seems to be corrupt in that the same document title is being used multiple times in the search results for unrelated files. For example, assume that there are 10 results being returned, then the title could be the same for most or all of the results even though the summary and the URL are different. Switching off the option to Extract metadata from Office 2007 files fixes the issue, but I would rather it worked because switching it off reveals another issue (see below).

    If I switch off the option to Extract metadata, there seems to be a corruption of the summary text which is displayed in the search results but also present in the zdat files. I can see that the summary text has come from the text from within the document, but figures have been inserted/added into the summary. For Word docx files, these figures seem to be either "4207510 -948690 0 0" or "493395 36830 0 0 2238375 36830 0 0 4013835 36830 0 0". They appear where there is a carriage return or where the text has been laid out in a table. For example, a two-line heading which is presented as:

    Age
    Data Profile

    in the Word document, is being extracted and displayed in the summary as "Age 4207510 -948690 0 0 Data Profile". Then there is a 4-column table where the text is presented as:

    Level | Usage | Compilation | Product Availability
    Person | Prospecting | Modelled | ConsumerView

    and this is being extracted and displayed in the summary as "Level Usage Compilation Product Availability 493395 36830 0 0 2238375 36830 0 0 4013835 36830 0 0 ConsumerView Monthly Person Prospecting Modelled". Note that the order of the text is different to how it is presented in the document.

    I also noticed that the summary text is not from where I would have expected it. In the example, the heading has been extracted followed by text within a table at the bottom of a page. But there are more meaningful introductory paragraphs which could have been used but haven't. The same is true of other Office-based files.

    Any thoughts or solutions on the above? I know that I could create ".desc" files to solve the issue, but I have hundreds of files which I would need to create these for, and also set up a MIME type of the server, so I really don't want to go down this route, if at all possible. If I could solve the first issue, and have Extract metadata switched on, without iteration of the titles then this would be a step in the right direction. I have a rather tight deadline to meet this week, so could really do with an early solution or fix.

    Many thanks,

    Russ

  • #2
    None of what you describe is typical, so I think the best way to proceed with diagnosing the problem is if you can send us the .docx files in question, and your .zcfg configuration file so we can try to reproduce the problem here.

    Without seeing the files in question, I can't speculate what might be leading to the additional numbers you see being extracted. For example, if they are values from a hidden data field or otherwise.

    Likewise with the summary text being extracted, I would have to see the document to determine what is extracted. Some data is easy to determine the order and layout from the file format structure, other data can be more difficult if they have floating layers or scripting that cannot be easily determined until rendering occurs.

    The sooner you can get us the files (more than one DOCX if you want to show us how repetition of titles occur, and the ZCFG configuration file), the more data we have, the sooner we can hope to have a solution for you. You can find our email details in the Contact Us page.



    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Thanks Ray,

      I have created a small Zip file and attached it to an e-mail which I have just sent to you at info@wrensoft.com. Hopefully, this will give you sufficient to come up with a fix.

      I also spotted another issue to do with indexing v1.5 (Acrobat 6.x) PDFs, and have included this in the e-mail as well.

      Many thanks and kind regards,

      Russell

      Comment


      • #4
        For anyone experiencing similar issues to those described above. The issue to do with replicated titles appearing in the DB was a failing on my part. The metadata in the original Office documents had not been updated, and we had been creating new documents with the same style as the original.

        The issue to do with strange numbers appearing in the summary for Office documents was fixed by the 7.1.1011 release.

        Thanks Ray,

        Russ

        Comment

        Working...
        X