PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Improved PDF Indexing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improved PDF Indexing

    Hi folks - I was just doing some research on improving PDF search for alittle project and stubled across this - https://www.wrensoft.com/forum/zoom-...fields-in-zoom

    Does anyone know if this will work for custom metadata created via custom properties in a PDF file?

    Thank you!

  • #2
    I don't have Acrobat Pro here to modify Custom Properties to test this out. But it does look probable that it's the same data structure within the document -- given that it says specifically you are not allowed to use "Title", "Author", "Subject", "Keywords" etc. as the custom property name.

    So give it a try, and let us know how you go.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Hi Ray - I set up a test for this but its not conclusive. The test site is at http://www.trs-80.computer/zoomtest/search.php

      Scan Options for the file type pdf are set to retrieve the internal meta information.

      If you do a regular search on a word like "transistor" you get results so thats good.

      I edited one of the PDF files and added a custom property with a name of "TEST" and a value of "TestTest". I then added a Custom Metadata field to my ZOOM config for this field for a partial match and re-indexed. If you do a search on the test page for "test" in the Test field it returns no results. I even tried "TestTest" and "Test".

      To check this further I added myself as the Author of the document - if this worked then it would prove that adding custom metadata does not work. I reindexed after adding a Cusotm Metadata field for Author. If you do a search in the author field on "parker" that does not return a result either.

      It appears that the retrieval of internal meta information may not actualy be working which is why this test is inconclusive.

      FYI the source files are Acrobat 10 files.

      Comment


      • #4
        Can you tell us exactly (URL) which PDF file it is that you've added these fields to, and we can download it and take a closer look.

        Also confirm that the exact file has been indexed in your index log.

        Make sure to also select "Reload all files (do not use cache)" to ensure the latest edited version of the PDF file got indexed, and not a cached copy (without the newly added data).
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Thanks Ray - the file in questions is http://www.trs-80.computer/zoomtest/...001-NUM001.pdf

          I have checked the log files (I had to rerun it today as I didn't have logging turned on before):

          11|09/25/17 09:19:42|Queued URL: http://www.trs-80.computer/zoomtest/...001-NUM001.pdf

          and

          14|09/25/17 09:19:42|DL Thread #4, got URL (http://www.trs-80.computer/zoomtest/...001-NUM002.pdf) off queue
          04|09/25/17 09:19:42|Downloading file http://www.trs-80.computer/zoomtest/...001-NUM001.pdf

          and

          14|09/25/17 09:19:43|Index Thread got ready buffer for http://www.trs-80.computer/zoomtest/...001-NUM001.pdf (Content-type: Acrobat document)

          and

          06|09/25/17 09:19:43|Processing PDF file http://www.trs-80.computer/zoomtest/...001-NUM001.pdf
          00|09/25/17 09:19:43|Indexing http://www.trs-80.computer/zoomtest/...001-NUM001.pdf
          Last edited by kpa; Sep-25-2017, 01:03 AM.

          Comment


          • #6
            Thanks for the test file.

            We attempted to index this and confirm that the custom fields are not being picked up.

            Also looked closer at the format, and confirmed that the "pdftotext" plugin is unable to extract the custom fields at this point.

            So unfortunately, this won't be supported until the pdftotext plugin (which is from a 3rd party but almost industry standard project known as "xpdf") adds support for this. We also tried the latest build of pdftotext from xpdf (Version 4.0) and confirmed its behaviour is the same.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Thanks Ray - while I can understand the custom field not being picked up it appears the standard fields (in this test case the Author field) are not being picked up - was this demonstrated in your test as well? As per my original post on this, it appears that the 4 standard fields should be/are supported. Thank you.

              Comment


              • #8
                Hi guys - I thought I'd experiment with using description files to see if I could get around some of my indexing requirements. That doesn't work either. I've set the flag in Scan Options | PDF Indexing Options to use description files ("Retrieve internal meta information" was turned on on my first test). This did not pick up any data from my .desc files. I tried turning off "Retrieve internal meta information" which didn't resolve the issue either. (That did prove that ZOOM was harvesting the standard metadata fields though because the results layout defaulted to just showing the file name). Any clues please?

                Comment


                • #9
                  Hi guys - just spotted this - https://www.wrensoft.com/zoom/suppor...html#descfiles - I'll check with my hosting provider first.

                  Comment


                  • #10
                    Hi guys - added .desc as a text mime type in the hosting service and that fixed it but still having an issue searching on author etc in custom metadata. Thank you!

                    Comment


                    • #11
                      Hi there. Sorry I missed this thread.

                      Regarding the "Author" meta field, make sure the following:

                      1) "Retrieve internal meta information" is checked for the .pdf file format
                      2) If you want to be able to search for the Author field's content from the main search query box, you should have "Meta author" checked under "Configure"->"Indexing options".
                      3) If you want to be able to search for the Author field via a Custom Meta Field, you have to DISABLE the above ("Configure"->"Indexing options"->Uncheck "meta author") and setup "Custom Meta Fields" with a Meta Name of "Author".

                      Note that the last 2 features are basically mutually exclusive. Hope that helps.


                      --Ray
                      Wrensoft Web Software
                      Sydney, Australia
                      Zoom Search Engine

                      Comment

                      Working...
                      X