PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

New ( I think) bug with PDFs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • New ( I think) bug with PDFs

    Using the latest beta with the pdf plugin and js I searched a site containing several pdfs for blue, knowing that bluebird was in one or more of them. Zoom listed the appropriate pdfs but clicking on one opened it with Adobe saying no result found. Apparently Zoom indexed it because bluebird contains blue but the Adobe reader couldn't find it because blue was part of another word. I can put a warning to the user to search only for complete words (fine if they want bluebird), but if they really want just "blue" then Zoom will tell them it is in the pdf which contains bluebird and acrobat will tell them it isn't. What to do?
    Walter K.

  • #2
    It sounds like you have enabled the "Substring match for all searches" option (on the "Languages" tab of the Configuration window) in Zoom.

    This is not the default behaviour of Zoom. With the above option off (by default), a search for "blue" will not match "bluebird".

    Since Adobe Acrobat Reader does not support this feature, you will find the above behaviour with this option enabled.

    If you want to be able to search for exact words (which is the default), disable the abovementioned option.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Originally posted by Ray View Post
      It sounds like you have enabled the "Substring match for all searches" option (on the "Languages" tab of the Configuration window) in Zoom.

      This is not the default behaviour of Zoom. With the above option off (by default), a search for "blue" will not match "bluebird".

      Since Adobe Acrobat Reader does not support this feature, you will find the above behaviour with this option enabled.

      If you want to be able to search for exact words (which is the default), disable the abovementioned option.
      I just looked at the Configuration and the substring option is not checked - nor do I remember setting it. Also, the var SearchAsSubstring = 0 in settings.js. However, the word blue is listed (as well as bluebird) in Zoom-1ndex.js because the word blue is in other pdf files on the site. As I said, the Zoom search result lists one pdf file that has only bluebird in it, not blue, (and several others that do have blue) and clicking on that result brings up Adobe Reader (vs.7) which says there is no match for blue (whole word only). Could the fact that blue by itself is in other pdf files, and therefore is in the Zoom-Index.js file be causing this?
      Walter K.

      Comment


      • #4
        Originally posted by whk View Post
        I just looked at the Configuration and the substring option is not checked - nor do I remember setting it. Also, the var SearchAsSubstring = 0 in settings.js. However, the word blue is listed (as well as bluebird) in Zoom-1ndex.js because the word blue is in other pdf files on the site. As I said, the Zoom search result lists one pdf file that has only bluebird in it, not blue, (and several others that do have blue) and clicking on that result brings up Adobe Reader (vs.7) which says there is no match for blue (whole word only). Could the fact that blue by itself is in other pdf files, and therefore is in the Zoom-Index.js file be causing this?
        Walter K.
        Is your website online, if so, could you give us the URL to your search page? Otherwise, could you zip up your search files and e-mail them to us. Please include the PDF file in question.

        There are several other possibilities I can think of:
        1.) The word "blue" appears in the meta description or title of the page (under "Document Summary" in Acrobat Reader) and thus Acrobat does not find the word when doing a search within the document, even though Zoom finds it because it has "Use meta information when available for plugins" enabled).
        2.) The word "blue" does not appear at all, but due to the layout of the PDF file (eg. if it has multiple columns, or the word "bluebird" is split between two lines like so:
        blue-
        bird
        Then it is possible that Zoom's PDF plugin may have interpreted that as two words. The PDF format does not lend itself particularly well to separating content from layout, and an intricate layout can be an issue for the content to be indexed.
        3.) Or it might be something else all together and your index files are mixed up.

        If you can send us the files, we will be able to confirm this for you.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          That seems to be it

          Originally posted by Ray View Post
          ..... 2.) The word "blue" does not appear at all, but due to the layout of the PDF file (eg. if it has multiple columns, or the word "bluebird" is split between two lines like so:
          blue-
          bird
          Then it is possible that Zoom's PDF plugin may have interpreted that as two words. The PDF format does not lend itself particularly well to separating content from layout, and an intricate layout can be an issue for the content to be indexed.
          If you can send us the files, we will be able to confirm this for you.
          Ray:
          Your suggestion #2 seems to be the cause. In two different sentences in the pdf file the word bluebird is split as you suggested:
          blue-
          bird.
          So, now that the cause is known,I don't know if you want the pdf and the Zoom files but I am emailing a zip file (Blue.zip) to you anyway. Keep in mind that the Zoom indexes are for more pdfs than the problem one I am sending.
          I don't know what to do about this type of problem; I could reformat so bluebird isn't split but lots of other words are, which could lead to similar errors. I hate to turn off hyphenation altogether.
          Walter K.
          Last edited by whk; Nov-03-2006, 03:33 PM. Reason: specify zip file name

          Comment


          • #6
            I had a look at the file, and can confirm that the issue is what was suspected: the layout and column formatting of the PDF causes the word to be split.

            Unfortunately, there is not much we can do about this as it is a technical limitation of the PDF document format. This can be exhibited by opening the document in Acrobat Reader itself, and attempting to select the text spanning several lines in one column using the "Text Select Tool". You will find that Acrobat is unable to do this, and its selection will span across several columns. It simply does not store the data to differentiate text from one column from another, and it merely guesses when a space is wide enough for it to be considered a separate line of text (because of this, it may have more success when selecting text from some columns than others).

            Due to issues like this, we offer two different scanning methods, which you can select by double-clicking on the ".pdf" extension in the Extensions list. The default is "Scan text by presentation layout" and the other is "Scan text in raw formatting order". The latter helps in some cases with multiple column layouts, but in this case, it is still unable to place together the word "blue-" and "birds" as being one.

            And just to verify - this is not a new issue with V5.0, the same issue can arise in the previous versions, it is simply a limitation of the PDF file format, and there is little we can do about it unfortunately.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment

            Working...
            X