Home » Forum
  • If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Announcement

Collapse
No announcement yet.

Indexing of scanned pdfs

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing of scanned pdfs

    Hello, I have just purchased a Zoom Search v6 Professional license. It seems very fast and does all it says.

    Recently my organization began photocopying documents for distribution, and posting the resulting pdfs to our website. These scanned copies appear to be graphic files which I would not expect Zoom or any other engine to index. However, Google and Google Custom Search appears to be indexing them somehow. Does Google do OCR as they index, and is that a feature Zoom Search might incorporate at some point?

    Thank you kindly for any response,

  • #2
    You need to have a close look at the PDF files.
    Some scanning software does automatic OCR on the pages and inserts a text layer on the document, with the graphic of the page. (And if your scanning software isn't doing this, then it should be).

    Zoom would also normally pick up text layers. But there are other possibilities like the document has been flagged with security flags indicting text extraction should not be done.

    Feel free to E-mail us, or post a link to an example PDF if you want us to take a look.

    Comment


    • #3
      OK, this is one file: http://www.capeelizabeth.com/council_packets/2011/01%2010%202011/Jones%20Lease%2001%202011%5B1%5D.pdf

      I added it to be indexed by Zoom Search as a incremental file just in case for some reason it was not indexed when I did the whole site. I uploaded the files (I am using cgi) When I search for text that I know is in this document it does not show up. However, the same query does find it in Google. Am I doing something wrong? Many thanks -

      Comment


      • #4
        There's no text layer in that PDF file.

        Normally, most document scanning software would run it through OCR to create a text layer (which is invisible to the user) and stored within the PDF. That way, when you try to copy text out of it in Acrobat Reader (by selecting with your mouse and hitting CTRL-C and pasting to Notepad), you will be able to get the text content out of it.

        With your PDF file, there is no text layer, so you can't ever copy and paste text from the file.

        About 6 months ago, Google started to add OCR functionality to Google Docs and various places. I think they might have added it to their indexing, but there was still alot of people saying that it wasn't very accurate, and generally it is best to OCR and have at least a human confirm that the results look vaguely acceptable before saving it with the PDF file.

        So it is still the better way to go, to actually add text layers to your scanned PDF files.

        Adobe provides the Paper Capture Plugin to do this:
        http://www.adobe.com/support/downloa...jsp?ftpID=1907
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Thank you for the info!

          Comment

          Working...
          X