Home » Forum
  • If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Announcement

Collapse
No announcement yet.

PDF 'image & text' indexing question

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PDF 'image & text' indexing question

    Can Zoom v6 index PDF docs that are scanned images that have been turned into searchable 'image and [hidden] text' PDF documents?
    I've installed the plugin but nothing seems to be happening and when I search on text in the PDF image I get no results.
    If I do a search on 'stream' I get a lot of results with this beneath:
    " ... n trailer] startxref 0 %% EOF 25 0 obj endobj 36 0 obj stream xœc``d`` b`b`/ac@ ..."

    I hope I'm doing something wrong or I've bought Zoom for no reason
    Last edited by Ted; 02-03-2009, 09:23 AM.

  • #2
    Yes Zoom can index the text in PDF files.

    But you need to ensure that the initial scan was of high enough quality to allow the OCR to do a accurate job.

    Can you post a link to the PDF in question so we can see what text is in the file.

    Comment


    • #3
      I thought it did

      Scans are done at a minimum of 240dpi or 300dpi which is OK for OCR. One of the PDFs is here:
      http://www.tedmount.demon.co.uk/00000004.pdf

      Comment


      • #4
        The source document is a low quality fax. So even if you you scan at 1000DPI it won't fix the bluring, fading and unreadable hand writing.

        Nevertheless the OCR seems to have picked up about 80%+ of the text correctly. But you do have text like this in the document,
        Pay t0: Co-oper2tive~Bank plc
        Removal of old bro ken fridge 1 If 60 .00 60.04
        BaCkColor }~','lwr;

        But I suspect the real problem is elsewhere. In Zoom, check in the scan options window. You might be scanning PDF files as HTML files instead of Acrobat files. To fix this remove the PDF extension and then re-add it to the list.

        Comment


        • #5
          Sorted......... yes the PDf files were being scanned as HTML (I did wonder about that when I first added the .pdf extension).
          All now working great.
          Although not actually a fax (which is usually lower res than this scan), it is actually a 200dpi scan which is our standard for document archiving. If we go ahead with full text searching for this client we'll increase the dpi to 240 or 300dpi.

          Thanks for your help......... loving Zoom

          Comment

          Working...
          X