PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

PDFs not getting indexed after spliting

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PDFs not getting indexed after spliting

    I have a PHP routine that uploads a PDF from a user's system, then uses ghostscript to break the PDF into separate files (one page per file). For this test, I am only taking the first 8 pages of the main file, and putting each page into its own file. That means there are 9 files (the original and the small 1 page files of the first 8 pages).

    Here's the page that is in development, so you can see the PDFs:
    http://207.158.22.22/~admin32/CurrentIssue.php

    You can see 8 of the small page files here...but if you do a search, you'll see that it isn't picking up any of them in the index. When I try to index the PDFs, it seems to index the main file, but does not index the small files at all. I've tried in Offline mode and spider mode.

    The indexing status shows 4169 unique words found and it shows that it indexed 9 files...but the only file that seems to be indexed is the big file.

    In offline mode, I removed the large file from the directory and indexed again, and it shows 194 unique words in 8 files indexed. I don't know where the 194 words are coming from, because when I search on words that should be in the files, no results found.

    I'm stuck - any ideas? Could there be something different about the PDFs I've created through ghostscript?

  • #2
    We've had a look at the files, and the problem appears to be a bug in Ghostscript, creating dodgy PDFs.

    Zoom uses the XPDF package to process the PDF files, and it reports multiple instances of the following error:

    "Error: Illegal entry in bfrange block in ToUnicode CMap"

    It then completes processing the file although it is unable to extract any text from it besides a single "."

    Looking up this error on Google returns a few references such as this one which says:

    Originally posted by Taco Hoekwater
    ToUnicode CMaps vectors in PDF are used to aid in searches,
    and they have a pretty strict "official" format. Your pdf file
    is apparently breaking one of the constraints.

    I am fairly certain of that, because some of the constraints
    are a cumbersome and seemingly arbitrary so it is likely that
    both gs and AR just ignore them.
    Here is an article that explains how to use Ghostscript to do the opposite of what you're doing (merging multiple PDFs into one file), and the bottom comment reports the same error when he tries to open the file.

    I have also tried opening the file in another third-party PDF viewer called Eroiica Viewer, and it fails to render any of the text.

    So I think it is safe to deduce that the problem is very likely to be a Ghostscript bug rather than a Xpdf one.

    Having said that, we will add a preventive measure in V6, and check the length of the extracted text from the PDF plugin - so that it can report "No text found within PDF file" rather than letting it go through without an error.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Do you know any other way I can break a PDF into separate files, or any suggestions on how I could work around the gs bug? This is the end of a project, and instead of being finished, I'm back to more development.

      Comment


      • #4
        The are aother applications out there to split up PDF files.

        For example the opensource, pdfsam.org, the paid for artspdf.com.

        Plus Adobe Acrobat will also do it.

        Quote from Lisa Rieger - on the Adobe Forums.
        -=-=-=-=-=-
        Go to:
        Document> Extract Pages
        and enter the range of pages. If you want to split an entire document into single pages, click "Extract pages as separate files".

        Acrobat will automatically save all these files, but you may wish to re-name them since you can't specify how you'd like them named. It just tacks a space and then sequential numbers after your original file name, for each.

        This automation takes a lot of the repetitive work out of individually extracting single pages from a multi-page PDF
        -=-=-=-=-=-

        Comment

        Working...
        X