PDA

View Full Version : Search results for PDF ignores columns



pststerrye
02-19-2009, 09:46 PM
I am using Zoom to index PDFs of a newspaper. The files are a spread (so there is page1.pdf, pages2-3.pdf, pages4-5.pdf, etc.). The index seems to cross over columns on the page, so the result words that appear before the search term are from the previous page.

Search term: raceway

Example text from the PDF:
Raceway rep addresses fire, rescue procedures (headline)
BYERS — The Byers Fire Protection District Board met with a representative from High Plains Raceway at its Feb. 9 meeting.
Joe Gilmore, regional executive of Colorado Region of the Sports Car Club of America, attended the meeting to discuss fire and rescue procedures at the race track, which opens in April.

Results description:
. 8 The I-70 Scout Tuesday, February 17, 2009 Tuesday, February 17, 2009 The I-70 Scout 9 Rural women focus Raceway rep addresses of MCC biz seminar fire, rescue procedures money go to the local citizens." Women for Rural Business, a one ...

The article about the Rural Women's biz seminar appears on page 8, and the Raceway article appears on page 9. Here's a link to the search page and to the resulting PDF.

This edition will only be up for a few more days, but you will see the same result with other pages. This has been going on a while - I'm just now getting some time to work on it.

Any ideas?

pststerrye
02-19-2009, 09:54 PM
More on what the results are from:

. 8 The I-70 Scout Tuesday, February 17, 2009 Tuesday, February 17, 2009 The I-70 Scout 9 Rural women focus Raceway rep addresses of MCC biz seminar fire, rescue procedures money go to the local citizens." Women for Rural Business, a one ...

The green text is from the header of page 8.

The pink text is from the header of page 9.

The turquoise text is from the headline of the article at the top of page 8.

The blue text is from the headline of the raceway story at the top of page 9.

The purple text is from the top of column 2 of the raceway story on page 9.

The grey text is the first line of the story on page 8.

Ray
02-20-2009, 12:21 AM
I am using Zoom to index PDFs of a newspaper. The files are a spread (so there is page1.pdf, pages2-3.pdf, pages4-5.pdf, etc.). The index seems to cross over columns on the page, so the result words that appear before the search term are from the previous page.

Double click on the ".pdf" extension on the "Scan Options" panel of the Configure Indexer tab. You can control the "Scan Method" here.

From the Users Guide (http://www.wrensoft.com/zoom/usersguide.html) (chapter 2.17.5) and Help file:




Scan Method (PDF only)
This option allows you to utilize alternative methods of extracting the text content from PDF files. Due to
the technical limitations of the PDF file format, the textual content stored within a PDF file can be
ambiguous in its order of presentation. For example, text may be split up in several columns, but this
may not be defined within the PDF file itself as to when a sentence ends and when it wraps around. It is
only structured visually.


For some PDF files (it depends on how they were created), the default scan method ("presentation
layout") may not be the best at preserving the order of text as intended, and in such situations, you
should try the other two methods available: "raw formatting order", and "text layer".

pststerrye
02-21-2009, 02:25 PM
That worked great! Thanks for the assistance and quick reply! :cool: