View Full Version : Zoom Professional Not Indexing PDF Content
hvacwebsite
06-11-2008, 07:51 PM
Is there anything special one needs to do in order to have the content of PDF files indexed? I have the .pdf extension specified in the search configuration, have the Zoom Indexer plugins installed...but when I search for something contained in a PDF it doesn't always work.
Maybe there are types of PDFs that are supported and some types that are not?
THANKS.
wrensoft
06-11-2008, 09:58 PM
Zoom converts PDF files to plain text and indexes all words found in the entire PDF or DOC document. Images, diagrams, graphs, etc. will however, not be indexed.
See also these FAQ
Q. Why can't I find words from my scanned PDF files? (PDFs created from scanning in physical documents) (http://www.wrensoft.com/zoom/support/faq_plugins.html#scannedpdfs)
And it is also possible the PDF is encrypted (in which case you can enter in the decryption password in Zoom).
And maybe the spider isn't finding your PDF files at at? In this case see these FAQ
Q. Why are links in my Javascript menus being skipped? (http://www.wrensoft.com/zoom/support/faq_problems.html#javascriptmenus)
Q. I am indexing with spider mode but it is not finding all the pages on my web site (http://www.wrensoft.com/zoom/support/faq_problems.html#spider_finding)
If you still have a problem, please post the URL of the site, the PDF file, your search function and let us know the search word used.
hvacwebsite
06-12-2008, 12:22 AM
Thanks for the response!
This site isn't using any flash or javascript. There is a literature page that has a list of PDF links. Here are the details:
site: http://thermostatsusa.com/literature.asp
pdf: http://thermostatsusa.com/pdfs/c11ns_submittal.pdf
search function: all words
search words tried: s1-thec11ns, hydronic, sleek styling
I've tried another PDF on this page and it didn't work either. Kinda strange...if you look at my home page (http://www.thermostatsusa.com) you'll see a direct html link to the literature.asp page which then has the direct PDF links.
Thanks for the help!
I tried indexing the PDF file in question, and all the search words you mentioned were found.
The problem is that you have configured Zoom to not index the PDF file at all. You should be able to note this in the index log.
As described in this FAQ (http://www.wrensoft.com/zoom/support/faq_problems.html#skipped), you should enable "Verbose Mode" if you need to see why certain pages are not indexed. It will display the reason why certain links are not being scanned.
When I attempt to spider from the following URL with Verbose Mode:
http://thermostatsusa.com/literature.asp
I see a list of PDF skipped messages such as the following:
10:44:39 - [SKIPPED] Skipping http://thermostatsusa.com/pdfs/c11ns_submittal.pdf (Blocked by robots.txt)
10:44:39 - [SKIPPED] Skipping http://thermostatsusa.com/pdfs/c11p5s_submittal.pdf (Blocked by robots.txt)
10:44:39 - [SKIPPED] Skipping http://thermostatsusa.com/pdfs/c11p5s_manual.pdf (Blocked by robots.txt)
10:44:39 - [SKIPPED] Skipping http://thermostatsusa.com/pdfs/c11p5s_installation.pdf (Blocked by robots.txt)
And as expected, when I checked the robots.txt file on your site, this is what I see:
# FULL access (All Spiders)
User-agent: *
Disallow: /stats
Disallow: /images
Disallow: /pdf
This means that the "robots.txt" file on your site is explicitly telling all spiders to not index your PDF files. So it is little surprise that Zoom has also ignored your PDF files.
You can configure Zoom to not obey the "robots.txt" file (on the "Scan Options" tab, uncheck the option that is labelled "Enable 'robots.txt' support"). Or you can change your robots.txt file so that Zoom is allowed to access the PDF folder.
hvacwebsite
06-12-2008, 06:38 PM
oh....well guess I'm the moron. THANKS for taking the time to help me fix this!
Have already told another web developer about your product - love it so far.
Thanks again for the help.
vBulletin® v3.7.0, Copyright ©2000-2008, Jelsoft Enterprises Ltd.