View Full Version : efficient spidering
01-28-2009, 03:35 AM
I have Zoom Search V6 Professional and am trying to spider a collection of blogs within similar niche. It currently is 100 blogs but I may add more in the future. These are not my blogs and I have attempted to spider the blogs by adding each blog as separate starting point.
I've experience a few challenges. The main problem appears to be indexing the same content several times. I think this primarily happens because of the blog's linking structure.
IE - www.blogsite.com/categoyX/postX is also listed as www.blogsite.com/2007/08/postx
Sorry if this is listed elsewhere but I could find it. Ideally if there is a collection of best practices for spidering other people's sites let me know.
As these blog sites tend to be running a content-management/blog script of sorts, this FAQ would be relevant:
Q. How should I index my site if it features a message board, forum, or calendar and other similarly complex scripts? (http://www.wrensoft.com/zoom/support/msgboards.html)
The topic of indexing external sites is too diverse and vague to provide a comprehensive "best practice". It depends on the sites being indexed.
The first thing to remember is to have "robots.txt" support enabled (on the "Spider options" panel of the Configure tab), so that you are obeying any instructions that the web admin has specifically set out for spiders.
There are an infinite number of ways sites can be designed and URLs can be used by these sites. All we can do is offer advice for the sites as you see them. In the case of the example you gave, having multiple URLs to the same page is generally poor practice and bad for spiders.
You can use "Duplicate page detection" in Zoom to prevent identical pages from being indexed, but often, these pages may feature rotating ads or some other element on the page which changes, and render this function impossible.
If there is a comprehensive list of links to all the posts in one particular style (e.g. the sitemap page are all linked in the style of www.blogsite.com/2007/08/postx and you can use the sitemap page as your start point) then you can add Skip Page entries to prevent the second style of links from being indexed (with entries like "/categoryX/").
Powered by vBulletin® Version 4.1.12 Copyright © 2013 vBulletin Solutions, Inc. All rights reserved.