PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Scheduled indexing & spidering

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scheduled indexing & spidering

    Folks,

    I really like the fact that one can schedule indexing & I do make use of that. But, in a way like Google, Yahoo, etc. spider resources, I'd like to enable a "slow crawl" such that the indexing process doesn't cripple a server incapable of rapidly returning pages as they are rapidly indexed. Perhaps something like a forced pause for X seconds to take place every Y pages indexed. That way, I can schedule more frequent indexing routines without fear of crippling my server.

    Dave

  • #2
    If you are using a single thread in Zoom, you should never have a problem, unless the server is already broken.

    There is a in depth discussion of the load throttling issue here.

    But we do plan on adding a throttle in a future release. Maybe in V5.1

    Comment


    • #3
      Please do add this! You are correct, I can index with a single thread and I never have a problem. It's not the web server that's the issue so much as it is the backend having to respond to a rapid succession of open/close events + feed all the content. Throttling is a #1 request for me! And, if Zoom Search could act as an XP Service with a config file, I could just let it do its thing and never have to worry about stressing my backend.

      Comment


      • #4
        There is an argument to say that you are better off hitting a site harder at a known time (e.g. 4am in the morning) rather than adding more background load to your busy period.

        Comment


        • #5
          4AM in whose timezone? I have a consistent hitload throughout the 24-hour clock. Hardware is the issue here, not the time of day. But, this shouldn't exclude my opportunity to use a great product that helps sites of all sizes...mine's funded out-of-pocket. What I have is two configs/indexes/cgi's for two parts of the site I maintain: one for the side with fewer pages but more active content and a second for thousands of pages with less active content. I have a schedule for the former with 4 concurrent threads, but I have to resort to a manual indexing for the latter because my backend can't cope. If I could throttle Zoom and have one cgi for the entire site that took ~ an hour or so to spider & index in the background while I worked on other things, I'd be much happier than having to walk through the manual steps at re-creating an index at an irregular schedule.

          My 2 cents.

          Comment


          • #6
            The 4am was just an example - the idea is to schedule it to index at whatever time your server has the least load. But as you say, if your reports indicate that you have a consistent hitload throughout the day, then yes, this would not help - but usually, most sites have several hours where the server has relatively less traffic, and its worth confirming with your web stats.

            Nonetheless, throttling is on our list of features planned for V5.1.

            [ UPDATE 18/June/2007: Zoom V5.1 has now been released with a throttling feature. You can now specify a delay between requests to be sent to the web server. This can be useful when you are crawling a web server which is under heavy load and you wish to minimize any additional load that can be placed on the server during the spidering process. ]
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment

            Working...
            X