PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

How do I slow down spidering?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I slow down spidering?

    My web host provider has implemented spider rate control code that clocks spider requests and returns a 503 if the spider is too agressive.

    When I atempt to index my site the spider is going to fast. Is there a way to slow down the spider rate to say one request per second or one request per two seconds?

  • #2
    I would start looking for a new host. If your hosting company can only deal with 1 page request per second, then real users will also get a 503, "Service Unavailable" message from time to time.

    I am also sure that Spiders like Google & MSN and Yahoo aren't going to take it slow for your benefit. So you'll end up not being indexed by the majors as well.

    Nevertheless, here are some ideas for how to slow Zoom down, or get around the issue.
    - Set Zoom to use only a single thread. (the default is two)
    - Index your site from a slow dial up connection (not broadband)
    - Index your site in offline mode from your local hard disk (if possible)
    - Setup your own local copy of Apache, PHP, etc.. and index your site locally, then convert the URLs before uploading the index files.
    - Ask your host if they have a low traffic period when they don't mind you loading the server. Then schedule Zoom to run during that period (e.g. 4am in the morning)
    - If all your pages are dymanic PHP / ASP pages on your site. Use scripting to insert a delay of 500ms before generating the page.
    - Run the indexer on a slow machine. But Zoom is pretty efficient, so it will need to be a rather slow old machine.
    - Tell your hosting company they are a bunch of jerks and that there is no need to refuse requests. There are better way of managing load, like the Apache module, "mod_throttle" http://www.snert.com/Software/mod_throttle/ that can delay, rather than reject traffic.
    - Look around for tools to limit your bandwisth use, e.g. SoftPerfect Bandwidth Manager (we have never used this product and can not say how good it is).
    - And did I already mention, get a new host?

    -------
    David

    Comment


    • #3
      So there is nothing within the indexer to do this...

      I have no intention of looking for a new host. Their rate controls are appropriate for community servers. My hosting provider has been reliable, dependable, flexible and very professional. The likelyhood that a user would ever get a 503 is minimal because the rate limiting code is crafted in such as way that a normal user should never get a 503. Only a spider hammering away would ever get the 503.

      Your other suggestions for slowing down zoom really don't have much appeal other than possibly inserting a delay in the dynamic code, assuming the Zoom spider identifies itself. Looking at the logs I can't see that it is but more investigation is required.

      What I have seen so far with Zoom looks very good but it seems like being able to throttle the spidering is a reasonable feature. I have 20+ sites that I would like use this on but the lack of a throttle may prevent that.

      Comment


      • #4
        Coding a delay works.

        The Zoom spider does identify itself so I was able to easily build a delay into the code generating the pages. It would still be nice if the spider had a configurable throttle but this is a reasonable work around.

        Before I buy a professional license I have a few questions. I scanned the documentation but wasn't 100% sure of the answers. Assuming a professional license:

        I have 20+ sites that I want to spider from a single PC running XP. Each site would be spidered daily and the resulting data files uploaded. Sites vary from 20 pages to 1000 pages. This needs to be an automatic process.

        Can I do this?

        Comment


        • #5
          Yes Zoom V4 uses the user agent string, "ZoomSpider - wrensoft.com", which you could use to indentify which page requests come from Zoom.

          Yes, having a throttle to make Zoom slower is a reasonable request. But we have higher development priorities in the short term. (It is kind of an ironic as well, because in the last two versions of the software we spend a lot of development time making the software faster and more efficient).

          Regarding you other questions.

          You need a license per installation of the indexer. But you can add search and maintain the search function on many sites, from a single install of the indexer.

          The best way in your case would be to create a series of Zoom configuration files (xxxxx.zcfg) and then schedule them to run during the night. If possible schedule the times to avoid any overlap. (i.e. don't run multiple copies of Zoom on the same PC at the same time).

          --------
          David

          [ UPDATE 18/June/2007: Zoom V5.1 has now been released with a throttling feature. You can now specify a delay between requests to be sent to the web server. This can be useful when you are crawling a web server which is under heavy load and you wish to minimize any additional load that can be placed on the server during the spidering process. ]

          Comment

          Working...
          X