PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Spider load throttling and server load while indexing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spider load throttling and server load while indexing

    This is a copy of a support E-Mail that others might find useful.

    =================================
    Customer
    =================================
    Today, I want to direct your attention to your SPIDER SPEED.
    I bet most people writing you about spidering want to make
    things faster. In contrast, I am suggesting to build into
    Zoom 5.1 some mechanism to make your spider run SLOWER,
    sometimes considerably slower.

    Let me explain:
    In a scenario where the Zoom owner runs the spider against
    their own, dedicated web server the maximum speed is
    naturally desirable.

    However, in a shared-web scenario and even more, running
    your program against someone else's server, the sheer amount
    of pages accessed in rapid succession may:

    a) alarm the webmaster that the site is under attack
    b) on a shared server, have the ISP cancel the site entirely
    c) on larger sophisticated sites, trigger their anti-attack
    countermeasures, and have access blocked from your IP
    and possibly other undesirable effects.

    I am sure you know large spiders like Google, Yahoo and
    others access web sites rather slowly, meaning with pauses
    between pages ranging from 1 second to more than 20.
    Discussing the subject with Amazon.com engineers (not that
    I would like to index their site, but I know them well and they
    have a LOT of insight) I was told their site would block my
    access in a heartbeat; to be precise after the first 500 pages
    accessed in rapid succession. The access blockage is triggered
    automatically.

    It is not the number of pages that would get me in trouble,
    it is the speed of the access. Slowing down my access to say,
    one page in 5 seconds will let me index their entire site,
    if I so desire - but your program does not make it possible.

    <snip>

    I can imagine at this point you will probably say I can reduce
    the number of threads all the way to 1. I think (don't hold me
    to it) I saw somewhere in your FAQs that setting the number of
    threads to 1 make your indexer behave like 1 regular site user.

    <snip>

    I would suggest giving the user a sliding scale from
    zero (0) to 300, where zero would insert no pause at all
    while the maximum 300 would insert 30-second pause.
    In other words, the user could fine-tune your spider speed
    in increments of 1/10 of a second, from zero to 30 secs.


    =================================
    Response
    =================================
    Yes you are correct that Zoom can place a reasonable amount of load on a server if you are using a lot of threads.

    And yes, having a throttle control is already on our to do list. Probably for a 5.1 release, but it isn't sure as yet as V5.1 is not defined as yet.

    However I would argue that the current situation is isn't nearly as bad as you think it is because,

    1) Servers that are under load start to report errors. And on the vast majority of sites we don't see any errors, even with 10 threads running. In fact I think you would have a problem finding a site that did have a problem. Can you point to an example?

    2) The load from 1 PC running Zoom is a drop in the bandwidth ocean for big sites like Amazon.

    3) Google and other big engines have 100,000+ machines that could hit and disable a site. So they need to be careful. Zoom only runs on 1 PC and so there is a natural limit of how much load you can generate from a single PC.

    3) There is an argument to say that you are better off hitting a site harder at a known time (e.g. 4am in the morning) rather than adding more background load to your busy period.

    4) Amazon certainly doesn’t block in a 'heartbeat'. And doesn't block after 500 hits either. We did a test and got to over 600 hits (with 1 thread) without any blocking. To be fair a lot of these hits just resulted in Amazon re-directing the HTTP request to another page. But they aren't as draconian as their engineers make out.

    5) Servers naturally throttle there own load. In the Amazon example above we only got about 1 hit per second. If your site allows 30 hits / second it can probably deal that that kind of load from 1 PC without a significant problem.

    6) Google and some of the other big engines do hit a sites multiple times per second.(At least that's what our logs show). But I agree they are significantly slower than Zoom.

    7) A resonable server can serve 20 to 100 HTTP requests per second depending on caching, page type, etc. We get a peak of 83 pages/sec from our server. But I am betting the figures you are seeing with 1 thread from Zoom are more like 2 to 3 pages per second. (~5% of total capacity).

    In 6 years of selling Zoom. We have never had any reports of an IP address being blocked nor anyones account cancelled by any ISP.

    Of course you can present it in such a way to make it sound much scarier than it is.

    This is not to say you don't have a valid point. Because you do, and we will address the issue. But it doesn't cause nearly as many problems in the real world as you might imagine.


  • #2
    What would happen if ZoomSpider received a 503 error?

    I work for an ISP which does shared web hosting. We have vhost servers that host up to 250 vhosts with CGI/PHP functionality. Sometimes these machines experience unusual, heavy load that results in excessive child spawning (Apache 1.3), swap usage, CPU load, and in some cases, the crashing of the server because of these conditions. These usually correspond to poorly behaving spiders crawling our customers poorly written PHP/CGI scripts that have a MySQL backend (our MySQL server in some cases also experiences heavy load as a result).

    In an effort to prevent this demi-service attacks from occurring, we are looking to throttle people who make too many requests at once to our server. In our testing phase of this project, the only legitimate spider we've seen so far that exceeds our thresholds has been the ZoomSpider. While we'd like our customers to be able to spider their site as they please, we also have to give consideration to the stability of the server. To this end, I'm here seeing if it's possible to throttle ZoomSpider as the poster suggested. Since this is not yet possible in v5.0, how would the spider behave if it encountered a 503 error (the error we throw if we throttle)? Will it come back another time?

    It looks like for the time being we can suggest our customers use 1 thread.

    Regards,

    Eric Waters

    Comment


    • #3
      This is a timely post

      We have just finished work on implementing the Spider Throttling feature, which will be included in the next build (Version 5.0 build 1009), most likely to be released in the next few weeks. This will allow users to throttle their spider crawling and lighten the load placed on their servers. They will be able to insert a delay between each HTTP request ranging from as little as 0.2 seconds to as much as 15 seconds.

      In the meantime, yes, limiting the indexer to use a single thread will be the best alternative.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        In the end the feature was included in Version 5.1 of Zoom, which has now been released (15/June/07). You can now specify a delay between page requests to be sent to the web server. This can be useful when you are crawling a web server which is under heavy load and you wish to minimize any additional load that can be placed on the server during the spidering process. The throttle can be adjusted from the 'general' tab of the Zoom configuration window.

        Comment


        • #5
          Throttling for multiple sites

          Hi,

          I've just started using zoom search engine, and i'm really impressed by the accuracy of the results it returns for searches and the speed.

          I do have two suggestions which I hope you will consider for a future release:

          Firstly : I work with a number of partner sites, and require to add all of them to a single index, however the sites are crawled in sequence so that with the throttling set to a very polite 15 seconds, it takes a very long time to do a complete crawl of all sites.

          The crawl could complete much more quickly if the sites were accessed in parallel with each other, whilst still observing the throttle on a per site basis.

          Also, an equally useful feature would be if I could merge two sets of index files together into one. With this feature, whenever I add a new partner site to my list, then I could run a separate crawl on the new site, and then merge the results into the original index, without having to recrawl all the other sites at the same time.

          Good luck with your fantastic product, and please keep up the good work supporting it.

          Thanks.

          Comment


          • #6
            Parallel indexing of different web sites

            Yes, using the throttling option will significantly slow down the indexing process. It also largely negates the speed benefit of using multiple threads as well.

            We agree that parallel indexing of multiple sites would speed things up. But we haven't done it because a) it is technically complex to implement, b) most of our users are only indexing a small number of sites (often just their own site) and c) it would be more resource intensive with more HTTP sessions being held option and internal memory structures being duplicated for things like tracking which pages on which sites have already been indexed.

            Comment


            • #7
              Add new sites to an existing index

              For the 2nd part of your question, you can add new sites to an existing index using the incremental indexing options.

              In the Zoom menu options, select
              Index / Incremental indexing / Add start points.

              Comment


              • #8
                Thanks

                Ok, no problem. I'll try the incremental update thanks.

                Comment

                Working...
                X