PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

V5 development progress - Indexing enormous sites

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • V5 development progress - Indexing enormous sites

    Zoom V5 is looking to be a great enhancement over the existing software. This is short update on one aspect the development process of V5 of Zoom.

    But before I get into that, I would like to remind everyone that we offer free upgrades for 6 months after a purchase, so if you purchase V4 now, it will be a free upgrade to V5 when it becomes available.

    Over the last couple of weeks we have been looking deeply into the problem of indexing enormous web sites. By enormous, we mean one or more web sites having more than 250,000 pages in total.

    At the moment in V4 the indexer requires a fair amount of RAM to index this many pages (around 1.5GB for 250,000 pages). It uses a lot of RAM because it holds part of the index in RAM, while it is being built. This gives better indexing speed, provided you have enough RAM. But not having enough RAM meant indexing enormous sites impossible. So the challenge was to move some of this data from RAM onto the hard disk without significantly reducing the indexing speed. (Accesses to the hard disk is at least 10 times slower than RAM access).

    So our plan was to write additional partial index files to disk during indexing and merge the partial files at the end into a larger index. The merge process hopefully not taking to long and not using too much RAM.

    Today was the first test of this new V5 code. For the first time we successfully indexed 500,000 small HTML documents on an old machine with only 512MB of RAM! A huge improvement on the ~2.5GB that would have been required to do the same thing with V4.

    The downside was that the writing out and merging of the partial indexes on the disk added nine minutes to the overall indexing time which was 56 minutes in total for the 500,000 files.

    So we have reduced RAM usage 5 fold, for this enormous site, at the expense of 16% longer indexing times.

    This new code only kicks in when you index more than 65,000 pages. For small sites under this limit there is no impact from this change.

    But this is just the first run. With further code optimization and profiling, we hope to get down to only maybe a 5% performance drop while still saving just as much RAM. Even this 5% will probably be offset by optimisations in other areas of the code. So V5 should still be faster overall. We also plan during the next week to push our test scenarios out to 1,000,000 HTML documents on the same old 1.8Ghz CPU, 512MB machine.

    As I get time I'll write about some of the other aspects of V5.

    -----
    David

  • #2
    As we hoped, after further optimization of the code we were able to reduce the merge time from 9 minutes to 71 seconds for the 500K page scenario.

    1.6GB of data needed to be read and written during the merge. To do this in 71 sec equates to 22MB/sec. Which is getting close to the maximum speed of this hard drive. So this indicates that the code is now close to fully optimised and any further work can only result in very minor gains. Better to move on & spend our time elsewhere now.

    So the merge overhead is now only 2% of the overall indexing time. And this is really a worst case, as this test was done using offline mode. In the alternate scenario, where pages are being downloaded from the web in Spider mode, the merge overhead will drop to less than 0.5%. An excellent result considering the massive capacity gains seen so far.

    Now we plan to move on to testing the 1M page senerio. This is a big step and we are looking forward to see the results.
    -----
    David

    Comment


    • #3
      Success. We hit a million pages today !

      It was close to a best case test senerio, with smallish HTML pages and no outoging links, but it looks like we should be able to handle 1M 'average' sized documents within 1GB of RAM.

      What we did notice however was that the index files have grown to around 2GB in size. This means we are very likely to start hitting 32bit operating system addressing limits (4GB) if we try to further double the number of pages or double the size of each document.

      The easiest way to avoid the 4GB pointer limits associated with 32bits is to switch to 64bits. The best way to do that is to develop a native 64bit version of Zoom. This is a lot of work and won't happen overnight, but will provide a path to get to the 2M+ document level on a single 64bit machine.

      We'll probalby examine this later in the development process. Or just after the V5 32bit release.

      ---
      David

      Comment


      • #4
        Testing continued on indexing enormous sites this week (~1M pages). As we half expected the removal of the RAM limitations exposed new limits in the index structure that we hadn't encountered before.

        The two main issues we have come across are

        1) The internal file pointers fail once the index files grow to be greater than 4GB in size.

        2) The coding we have been using for representing words in the index failed once more than about 1.2M unique dictionary words were encountered. The coding scheme was very efficient for around 50,000 unique words but became much less efficient once we get to around the 1M level.

        So we have decided that V5 will need to have a limit of 4GB for any individual index file. Corresponding to the address range you can get with 32bits. We have been adding code to make sure there is a graceful failure once this level is exceeded.

        Secondly we have decided that we needed to overhaul the dictionary coding scheme. This was not something we orginally planned on and it will probably delay V5 by a week or so, but it will raise the limit from 1.2M unique words to around 16M unique words. Plus it will reduce the size of the index files for large indexes.

        It should be noted that there are only about 50,000 words in the English language, so getting to the 1M+ level is a fairly extreme case. (If the same word is used many times in many documents, it is still only 1 unique word as far as the index is concerned)

        -----
        David

        Comment


        • #5
          I hope you dont mind me posting here....

          The above sounds really great!!!!

          Can you list any new features you are striving for for this version? (besides the speed and how much data it can index) ?
          ____________________________
          Terry Remsik

          Comment


          • #6
            There will be lots of new stuff. Over the next few weeks I'll post more details on image indexing, mp3 indexing, incremental indexing, image thumbnails, XML output and more on search speed improvements.

            There are also some feature for which we have not determined the final specification as yet. Including enhanced categories, enhanced support for mixed characters sets, debug logging.

            At some point we'll also make a comprehensive list. Which is above is far from being.

            -----
            David

            Comment


            • #7
              WOW! sounds sweet!! I am very much looking foward to the new version!
              ____________________________
              Terry Remsik

              Comment


              • #8
                Any chance you’ll separate the crawler and indexing process? Currently if a large remote index job fails, all must be downloaded again, and for two hundred thousand documents it takes about 3 days.

                Comment


                • #9
                  If it is taking you 3 days for 200K pages then this is less than 1 document per second. I assume this is because the remote server is very slow? I would investigate why it is so slow. In our indexing benchmarks we get between 2.6 and 10 pages per second. How many threads are you running? How much RAM is in your machine?

                  Then I would investigate the cause of your 'failure' and try and get to the bottom of whatever is causing the trouble.

                  V5 should help in a few ways. Indexing is quicker and uses less RAM (but if the remote server is the problem, this won't help). We are working on incremental indexing which will help some large sites. This has the potential to avoid downloading a lot of files for some sites.

                  No it is not really possible to have the crawler and indexing process run at different times. What you are implying is having a massive cache of downloaded files which would take up a huge amount of disk space and be much slower will all the disk activity.

                  As to progress with testing V5 on large sites. We have hit a surprising number of different limits, both internal to Zoom and operating system limits. Yesterdays problem was inefficient searching of dictionary words once we got past the 1M unique word level. Today's problem was hitting the 2GB virtual memory limit in Windows before we ran out of physical RAM (at 1.4M unique words and 300,000 pages).

                  We have a fix for both these problems via some sophisticated hash tables and virtual memory management but it is more coding and more testing. We are hoping to have a new beta done with this new large site code later this week.

                  -----
                  David

                  [Update]: There is now also this FAQ for indexing enormous sites.

                  Comment


                  • #10
                    who wrote the million HTML pages?

                    Sounds really good, although even my biggest clients' sites are only a few hundred pages except for those dynamically created by PHP and MySQL...

                    Comment


                    • #11
                      I’m using 4GB ram on windows XP-Pro. 3.2 Intel.

                      Yes windows does provide 2 GB virtual space to programs and keeps 2 for the kernel-mode processes, but it still provides the full ram address space. Regardless, I wonder if the /3GB Startup Switch would help?

                      I’m indexing pages from 72 thousand different domains, set to 3 pages. The number of crawlers is set to 4. This is because your crawler will index one domain at a time, before moving on to the next. 10 crawlers in this scenario are wasted because the one domain nature of the Zoom crawler. Bandwidth is 5 megabits cable download.

                      Using Zoom 4.2 (Build: 1013) Professional.

                      Would love to test your new beta …

                      Comment


                      • #12
                        Sorry, reported wrong version, it's Zoom 4.3 (Build: Beta 6b)

                        Comment


                        • #13
                          Yes indexing 2 pages from 50 different domains is slower than indexing 50 pages from 2 domains (all else being equal). It is necessary to set up a new HTTP session for each new domain which costs some time. And as you say, the benefit of threading is reduced when there are only a few pages to work on at any one time.

                          who wrote the million HTML pages?
                          We wrote an application that generates random sized HTML pages with random dictionary words and random links between the files. But the English dictionary doesn't contain enough words so we had to start making up some words as well to get to > 1M unique words. We are using this for a lot of in house testing with Localhost.

                          More limitations were found and fixed yesterday. The most interesting and unexpected one was the our use of CRC-32 (Cyclic Redundancy Checksum - 32 bit) for detecting duplicate page via their URLs and / or content.

                          A careful examination of our 1M page index showed that we didn't in fact index 1M files. For some reason we had indexed only 999,994 files. Six files were skipped and missing from the index. It turned out that will 1M+ files we were starting to see significant collision rates in the CRC algorithm that didn't happen with fewer files. i.e. several files had the same CRC-32 but different content!

                          There is a nice summary of the CRC collision problem here.

                          The solution is to switch to CRC-64 but there is a RAM usage and speed trade off as we need to store and search millions of these things. But in the end we decided it was better to use slightly more RAM in this area and avoid around 1 in 200,000 pages silently being dropped, at random.

                          With CRC-64, theory predicts that we should only have a collision every 2 trillion pages, (which is much more acceptable). Assuming we have done a good job on the CRC algorithm.

                          We really thought we would have been finished development on this 'indexing enormous sites' feature a couple of weeks back. But it continues to surprise. On the plus side however most of the changes we are making will improve the indexing speed and resource usage for small sites as well as large.

                          E-Mail us if you want to try the next beta (when it is done).

                          -----
                          David

                          Comment


                          • #14
                            I just installed the v5 beta and notice that the limits are not adjustable. I have a very active site with over 200,000 different pages of real content in a good mix of types like HTML, PDF, DOC, TXT, etc... I have been using Zoom now for a couple of years and the new features in v5 are many I have been waiting for. I could do some extensive testing for you if you wish. Let me know.

                            Comment


                            • #15
                              You can upgrade to V5 now and receive a key which will work for the beta as well as the final release.

                              See details for upgrading to V5 here:
                              http://www.wrensoft.com/forum/showthread.php?t=1124

                              Contact us by email if you have any questions.
                              --Ray
                              Wrensoft Web Software
                              Sydney, Australia
                              Zoom Search Engine

                              Comment

                              Working...
                              X