Zoom V5 is looking to be a great enhancement over the existing software. This is short update on one aspect the development process of V5 of Zoom.
But before I get into that, I would like to remind everyone that we offer free upgrades for 6 months after a purchase, so if you purchase V4 now, it will be a free upgrade to V5 when it becomes available.
Over the last couple of weeks we have been looking deeply into the problem of indexing enormous web sites. By enormous, we mean one or more web sites having more than 250,000 pages in total.
At the moment in V4 the indexer requires a fair amount of RAM to index this many pages (around 1.5GB for 250,000 pages). It uses a lot of RAM because it holds part of the index in RAM, while it is being built. This gives better indexing speed, provided you have enough RAM. But not having enough RAM meant indexing enormous sites impossible. So the challenge was to move some of this data from RAM onto the hard disk without significantly reducing the indexing speed. (Accesses to the hard disk is at least 10 times slower than RAM access).
So our plan was to write additional partial index files to disk during indexing and merge the partial files at the end into a larger index. The merge process hopefully not taking to long and not using too much RAM.
Today was the first test of this new V5 code. For the first time we successfully indexed 500,000 small HTML documents on an old machine with only 512MB of RAM! A huge improvement on the ~2.5GB that would have been required to do the same thing with V4.
The downside was that the writing out and merging of the partial indexes on the disk added nine minutes to the overall indexing time which was 56 minutes in total for the 500,000 files.
So we have reduced RAM usage 5 fold, for this enormous site, at the expense of 16% longer indexing times.
This new code only kicks in when you index more than 65,000 pages. For small sites under this limit there is no impact from this change.
But this is just the first run. With further code optimization and profiling, we hope to get down to only maybe a 5% performance drop while still saving just as much RAM. Even this 5% will probably be offset by optimisations in other areas of the code. So V5 should still be faster overall. We also plan during the next week to push our test scenarios out to 1,000,000 HTML documents on the same old 1.8Ghz CPU, 512MB machine.
As I get time I'll write about some of the other aspects of V5.
As we hoped, after further optimization of the code we were able to reduce the merge time from 9 minutes to 71 seconds for the 500K page scenario.
1.6GB of data needed to be read and written during the merge. To do this in 71 sec equates to 22MB/sec. Which is getting close to the maximum speed of this hard drive. So this indicates that the code is now close to fully optimised and any further work can only result in very minor gains. Better to move on & spend our time elsewhere now.
So the merge overhead is now only 2% of the overall indexing time. And this is really a worst case, as this test was done using offline mode. In the alternate scenario, where pages are being downloaded from the web in Spider mode, the merge overhead will drop to less than 0.5%. An excellent result considering the massive capacity gains seen so far.
Now we plan to move on to testing the 1M page senerio. This is a big step and we are looking forward to see the results.
Success. We hit a million pages today !
It was close to a best case test senerio, with smallish HTML pages and no outoging links, but it looks like we should be able to handle 1M 'average' sized documents within 1GB of RAM.
What we did notice however was that the index files have grown to around 2GB in size. This means we are very likely to start hitting 32bit operating system addressing limits (4GB) if we try to further double the number of pages or double the size of each document.
The easiest way to avoid the 4GB pointer limits associated with 32bits is to switch to 64bits. The best way to do that is to develop a native 64bit version of Zoom. This is a lot of work and won't happen overnight, but will provide a path to get to the 2M+ document level on a single 64bit machine.
We'll probalby examine this later in the development process. Or just after the V5 32bit release.
Testing continued on indexing enormous sites this week (~1M pages). As we half expected the removal of the RAM limitations exposed new limits in the index structure that we hadn't encountered before.
The two main issues we have come across are
1) The internal file pointers fail once the index files grow to be greater than 4GB in size.
2) The coding we have been using for representing words in the index failed once more than about 1.2M unique dictionary words were encountered. The coding scheme was very efficient for around 50,000 unique words but became much less efficient once we get to around the 1M level.
So we have decided that V5 will need to have a limit of 4GB for any individual index file. Corresponding to the address range you can get with 32bits. We have been adding code to make sure there is a graceful failure once this level is exceeded.
Secondly we have decided that we needed to overhaul the dictionary coding scheme. This was not something we orginally planned on and it will probably delay V5 by a week or so, but it will raise the limit from 1.2M unique words to around 16M unique words. Plus it will reduce the size of the index files for large indexes.
It should be noted that there are only about 50,000 words in the English language, so getting to the 1M+ level is a fairly extreme case. (If the same word is used many times in many documents, it is still only 1 unique word as far as the index is concerned)
I hope you dont mind me posting here....
The above sounds really great!!!!
Can you list any new features you are striving for for this version? (besides the speed and how much data it can index) ?
There will be lots of new stuff. Over the next few weeks I'll post more details on image indexing, mp3 indexing, incremental indexing, image thumbnails, XML output and more on search speed improvements.
There are also some feature for which we have not determined the final specification as yet. Including enhanced categories, enhanced support for mixed characters sets, debug logging.
At some point we'll also make a comprehensive list. Which is above is far from being.
WOW! sounds sweet!! I am very much looking foward to the new version!
Any chance you’ll separate the crawler and indexing process? Currently if a large remote index job fails, all must be downloaded again, and for two hundred thousand documents it takes about 3 days.
If it is taking you 3 days for 200K pages then this is less than 1 document per second. I assume this is because the remote server is very slow? I would investigate why it is so slow. In our indexing benchmarks we get between 2.6 and 10 pages per second. How many threads are you running? How much RAM is in your machine?
Then I would investigate the cause of your 'failure' and try and get to the bottom of whatever is causing the trouble.
V5 should help in a few ways. Indexing is quicker and uses less RAM (but if the remote server is the problem, this won't help). We are working on incremental indexing which will help some large sites. This has the potential to avoid downloading a lot of files for some sites.
No it is not really possible to have the crawler and indexing process run at different times. What you are implying is having a massive cache of downloaded files which would take up a huge amount of disk space and be much slower will all the disk activity.
As to progress with testing V5 on large sites. We have hit a surprising number of different limits, both internal to Zoom and operating system limits. Yesterdays problem was inefficient searching of dictionary words once we got past the 1M unique word level. Today's problem was hitting the 2GB virtual memory limit in Windows before we ran out of physical RAM (at 1.4M unique words and 300,000 pages).
We have a fix for both these problems via some sophisticated hash tables and virtual memory management but it is more coding and more testing. We are hoping to have a new beta done with this new large site code later this week.
[Update]: There is now also this FAQ for indexing enormous sites.
who wrote the million HTML pages?
Sounds really good, although even my biggest clients' sites are only a few hundred pages except for those dynamically created by PHP and MySQL...