Best Way: Large Indexes?
We're working on indexing travel websites. The biggest problem is simply the sheer number of city/town names (hundreds of thousands) around the world. Each city name is accordingly a unique word to be indexed in addition to other terms.
It appears we might be best off if we created multiple indexes for different geographical regions (Europe, Asia, etc.) or countries. Or should we consider those divisions to be categories?
To help limit the number of sites, we will likely limit URLs scanned to their top page except for sites we feel are excellent and would manually allow them in to a certain depth. We are also using +/- filtering for travel terms.
Almost forgot that the above are all external sites but we'll also want to merge in our own site's pages (which we will weigh heavier).
Any suggestions/tips would be appreciated.
PS: We are using the current Zoom Search Enterprise addition but keeping in our minds to use MasterNode in the future.
The biggest tip is to use the CGI option and don't even attempt to use ASP or PHP. The CGI option is much faster and will support much larger indexes.
Using categories slightly increases the size of the database. So this is not an efficient way to improve performance. Using multiple, per region, indexes will reduce the size of each indivdual set of index files. But then you need something like MasterNode to combine the output from the difference indexes.
But the size of the problem depends larger on how many pages / sites you are planning to index. 50,000 pages is a fairly easy job. 500,000 pages requires some plannning. 1M+ is a fairly large project.
When doing large collection re-index?
Is their any type of down time with the search side of things? So if you have a large PDF collection and nightly you re-index this collection can the search still function when the Index is running? I was thinking about just adding new files to the index but if you have updates to old documents that would just become more work to remove and then add the new one but might be the plan of attack if their is issues with indexing and search old indexes at the same time.
The index files are created and uploaded as ".tmp" files during the indexing process and the uploading process. This means that it will not interfere with the live search while indexing.
When indexing (or uploading) is completed, there will be a few unavoidable seconds of downtime when the files are renamed to be the final ".zdat" files, but this is minimum.
Ok that is what I was thinking that happened and that is great compared to the K2 Verity Search that forces you to turn it off during indexing. Thanks Chris
Which is all rather ironic, as K2 Verity Search should be a significantly better product given it is 100 times more expensive. I guess you don't always get what you pay for.
The other way to look at it, is that you're getting more than what you pay for with Zoom - the value of a product worth 100 times more!