Home » Forum
  • If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Announcement

Collapse
No announcement yet.

Indexing large systems and strategy

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing large systems and strategy

    Now I know that I have touched on this subject previously and I know that some posts touch on aspects of this but all that considered I'm still not content with how this is working. I accept that some of this may be strategy or my lack of understanding of best practice with ZOOM or the way ZOOM works as opposed to the way I'd like it to work. But make no mistake, I'm a big fan of ZOOM.

    It seems to me there are several barriers to indexing large systems with ZOOM despite Wrensoft's performance claims (that's not a criticism - I accept its metrics but its the practicality of taking ZOOM that far that is an issue for me.) Let me expand with some of the issues, questions, concerns and suggestions I have:

    Changes to your .cfg file - I know you can add new starting points and keep going which is great but invariably if you add new starting points you might be inclined to tweak your settings in which case ZOOM requires you to start all over again. For me I usually tweak the settings to improve search results or reduce indexing loads and times. On large operations restarting or redoing it entirely is a pain and a great consumer of resources.

    System disruption - if something goes pear shaped during a long indexing operation (e.g. loss of connectivity, app crash [most common one for me] etc) then you have the joy of starting all over again. It would be nice if on longer operations ZOOM did a save in the background e.g. every 10 minutes or so - something like a FTP resume would then be nice so it picks up from the last save.

    Maximum file size - is a real pain for me. In some cases I'm indexing (or trying to index) files (usually PDFs) way beyond the default limit for ZOOM - I have played with various settings but that's not the real issue. It looks like ZOOM downloads the file to the maximum configured file size and once reached ditches it - is that right - if so then this is great waist of resources and ultimately the document is not in the index in any way. What would work better is that if ZOOM indexed the file to the maximum size if that's possible - or alternatively if the maximum size is reached (or preferably detected before download) then it index the first X pages of the document. This is actually a good strategy because if you're indexing a lot of vintage magazines as I am then the Contents of the magazine are usually listed within the first 10 pages or so so you pick up a "snapshot" of the magazine in your index rather than it being ditched and not indexed at all.

    I did have a tinker with MasterNode creating smaller indexes and then using MasterNode to collectively search them but didn't particularly like it due to the trade off in features and need to synchronize several .cfg files. I have, as an interim solution, deployed an ad free Google Custom Search but I prefer the level of control that ZOOM gives me over the UX and what is indexed etc etc so want to get this working.

    Thanks folks!!!

  • #2
    As pointed on the FAQ page for indexing large sites.
    Indexing huge amounts of data can be a big job. There isn't any getting around it.

    ...you might be inclined to tweak your settings in which case ZOOM requires you to start all over again.
    Correct. There are many instances where changing settings results in changes to the structure and content of the index files. For example, if you added a new file type to be indexed, maybe .PDF files, then of course the indexer needs to rescan the sites previously indexed to index all the PDF files.

    The solution to this is to, where possible, finalize your settings with a small data set (e.g. 500 files) before moving on to indexing the full data set.

    System disruption - if something goes pear shaped
    For very large indexes it can take a couple of minutes to save the index. So saving the index every 10min would be a significant overhead.

    You can script the incremental indexing of each site. It doesn't provide any mechanism for automatic recovery and it will be slower than indexing everything in one go, but you will have a partial indexing in the case of problems.

    loss of connectivity
    It doesn't really make sense to be running a major search engine on a unreliable internet connection. Can you run the indexer on the same machine that hosts the site or run the index on a machine from the same hosting company (completely removing the problem of an intermittent internet connection).

    app crash
    I don't know if you are referring to Zoom or some app you have running on your server. But if Zoom is crashing you should let us know. Zoom should never crash (assuming your hardware is stable).

    Maximum file size
    Can you give us an example URL to one of these large files.
    Some servers correctly report the file size before any download starts. Others don't report a file size, so the client doesn't know the file size until after the file is downloaded.

    If you are mostly indexing PDF files, can you use offline mode? It would solve many of your problems.

    I did have a tinker with MasterNode
    We stopped development on MasterNode. There wasn't enough interest in federated search or indexing truly huge datasets.

    Comment


    • #3
      Originally posted by kpa View Post
      ...

      Changes to your .cfg file - I know you can add new starting points and keep going which is great but invariably if you add new starting points you might be inclined to tweak your settings in which case ZOOM requires you to start all over again. For me I usually tweak the settings to improve search results or reduce indexing loads and times. On large operations restarting or redoing it entirely is a pain and a great consumer of resources.
      I'm not sure this would help in your case, and I don't know if there is any risk to the integrity of your configuration files, but I have a good number of Zoom configuration files (.zcfg) that start with a common base set of rules and settings, then diverge greatly in terms of search paths and output paths as the base set is tweaked or extended for various purposes.

      I handle this by building my "base" or common configuration directly with the Zoom GUI, save it as a .zcfg "base template". I then build a "benchmark template", again in Zoom, which incorporates the base set and is extended with as many features that I may need, paths, and so on, as possible.

      At that point I can maintain the .zcfg files directly with a text editor, which allows me to easily compare files against the base or the benchmark templates in order to see where the tweaks are, or need to added, for example. And I can use standard or regex search and replace in order to modify all my zcfg files at one go.

      Comment


      • #4
        Good points in the two posts above, but just wanted to add:

        Originally posted by kpa View Post
        Maximum file size - is a real pain for me. In some cases I'm indexing (or trying to index) files (usually PDFs) way beyond the default limit for ZOOM - I have played with various settings but that's not the real issue. It looks like ZOOM downloads the file to the maximum configured file size and once reached ditches it - is that right - if so then this is great waist of resources and ultimately the document is not in the index in any way. What would work better is that if ZOOM indexed the file to the maximum size if that's possible - or alternatively if the maximum size is reached (or preferably detected before download) then it index the first X pages of the document. This is actually a good strategy because if you're indexing a lot of vintage magazines as I am then the Contents of the magazine are usually listed within the first 10 pages or so so you pick up a "snapshot" of the magazine in your index rather than it being ditched and not indexed at all.
        As mentioned above, but worth repeating -- if the files you are indexing are your own, and they are a large collection of PDFs, then it would really make sense to index with Offline Mode instead of Spider Mode. It would be far more efficient, much faster, and avoid almost all the abovementioned grief.

        Second point -- assuming you really have to index with Spider Mode -- can you clarify how this is "a real pain"? It sounds like the issue you have is with the time it takes to download the full PDF file (which is apparently very large).

        We can't process the PDF file until we have the full file downloaded. The format is a proprietary Adobe binary format, and the external module used to handle this is an industry standard "xpdf" solution that is used by OpenOffice and other tools. As far as I know, there is no way to reliably process the content of a PDF file when it is downloaded partially.

        Again, if you switch to using Offline Mode, then there would be zero download time.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X