PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Laborious indexing - incremental indexing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Laborious indexing - incremental indexing

    Hi there

    Would it be possible to make something which updates the index files when uploading another file to the users website. The user would noot need to reindex the whole site

    Thanx

  • #2
    We don't believe there is any good way to do incremental indexing.

    Scheduling Zoom to run during the night is a good option however. There is a built in scheduling function. This might not be as bad as you think becasue Zoom using caching like Internet Explorer and old files that have not been updated will not be downloaded again (if they are cached).

    Long indexing sessions are only a problem for large sites and large sites typically use dymanic pages generated by a script or some type of content management system.

    In this case I don't think any external search engine can do incremental indexing. If a page is generated by a script or CMS, then there is no page creation date. (the page gets recreated each time it is downloaded). If there is no page or file creation date, how can a search engine know if a page is new or old? It needs to download the page to find out and if the page must be indexed, then that leads to a long indexing session.

    By contrast, with plain HTML pages, they have a file creation date and this is not an issue.

    Even if a page could be identified as not having been changed, the page still needs to be downloaded in order to follow links on the page to see if the linked pages had changed.

    In a future version of the software, we would hope to be able to offer some improvement however. Maybe a way for the user to manually specify which pages are new and need to re-indexed.

    -----
    David

    [Late Update: 5/Dec/06 - V5 of Zoom now contains some functions for Incremental indexing. While it won't work for all sites, it will help out some people]
    Last edited by David; Dec-05-2006, 05:23 AM. Reason: Added comment about V5 having some new functions in this area

    Comment


    • #3
      I have to say that indexing my site is extremely quick, but I also noticed that it is many times quicker indexing offline, as you would perhaps expect. If your idexing is slow, why not try it offline?

      Comment


      • #4
        I dont try it offline because I index many sites which I myself do not own and hence cannot index them offline.

        Comment


        • #5
          Welll.. most CMS sites do have pub dates or the like.. at least mine do... but yes.. it would have to download the file to find that out anyway... on the other hand the query used on a custom index page can be specified with a date range.. but then im thinking what if a file content is changed?

          it would ahve to be reindexed anyway.. and with so many pages to index... how will you know if its changed!

          Comment


          • #6
            I am looking for an alternative to the search functionality included in the bug tracking system used within our organisation. It provides an easy way to list the most recently updated cases. Ideally, we would like to frequently (every 30 minutes) re-index just the pages that have changed. In other words, append the latest cases to the index and replacing any pages that already exist.

            As this is an internal tool, it is not accessible to any other search engines. Indexing the entire database (thousands of large pages) each time is not practical and would significantly impact the server loading. We also have client sites that use our own proprietary CMS and incremental updating would be vital for these.

            Are any improvements likely to be made in the near future?

            Thanks.

            Comment


            • #7
              We are working on some improvements for V5 of the software. But I don't want people to get their hopes up too high. Although is sounds easy to incrementally add some new pages, it isn't. It is technically complex.

              Some of the issues are,

              - The index files can have several different formats. (e.g. CGI is a 4 byte page number, PHP is a two byte page number). So how do we stop people creating an index with a particular configuration and then adding new data in a different format (resulting in a corrupted index).

              - Adding pages is only 1/3 of the problem. Deletes and edits also need to be handled.

              - Deletes and edits result in holes in the index files. Eventually after a lot of deletes and edits, the index files will be mostly holes and they would need to be re-compressed.

              - The index files can be very large for some huge sites. (200,000+ pages). We can't assume the PC has enough RAM to hold the entire index in RAM at the same time. But holding everything in RAM would be the most efficient way to update the index. Doing the editing on disk would be very slow.

              - What happens if a user creates a set of index files with V3 of the software then tries to add new pages with V4. (the index formats are different)

              - The whole problem of actually detecting what files are new (discussed above). On many sites it is not technically possible to know if a file is new (or updated / deleted) without downloading it. And if we download it, then we aren't saving any significant time in the end.

              Despite all this. We know it is an important feature and we will deliver something in this area for V5. Exactly what I can't say at this point. It is still early days for V5 development.

              -----
              David

              Comment

              Working...
              X