PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexing multiple sites into a single set of ZDAT files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing multiple sites into a single set of ZDAT files

    Hi, I need a little help with indexing my site.
    I wonder is there any way to store all indexing data in one set of files. I mean, when I index www.site.us/ZZ and store data in .zdat file can I index www.site.us/RR and put that index files in the same .zdat file where is data of /ZZ
    thank you

  • #2
    If I understand you correctly, you are trying to index multiple web sites and have a single search function that allows searching across the multiple sites.

    You can do this by adding additional web site start points in the indexer. Click on the "More" button in the main window to add new start points.

    Doing this means you'll end up with a single set of .zdat files that covers both sites. (You can index 100's of web sites into a single index by doing this).

    -----
    David

    Comment


    • #3
      Originally posted by Wrensoft
      You can do this by adding additional web site start points in the indexer.
      That would work for as long as the scan, skip, exclusion and indexing options remain the same.

      For my application, it would be useful to merge a number of indexes into one. This would give me the ability to use different options in each and only re-index the the section that has changed.


      Fred

      Comment


      • #4
        One additional note: an index merge function would also let me get around memory, word and file limitations which I anticpate with 600k files. (most of which only need the title indexed, whereas others require the page content).

        Fred

        Comment


        • #5
          You can still have different skip options for each site by including the full path in the skip list.

          So for example, this line
          www.SiteA.com/file1.html

          would only skip the file1 on SiteA and not a file called on file1 on SiteB.

          A merge function is not easy to implement. The format of the index files is optimised for the options you select (to keep it as small as possible). As an example, if you index with the PHP option, then page pointers are 2 byte integers. But if you index with the CGI option page pointers are 4 byte integers. If you index with the Javascript option, then integers aren't used at all. Other options like date sorting, context results & categories can all affect the data stored.

          So if you are using different options to index different sites then the number and structure of the files can vary. Making a sensible and comprehensive merge very complex to implement.

          In your particular case, I would suggest that for the pages that needed only their title indexed you use the and tags to force the page content to be ignored.

          However 600,000 files it a lot. I don't know what percentage are only going to need the title indexed but I am still thinking you are going to run out of RAM during indexing. What type of hardware to you have available?

          Comment

          Working...
          X