PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

skip words - Large number of files and words.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    What is the maximum limit you can set without triggering the "Not enough memory" error, and are you actually hitting this limit? How much memory do you have on your machine, and how much do you think Zoom is using?

    When physical memory is close to exhaustion, Windows can start behaving erratically. We don't want to encourage users to over-state their limits (people can get lazy and just set the highest limits that Zoom allow without thinking even roughly how much data they are indexing), and then when Zoom chews up this much memory, Windows will start acting poorly, services will crash or swap in and out indefinitely without completing any task, and the user will blame Zoom for this behaviour.

    At the moment, Zoom will not proceed with indexing if it estimates that you need over 135% of the Total amount of RAM (note that this is not just available RAM, but the total RAM you have installed) on your computer.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #17
      I got an error message and the indexer stopped and started to finishing up after indexing around 720,000 pages with 11,600,000 unique words. (I also got a warning of 32-bit addressing issue when configuring the indexing limit.)

      Error: out of memory, the 32-bit address space limitation of your system has been reached (page...

      I have 6G total physical memory at the moment, and the task mananger said 2.56G were used (1.6G were used by zoomindexer.exe) when the program stopped. The system is windows web server 2008 x64 and I set the cgi/win32 option for the indexer.

      It seems that the indexer did not recognize the 64bit environment and therefore could not access the 6G ram.

      Comment


      • #18
        11,600,000 unique words is a truly massive number. Are you sure you're not indexing some binary files (e.g. you added some unrecognized binary file extensions to be scanned), or some file formats are being indexed incorrectly? (e.g. if a web page serves a PDF file and specifies it as a text content-type, it would be indexed as text).

        Considering the English dictionary has only 50,000 unique words. And we're not even looking at the total number of words here ... what is the nature of the content you are indexing? Is it a database of product codes and serial numbers? Is there really over 11 million of those in your content?

        Are you indexing content that spans several languages besides English? If so, have you considered creating separate indexes/search functions for them? It is rare to need them to be indexed together.

        The current release of the Zoom Indexer is a 32-bit executable. You can run 32-bit executables on 64-bit platforms, but it will not use 64-bit address space. As mentioned in the other thread, we have still yet to come across someone who really needed the 64-bit address space. If you can provide some more information to the questions above, we can have a better idea as to what your requirements are.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #19
          They are technical documents all in English. The large unique words are due to words like XXXX_XXX_XXXXXX, XXX.XXX, XXX-XXXX, or words contains numbers or all numbers. There are no binary files in them.

          The only solution for me I think is the 64bit zoom, does it exist now? I really want to use it as soon as possible.

          Comment


          • #20
            You can disable underscore, dots and hyphens from joining words (on the "Indexing Options" tab of the Configuration window). That way, a word like "ZOOM_TEST_WORD" would be broken up into 3 common English words in the index, without the need for a new unique word. A search for "ZOOM_TEST_WORD" will still yield a result (especially if you put quotes around it for exact phrase, and select "match all words" on the search form), the only difference is that you could also get pages containing the individual words especially if you have "match any word" selected, and if you're not using exact phrase matching.

            I would recommend trying the above method first and seeing how the results suit you.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #21
              I will give it a try and let you know the results when it is done.

              Comment


              • #22
                It did not work. This time the unique words dropped to 7,100,000, but the indexer still stopped after indexing 720,000 pages.
                Can you send me a copy of the current 64bit zoom to try? Thanks.

                Comment


                • #23
                  You are probably now hitting the 32bit file system limits, not RAM limits.

                  There is no 64bit release at the moment. We will finish the V6 beta release, then look at 64bit.

                  What are you actual requirements. How many files / pages do you have to index in total?

                  It might also be interesting if you could zip up your zoom_dictionary.zdat and put in on a server where we can download it. Just to see why it is so large.

                  You could also look at MasterNode as doing a distributed search can significantly increase capacity.

                  Comment


                  • #24
                    What is the "32bit file system limits"? How to dignose that? I did not get an error message in the indexlog for that. I am runing the indexer on a windows web server 2008 x64 system, and the zoom_pagetext.zdat is already 4G.

                    Can you tell me how to make the indexer access 3G ram? When the program stopped, it only used 1.6G, and the total ram used is 2.5G. You mentioned before there is a trick to do that. Also I wonder if you are aware of any software that can specify certain amount of ram and assign them to individual applications.

                    Now I have 2.5 million files to be indexed, and another 3 million files need to be added to the index very soon.

                    It is not possible for me to send you the dictionary file. But I can list all types of unusual unqiue words here:
                    Type 1: 'd0\'bf\'d0....... 400,000
                    \'c3\'b41
                    Type 2: abddeggeccedeedddeffefecdegeedcbba 200,000

                    Type 3: abee35tyui4rtt5umo 200,000

                    Type 4: a long string of numbers 800,000

                    Type 5: a string of unrecognizable characters 200,000

                    The total number of unique words in the dictionary file is not 7,200,000 but around 4,500,000.

                    Comment


                    • #25
                      I think it is pointless to talk about RAM, as you are not running out of RAM (even with the 32bit limits on RAM).

                      The 32bit file limits are operating systems limits that prevent files being 4GB or larger being created (it is an even smaller limit of 2GB on some Linux systems). This really only effects old operating systems. But in order to have our software work on a wide range of system (and have a efficient compact set of index files) we enforce this lowest common denominator O/S limit. There is nothing you can do to change or remove this limit.

                      Now we could have had different software package for every operating system (and this is effectively what will we do for 64bit), and this can remove the lowest common denominator effect. But as already pointed out the index files risk becoming large and inefficient.

                      What you can do is continue to look at why 30% of your index is effectively rubbish data. We would love to help, but if you can't provide us with a copy of the data to look at, it is going to be an exercise in frustration for both of us.

                      Having said that you are never going to get to 5.5M documents with 32bit. Even with 64bit it isn't sure (it depends on what we do with internal data structures and how much we trade off wasted space in the index files vs theorical capacity).

                      Comment


                      • #26
                        Do you mean that zoom can not do the work for me?

                        Comment


                        • #27
                          If you are aiming to index over 5 million files, you really should be looking at a distributed solution, like MasterNode, which allows you to accumulate multiple sets of index files, created by Zoom. This was already mentioned above.

                          Also mentioned above, you should be investigating why your index contains so much meaningless data. My guess is that your content is not purely HTML as you believe. You're likely indexing files that you don't even need to include in the index. The fact that you have things like "d0\'bf\'d0" would suggest RTF files, or any other various file formats. These files may be indexed incorrectly because they are being served with the wrong content-type, or they have the wrong file extension, etc. There are many possible reasons, but we can't help you if you won't provide us with the necessary data.
                          --Ray
                          Wrensoft Web Software
                          Sydney, Australia
                          Zoom Search Engine

                          Comment

                          Working...
                          X