PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexer character limit - max word length

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexer character limit - max word length

    Hi,

    I have searched the forums and it seems that there is a limit of about 35 characters for one word in the Indexer. We have run into a few problems with this because people are searching for terms longer than 34-35 characters in one word and the search is not returning any results. One example is searching for long java packages like:

    com.novell.bordermanager.proxy.migration.Addrlist

    Can we somehow change the Indexer limit? Or, if not, will it be changed in the near future? This would be a great help.

    Thanks,
    D

  • #2
    Yes this is correct. There is a hard coded limit of 35 characters per word. At least in V5. It was somewhat arbitrary, but selected so as to not use too much RAM but still comfortably hold all English words.

    I don't know if it is an option on your site, but maybe you could turn off "Dots" and a join character on the "indexing options" tab. Using your example, this would result in your long word being broken up into 6 small words.
    com
    novell
    bordermanager
    proxy
    migration
    Addrlist

    Comment


    • #3
      but maybe you could turn off "Dots" and a join character on the "indexing options" tab
      This is a pretty good idea. We will try it out.

      Also, Would it be feasible to request a feature allowing for than 35 letter word to be indexed?

      Thanks,
      D

      Comment


      • #4
        This changes would require a lot of small changes in the scripts and indexer. Plus some UI changes and config file changes. We'll have a look at how much work it is and it may or may not get into V6.

        Needless to say the longer the Max word length, the more RAM required during indexing and searching.

        Comment


        • #5
          I've noticed that when trying to search for a word (not necessarily an actual word, but a string of characters) that is longer than the limit no results are returned--even though the word exists on a page that was indexed. Perhaps the search could automatically chop the word at the character length limit and search for the chopped word instead of the full length word. Some of the results may not be correct, but at least the correct results would show up as well.

          If you do plan to increase that limit, I see how more ram would be needed to do the search (I can speak for the cgi because I've seen the code, but not for the indexer). However instead of using a two dimensional array to hold all of the dictionary words, you could have two one dimensional arrays. One array to act as a long string of characters representing back to back null separated words. The other array being a list of "word" pointers pointing into the long string of words. This way you can use the pointers to identify where the words are and there is no wasted space. Also a word could then be any arbitrary size.

          Comment


          • #6
            The array of pointers uses an additional 4 bytes per word (8 bytes on a 64bit machine). Plus there is the minor complication that the dictionary is an array of structures and not a plain two dimensional array. But I agree that overall less RAM would be used.

            There is also the downside that additional CPU time is required to build the array of pointers as the dictionary is read in, plus the additional dereferencing when a word is accessed. It also doesn't help with the ASP, PHP and Javascript code. So there is a RAM usage / code complexity / performance trade off.

            There is no doubt however that this bit of the code could be slightly better in the CGI. The real optimisation would be to have the two arrays of structures already built in place on the disk in binary format and execute a single disk read on each. Structure 1 is the packed dictionary and Structure 2 holds double pointers (pointer 1 is a reference to the word sorted alphabetically, pointer 2 is an reference to the original word, which is also required for the context output). We have considered this in the past and may still do it for the next major release (but this also doesn't work for ASP & PHP).

            Comment

            Working...
            X