PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Inherent Indexing word rules for Bengali language make search unusable

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inherent Indexing word rules for Bengali language make search unusable

    I am rather disappointed the way zoom is dealing with
    unicode while indexing pages containing Bengali - an
    Asian language, making use of zoom search NOT possible.

    Words like this

    Code:
    গ্র
    are being broken up at & # 2509 ; (read without spaces)

    so that what is indexed is

    Code:
    গ ্ র
    In "Indexing word rules" there is no option
    to add own rule or to correct this

    Can there be any quickfix / work around ?


    PS : I find that this character

    ্ seen on screen as
    Code:
    is being allotted -1 . I think this may be causing the problem. How t oprevent this ?? Incidentally this is same as & # 2509 ; (read without spaces)


    For example you may create a text or html file containg following

    Code:
    গ্র
    and now enter search term
    Code:
    গ্র
    zoomsearch will NOT find it

    However it will find

    Code:
    Last edited by abbas; Mar-16-2008, 07:08 AM.

  • #2
    While it is true we have never done any testing with Bengali, we do support Unicode. But text files are assumed to be in single byte ANSI text (as there is no way to specify their character set). For HTML files you need to specify the character set in the HTML file.

    You mention Unicode, but you still didn't state if you are using UTF-8 (multi-byte) or UTF-16 (double byte) or something more obscure.

    What version of Zoom are you using?

    What character set did you select in Zoom?

    What character set did you specify on your HTML page.

    Also by default 'words' of 1 character length are not indexed. You can change this from the "skip options" tab in Zoom.

    Comment


    • #3
      But text files are assumed to be in single byte ANSI text (as there is no way to specify their character set). For HTML files you need to specify the character set in the HTML file.
      No not text file, html file or php file, with proper character set defined

      but you still didn't state if you are using UTF-8 (multi-byte) or UTF-16 (double byte) or something more obscure.
      UTF-8

      What version of Zoom are you using?
      Vesrion 5.1

      What character set did you select in Zoom?
      Unicode UTF-8
      and tried with all International Searching options with either all off
      or one or more or all On in all possible permutation/combination

      What character set did you specify on your HTML page.
      Code:
      <?xml version="1.0" encoding="UTF-8"?>
      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
      <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
      <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
      and various other with addition or omission in all standard declarations possible
      However, please note any search ( eg google ) or others are able to search unicode irrespective of what is defined in the header

      Also by default 'words' of 1 character length are not indexed. You can change this from the "skip options" tab in Zoom.
      No I did not search any one character word, that was JUST an example to keep the above simple and easy. For actual search I have always used words with more than 5 chars

      I suggest you make an appropriate HTML file with the above characters ( the very first one in the first post ) copypasted 3 or 4 times ( to give length ) , then ask zoom to index it, then search again by copypasting - you will find that it is NOT giving any result.

      It is frustrating for I think that character ( see above ) is being allotted -1
      and it is causing the word to break one word into two words.

      BETTER STILL, try a real life example

      ask zoom to index ( just for example ) http://banglarkrishi.gov.in/prani_dev.htm

      Now search the following word with zoom

      Code:
      গ্রাম
      Zoom does NOT find it

      Now copy paste the above word in google search , it easily finds that and you will find that
      page probably somewhere between 4 to 6 position.

      Something is seriously wrong in the way zoom is indexing these.
      Incidentally, I use drupal cms ( free and open source ) which has no such problem in search BUT all pages on the site are not drupal and drupal will not obviously search those.
      Please let know, if possible, what are findings on your end for that sample url with zoom and with google.

      Comment


      • #4
        As mentioned before, we have not tested with Bengali, and we are not at all familiar with the language so it is difficult for us to understand the nuances of the written text. It seems, for example, that some of the characters are diacritic marks which appear differently depending on the character that follows or precedes it.

        However, we have just spent some time looking into it, and believe we are able to improve the current behaviour for Bengali support. We have added code to certain nonspacing characters, such as these diacritic/vowel marks, and made it so they would not break up the word. In our tests, the new implementation seem to work much better, and to us, it appears to be searching the content fine. We will likely include this change in a future release.

        If you would like a preliminary build with this feature, contact us via e-mail and we can send you a copy for further testing.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Thanks I will write you via the above link for a test version when I get back home. When will this be included in a stable downloadable ( like the free 50 page limited version ) available as many local users will like to test ?

          Did you do the above test with the improved version ? What were the results ?
          Huge Thanks for yout time and support.

          Comment


          • #6
            Yes, we did test with the above given page (which is not actually encoded in UTF-8 but in windows-1252 with every Bengali character encoded as a HTML entity). We also tested with some sites which was actually encoded in UTF-8. The results look good to us, but we're not particularly sure what would be the best settings for Bengali (e.g. it may be better to enable single-case support or substring matching, but we don't quite know what would match common searching habits).

            We've indexed a few pages and put up a search page here you can test with. Try searching for words in that page you mentioned above.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Thanks to your hard work, things seem to working nicely.
              If the changes are incorporated in the limited version downloadbale-by-all version it will be great - then, some detailed testing can be done. Many users, I know, actually tested this, and were disappointed, and they did not bother to write here in the forum ( though some issues were raised in mozilla forums probably )

              It will be actually very beneficial for all if the updated version is available for testing by all, so that there can be a start of fresh round of purchase of the prof. edition by many potential users.

              Thanks again.

              Comment


              • #8
                We intend to include this in a future build. But since we are unable to do much testing in the Bengali language, we would not be adding it immediately, and would prefer some time to iron out any problems (we do not want to be reissuing new builds on a daily basis).

                But if you have had a look at our example posted above, and think that looks pretty reasonable, we should be able to include it in the next build given no other problems arise. This would probably not be for a few weeks from now (this is why I offered a preliminary build).

                Which other forum are the users posting about this topic?
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  The new build (5.1 build 1014) with improved Bengali support is available for download:
                  http://www.wrensoft.com/zoom/whatsnew.html

                  Note that you should have "UTF-8" selected in Zoom (and your search template page created in UTF- for this to work.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment

                  Working...
                  X