PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

problems with arabic diacritic marks

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    OK I see, you are asking for the 3 types of alif character to be treated as the same character. So when you search for one of them, it matches the other 2 versions of the character. Correct?

    Like we do for French accents, é and e for example.

    Comment


    • #17
      Originally posted by wrensoft View Post
      OK I see, you are asking for the 3 types of alif character to be treated as the same character. So when you search for one of them, it matches the other 2 versions of the character. Correct?

      Like we do for French accents, é and e for example.
      exactly .. also there is four type of alif character which is آ alif with madda
      Last edited by mrbasserby; Nov-30-2012, 10:11 PM.

      Comment


      • #18
        hi i wonder if its possible of highlight words with diacritic for the file highlight.js like this example :

        http://jsfiddle.net/FUg85/15/

        Comment


        • #19
          We can probably add something like that into V7. However, we're not familiar with Arabic lettering so I'm not entirely sure how universal the above suggestion is. Did you write that bit of code yourself, or is it from someone else? Are you aware that it simply strips the following 5 characters:







          From the two strings being compared? Is that enough to fix all issues with diacritic marks in Arabic or are there other marks that are not addressed by this approach?
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #20
            the above script example i found it in the net but it could helps as an example for the java-script of highlight diacritic words in the highlight script file and this is the standard arabic characters :



            أ alif with above hamza
            ب baa
            ت taa
            ث close to thaa
            ج jaa
            ح haa or 7aa
            خ khaa
            د daa
            ذ thaa
            ر raa
            ز zaa
            س saa
            ش shaa
            ص close to saa
            ض close to daa
            ط close to taa
            ظ close to thaa
            ع ayin or close to aaa
            غ close to khaa
            ف faa
            ق close to kaa
            ك kaa
            ل laa
            م maa
            ن naa
            هـ haa
            و waa
            ي yaa



            and this the diacritics used with it i will put it to ( ـ ) as indicator to the arabic characters :


            ( ـُ )
            damma

            ( ـَ )
            fattha

            ( ـِ )
            kassra

            ( ـٌ )
            tanween damma or double damma

            ( ـً )
            tanween fattha or double fatha

            ( ـٍ )
            tanween kassra or double kassra

            ( ـْ )
            skoon

            ( ـّ )
            shadda

            ( ـَّ )
            fattha above shadda

            ( ـُّ )= -ّ + -ُ
            damma above shadda


            ( ـِّ )= -ّ + -ِ
            shadda above kassra

            ( ـَّ )= -ّ + -َ
            fattha above shadda

            ّ( ـٌّ )= -ّ + -ُ
            double damma above shadda

            ( ـٍّ ) = -ّ + -ٍ
            shadda above tanween kassra


            إ = ا + ء
            stand alone characters
            hamza under alif


            أ = ا + ء
            stand alone character
            hamza above alif


            آ = ا + ~
            stand alone character
            madda above alif


            لأ = ل + أ
            stand alone character
            laa with hamza above alif

            لإ = ل + إ
            stand alone character
            laa with hamza under alif

            لآ = ل + آ
            stand alone character
            laa with maddda above alif

            ( ؤ )= و + ء
            stand alone character
            hammza above wow

            ئ = ى + ء
            stand alone character
            hamza above short alif

            ( ى ) short alif stand alone character

            ( ء ) just hamza consider as stand alone character

            and you could use notepad to see it better and to understand more how this characters sound you could use arabic text to voice program like this :

            https://acapela-box.com/AcaBox/index.php

            if guys need more information how to use it with keyboards I'm glad to help you for more details information about it and check wiki site for images and information :
            http://en.wikipedia.org/wiki/Arabic_diacritics
            Last edited by mrbasserby; Dec-03-2012, 04:23 PM.

            Comment


            • #21
              now my suggestion is why not make the search engine find words with or without diacritics there is no need to to give the user the option to strip words from diacritics and enable it as default and if the user input words contain diacritics in text search box the script will have first to strip the words form diacritics then trying to find the match words no matter if diacritic or not because we as users want it from the search system to find both types and also jump and highlight both types of characters ..

              for example i want to find the character alif ا or أ or إ or even آ or any alif with diacritics

              i want the result able to find all this types no matter what type of alif i input ..

              Comment


              • #22
                in search.php file could be like this example :

                PHP Code:
                function strip_Dia($string)
                    {  
                    return 
                preg_replace('/َ|ِ|ً|ٍ|ُ|ٌ|ّ|ْ|ٰ/'''strtolower($string));

                }


                // we use the method=GET and 'query' parameter now (for sub-result pages etc)
                $IsZoomQuery 0;
                if (isset(
                $_GET['zoom_query']))
                {   
                $inputStri $_GET['zoom_query'];
                    
                $outputStri strip_Dia($inputStri);
                    
                $query $outputStri;
                    
                $IsZoomQuery 1;
                }
                else
                    
                $query ""
                the same principle for other Unicode uft8 languages
                i still didn't find solution for highlight.js to mark both types .

                Comment


                • #23
                  and also i want to say if use this regular expression it will strip all diacritics for arabic text for sure :

                  PHP Code:
                  return preg_replace('/َ|ِ|ً|ٍ|ُ|ٌ|ّ|ْ|ٰ/'''strtolower($string)); 
                  the only thing remain is to able to find arabic characters that merged with hamza ( ء )
                  and this case only alif charcter have two types long alif ( ا ) and short alif (ى )

                  for example of input:
                  أَبْتَغِى

                  if we want match the output would be :
                  أَبْتَغِى
                  or
                  (ا أ إ آ)بتغ(ا آ أ ئ ى)

                  Comment


                  • #24
                    Originally posted by mrbasserby View Post
                    now my suggestion is why not make the search engine find words with or without diacritics there is no need to to give the user the option to strip words from diacritics and enable it as default
                    We can't do that because it would very likely cause issues when the script is used to search sites in other languages. Some characters overlap in different character sets, or even in Unicode, it can cause problems when the script is used elsewhere that isn't using Unicode (the stripping will match the wrong characters, or the middle of a multi-byte character).

                    Having said that, we already have an option to toggle "Strip Arabic diacritic marks", so we can change the script behaviour according to this. There might still be complications with different charsets used for Arabic websites, e.g. UTF-8 or windows-1256 or iso-8859-6.

                    So this is more involved to get it to work properly and for everybody.

                    We've added this to our V7 todo list.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment

                    Working...
                    X