PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

French Site - Query Problems

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • French Site - Query Problems

    Greetings. I have a French site that I just finished. The accents (in the HTML) are all in ASCII. This is very good for viewing in browsers, but may be causing a problem with Zoom, particularly with querying.

    I have all of my pages in UTF-8. I did use the "Enable accent/diacritic/ligature insensitivity" setting. I did use the UTF-8 setting.

    What happens:

    #1, if I run a query actually using words with accent marks, it doesn't pull up the results in the index.

    #2, if I run a query without the accents, it pulls up the results (which have the accents) but doesn't highlight them.

    What am I doing wrong? I need this to work properly. I would like a person to be able to search for words with or without the accent marks and for it to pull up the right words.

    (I have V5 Pro)

  • #2
    Forum Members

    I look forward to receiving help from Ray and/or David, but, if there are any forum members that are French speaking and/or have experience crawling, indexing and querying French sites, I would appreciate your contribution in this thread!

    Comment


    • #3
      Can you give a couple of examples of what words you are searching for and let us know what Zoom script option you are using (PHP, ASP, etc..).

      If you are only using plain ASCII, then you probably don't need UTF-8 and might try using the Windows-1252 (English/Latin) character set. Having said that, UTF-8 should still work. I am wondering however is the UTF-8 encoding of the French accented characters is different from the Windows-1252 encoding and if this is what is messing up the accent insensitivity option.

      Comment


      • #4
        Php

        Originally posted by wrensoft View Post
        Can you give a couple of examples of what words you are searching for and let us know what Zoom script option you are using (PHP, ASP, etc..).

        If you are only using plain ASCII, then you probably don't need UTF-8 and might try using the Windows-1252 (English/Latin) character set. Having said that, UTF-8 should still work. I am wondering however is the UTF-8 encoding of the French accented characters is different from the Windows-1252 encoding and if this is what is messing up the accent insensitivity option.
        Thanks for your response. Hopefully this can be resolved. I am using PHP. The index is of about 1200 pages. I wondered if I needed UTF-8, since I am using ASCII. It's probably not necessary, but I also tend to think it doesn't hurt.

        Here are a couple examples of searches:

        délibérèrent

        ôtera même

        I almost tried the Windows-1252 crawl/index option. I can still try that and then encode the search page in Windows-1252, unless you think it is not necessary.

        I look forward to further responses, ASAP.

        Comment


        • #5
          I spend some time making up some example pages but couldn't reproduce the main part of the problem you described in the end.

          I made two example pages. Using ASCII characters (no multibyte and no character entities in the HTML). I then set the page character sets to UTF-8 and ISO-8859-1 for the two files.

          In Zoom I selected the PHP option with the UTF-8 character set and checked this was carried over to the search_template file. I also set the 'Enable accent/diacritic/ligature insensitivity' option.

          I then did searches for the words you mentioned, both with and without accents. I got the same set of results with and without accents. As expected.

          Here is a screen shot.




          However I think you are right about the highlighting of the search word not working with this combination of configuration settings, character sets and accented search words. So we need to have a look at this part of the problem to see if it can be fixed or improved on for the next patch release.

          Comment


          • #6
            Originally posted by wrensoft View Post
            I spend some time making up some example pages
            Thank you for your efforts.

            Originally posted by wrensoft View Post
            ...but couldn't reproduce the main part of the problem you described in the end.

            I made two example pages. Using ASCII characters
            What method did you use and what software, out of curiousity?

            Originally posted by wrensoft View Post
            (no multibyte and no character entities in the HTML).
            Please explain more, exactly what you mean.

            Originally posted by wrensoft View Post
            I then set the page character sets to UTF-8 and ISO-8859-1 for the two files.

            In Zoom I selected the PHP option with the UTF-8 character set and checked this was carried over to the search_template file. I also set the 'Enable accent/diacritic/ligature insensitivity' option.

            I then did searches for the words you mentioned, both with and without accents. I got the same set of results with and without accents. As expected.
            I am curious how you actually got the results, searching with the accents. I wonder if it matters how someone is inputting the words, into the search fields? I am asking out of curiousity. I wonder if someone types into the field using a French keyboard layout or copying and pasting or another method, how that would affect things or not. I also wonder if you were copying from a page that was using the ASCII in the HTML or if it was from a text editor that had the words not in ASCII. Again, these are just things that occur to me and that I wonder about.

            Originally posted by wrensoft View Post
            ...However I think you are right about the highlighting of the search word not working with this combination of configuration settings, character sets and accented search words. So we need to have a look at this part of the problem to see if it can be fixed or improved on for the next patch release.
            This is good then, that you see this and can work on the issue. I also noticed that the jump to worked but the highlight did not work, when I click on the links from the results.

            Comment


            • #7
              The test files were made using a text editor.

              Please explain more, exactly what you mean.
              Multi-byte is when more than 1 byte is required to represent a character in the alphabet. ASCII is always single byte. UTF-8 is a mix of single byte and multi-byte. There are some accented characters that require 1 byte and some that require 2 or 3 or 4.

              HTML character entities are special strings, defined in the WWW standards, that are used to represent special characters. Including accented characters in some character sets.

              It should not matter if you cut and paste or type in the accented characters. Provided of course that the you aren't forcing a Unicode to single byte conversion on multibyte character. Which should not be the case here as the accented characters in question are represented by a single byte.

              So we need more details & maybe copies of your HTML pages if we are going to reproduce the problem.

              Comment


              • #8
                Iso-8859-15

                I found this information helpful (found at http://en.wikipedia.org/wiki/ISO_8859-1 ):

                ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. Each character is encoded as a single eight-bit code value. These code values can be used in almost any data interchange system to communicate in the following European languages (with a few exceptions due to missing characters, as noted):

                ...# French (missing Œ, œ and rare Ÿ)

                * Note that Windows-1252 and ISO-8859-15 do contain these

                ...Relationship to ISO/IEC 8859-15

                Although ISO/IEC 8859-1 has enough characters for most French text, it is missing a few less-common letters. It is also missing a single-glyph representation for the letter IJ, two Finnish letters used for transcription of some foreign names and in a few loanwords (Š and Ž), typographic quotation marks and dashes, and common symbols such as the euro sign (€) and dagger (†).

                In order to provide some of these characters, ISO/IEC 8859-15 was developed as an update of ISO/IEC 8859-1. This required, however, the removal of some infrequently-used characters from ISO/IEC 8859-1, including fraction symbols and letter-free diacritics: ¤, |, ¨, ´, ¸, ¼, ½, and ¾.

                Comment


                • #9
                  More testing

                  I am going to do some more testing this weekend, including converting the ASCII characters to real French characters IN the code. (Don't worry! I'll do testing on a copy of the site.

                  Comment


                  • #10
                    ténèbres étaient à la surface

                    Okay, I just changed all of the ASCII characters to actual accented French vowels. I also changed all of the encoding to ISO-8859-15 (I did that prior to changing all of the vowels). I reran the zoom crawler (locally). I uploaded the new files and ran a query with the following words:

                    ténèbres étaient à la surface

                    The search result page displayed:

                    Résultats de la recherche pour : ta©našbres a©taient a la surface dans toutes les categories

                    and infact, the actual search field displays:

                    ténÚbres étaient à la surface

                    instead of:

                    ténèbres étaient à la surface

                    I did/had change/changed the encoding on the search template to ISO-8859-15 too. So, I am not sure what to make of this.

                    Comment


                    • #11
                      ISO-8859-15 setting

                      By the way, it should go without saying, that when I crawled the files locally, with Zoom, that I used the ISO-8859-15 setting.

                      Comment


                      • #12
                        Vouliez-vous dire: tenebres autant au la surface?

                        I noticed that the suggested search isn't correct either:

                        Vouliez-vous dire: tenebres autant au la surface?

                        instead of:

                        Vouliez-vous dire: tenebres etaient a la surface?

                        Comment


                        • #13
                          Utf-8

                          Okay, I switched everything over to UTF-8 again and recrawled the files that had been converted from ASCII text to French accents. Reposted everything with the changes. Now we're back to the old/original problem. The queries are not pulling up results when I do a search with accented characters.

                          Comment


                          • #14
                            Can you put the HTML pages in question on a public web site where we can see the files. Or put the entire search function on a public site and post the URL. E-mailing us your Zoom configuration file would also help us match your configuration.

                            Comment


                            • #15
                              I'd rather not

                              Originally posted by wrensoft View Post
                              Can you put the HTML pages in question on a public web site where we can see the files. Or put the entire search function on a public site and post the URL. E-mailing us your Zoom configuration file would also help us match your configuration.
                              I know it is limiting, but I'd rather not (and there's a lot of people that feel the same way I do). So, let's continue to communicate through the forum. What other questions can you think of?

                              Comment

                              Working...
                              X