PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Incorrect display of UTF-8 characters (Czech)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Incorrect display of UTF-8 characters (Czech)

    All pages on this Czech language site are encoded in utf-8. search.asp is #included in a custom search page called zoomsearch.asp, however the problem is the same with the default HTML template.

    I set language encoding in Zoom Indexer configuration to utf-8 and also saved search_template.html and zoomsearch.asp in utf-8. All multi-byte characters are displayed incorrectly in search results. Adding a preprocessing directive <%@ CODEPAGE=65001%> to the top of my ASP page does not help.

    When I save search_template.html and zoomsearch.asp back in ANSI encoding and specify <%@ CODEPAGE=1252%> as suggested on your international language support page, page titles in search results suddenly display correctly but context descriptions do not. Incorrectly displayed is also the word in the search box and in "Search results for:" line.

    Nothing displays correctly without the preprocessing directive.

    If I change the preprocessing directive to <%@ CODEPAGE=1250%>, everything displays correctly except page titles.

    I should add that I left "charset=utf-8" in the META tag of zoomsearch.asp despite it being saved in ANSI encoding. If I change this to "charset=windows-1250" all text is again displayed incorrectly.

    It seems that context descriptions are copied directly from the utf-8 coded pages while page titles are displayed from ANSI coded zoom_pagedata.zdat. I don't understand why the <%@ CODEPAGE=1250%> directive displays utf-8 correctly but it does.

    An image of the search result page is attached.

    What am I missing?
    Jan

    Last edited by jansynek; Dec-30-2007, 09:26 PM.

  • #2
    What is the URL of one of the "HonorA!A?e" pages above. We can then do some testing here on our server.

    Comment


    • #3
      Thanks for replying. We really like Zoom Search but this display problem, if not solved, would keep us from using it. It's driving me nuts.

      The url's are
      http://fotomonitor.info/www/hon/index.asp
      http://fotomonitor.info/www/hon/af.asp

      Jan

      Comment


      • #4
        We have a few people away on holidays this week. So it will be early next week (7th Jan) before we can test these pages and get back to you.

        I would note however that classic ASP/VBScript (the programming language) has known problems when dealing with things like case conversions of some foreign characters. If we can't get good results with Classic ASP, is using the CGI or PHP or ASP.NET an option?

        Comment


        • #5
          Switching to PHP or CGI is not an option for us and neither is ASP.NET at this point of site development.

          From my observations, the problem is almost certainly not with ASP but with the fact that two different representations of utf-8 characters are used. It can be seen even with settings when nothing in the search results displays correctly:

          E.g. the word "Honoráře" displays as "Honoráře" in the page title line and as "Honoráře" in the description with no preprocessing directive or with <%@ CODEPAGE=65001%>.

          When I set <%@ CODEPAGE=1252%>, the word displays correctly in page title line and as "HonorA?L�e" in the description.

          With <%@ CODEPAGE=1250%> it's correct in the description and as "HonorA!A�e" in the page title line.

          Hope these observations help.
          Jan

          Comment


          • #6
            Hi Jan,

            Can I check with you the version and build of Zoom you are using? The latest is V5.1 build 1010. Make sure that you are using the very latest "search.asp" script too, and not the script from a previous version.

            Originally posted by jansynek
            It seems that context descriptions are copied directly from the utf-8 coded pages while page titles are displayed from ANSI coded zoom_pagedata.zdat.
            No, the context descriptions are formed from words in the zoom_dictionary.zdat file (which are UTF-8 encoded). But the zoom_pagedata.zdat file should also be UTF-8 encoded considering the settings you are using.

            I have tested indexing the above pages using the latest version, and the words are UTF-8 encoded in both the abovementioned files. My search page is also showing the words correctly in both the titles and the context descriptions without any CODEPAGE directive.

            The reason for this difference may be a server setting. My test server is running IIS 6.0 and it is setup with English regional settings. If your server is setup otherwise, it may be causing ASP to perform additional re-encoding automatically on the text that it reads from a file. This is why the CODEPAGE preprocessor directive is necessary, and why it should only be set to 1252 as per our instructions.

            I am however, unsure as to why the context description appears incorrectly when you set your CODEPAGE to 1252. Have you modified "search.asp" in any other way, and/or the ZDAT files? Could you zip up the relevant files and e-mail them to us so we can take a closer look?
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Hi Ray,

              I just e-mailed you a complete set of my Zoom files. Search.asp has not been modified at all and neither have been the ZDAT files. My Zoom Indexer version is the latest V5.1 build 1010, just downloaded a couple of weeks ago. Our web server is also running IIS 6.0 and ASP 3.0 but it is almost certainly set up with Czech regional settings. We do not have access to modify any server settings.

              I don't quite understand why ASP would perform additional re-encoding on one part of the text that it reads but not on another part. You can clearly see that the two-byte utf-8 characters are encoded differently in the title line and in the description, yet they both come from the same source.

              Jan

              Comment


              • #8
                Originally posted by jansynek View Post
                I don't quite understand why ASP would perform additional re-encoding on one part of the text that it reads but not on another part. You can clearly see that the two-byte utf-8 characters are encoded differently in the title line and in the description, yet they both come from the same source.
                They come from different sources actually (the title is stored in zoom_pagedata.zdat and the context description is generated from data stored in zoom_dictionary.zdat and zoom_pagetext.zdat).

                There are different ways to open and read from a file in ASP. The search script employs different methods for different files, based on how it needs to access that file (e.g., it may use the FileSystemObject to read an entire file into memory, but use the ADODB.Stream object to open a handle to a file which it needs to seek within and read only a portion of the file).

                The problem here is that one of these methods can sometimes re-encode the data incorrectly depending on the server regional settings. The data read from other files using different methods would not exhibit the same behaviour.

                We are looking into any possible changes that we could make to minimize or eliminate this dependency on the server (possibly by coming up with a different way to read in the file). We will update with more information as we have it.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  Hi Raymond,

                  received a modified search.asp from you this morning which solved the problem. Good work and excellent support.

                  Thanks,
                  Jan

                  Comment

                  Working...
                  X