PDA

View Full Version : Chinese Language Search


lfeuling
01-16-2008, 03:03 PM
I’m looking for some help in getting Zoom Search to work on a Chinese language website. The search works great on the English section of the website, but I haven’t had any success in getting it functional on the Chinese section. I’m guessing it’s a coding issue, but I think tried all the possibilities (both utf-8 and Unicode). I’m using the PHP 5.1 version of Zoom. Is there someone out there using PHP to search a Chinese website that can give me some pointers? If you would like to look at the web site it can be found at: http://www.kongandallan.com/cn_index.html
The Chinese language search boxes can be found in the bottom 3 menu button selections in the right hand side of the main page (or directly at: http://www.kongandallan.com/cn_casestudies.html ) At present the search is using the pure generic search template. Thanks!

Ray
01-17-2008, 12:33 AM
Are you using Offline Mode or Spider Mode?

When I tried to index your site using Spider Mode, I discovered that your server is responding in UTF-16 (double byte). This is pretty rare (most web servers use UTF-8 for Unicode or various single byte encoding). The current version of Zoom does not handle UTF-16 response in Spider Mode, but we should be able to add this into the next build (V5.1 build 1012).

On the other hand, your webpages contain a meta tag specifying the GB2312 charset, so this was pretty confusing.

The behaviour may be different in Offline Mode, as I can't be certain how your files are stored on disk (they could be in UTF-16 on disk as well, or they could be GB2312 on disk and converted to UTF-16 by the web server).

In any case, if you can change your web server to respond in UTF-8, this would get around the problem. Otherwise, you can wait for the next build or e-mail us (http://www.wrensoft.com/contactus.html) for a test copy in the meantime.

lfeuling
01-17-2008, 02:54 PM
Thanks for the quick response! I'm using the off-line mode, but I've also tried the spider mode with the same result. The pages were originally all defined as UTF-8. (files saved as utf-8 and meta tag definitions). When I wasn't able to get it working in that format I changed the meta tag to GB2312 because my Chinese contact said I need to make that change (this is my first experience with a multi language site - so I have lots to learn). That change didn't have any effect. After more reading on the subject I was lead to think I needed to save all the files as UTF-16. Still no luck. I'm guessing the web server is responding in UTF-16 because that is how the files where formated when I uploaded them. Should I set everything back to UTF-8 and try again? Or, does something need to be changed on the web server?
I appreciate you help!

Ray
01-17-2008, 11:38 PM
UTF-16 is rarely used on the web. This is because it wastes alot of space, you are essentially doubling up the filesize for all alphanumeric characters (which even Chinese pages will contain due to the HTML markup).

I think the behaviour should be different when you index with UTF-8 than UTF-16. Take note of these things:

- How many files are indexed
- Check that you have the same encoding selected in the Zoom Configuration window as the charset you are using for the webpages.
- Check that your search_template.html is in the same encoding
- Turn on "Support single case languages" and "Substring match for all searches" on the "Languages" tab

See this page (http://www.wrensoft.com/zoom/support/languages.html#asian) for more information.

lfeuling
01-22-2008, 03:34 PM
I reformatted all the Chinese pages in UTF-8 and now everything works perfectly! I'm guessing I missed setting the search_template.html to utf-8 my first time around. Thanks again for your help!