PDA

View Full Version : French Site - Query Problems



JCF1976
01-01-2007, 07:33 AM
Greetings. I have a French site that I just finished. The accents (in the HTML) are all in ASCII. This is very good for viewing in browsers, but may be causing a problem with Zoom, particularly with querying.

I have all of my pages in UTF-8. I did use the "Enable accent/diacritic/ligature insensitivity" setting. I did use the UTF-8 setting.

What happens:

#1, if I run a query actually using words with accent marks, it doesn't pull up the results in the index.

#2, if I run a query without the accents, it pulls up the results (which have the accents) but doesn't highlight them.

What am I doing wrong? I need this to work properly. I would like a person to be able to search for words with or without the accent marks and for it to pull up the right words.

(I have V5 Pro)

JCF1976
01-01-2007, 07:37 AM
I look forward to receiving help from Ray and/or David, but, if there are any forum members that are French speaking and/or have experience crawling, indexing and querying French sites, I would appreciate your contribution in this thread!

wrensoft
01-01-2007, 09:49 PM
Can you give a couple of examples of what words you are searching for and let us know what Zoom script option you are using (PHP, ASP, etc..).

If you are only using plain ASCII, then you probably don't need UTF-8 and might try using the Windows-1252 (English/Latin) character set. Having said that, UTF-8 should still work. I am wondering however is the UTF-8 encoding of the French accented characters is different from the Windows-1252 encoding and if this is what is messing up the accent insensitivity option.

JCF1976
01-01-2007, 10:51 PM
Can you give a couple of examples of what words you are searching for and let us know what Zoom script option you are using (PHP, ASP, etc..).

If you are only using plain ASCII, then you probably don't need UTF-8 and might try using the Windows-1252 (English/Latin) character set. Having said that, UTF-8 should still work. I am wondering however is the UTF-8 encoding of the French accented characters is different from the Windows-1252 encoding and if this is what is messing up the accent insensitivity option.

Thanks for your response. Hopefully this can be resolved. I am using PHP. The index is of about 1200 pages. I wondered if I needed UTF-8, since I am using ASCII. It's probably not necessary, but I also tend to think it doesn't hurt.

Here are a couple examples of searches:

délibérèrent

ôtera même

I almost tried the Windows-1252 crawl/index option. I can still try that and then encode the search page in Windows-1252, unless you think it is not necessary.

I look forward to further responses, ASAP.

wrensoft
01-02-2007, 02:27 AM
I spend some time making up some example pages but couldn't reproduce the main part of the problem you described in the end.

I made two example pages. Using ASCII characters (no multibyte and no character entities in the HTML). I then set the page character sets to UTF-8 and ISO-8859-1 for the two files.

In Zoom I selected the PHP option with the UTF-8 character set and checked this was carried over to the search_template file. I also set the 'Enable accent/diacritic/ligature insensitivity' option.

I then did searches for the words you mentioned, both with and without accents. I got the same set of results with and without accents. As expected.

Here is a screen shot.

http://www.wrensoft.com/images/forumimages/frenchaccent.gif


However I think you are right about the highlighting of the search word not working with this combination of configuration settings, character sets and accented search words. So we need to have a look at this part of the problem to see if it can be fixed or improved on for the next patch release.

JCF1976
01-02-2007, 04:01 AM
I spend some time making up some example pages

Thank you for your efforts.


...but couldn't reproduce the main part of the problem you described in the end.

I made two example pages. Using ASCII characters

What method did you use and what software, out of curiousity?


(no multibyte and no character entities in the HTML).

Please explain more, exactly what you mean.


I then set the page character sets to UTF-8 and ISO-8859-1 for the two files.

In Zoom I selected the PHP option with the UTF-8 character set and checked this was carried over to the search_template file. I also set the 'Enable accent/diacritic/ligature insensitivity' option.

I then did searches for the words you mentioned, both with and without accents. I got the same set of results with and without accents. As expected.

I am curious how you actually got the results, searching with the accents. I wonder if it matters how someone is inputting the words, into the search fields? I am asking out of curiousity. I wonder if someone types into the field using a French keyboard layout or copying and pasting or another method, how that would affect things or not. I also wonder if you were copying from a page that was using the ASCII in the HTML or if it was from a text editor that had the words not in ASCII. Again, these are just things that occur to me and that I wonder about.


...However I think you are right about the highlighting of the search word not working with this combination of configuration settings, character sets and accented search words. So we need to have a look at this part of the problem to see if it can be fixed or improved on for the next patch release.

This is good then, that you see this and can work on the issue. I also noticed that the jump to worked but the highlight did not work, when I click on the links from the results.

wrensoft
01-02-2007, 10:48 PM
The test files were made using a text editor.


Please explain more, exactly what you mean.

Multi-byte is when more than 1 byte is required to represent a character in the alphabet. ASCII is always single byte. UTF-8 is a mix of single byte and multi-byte (http://en.wikipedia.org/wiki/Unicode). There are some accented characters that require 1 byte and some that require 2 or 3 or 4.

HTML character entities (http://www.w3.org/TR/html401/sgml/entities.html) are special strings, defined in the WWW standards, that are used to represent special characters. Including accented characters in some character sets.

It should not matter if you cut and paste or type in the accented characters. Provided of course that the you aren't forcing a Unicode to single byte conversion on multibyte character. Which should not be the case here as the accented characters in question are represented by a single byte.

So we need more details & maybe copies of your HTML pages if we are going to reproduce the problem.

JCF1976
01-06-2007, 08:54 AM
I found this information helpful (found at http://en.wikipedia.org/wiki/ISO_8859-1 ):

ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. Each character is encoded as a single eight-bit code value. These code values can be used in almost any data interchange system to communicate in the following European languages (with a few exceptions due to missing characters, as noted):

...# French (missing Œ, œ and rare Ÿ)

* Note that Windows-1252 and ISO-8859-15 do contain these

...Relationship to ISO/IEC 8859-15

Although ISO/IEC 8859-1 has enough characters for most French text, it is missing a few less-common letters. It is also missing a single-glyph representation for the letter IJ, two Finnish letters used for transcription of some foreign names and in a few loanwords (Š and Ž), typographic quotation marks and dashes, and common symbols such as the euro sign (€) and dagger (†).

In order to provide some of these characters, ISO/IEC 8859-15 was developed as an update of ISO/IEC 8859-1. This required, however, the removal of some infrequently-used characters from ISO/IEC 8859-1, including fraction symbols and letter-free diacritics: ¤, |, ¨, ´, ¸, ¼, ½, and ¾.

JCF1976
01-06-2007, 08:57 AM
I am going to do some more testing this weekend, including converting the ASCII characters to real French characters IN the code. (Don't worry! I'll do testing on a copy of the site. :-)

JCF1976
01-06-2007, 11:15 AM
Okay, I just changed all of the ASCII characters to actual accented French vowels. I also changed all of the encoding to ISO-8859-15 (I did that prior to changing all of the vowels). I reran the zoom crawler (locally). I uploaded the new files and ran a query with the following words:

ténèbres étaient à la surface

The search result page displayed:

Résultats de la recherche pour : ta©našbres a©taient a la surface dans toutes les categories

and infact, the actual search field displays:

ténÚbres étaient à la surface

instead of:

ténèbres étaient à la surface

I did/had change/changed the encoding on the search template to ISO-8859-15 too. So, I am not sure what to make of this.

JCF1976
01-06-2007, 11:21 AM
By the way, it should go without saying, that when I crawled the files locally, with Zoom, that I used the ISO-8859-15 setting.

JCF1976
01-06-2007, 11:31 AM
I noticed that the suggested search isn't correct either:

Vouliez-vous dire: tenebres autant au la surface?

instead of:

Vouliez-vous dire: tenebres etaient a la surface?

JCF1976
01-06-2007, 01:07 PM
Okay, I switched everything over to UTF-8 again and recrawled the files that had been converted from ASCII text to French accents. Reposted everything with the changes. Now we're back to the old/original problem. The queries are not pulling up results when I do a search with accented characters.

wrensoft
01-07-2007, 04:24 AM
Can you put the HTML pages in question on a public web site where we can see the files. Or put the entire search function on a public site and post the URL. E-mailing us your Zoom configuration file would also help us match your configuration.

JCF1976
01-07-2007, 09:54 PM
Can you put the HTML pages in question on a public web site where we can see the files. Or put the entire search function on a public site and post the URL. E-mailing us your Zoom configuration file would also help us match your configuration.

I know it is limiting, but I'd rather not (and there's a lot of people that feel the same way I do). So, let's continue to communicate through the forum. What other questions can you think of?

wrensoft
01-07-2007, 10:40 PM
I have uploaded the set of working example files I made to our server. So instead of us trying to reproduce the problem with your files (which we don't have), you can try and provoke the problem by editing our files or work out what is different by comparing your files to our files.

You can download the set of files here,
http://www.wrensoft.com/test/french/accenttest.zip

and see it working here,
http://www.wrensoft.com/test/french/search.php?zoom_query=m%C3%AAme
http://www.wrensoft.com/test/french/search.php?zoom_query=meme

This set of index files were generated with the UTF-8 selected in Zoom 5 on a Windows XP machine. I tested the search behaviour on Windows/PHP and Unix/PHP and it was the same.

JCF1976
01-08-2007, 12:13 PM
Thanks! I'll download it and test it this evening!

m00di
01-30-2007, 03:21 AM
Hi

Did you find a solution for this. I am having the same issue.

Thanks

JCF1976
01-30-2007, 05:25 PM
Hi

Did you find a solution for this. I am having the same issue.

Thanks

m00di, I have been meaning to take the time to test this and get back to David. I have been swamped with other project work. This is still very important to me. Please contribute to this thread with your own findings and maybe David will be able to offer a fix for this.

JCF1976
03-18-2007, 07:22 AM
David, I did further work on this problem. I made another copy of the whole site and did a find and replace function to replace all of the ASCII characters with normal characters. I did download the most recent version of your software to use in crawling the site again.

I am restricted by doing offline searches. I don't know if that makes a difference. I am sure you would say that it does not. ...If Zoom would obey my online robots.txt file, I could try to crawl the site online to see if there would be a difference. This is another reason why you would not be able to crawl the site and do testing.

I see a slight improvement, after the work I did and possibly the work you have done on your program. I see that if I do a one word search with the accents in the word, it does come up in the results. It appears that Zoom did indeed index the accented words. If I do a multi-word search, it also gives results, but it's hard to tell exactly what kind of results I am getting. What I have tried to do is doing the multi-word search in quotes, where there are words with accented characters. This does not work. So, this appears to be where your indexing breaks down.

Also, the words are not being highlighted, if I do searches with accented words. This is disappointing, of course. I hope you will be able to fix this.

I did download your files and took a look at them. I also reviewed the searches you performed. Again, I saw that you did not try to perform any searches with two or more words in quotes, where the words are accented. This kind of search is critical on my site.

I look forward to your further responses. Based on m00di's posts, it's evident that others are also interested in this being resolved.

wrensoft
03-18-2007, 08:25 PM
As far as we are aware is there no issue to be resolved. We did some testing and posted the results of our tests (see above). But didn't see the problem you are talking about. Once you get the config correct, it works fine for French as far as we know and no one as provided an example to the contrary.

So unless you are prepared to provide exact details of your configuration and copies of your input files we don't plan on investigating this issue. Otherwise there is nothing for us to investigate.