PDA

View Full Version : problem with dashes and indexing



derdan
07-20-2005, 05:42 PM
I'm indexing a set of HTML pages for offline access. Included among the indexed terms are command line flags, which consist of a dash followed by a single letter (e.g., "-f"). For the most part, there are no problems with these. However, I did notice that I wasn't getting the correct "hits" when searching for -b or -v. I looked in the zoom_index.js file, and sure enough, I found something peculiar:

"-v,6,4",
"-b,6,4",
"-b,6,8,14,15,35,15,269,15,271,16,272,15,281,75,282, 15,288,15,289,15,306,16",
"-v,6,4,10,8,81,15,122,20,124,15,133,30,271,4,273,15",

For some reason, -b and -v both have two array entries. When I perform a search, only the web pages listed for the first occurence of the same flag in the array are returned. If I transpose the array elements for either -b or -v, I get the longer list of results when performing a search, as expected.

So I'm wondering why this indexing anomaly is occurring, and if there's a possible workaround. It's not difficult for me to fix the problem manually, but the index file is over 400 KB and I haven't checked it thoroughly for other similar issues. Better to avoid the problem altogether.

I'm using Zoom v4.1 Professional, by the way.

Thanks in advance.

Ray
07-21-2005, 12:57 AM
Check if the encoding/charset of the files you are indexing, matches your encoding configuration in Zoom (click "Configure" -> "Languages"). This can sometimes cause different characters or words to get mis-encoded when they are written out to file, causing what appears to be a duplicate entry.

If this is not the problem, it might be best if you can e-mail us your ZCFG file, and some of the HTML pages which you are indexing (if the files are not online). Depending on the number of pages which contains these terms ("-b" and "-v"), send us enough pages to cause this problem to replicate (multiple entries of the same term in the zoom_index.js file). We can then take a closer look at this problem.

You should also ensure that you are using the latest build (Version 4.1 build 1003) available at:
http://www.wrensoft.com/zoom/whatsnew.html

derdan
07-22-2005, 02:05 PM
The charset of the files and Zoom configuration are the same, and I'm using the latest build.

I'll e-mail the relevant files so you can investigate further...

Ray
07-25-2005, 04:14 AM
We had a look at your files and have determined that this is a bug.

Note that this only affects words beginning with a punctuation character (eg. "-b", ".net", etc.) and only for the Javascript platform. It occurs when you have these words in different upper/lowercase forms.

This was triggered in your files because you have "-b" and "-v" on most pages, except for one ("errors_ndlm.html"), which mentions them as "-B" and "-V" (note: in upper case form).

We will have this bug fixed in the next public build - most likely Version 4.2.

In the meantime, you may want to workaround the issue by modifying that single file, and replacing "-B" with "-b" and "-V" with "-v".

Let us know if you have any questions, or if you continue to have problems.