Announcement

**David** · Oct-19-2010, 06:10 PM

For PDF files the data for the search result title can come from the file name, or the internal document title (in the PDF), or an associated meta file (.desc file). What was the name or URL of the PDF file? What was the internal title of the file? Are you indexing in spider mode or off line mode? What character set are you using in Zoom? What character set are you using on your web site?

**delario** · Oct-20-2010, 02:58 PM

Ohh I see: I forgot a lot of informations:

The filename of the file is:
"MueKo 2010 � 13 i.pdf" there is no title in the pdf properties.

I don't use a .desc file
I use the spider mode.
If I use the offline mode,the section sign is shown good, but some other informations are badly.

I use the utf-8 character set in Zoom (trys ome other) and utf-8 on website too.

The url is not conform (changing the folder an documents names will take weeks):
http://server-name/aufs%C3%A4tze/MueKo%202010%20%C2%A7%2013%20i.pdf

Hope this infos helps.
Regards

**Ray** · Oct-21-2010, 01:29 AM

We tried creating some test situations but we could not reproduce the problem.

Can you tell us:

1) Which scripting platform you are using - PHP? ASP? ASP.NET? CGI? Javascript?

2) Is the search script being embedded within another page? For example, did you use #include or similar to embed "search.asp" within your own ASP page? If so, have you tried accessing the search script directly (not via your embedded page), and does it exhibit this problem?

3) Have you modified the search script (search.php, search.asp, search.js, etc.) in any way?

**delario** · Oct-21-2010, 09:21 AM

Thanks for reply,

I use the CGI platform with IIS 7 on Windows 2008 R2.
For testing I use the search.cgi without template modifications.

To eliminite some possible error sources I try the ASP Platform with the same result.

This is the sourcecode of the false result:

Code:

<div class="result_image">
<a href="http://server-name/aufs%C3%A4tze/MueKo 2010 %C2%A7 13 i.pdf#search=&quot;bgh&quot;" target="_blank">
<img src="http://ip-address/search/pdf.gif" alt="" class="result_image" />
</a>
</div>
<div class="result_title">
<b>2.</b>
&nbsp;
<a href="http://server-name/aufs%C3%A4tze/MueKo 2010 %C2%A7 13 i.pdf#search=&quot;bgh&quot;" target="_blank">
MueKo 2010 § 13 i.pdf
</a>
</div>

Regards

**Ray** · Oct-22-2010, 02:41 AM

OK, I just reproduced the issue with Spider Mode. I had overlook that and was testing with Offline Mode, which didn't have the problem.

The problem is that, ultimately, you are using a character in a filename which is bad practice for URLs (note that a character may be okay for a filename but ill advised for a URL, which has different requirements).

This is because the standards did not originally accomodate for such characters, and it was ambiguous how such characters were to be handled.

Because of this, Microsoft originally implemented their Windows API functions (UrlEscape and UrlUnescape) to handle such characters as ASCII. Likewise with some old web servers and browsers. But since then, and more recently, it has become more accepted and expected to handle such characters as UTF-8.

Microsoft has only recently addressed this, and the feature in the form of URL_ESCAPE_AS_UTF8, which is only available in Windows 7.

To complicate matters, older web servers may use ASCII escaped characters. So there's no one "correct" way of handling this.

We recently added support for this (in V6.0 build 1021) when percent encoding URLs. In your situation however, it is percent decoding from the web server given URL. So we need to change that too for the next release**. But as mentioned, even with this change in place, it will only work on Windows 7.

Quoting from my previous post on this issue, ultimately you need to consider this:

It is worth noting that non-alphanumeric characters in URLs have always been a nasty/problemmatic area, as URLs were never designed for them and these measures were all attempts at making the syntax do something it wasn't originally capable of. So it would not be unwise to really consider renaming the folder/filenames if you wish to avoid this kind of trouble.

This applies to any other software which will interact with your web site, including servers, browsers and other spiders and bots.

**EDIT: Okay, it turns out Microsoft didn't even add support to UrlUnescape to handle UTF-8 percent encoding in Windows 7. It's only in UrlEscape, not the reverse. Go figure. I even tested this by implementing it first, thinking it might just be undocumented (as it was originally in UrlEscape). But no, it just doesn't work. This would suggest that proper support for unescaping UTF8 URLs is even less common than first expected.

So it's going to be more than just a quick patch to address this, we'll have to write new code which will more likely go into V7 instead (where it will be a user toggle-able behaviour). I would recommend addressing this as suggested above by getting to the cause of the problem. As much work as it may seem, you will benefit in the long run.

There are various tools out there to perform mass file renaming without you manually doing it.

**delario** · Oct-22-2010, 07:59 AM

Okay, thank you very much for analyzing this problem, Ray!

When I was analyzing this, I see there is the zoom_padedata.zdat file.
If I do an automatic replacement for this ugly sign with nothing or a blank, sometimes it's ok sometimes I got the message "Error: Corrupted or invalid index files. Please re-index your site and make sure to re-upload all Required Files listed at the end of indexing. " on search site.

Is there a workaround to edit this file without getting the message?
If I see right, I replace only the part for the filename, not for the url or any other important data.

Thanks for reply and best regards

**David** · Oct-22-2010, 08:22 AM

We don't recommend direct editing of the data in the index files. You will end up corrupting them (as you have discovered). If you didn't change the length of any of the records you might get away with it. But why not put the effort into fixing the root cause of the problem instead.

**delario** · Oct-22-2010, 08:32 AM

Fixing the root problem isn't an option, because it's for our user very nessacary to find terms like "� 123 Abs. 2".

When I understand right we have to remove all section signs from the filenames and that isn't possible because the documents haven't a title in file properties.

Announcement

Section Sign shown wrong in Title

Section Sign shown wrong in Title

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment