PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Bug? Accented character in directory path

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bug? Accented character in directory path

    I'm using Zoom 5.0.1004 with CGI/Win32. One of my directories includes an accented character, which is indexed fine.

    But on clicking on the search results link, the accented character is corrupted, and the document is not found.

    Indexed directory path
    C:\web\..\papers\Cafén

    On clicking on the link, I get:
    The requested URL /../papers/Cafén/doc.pdf was not found on this server


    Regards,
    Ian Tresman

  • #2
    You should not use 'é' as part of a URL. Accented characters are not valid characters in a URL.

    The formal definition is,

    -=-=-=-=-=-=-=-=
    The URL standard, RFC 1738, <http://www.ietf.org/rfc/rfc1738.txt>.

    ; URL schemeparts for ip based protocols:
    ip-schemepart = "//" login [ "/" urlpath ]
    login = [ user [ ":" password ] "@" ] hostport
    hostport = host [ ":" port ]
    host = hostname | hostnumber
    hostname = *[ domainlabel "." ] toplabel
    domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit
    toplabel = alpha | alpha *[ alphadigit | "-" ] alphadigit
    alphadigit = alpha | digit
    hostnumber = digits "." digits "." digits "." digits
    port = digits
    user = *[ uchar | ";" | "?" | "&" | "=" ]
    password = *[ uchar | ";" | "?" | "&" | "=" ]
    urlpath = *xchar ; depends on protocol see section 3.1

    ; HTTP
    httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
    hpath = hsegment *[ "/" hsegment ]
    hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
    search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

    ; Miscellaneous definitions
    lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
    "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
    "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
    "y" | "z"
    hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
    "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
    "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
    alpha = lowalpha | hialpha
    digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
    "8" | "9"
    safe = "$" | "-" | "_" | "." | "+"
    extra = "!" | "*" | "'" | "(" | ")" | ","
    national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
    punctuation = "<" | ">" | "#" | "%" | <">

    reserved = ";" | "/" | "?" | ":" | "@" | "&" | "="
    hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
    "a" | "b" | "c" | "d" | "e" | "f"
    escape = "%" hex hex
    unreserved = alpha | digit | safe | extra
    uchar = unreserved | escape
    xchar = unreserved | reserved | escape
    digits = 1*digit
    -=-=-=-=-=-=-=-=

    But having said that, we could probably look at encoding the illegal characters in the URL (as % hex hex). In fact I think we already do this for spider mode. I assume you are using offline mode?

    Comment


    • #3
      Originally posted by wrensoft View Post
      But having said that, we could probably look at encoding the illegal characters in the URL (as % hex hex). In fact I think we already do this for spider mode. I assume you are using offline mode?
      Correct, for speed, I'm indexing in Offline mode, and then upload and running from a Web site.

      Regards,
      Ian Tresman

      Comment


      • #4
        I think I'll just change the directory name and remove the accent.

        Regards,
        Ian

        Comment


        • #5
          This will be addressed in the upcoming build (5.0.1005) which will encode any unsafe characters in the filename/path when indexing in Offline Mode. So in the above example, it will now link to "../papers/Caf%E9n/doc.pdf".
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment

          Working...
          X