PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Percent encode URLs in UTF-8 gone

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Percent encode URLs in UTF-8 gone

    Hello,
    i am missing in "Configure"->"Advanced"->"Percent encode URLs in UTF-8" option.
    We have some problems to open URL's with space insite.
    And with the '+' in the path, it will not replaced with %2B in the link.

    Regards
    joern
    Last edited by Schluej; Jan-29-2015, 09:00 AM.

  • #2
    Can you give us some URLs as examples?

    The "+" character should not be percent encoded generally. And we tested scenarios where space characters are correctly encoded as "%20". So we'll need to see some example URLs of what you mean.

    Also make sure you have the latest build.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Hello Ray,
      her is an example:
      http://127.0.0.1:8091/thb2009tl/veed...rch=%22edim%22
      After a replace "Lost+found" to "Lost%2Bfound" the browser is able to find the pdf.

      Regards
      Joern

      Comment


      • #4
        Hello,

        I`m experiencing simmilar problems with comma (,) and brackets () in PDF files titles indexed locally in intranet network.
        If I not resolve this, my search engine for our comapany site is in NO GO STATE.

        Thomas

        Comment


        • #5
          While I was able to reproduce the problems you had with the comma and parentheses (note for other readers: in IE specifically, and for offline URLs specifically, please see this thread)

          I could not reproduce the problem you are having with + signs.

          I have also checked pretty thoroughly that the standards (RFC1738, RFC2396, RFC3986) expect the + sign in the path component of a URL to be left unencoded.

          However, I do note that I'm using IE11. It's possible this was a bug in IE9 that they've since fixed. You may want to check if you have the latest version of IE9 while you're at it, since Microsoft's support for it has been stripped down to a limited number of environments.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Hello,
            i have used 3 browser with this link
            PHP Code:
            http://127.0.0.1:8091/thb2009tl/Endress+Hauser/Promass-64-Betriebsanleitung.pdf#search=%22promass%22 
            Firefox 35.0.1
            IE 11.0.15
            Opera 25
            Same result HTTP 404 after replacing the "+" with %2B the link is working...

            A link with "," and "(" is working
            PHP Code:
            http://127.0.0.1:8091/thb2009tl/Gilbarco-Schulungsunterlagen/Training%20Files/Laptop_Tools/Gilbarco_Tools/SK700%20Laptop-Tool/Setup,SK700%20Laptop-Tool/Setup1.0.0.15.exe?zoom_highlight=setup1+0+0+15+exe
            http://127.0.0.1:8091/thb2009tl/VEEDER-ROOT%20(TIM)/ESSO-Spezial/ATG-350_350-Plus-Rev-1.pdf#search=%22atg-350_350-plus-rev-1%20pdf%22 
            Regards
            Schluej
            p.s.: http://tools.ietf.org/html/rfc3986
            Code:
            Berners-Lee, et al.         Standards Track                    [Page 12]
            
             
            RFC 3986                   URI Generic Syntax               January 2005
            
            
                  [B]reserved[/B]    = gen-delims / sub-delims
            
                  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
            
                  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                              / "*" / "+" / "," / ";" / "="
            
               The purpose of reserved characters is to provide a set of delimiting
               characters that are distinguishable from other data within a URI.
               URIs that differ in the replacement of a reserved character with its
               corresponding percent-encoded octet are not equivalent.  Percent-
               encoding a reserved character, or decoding a percent-encoded octet
               that corresponds to a reserved character, will change how the URI is
               interpreted by most applications.  Thus, characters in the reserved
               set are protected from normalization and are therefore safe to be
               used by scheme-specific and producer-specific algorithms for
               delimiting data subcomponents within a URI.
            Last edited by Schluej; Feb-03-2015, 09:14 AM.

            Comment


            • #7
              Originally posted by Schluej View Post
              PHP Code:
              http://127.0.0.1:8091/thb2009tl/Endress+Hauser/Promass-64-Betriebsanleitung.pdf#search=%22promass%22 
              Firefox 35.0.1
              IE 11.0.15
              Opera 25
              Same result HTTP 404 after replacing the "+" with %2B the link is working...
              Just tested this same file path and file name and it worked fine here with Firefox, IE 11.0.9, Chrome 40.0.

              So we would really need more details to recreate the situation. Something else must be different.

              If the search page is available online, then ideally, give us the URL and we can take a look.

              If not, zip up your search files (index files) and your .zcfg configuration file and email this to us.

              Possible differences include: how is the link being found by the indexer? I have tried:
              1) Indexing with spider mode, and having a page which links to the abovementioned URL
              2) Indexing with spider mode where the start URL is the abovementioned URL
              3) Indexing with offline mode

              None of this yielded a problem. But it's possible you have a particular usage scenario that is different. Your configuration file should hopefully show us what it is.

              Originally posted by Schluej View Post
              p.s.: http://tools.ietf.org/html/rfc3986

              Code:
              Berners-Lee, et al.         Standards Track                    [Page 12]
              
               
              RFC 3986                   URI Generic Syntax               January 2005
              
              
                    [B]reserved[/B]    = gen-delims / sub-delims
              
                    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
              
                    sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                                / "*" / "+" / "," / ";" / "="
              
                 The purpose of reserved characters is to provide a set of delimiting
                 characters that are distinguishable from other data within a URI.
                 URIs that differ in the replacement of a reserved character with its
                 corresponding percent-encoded octet are not equivalent.  Percent-
                 encoding a reserved character, or decoding a percent-encoded octet
                 that corresponds to a reserved character, will change how the URI is
                 interpreted by most applications.  Thus, characters in the reserved
                 set are protected from normalization and are therefore safe to be
                 used by scheme-specific and producer-specific algorithms for
                 delimiting data subcomponents within a URI.
              Yes. And this is the paragraph following:

              Code:
              [COLOR=#000000]  URI producing applications should percent-encode data octets that[/COLOR] 
              correspond to characters in the reserved set [B]unless these characters are 
              specifically allowed by the URI scheme to represent data in that component
              [/B]
              Elsewhere, the path component allows for the + sign.

              So it is generally regarded that a + sign is expected to be encoded in the query component (e.g. "?zoom_query=a%2Bb" is a query for "a+b", while "?zoom_query=a+b" means "a b").

              But the "+" character in the path component (e.g. folder names) is treated literally.

              More references: http://stackoverflow.com/questions/1...and-plus-signs
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                Hello Ray,
                files on the way.

                Regards
                Schluej

                Comment


                • #9
                  I just tested your files and the above URL still worked fine in all the abovementioned browsers. All browsers have default settings.

                  However one thing I noticed is that the files you sent us includes an old "search.cgi" file from 2011. This is definitely not from the latest build (the rest of the files were generated with V7 build 1015). So this mix of files would not work. I couldn't get your described behaviour with it however.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment


                  • #10
                    I just noticed you also have "search_linux.cgi", "search_win32.cgi", and "search_osx.cgi" which implies you created the index for FlyingAnt at one point. Is this correct?

                    You never mentioned FlyingAnt in the above thread, so this wasn't something we've investigated yet. So it seems like we've been on a wild goose chase with the wrong details.

                    UPDATE: Just tested and could reproduce the problem with FlyingAnt. It's a FlyingAnt bug in that it's not handling unescaped "+" characters in the URL. We'll open a ticket for this to be fixed in the next build of FlyingAnt.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment


                    • #11
                      Hello Ray,
                      sorry for missing this important information...
                      Could you see a date for the new version?

                      Regards
                      Schluej

                      Comment


                      • #12
                        New build of FlyingAnt is now available with the problem with folder paths containing "+" characters fixed.
                        http://www.wrensoft.com/flyingant/
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment

                        Working...
                        X