PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

404 when I click on result

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 404 when I click on result

    Hi to all,
    when I click on a title to see the document searched, I get a 404 error.
    I checked my apache configuration and I think all is well configured. I think could be an encoding problem, because if I try to access directly through the Apache directory listing, I can reach the document, but directly from the search engine I get the 404.

    I paste the two different link:

    This is the link with 404 error returned by the search engine:

    http://Linux-02/archivi/01-sistema%20qualit%E0%20/z_superati/GIANNI08/Chiara2008/Registrazioni/m4.2.b-%20elenco%20riviste.doc

    And this is the link generated from Apache on Directory Listing that works:

    http://Linux-02/archivi/01-sistema%20qualit%c3%a0%20/z_superati/GIANNI08/Chiara2008/Registrazioni/m4.2.b-%20elenco%20riviste.doc

    Could you help me to find the solution?

    Thank you very much!

  • #2
    Off line Indexing?

    I assume you are using zoom in the offline mode? YOu need to set you base URL to the root of you site. "http://domain.com"

    Try indexing ONLINE with the spider? If the spider can't reach the pages.
    Then you are not configuring you server correctly.

    Can you reach you documents via a web browser?
    Are you attempting to index a web site?

    ++++++++=======> Maybe the guys that run this forum can assist better.

    Comment


    • #3
      On mare tip

      SPACE in the names are BAD...

      Name you DIR and DOCS like so

      this_file OK
      this file NOT OK

      Comment


      • #4
        Spaces are converted to %20 which are not a problem (though it does make for ugly URLs, and underscores, i.e. the "_" character, as z00m_user suggests are generally better)

        This is indeed an encoding problem. The accented character "à" is being encoded as "%E0" (the hexadecimal value of the character in ASCII). However, Apache appears to be expecting it as "%C3%A0" (the hexadecimal values of the character in UTF-.

        Historically, it was not specified in the standards what character set the percent encoding should be made for such characters in the URL. And indeed, it was common to use ASCII here if the character exists in that charset (and in this case, it does, because "à" appears in ASCII).

        However, it now seems common that non-reserved characters are converted to UTF-8 prior to "percent encoding" them. We will have to update Zoom to match the behaviour of various web servers accordingly (and hopefully not break compatibility with older servers).

        In the meantime, if this is the only character in your index file which suffers from this problem, you can workaround it with the "Rewrite links" feature ("Configure"->"Indexing options") and have it "Find in URL": %E0 and replace with: %C3%A0.

        If you have more than one character that exhibits this problem (look for any folder or filenames containing accented characters), then this will be trickier. You may need to rename the folders or filenames without accented characters to accomodate.

        We'll try to put this change in the next patch release (V6.0 build 1021).
        Last edited by Ray; Mar-31-2010, 12:06 AM.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          We did some further investigation.

          It turns out that it is Microsoft's API function UrlEscape (along with UrlCanonicalize and its equivalents) which is performing this duty and encoding the character in ASCII (or windows-1252) when it applies percent encoding.

          There is no ability to utilize a different charset with these functions (or any other that we are aware of) in Windows. Except for Windows 7, where they have introduced a new parameter (URL_ESCAPE_AS_UTF. We will add support for this in the next release if possible (but it would naturally still require Win 7 to work).

          It is worth noting that non-alphanumeric characters in URLs have always been a nasty/problemmatic area, as URLs were never designed for them and these measures were all attempts at making the syntax do something it wasn't originally capable of. So it would not be unwise to really consider renaming the folder/filenames if you wish to avoid this kind of trouble.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Thank you Ray for your help and your detailed explanation.
            Unfortunately I have a huge amount of document (about 2TB) and I cannot rename them because a very large number has special chars.

            Waiting for the fix on the new build, I try with rewrite links option. Thank you for your help!

            Comment


            • #7
              If these are just DOC files on a Linux server, it would be fairly easy to write a script that cycled through each file, checked the name and renamed it if requried.

              For example here is a shell script to replace space characters with underscores for .DOC files.

              for FILE in *.doc ; do NEWFILE=`echo $FILE | sed 's/ /_/g'` ; echo "$FILE will be renamed as $NEWFILE" ; mv "$FILE" $NEWFILE ; done


              Test it in a small folder first and backup your files before you start. It can be hard to reverse changes like this.

              Comment

              Working...
              X