PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Index file duplicated in the search results

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Index file duplicated in the search results

    The index file is getting duplicated in the search results: once for the file itself, and once for the fact that it is the "default" file for the root directory (note the URL listings in the example results below):

    Search results for: future


    2 results found.


    1. Home Page
    ... . How about the future growth? ...
    URL: http://www.blah.com/

    2. Home Page
    ... . How about the future growth? ...
    URL: http://www.blah.com/index.html
    The HTML files are being passed through the PHP intrepreter via an .htaccess file AddType command. I'm not sure if that makes any difference or not.

  • #2
    If the two pages are really identical then you can use the Duplicate page detection feature (on the "Scan options" tab in the configuration window) to remove one of them.

    -------
    David

    Comment


    • #3
      Okay I'll give that a try...

      I didn't think to try that option because there is only a single file (the index.html file) but somehow it is being duplicated in the index.

      Comment


      • #4
        There is only one file on the disk, but if you were looking at just the URLs (like the spider does), then appears to be two files becuase there are two different URLs.

        ------
        David

        Comment


        • #5
          If anyone is curious as to why this occurs:

          The nature of HTTP is that web browsers and spiders can not tell if "http://www.mysite.com/index.html" is the same page as "http://www.mysite.com/". Technically, they can be two totally different pages, it depends on what the web server is configured to do with the URL.

          Turning on "duplicate page detection" will ask Zoom to look at the files after it has downloaded them and determine if we've seen this page before, and discard it if we have.

          But to really prevent this from happening in the first place, you should use a consistent linking scheme in your web pages. The reason that Zoom's spider found the two different pages is that it is being referred to somewhere as "http://www.blah.com/index.html" and elsewhere as "http://www.blah.com/". This may be because you have two different links back to your homepage, one using the former address, and the other using the latter. This can also be caused by your "start URL" in Zoom not matching the address used in your hypertext links to the same page. If these URLs are consistent on your site, then Zoom would not come across the multiple instances of the page at all, and duplicate page detection would not be required.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Thanks for the clarification. That did it!

            I wasn't referring to the "index" file within my HTML via both methods (only the blah.com/index.html way), but in Zoom I was telling it to start indexing at blah.com/. Updating the Zoom file to start indexing at blah.com/index.html fixed it.

            Comment

            Working...
            X