View Full Version : Index file duplicated in the search results
Anonymous
06-03-2005, 05:22 PM
The index file is getting duplicated in the search results: once for the file itself, and once for the fact that it is the "default" file for the root directory (note the URL listings in the example results below):
Search results for: future
2 results found.
1. Home Page
... . How about the future growth? ...
URL: http://www.blah.com/
2. Home Page
... . How about the future growth? ...
URL: http://www.blah.com/index.html
The HTML files are being passed through the PHP intrepreter via an .htaccess file AddType command. I'm not sure if that makes any difference or not.
wrensoft
06-04-2005, 01:15 AM
If the two pages are really identical then you can use the Duplicate page detection feature (on the "Scan options" tab in the configuration window) to remove one of them.
-------
David
Anonymous
06-04-2005, 03:05 AM
Okay I'll give that a try...
I didn't think to try that option because there is only a single file (the index.html file) but somehow it is being duplicated in the index.
wrensoft
06-05-2005, 12:07 AM
There is only one file on the disk, but if you were looking at just the URLs (like the spider does), then appears to be two files becuase there are two different URLs.
------
David
If anyone is curious as to why this occurs:
The nature of HTTP is that web browsers and spiders can not tell if "http://www.mysite.com/index.html" is the same page as "http://www.mysite.com/". Technically, they can be two totally different pages, it depends on what the web server is configured to do with the URL.
Turning on "duplicate page detection" will ask Zoom to look at the files after it has downloaded them and determine if we've seen this page before, and discard it if we have.
But to really prevent this from happening in the first place, you should use a consistent linking scheme in your web pages. The reason that Zoom's spider found the two different pages is that it is being referred to somewhere as "http://www.blah.com/index.html" and elsewhere as "http://www.blah.com/". This may be because you have two different links back to your homepage, one using the former address, and the other using the latter. This can also be caused by your "start URL" in Zoom not matching the address used in your hypertext links to the same page. If these URLs are consistent on your site, then Zoom would not come across the multiple instances of the page at all, and duplicate page detection would not be required.
Anonymous
06-06-2005, 02:04 PM
Thanks for the clarification. That did it!
I wasn't referring to the "index" file within my HTML via both methods (only the blah.com/index.html way), but in Zoom I was telling it to start indexing at blah.com/. Updating the Zoom file to start indexing at blah.com/index.html fixed it.
vBulletin® v3.7.0, Copyright ©2000-2008, Jelsoft Enterprises Ltd.