Hi guys
We are getting duplicates in our zoom search results.
The urls are almost exactly the same -
http://www.example.com/test.htm
and
http://www.example.com/test.htm/
When I compare the source code of both pages, they are identical except for
the url (as above) and the facebook iframe which differs in its link (whether the trailing slash appears.) EG
<iframe id="facebook_like" src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.example.com/test.htm/" ></iframe>
So I wrapped the iframe in zoom tags Eg
<!--ZOOMSTOP-->
<!--ZOOMSTOPFOLLOW-->
<iframe id="facebook_like" src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.example.com/test.htm/" ></iframe>
<!--ZOOMRESTARTFOLLOW-->
<!--ZOOMRESTART-->
Is this enough to make sure zoom doesn't open the iframe content?
Shouldn't zoom disregard the iframe content since it is a different domain (facebook.com)?
What else can I do?
Several things of note:
1) Zoom would not be following that facebook.com link unless you:
(a) have multiple start points, one of which is facebook.com
(b) have multiple base URLs for the start point, which allows facebook.com to be considered part of the same start point.
(c) have set the spidering options for the start point to "Index page and follow internal and external links" (after clicking on "More"->"Edit")
If none of the above is the case, then I'd suspect there's another link somewhere on your site which is going to that URL, rather than the facebook link.
2) Technically, the following are two very different URLs:
http://www.example.com/test.htm
http://www.example.com/test.htm/
The latter is in fact, a directory named "test.htm". However, you can configure a web server to rewrite URLs and automatically attempt to find a matching filename, ignoring the fact that a folder was actually requested. When this happens, the server is simply compensating and doing this while the client (i.e. the browser or in this case, the spider) is none the wiser and is given no clue that this was treated as the same URL.
Having said that, what Zoom can do is look at the page content and decide if it is truly a duplicate page, and reject it if so. This setting can be found under "Configure"->"Scan options"->"Use CRC to skip files with identical content".
Note that the page must be completely identical to work, so if it has something which is dynamic (e.g. the current date and time is printed at the top of the page, or it contains advertising), then it will not be recognized as being identical.
Thanks for the quick response Ray.
"Use CRC" is selected and works .. I have a test page set up to check this.
So there is something that is not identical in these pages.
When I check them they look identical, but maybe its when the indexing is run that they aren't identical.
Any changing content is in ZOOMSTOP tags or is written to the page using javascript document.write commands.
Would it be ok if I PM you some links?