PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Duplicate Results

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate Results

    Having recently installed Zoom on my company intranet, I've noticed a rather strange occurance. Some pages are coming up in the results twice, for the following reason:

    1) If a page is accessible with different casing in the link, it seems to get indexed twice.

    i.e. If a page is linked to as follows:
    /Staff_Handbook/Mobile_phones.aspx

    or

    /staff_handbook/mobile_phones.aspx

    then both entries will appear in the results. Shouldn't Zoom be able to work out that these two pages are actually the same page?

    2) Sometimes the default folder document (default.asp, default.aspx, index.html etc) will be indexed twice,

    i.e. "/staff_handbook/" and "/staff_handbook/default.aspx" will both be indexed as separate pages


    Anyone know if I'm missing something here? I know that ideally all links would be in the same case and every default document would be linked to using either "/" or "/default.aspx" but unfortunately in an environment where less techie people are responsible for updating content, this isn't always that easy to enforce.

  • #2
    Can't really blame Zoom, it follows the links and the server returns a page. Its been an issue with IIS since day one when other web servers treat page names as context sensitive.

    Have you tried the Duplication page detection option on the Configure | Scan Options tab.

    I copied the following from the ZoomSearch help.

    Duplicate page detection
    Checking this option enables the use of CRC-32 signatures to ensure that only pages with unique content are indexed. This is particularly useful for spidering websites with links to pages without a filename, for instance, to a directory (eg: http://mywebsite.com/). These links may otherwise be indexed twice if there is another link somewhere else on the website which points to the same place, but with the actual filename specified (such as http://mywebsite.com/index.html, http://mywebsite.com/home.htm, etc.). It is best to avoid this on your website and use a consistent linking method. However, you can also prevent this by turning on this option.
    Mark Gallagher

    Comment


    • #3
      As sizbut recommended, the CRC-32 option should do the trick.

      Note that URLs are supposed to be case sensitive, according to the HTTP standard, and this is true for almost every web server except IIS. Because the Windows file system is not case sensitive, IIS just returns the same file regardless of the case. On a Linux server, for example, the above URL would be considered two different files.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment

      Working...
      X