PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

//index.php - Site indexed twice with double slashes

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • //index.php - Site indexed twice with double slashes

    Hi.

    I just purchased Zoom Search Engine and absolutely love it!

    One thing I noticed, though is, that the indexer indexes all my pages twice, one time http://www.mydomain.com/index.php?id=xx and one time http://www.mydomain.com//index.php?id=xx. I am using the Etomite CMS for my website.

    On one hand I'd like to know, why the indexer scans my sites twice and on the other hand, I'd prefer not to have my sites indexed as //index.php pages (uniformity of the links).

    Any ideas?

    Thanks in advance!

  • #2
    Zoom follows all the links it can find. Somewhere on your site you probably have a bad link. Maybe just a typo. This bad link will have a double // in it. It might even be a bug in the Etomite CMS software.

    Once the bad link is encountered, it is followed, and lots of new links will then be generated by your site with double slashes. In the end your entire site will be indexed twice.

    We have even seen cases where the site is a infinite loop.
    First pass,
    http://www.mydomain.com/index.php
    2nd pass,
    http://www.mydomain.com//index.php
    3rd pass,
    http://www.mydomain.com///index.php
    ....
    10th pass,
    http://www.mydomain.com//////////index.php

    The are two solutions,

    1) Turn on full logging in Zoom and find the bad link by looking through the log. (this is the best solution)

    2) Hide the problem, by adding
    .com//
    to your page skip list in Zoom.

    Comment


    • #3
      Apparantly, selecting "use CRC to skip files with identical content" also did the trick.

      Thanks anyway.

      Comment


      • #4
        Yes, that would also fix the problem. I didn't suggest that because it is an inefficient solution. Firstly because you are only hiding the problem rather than fixing it. Secondly because the CRC filtering can only happen after page is downloaded. So you are still downloading all pages twice.

        Filtering on the URL via the skip list is a more efficient way of hiding the problem.

        Comment

        Working...
        X