PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Spider not finding all broken links

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spider not finding all broken links

    Perhaps someone with a bit more experience can help me out.

    When I run the spider to index my intranet site, I get the following broken link output in the log (I've replaced a large chunk of characters with bold x to preserve privacy):

    10:17:14 - Broken link found on page:
    http://xxx.xx.ca/opx/Pxx/Px_NorthXxx/xxxments/paxx%20xxund/paxx%20unddetails.htm

    10:17:21 - (Broken link URL is:
    http://xxxweb.xxx.xxx.xx.ca/opx/Pxx/Px_NorthXxx/common/Xxxulance.htm )

    I understand that the broken link URL is found on the page identified by the previous line in the log. No issues there, I fix the link on the server and all is good.

    However when I run the spider again, the broken link isn't found anymore (good), but finds another broken link on the server. This new broken link is similar to the first (same name, same URL) - the only difference is that the new broken link is on a different page located in a different branch of the file tree.

    This happens everytime I run the spider. It's gets very tiresome fixing, running, fixing, running when the spider should find and identify multiple occurences of the same URL being broken, then I could fix them all at once.

    So, I guess my question is this. Why does the spider seem to stop after finding the first broken link, even when the same broken link URL exists somewhere else within the directory structure?

    Is there a setting I can change?

    Any help would be appreciated

  • #2
    The obvious thing is to not create hundreds of broken pages in the first place.

    Second obvious thing would be to use page templates or a CMS system. So when things like links in menus get broken you only need to fix it once in one place.

    But as this doesn't seem to apply, in your case I would suggest doing either,

    1) A global search and replace across your HTML code. There are many tools that can do this. Dreamweaver and Ultraedit are two we use.

    2) Putting in a server side redirect. This is especially useful if external sites, which you don't control, are linking in to your site. Again, it is a one off change to do this.

    Comment


    • #3
      Originally posted by wrensoft View Post
      The obvious thing is to not create hundreds of broken pages in the first place
      Thanks for the idea, don't think ANYONE would have ever thought of that <roll-eyes>

      We inherited this directory structure and don't have much choice at the moment to use what has been handed to us.

      My original questions still stand. Is there are reason it stops after finding the first common broken link? Or is there a setting that I'm missing in the configuration set-up.

      Comment


      • #4
        Originally posted by PilotJR70 View Post
        However when I run the spider again, the broken link isn't found anymore (good), but finds another broken link on the server. This new broken link is similar to the first (same name, same URL) - the only difference is that the new broken link is on a different page located in a different branch of the file tree.
        ...

        So, I guess my question is this. Why does the spider seem to stop after finding the first broken link, even when the same broken link URL exists somewhere else within the directory structure?
        There are two main reasons:

        1) If the first broken link causes the spider to not crawl a sub portion of the website, then it won't get to that branch containing the second broken link.

        For example, let's say you have a "/news/" section of the site which links to "/articles/index.html" but this link is broken so the entire "/articles/" section of the site is not indexed. In which case, if there's a broken link in the /articles/ section, it won't be found until you fix that first link.

        2) Second, yes, Zoom will not reconsider a link if it turns out to be broken the first time. So it won't determine it is a broken link again, it will just notice that it's a link it's already seen and bypass it. This saves indexing time instead of having to re-attempt every time a URL appears (some broken links actually lead to a time out instead of a 404 error and can take up to a minute to complete).
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X