PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Index links only on certain pages - no control over HTML

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Index links only on certain pages - no control over HTML

    We've been using Zoom for years on our sites and it works great. Using the STOP and RESTART tags works fantastic for removing content we don't want indexed (menu's, repetitive content, etc).

    We are creating a new site that will index 3rd party content. It is our content, just hosted on another site that we will be linking to. The problem with this is that we don't have control over the HTML content. We can't put in the STOP and RESTART commands. Is there anything we can do otherwise?


    Also, the indexer finds all the pages on the site which is great. It will come across the Search Results pages and then index all the links from there. The problem is that it indexes the search listing page too. I don't want it to index that page, just follow the links. Since we don't have control over the content, I can't set this as a new start point with the "follow links" option. The main starting point just finds the search pages and goes through them.

    If I set this page to be a Skipped page, it won't index them at all.

    Thanks,
    Shawn

  • #2
    If you can add a robots meta tag to the page, then you could have:

    <meta name="robots" content="noindex" />

    And the page content will not be indexed, but the links will be followed. Likewise you can have "nofollow" to skip links, or combine the two "noindex,nofollow".

    You'll have to enable robots support under "Configure"->"Spider options".

    The other way is to create start points with different indexing options like "Follow only" as you mentioned. I'm not quite sure why you can't do this according to your description of the situation. This should work even if you can't modify the content. Just make sure to specify the start point for these pages earlier than any other start points where it may overlap the same page. It will be skipped if it has already been indexed, and the start points are indexed sequentially.

    There is also the Content Filter ("Configure"->"Filtering") which can exclude pages based on any text found on the page. Note that this applies to the raw HTML source code, so you can specify any unique HTML code that will identify these pages.

    For example, a content filter entry of:
    -<title>My 3rd party site

    Will filter pages where with a title tag beginning like the above. Note the "minus" sign in front of the line is important.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment

    Working...
    X