PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Partly ignores robots.txt?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Partly ignores robots.txt?

    I can't seem to get zoom search to exclude specific files via the Robots.txt if they have typical variables as part of the URL. Robots.txt works fine for directories or a filename without variables. I'm trying to tune robots.txt for my forum (VB -same the one you use), and while I it can be done in the zoom search configuration (which I've confirmed works), it might indicate a bug?

    For example looking at the log -

    Zoom search indexes the file: (note the line is much longer, but you get the idea - the thread editor doesn’t' allow the full line).
    http://www.mysite.com/forums/search.php?f=8

    The Robots.txt file: (confirmed uploaded and with these entries, forum is under /forums)
    User-agent: *
    Disallow: /forums/search.php
    Disallow: /forums/calendar.php
    ....

    The log shows it reads the Robots.txt file fine, and it excludes other files and directories in the Robots.txt.

    I also have set the General configuration option "Reload all files (do not use cache).

    Where it becomes really awful is to watch it attempt to index the calendar in VB forum where I'm guessing it's infinitely going through prior and future months via links on the calendar page. It's the same problem as above, where the robots.txt excludes the calendar, but it seems to process it when there is one or more passed variables in the URL. I have to manually stop it as I'm not sure how long it might run.

    Perhaps you will see the same issue with the Wrensoft site since you're using the same forum system.

  • #2
    We could not reproduce the problem you describe. A robots.txt disallow of "search.php" will exclude "search.php?f=8" or similar pages.

    However, in looking at this issue, we did discover a bug which might be the actual cause of your problem. Zoom is currently looking for "robots.txt" file at the base URL, as opposed to the root of the domain.

    While there is no specification for "robots.txt" file, the general consensus seems to be that it should be located in the root folder of the domain, that is:
    http://www.mysite.com/robots.txt

    If Zoom is given a start point that begins at the root domain (eg. you start spidering from http://www.mysite.com/index.html), then this is not a problem, and it uses the abovementioned robots.txt file. But if Zoom is given a start point that is one or more folders deep, it will mistakenly look for the robots.txt file there.

    So for example, given a start point of http://www.mysite.com/forums/index.php

    It will currently look for (and use) the robots.txt file at:
    http://www.mysite.com/forums/robots.txt

    This is incorrect based on robotstxt.org (the original specs are somewhat more vague and ambiguous). And it means that, potentially, it is not finding the "correct" robots.txt file, especially if you have multiple (invalid) robots.txt file located in your other folders.

    You mentioned that you have confirmed that the log found a robots.txt file, but I wonder if it might be the case, that you actually have more than one robots.txt file, one of which is invalid and situated in a folder besides the root, and Zoom is using that one instead. This might explain why it is not behaving as you expect.

    We will address this issue in our next build (5.1.1014) and change it so that Zoom will only pick up the robots.txt file at the root level of the domain.

    If you are sure that it is using the correct robots.txt file in your scenario, and still believe that it is failing to skip the disallows specified, can you provide us with the actual URL to your website so that we can investigate further.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment

    Working...
    X