PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Can not index sub-directory

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can not index sub-directory

    Zoom Search
    7.1 build 1002
    using PHP

    I have a problem in that I can not index ANYTHING beyond the first page of my site.
    The program works like it should (I think) when my files are in the root of the website !

    After much frustration, thinking I was improperly setting the configuration wrong,
    I decided to go simple and;

    1st
    Create a website www.test.com
    add a directory called "search" to upload the results of indexing -- "www.test.com/search"

    2nd
    I put 5 PDFA files in the root of the www.test.com site.
    Select to ONLY index PDF extensions
    set start point to www.test.com
    Start Indexing,finish and auto-upload to the "search" directory,
    executed the "search.php" and complete a text search properly -- (www.test.com/search/search.php)

    3rd
    I now create a directory called "cnl" -- "www.test.com/cnl/"
    I moved the 5 PDFA files from the root to the new directory "cnl"
    I change the configuration to start at "www.test.com/cnl/"
    The process FAILS with "could not download file "http://test.com/cnl/ (Forbidden)"

    4th
    I also tried without the trailing slash " www.test.com/cnl " FAILED
    I have tried various permissions on the site, the directories, and files, all the way to "777" - no difference

    last....
    I move the files back to the root and change the start point to the root again and all works..????

    SO what am I missing????
    Anne

  • #2
    We replied to the email support ticket for this same question, so I will paste my response below.

    The key thing to understand is if you are using "Spider Mode", then the Zoom Indexer will only be able to see or access any URLs that are actually accessible over the web. In other words, they have to be accessible with your browser.

    "Spider Mode" works by crawling web pages -- similar to how you would load the "Start spider URL" page from your browser and then clicking on each link that you can find, and following that to the next page, then clicking all the links on that page, and so forth. This means that if your website is not yet setup so that you cado the same from your browser, to find all the files you are expecting to index, then it will not find those files.

    Without seeing your actual web site setup (since you didn't give me a real URL, and you imply you've changed the folders and web pages since your testing), there are many possible reasons for what you are seeing.

    But I can guide you to better deduce what's happening and why.

    When you were getting the error that a particular URL returns "Could not download file... Forbidden", you can try that exact same URL in your browser and see what you get. You should see a very similar error from your browser. This means there is a web site configuration issue, or your URLs or links are simply incorrect.

    I would suggest the following:

    1) Consider using OFFLINE MODE, which allows you to index the local folders of files you wish to index. And avoid all issues with web site permissions, etc. Making sure to specify the correct Base URL so that the links will be correct when it goes online. The Base URL would be the folder where you are uploading the files.

    For example, if you have all your PDF files locally at:
    C:\MyWebsite\MyPDFFiles\

    Then the above would be your Offline Mode "Start Folder" setting.

    And your Base URL would be the location you are uploading the files to, e.g. if it's the "cnl" subfolder, then your base URL could be:
    http://www.test.com/cnl/

    2) If you wish to persist with SPIDER MODE, make sure you have set up your website so that you can actually navigate to all the files you wish to index. Either by creating links and web pages, or by enabling "directory listing".

    Hope that helps.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      As I said before I do NOT have any problem with navigating to the files with a browser.
      Since our last communication I have set up a special simple test bed to illustrate my problem.
      I have sent the information to you in an email yesterday but I have not heard anything..
      The test bed is nothing more than a index.html page with a html menu linking to another html menu residing in a sub directory. The second html menu residing in the sub directory, points to all the files. The browser has no problem finding the files but zoom fails .

      As to your comment about permissions.. I set all the permissions to 777 just to make sure.

      Also, I can take the menu in the sub directory ( the one that points to all the files) and move it to the root, change its name to index.html and zoom will now index the files. This is NOT a solution, just a test to show indexing will occur EXCEPT when the files are anywhere other than the root !

      To me that says everything is working EXCEPT the spider function. ???? But I am just guessing...
      Anne

      Comment


      • #4
        My last reply back to you was as follows. Please check your spam box if you did not see it. I think this is the 2nd time we've sent an email to you and you did not appear to have received it.

        Subject: Re: INDEXING problems
        Date: Wed, 01 Jun 2016 17:54:53 +1000

        Hi Anne,

        I just directed my indexer to [your website here]

        It found all the PDF files under the "CNL" link. With the exception of a broken link or two (one of which goes to "327_Jan_05pdf" with a missing dot, and another to "044_Jul_80.pdf" which doesn't exist on the server). At time of writing, it has indexed 149+ PDF files on your server.

        One thing I noted is that you now require the crawler to follow a link to "/cnl/cnl.html", which means you have to make sure ".html" files are being indexed. If you only have ".pdf" files selected, then this link will not be followed.

        So make sure you have both ".html" and ".pdf" files in your extensions list and try again.

        If you are still having trouble, send us your .zcfg configuration file and we can take a look at what is wrong with your configuration.
        If you have sent a reply to this above email and we didn't receive it, let us know.

        Originally posted by Annie Sixgun View Post
        As I said before I do NOT have any problem with navigating to the files with a browser.
        For clarity's sake, I will try to summarize the situation as it occurred. You have changed the site a number of times and in different ways, so the problem has changed as well, with each change you make.

        1) When I looked at the problem on May 30th, your test site only had 6 links to PDF files at the root level. I am 100% sure that there were no links to the "cnl" subfolder on this date. That was why Zoom could not find the "cnl" folder or attempt any of the files there.

        At this point, you could not browse to any of the CNL files and folders, so this was a valid problem at this date.

        2) On May 31st, you then changed the website and added links to "cnl/cnl.html" which had links to all the files. You told me it still "didn't work".

        3) On that same day, I tested the index here with default settings and it worked, picking up all the PDF files in the cnl folder. As I explained in the above email, one possibility why it didn't work for you, might be that you only have ".pdf" files selected for indexing. You need to have ".html" in your Scan Extensions list, for the spider to follow that "cnl.html" file which had the links to the PDF files. Please check this.

        I hope that's clearer now.

        It appears you have since removed the website and pages again, so make sure you can restore it to it's previous state in order to proceed with working out what the problem is.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X