PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexing a non-spidery website

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing a non-spidery website

    Assume I have 15,000 PDF files organized in 300 subdirectories in various levels under a common parent directory. I wish to use Zoom to search these files. Due to a complete reorganization of the website, none of the files are accessible via links in other documents, so traditional spidering will not work. I could download the entire content to my local machine and use offline indexing, but that sounds like a lot of work with potentially problematic results if I don't properly maintain relative URLS, and I'd have to continue that process to update the index whenever I add new content to the website. Two possibilities I have wondered about: First, I have installed a webdisk on my local machine that allows access to the entire content of my website; is it at all possible to run offline indexing through the webdisk? (I realize the webdisk is a virtual, not physical, disk, so there is probably no way for that to work.) Second, is there a program that will comprehensively search this component of my website and create a document with a list of filenames with full URLs that I could use as the starting page for the spider indexing? Any other ideas? I have been circling Zoom for about the last three years, wanting to find time to implement a search solution and confidence that it will suit my needs. A side question: why doesn't Zoom have an online indexing option that simply searches a directory and its subdirectories?

    Thanks so much,
    Robin Miller (in the US)

  • #2
    Just as a backup you should probably have a local copy of the files. What happen if the hosting company goes bankrupt overnight?

    An easy solution would be to turn on directory listing on your web site.
    If you are running the Apache web server, google,
    apache directory listing

    Otherwise there are also directory listing scripts, which will automatically create links to all the files. Again, try Googling, directory listing scripts. There are a lot of them around.

    I don't know anything about webdisk, but if it mounts the remote server with a drive letter, then it can potentially work as well.

    Comment


    • #3
      Indexing a non-spidery website

      Originally posted by wrensoft View Post
      Just as a backup you should probably have a local copy of the files. What happen if the hosting company goes bankrupt overnight?
      Thank you!

      All the files are resident on my computer as well as on my backup service. However, they are not organized the same way as they are on my website, which I have set up in a very specific way to allow people to search a large number of categories. I could download a copy of everything on my website so as to maintain the website organization, but it seems like a lot of trouble, and I am not completely confident that I would get the URLs correct, so I am trying to find a way to use the spider indexing on the website.

      Originally posted by wrensoft View Post
      An easy solution would be to turn on directory listing on your web site. If you are running the Apache web server, google,apache directory listing

      Otherwise there are also directory listing scripts, which will automatically create links to all the files. Again, try Googling, directory listing scripts. There are a lot of them around.

      I've Googled both of these. I don't really understand it, at least yet, but maybe I will. If I can find a script that will create a list of all files in or under a specified parent directory, and include paths as part of the filename, then I can upload that list to the parent directory and use it as the starting point for the spider.

      My web hosting provider does run an Apache server, but enabling directory listings just makes directories visible, doesn't it? I don't see how that would help the spider indexing, because, as I understand it, the spider follows links in files; it doesn't access directory listings.

      Thanks again,
      Robin

      Comment


      • #4
        the spider follows links in files; it doesn't access directory listings.
        Ah, but that is the beauty of the solution, the listings are in fact HTML pages with normal HTML links in them.

        Comment

        Working...
        X