PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Zoom on a Windows Intranet ??

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Zoom on a Windows Intranet ??

    Greetings All !

    My workgroup at my day-job has a problem. We typically consult a Yahoo-like directory of links many, many times per day in our problem-solving work.

    This directory appears to the user to be a single page of expandable folders. It runs off of an Access database, where the links are stored. Now, while SOME of the links are URLs (especially links to files on the webserver that hosts this application)... many, many more are structured like this:

    file://servername/path/to/directory

    Now, I installed the trial version of Zoom onto our webserver. Tried running it, but it reaches its file limit quickly. I ended up not being sure if it would actually pull up such paths like this? I also wondered if it could handle directories. I mean, when we click on links like this, a Windows Explorer folder opens up in the main frame... showing subdirectories and files alike. Ideally, I'd like to get Zoom to scan everything in such directories, including subdirectories.

    Can it do something like this?

    I confess I set aside the trial version of Zoom a few weeks ago... and tried coming up with other solutions instead. I installed phpmysearch and tried pounding it into submission, but it too doesn't know what to do with paths that are structured like that... or like Windows UNC paths.

    One reason I set aside Zoom is that I couldn't confirm that it could do what we need it to do here without spending the money first. (I'm not just talking about the directory stuff here... I needed to prove to my team lead that it will scan Word docs, Excel spreadsheets, and PDFs.)

    Odds are, the money would come out of my own personal pocket. It's hard to explain, but this entire little project is limited to just our workgroup... and getting the company to pay for it will be very, very difficult.

    So, can Zoom handle paths like this? Is anybody out there currently using it on a corporate intranet?


    Thanks!!!

    -= Dave =-

  • #2
    UNC

    OK, I had totally forgotten that I had posted some of this back in early February. I guess our corporate firewall blocked the "reply notification" from this thread in getting to me.

    So, I just learned from a reply to that previous post that, yes, if I use the Offline method and use Windows Universal Naming Convention paths (\\server\directory\etc ), I should be able to do this. And it does appear that Zoom will "see" the directory contents appropriately.

    I've already written a script that dumps all the URLs from our Yahoo-style directory's database. I plan on adding some code to that script to have it convert all "file://" links to UNC links. I'll then simply tell Zoom to read that file.

    The only problem I can foresee if I use this method is that the search form itself will sometimes come back with this "url dump file" as an actual search result. And that's not my intention for that file. Not sure if there's a way I can make Zoom read the URLs out of it, but not index it?

    It's the end of my current work-shift. I'll have the next 5 days off, so I won't be able to test this for a while.

    I guess I'll try to come up with an extra $99 while I'm off-shift!

    I also need to read up on this concept of running Zoom as CGI. My searches with just some initial tests and only 1-2 dozen pages indexed are taking from 4 - 7 seconds.

    Thanks!

    -= Dave =-

    Comment


    • #3
      Re: UNC

      Just to clarify for others reading this, here is your original post (and my response):
      http://wrensoft.com/forum/viewtopic.php?t=92

      The description of your scenario has expanded now from before, and is slightly more complicated. You mention that you are now indexing a list of URLs from an Access database in addition to the local folders that you wish to index.

      Zoom runs in either Offline mode or Spider mode. In offline mode, it would be able to index a folder structure and all files and subfolders under it (think of how an anti-virus application behaves when you ask it to scan a folder). However, it would not index files over the web (eg. http://www.somesite.com/index.html).

      Spider mode indexes files on websites, both local and remote servers. However, this does not index offline folders which are not being hosted by a web server (which was your initial requirement).

      I can see two possibilities:

      1) If you only need to index files on your local network, and your URLs are not remote websites, then use Offline mode. Note that this does not "crawl" a page of links, it simply takes a list of folders and scans them offline. If you have a local web server, you can just point it to the shared folder for the web pages (eg. "\\ourwebserver\Inetpub\wwwroot\site1\")

      2) Use spider mode, but you will have to setup your web server to host all the files and folders required. This means "web sharing" certain folders on the network, so that they would be accessible with a http://ourwebserver/path/to/directory type of URL.

      You should actually be able to test your setup fairly well with the Free Edition. You simply need to try smaller sets of files at any one time - instead of always using the root folder as the start point, try various locations and index sections of the site at a time.

      If you haven't already, take a look at our Users Guide, which explains spider and offline mode in more detail:
      http://www.wrensoft.com/zoom/usersguide.html

      Originally posted by DR4296
      The only problem I can foresee if I use this method is that the search form itself will sometimes come back with this "url dump file" as an actual search result. And that's not my intention for that file. Not sure if there's a way I can make Zoom read the URLs out of it, but not index it?
      This would only be necessary if you chose to use Spider mode. In which case, yes, you can specify a URL to "follow links" from, but not index. If it is a start point, you click on "More" in spider mode, "Edit", and then select "Follow links only". Alternatively, you can enclose the page with the and tags.

      In offline mode, this would not be necessary. Zoom does not crawl web pages in this mode, it only uses a list of directories which you enter into the Indexer manually (or import via a text file).

      I also need to read up on this concept of running Zoom as CGI. My searches with just some initial tests and only 1-2 dozen pages indexed are taking from 4 - 7 seconds.
      That's also unusual and we suspect there's something wrong. Which platform of the search script are you using at the moment? (PHP, ASP, Javascript?) How big (in filesize) are the files you are trying to index, and what types of files are they?
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        Zoom at work

        Ray,

        Thanks for your reply.

        Let me save the speed question for last. First, back to how we need to use Zoom.:

        Looks like I'm in a bit of a bind again, because, from what you're saying, I'd have to tell Zoom each and every folder to scan manually (in Offline mode) and it would therefore do zero link following... even if the links were internal to our network.

        Or, I could use Spider mode, but then some of these servers on our internal network are not set up as web servers.... (in other words, I CAN'T set up our webserver to host all of the files, due to the nature of our network here), so I can't pass a URL to Zoom in order to reference them.

        OK, back to the speed question. We're currently using it in PHP mode. I'll have to test it when I get back to work on Wednesday to see if it's still happening. I know I had a problem later in the day and had to cycle the webserver.

        As for the files I'm indexing, most are Word documents, so the free version isn't indexing those... which just leaves us with some HTML files. Nothing big.

        I should have more time to test this on Wednesday, but now I'm discouraged to learn that Local mode alone won't work as a solution for me.

        Thanks!

        -= Dave =-

        Comment


        • #5
          From my response to your original post:

          By selecting "Offline mode", you can specify a Start directory to index from. It will then index all files within that folder, as well as the content of all sub-folders. You can also specify multiple "start directories" by clicking on the "More" button.
          So no, offline mode does NOT follow links. You will need spider mode for that. However, you should not need to specify _every_ folder with offline mode, assuming some of your folders share a common root, and are not completely scattered everywhere.

          You said that you wanted to index the contents of folders with no webpages linking directly to documents (you simply get a "Windows Explorer window displaying the contents of the folder"). This would mean that if you actually relied on "following links", then this would NOT work for your site - since there are no links? So I'm a bit confused as to your seemingly conflicting requirements at the moment.

          The best I can decipher is that you have SOME files which you would like to index in a "spider mode" manner, with HTML links that should be followed, and SOME files which you would like to index in an "offline mode" manner. And you would want the spider to follow HTML links to these offline paths, where it would automatically switch to Offline mode and index "everything in this certain folder".

          If this is your requirement, then no Zoom can not do this, and I doubt there is much out there that would. It is unusual to require two different file finding methods in the same session. There's also no reliable way to determine which mode to use automatically.

          However, I still think one of the two solutions that I posted above are possible. Especially the first one, by using the Offline mode, and importing the list of folders. Once again, each folder you list will ensure that EVERY subfolder and file under it is indexed. You said that you had a directory link script which produced all the links already - you can modify this to output a simple list of directories as UNC paths as a text file, and import this file into Zoom.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Offline vs. Online

            Originally posted by Ray

            You said that you wanted to index the contents of folders with no webpages linking directly to documents (you simply get a "Windows Explorer window displaying the contents of the folder"). This would mean that if you actually relied on "following links", then this would NOT work for your site - since there are no links? So I'm a bit confused as to your seemingly conflicting requirements at the moment.

            The best I can decipher is that you have SOME files which you would like to index in a "spider mode" manner, with HTML links that should be followed, and SOME files which you would like to index in an "offline mode" manner. And you would want the spider to follow HTML links to these offline paths, where it would automatically switch to Offline mode and index "everything in this certain folder".

            If this is your requirement, then no Zoom can not do this, and I doubt there is much out there that would. It is unusual to require two different file finding methods in the same session. There's also no reliable way to determine which mode to use automatically.

            However, I still think one of the two solutions that I posted above are possible. Especially the first one, by using the Offline mode, and importing the list of folders. Once again, each folder you list will ensure that EVERY subfolder and file under it is indexed. You said that you had a directory link script which produced all the links already - you can modify this to output a simple list of directories as UNC paths as a text file, and import this file into Zoom.

            Ray,

            You are mostly correct. Due to a lack of rules here or an official system on adding content to this "master web page" of ours, all links within our system / database fall into two groups. :

            1) Links to documents on Windows machines actually running webserver software (IIS). I'd estimate that 90% of these machines are internal to our intranet. I believe that 100% of these links are to exact documents. So, we'd need these links to be spidered, and any links within the pages (which will all be URL's, not UNC links) followed.

            2) Links to documents on Windows machines NOT running webserver software (in the form of UNC links or this weird "file://" style of links.
            I'd estimate that 30-40% of these links are to folders instead of specific single files. We'd need the contents of these folders and their subfolders indexed. And, preferrably, we'd like to see that any URL's that are encountered within those documents are followed.

            I'm at my day-job now, so, hopefully, if things are slow here, I'll be able to experiment with Zoom today.

            I'm sorry this is such a convoluted pain-in-the-rear! The links of type #2 above seem to exist mostly because we contacted various groups outside of our area and asked them for support documentation for the software they are responsible for, plus some heavily-updated documents such as weekly on-call schedules. Most groups simply forwarded my predecessor a UNC link or a "file://" link to documentation they already had on their own workgroup's server. Since most of these "outside workgroups" are not running webservers, they could not send us HTTP URL's.

            And, of course, WE'RE running a webserver, but we can't ask them to change their processes and move their documentations' permanent location to our server.

            Thanks!

            -= Dave =-

            Comment


            • #7
              You said that you had a directory link script which produced all the links already - you can modify this to output a simple list of directories as UNC paths as a text file, and import this file into Zoom.

              Question: Is there some way I can quickly import this text file into Zoom for Offline mode scanning?


              -= Dave =-

              Comment


              • #8
                UNC paths

                Well, I had had a bit of hope there for a few minutes.

                I decided to try converting as many URL's into UNC paths as possible. This was looking very good when I did an index.

                However, when Zoom's Search script builds the output page, the URL's are all.... just that.... URLs. And therefore, since most of these servers are not running web server software, the links don't work.

                Bummer.

                -= Dave =-

                Comment


                • #9
                  Getting there!

                  OK, I editted the search.php script and did a little alteration... so that the URL gets reformatted.:


                  if ($DisplayTitle == 1)
                  {
                  // Replacing 04/20/05
                  //print "<a href=\"".rtrim($urls[$ipage])."\"" . $target . ">";
                  $thisurl = rtrim($urls[$ipage]);
                  $thisurl = preg_replace("/\//s", "\\", $thisurl);
                  print "<a href=\"". $thisurl . "\"" . $target . ">";

                  if ($Highlighting == 1)
                  PrintHighlightDescription(rtrim($titles[$ipage]));
                  else
                  print rtrim($titles[$ipage]);
                  print "</a>";
                  }
                  else{
                  // Replacing 04/20/05
                  //print "<a href=\"".rtrim($urls[$ipage])."\"" . $target . ">".rtrim($urls[$ipage])."</a>";
                  $thisurl = rtrim($urls[$ipage]);
                  $thisurl = preg_replace("/\//s", "\\", $thisurl);
                  print "<a href=\"".$thisurl."\"" . $target . ">".$thisurl . "</a>";
                  }




                  And later...



                  if ($DisplayURL == 1)
                  {
                  if (strlen($info_str) > 0)
                  $info_str .= " - ";

                  // Replacing 04/20/05
                  //$info_str .= $STR_RESULT_URL . " ".rtrim($urls[$ipage]);
                  $thisurl = rtrim($urls[$ipage]);
                  $thisurl = preg_replace("/\//s", "\\", $thisurl);
                  $info_str .= " $thisurl ";
                  }


                  This gets the results formatted properly for us.

                  Of course, there are still a lot of regular old http:// style URLs that I still cannot index, but I think this may be a very good start.


                  -= Dave =-

                  Comment


                  • #10
                    performance

                    So, my php searches were taking anywhere from 5-10 seconds to complete.

                    I then decided to alter Zoom's configuration so as to use CGI mode. My CGI searches are all taking from 1-2 seconds.

                    I put some statements in the PHP version to try to determine where it's slowing down. It appears to be in two places: When we read the template file (near the beginning of the script) and when it processes / draws the rows of results. In fact, I'm thinking it must be PHP's performance overall on this box that's so slow.

                    Not sure what to do about it. But I thought I'd mention it, because, while I could switch to the CGI version of the search script, I can't exactly alter its code so as to format the URLs the way I need them to be... like I can with the PHP version.


                    Thanks!

                    -= Dave =-

                    Comment


                    • #11
                      Originally posted by DR4296
                      Question: Is there some way I can quickly import this text file into Zoom for Offline mode scanning?
                      Yes, from the Offline mode tab, click on "More" and "Import".

                      However, when Zoom's Search script builds the output page, the URL's are all.... just that.... URLs. And therefore, since most of these servers are not running web server software, the links don't work.
                      The output URLs are dependant on the "Base URL" that you specify in the Offline mode tab. This allows you to change the way the "search result links" are formed. You can for example have:

                      Start dir: \\server\myfiles\
                      Base URL: file://server/myfiles/

                      This will produce links in the form of: file://server/myfiles/mypage.html which should work fine without the need to modify the script or CGI.

                      I'm not sure why PHP is taking that long to return results, it depends on many things, such as server load, number of pages etc. But 5-10 seconds seems to indicate something else is wrong. You might want to zip up your search files and email them to us so we can verify this for you.
                      --Ray
                      Wrensoft Web Software
                      Sydney, Australia
                      Zoom Search Engine

                      Comment


                      • #12
                        Progress....

                        Originally posted by Ray
                        Originally posted by DR4296
                        Question: Is there some way I can quickly import this text file into Zoom for Offline mode scanning?
                        Yes, from the Offline mode tab, click on "More" and "Import".
                        Ray, there is no "More" button on the Offline tab, only on the Online tab.



                        Originally posted by Ray
                        The output URLs are dependant on the "Base URL" that you specify in the Offline mode tab. This allows you to change the way the "search result links" are formed. You can for example have:

                        Start dir: \\server\myfiles\
                        Base URL: file://server/myfiles/

                        This will produce links in the form of: file://server/myfiles/mypage.html which should work fine without the need to modify the script or CGI.
                        Oh, okay, I understand now! That works great!

                        Originally posted by Ray
                        I'm not sure why PHP is taking that long to return results, it depends on many things, such as server load, number of pages etc. But 5-10 seconds seems to indicate something else is wrong. You might want to zip up your search files and email them to us so we can verify this for you.
                        I really think that's a problem on our server. I've got phpmyAdmin installed on that box too, and it's dreadfully slow. I just lived with it and didn't think about it much, until now.


                        One other angle I started exploring late yesterday: I have little experience with IIS, but if I understand what I'm reading, then it may be possible to extend the "virtual server" of our box to include remote directories (on other in-house boxes) within that virtual server as "virtual directories". In other words, it sounds like I can make our webserver act as the webserver for those other boxes too, but we'd isolate that to just certain directories on those boxes.

                        I'm researching now whether it's possible to take already-existing directories on those boxes and make them a part of this virtual server.

                        My thinking is this might be a way to reach those files with standard URL's. So then, theoretically, Online Mode would work to spider all of those links.

                        The one flaw I see with this plan is that... remember, some of those links are to directories... which Offline Mode reads nicely, but Online Mode won't know what to do with.

                        What a project! Wish the company was paying me extra for this!


                        Thanks!

                        -= Dave =-

                        Comment


                        • #13
                          Ray, there is no "More" button on the Offline tab, only on the Online tab.
                          I think you mean there's no "Import" button (there is a "More" button on the Offline tab since Version 4.0). But sorry, you're right, there is no import for offline mode. I was thinking of spider mode for some reason.

                          And yes, extending the "virtual server" to include remote directories on the network is possible (this was what i meant as my second solution earlier on). You'll have to make sure user permissions are setup accordingly on the machines however so that the IIS account will have access to all these remote folders.

                          The one flaw I see with this plan is that... remember, some of those links are to directories... which Offline Mode reads nicely, but Online Mode won't know what to do with.
                          Actually, that's easily solved. In IIS, right click on the folder, select "Properties", and enable "Directory browsing". This makes IIS automatically generate a HTML directory listing for each folder which does not have a default file (eg. index.html, default.html etc.). The spider will follow these links and index the entire contents of that folder and its sub-folders.
                          --Ray
                          Wrensoft Web Software
                          Sydney, Australia
                          Zoom Search Engine

                          Comment


                          • #14
                            Thanks!

                            Thanks Ray!

                            You were right... I meant the "Import" button, not the "More" button.


                            Right now, I'm waiting on our internal help desk here to get back to me. I've asked them to give me sign-ons that I can use in order to access those remote directories on our network.

                            I seem to be at a stopping-point until I hear back from them.


                            Thanks!

                            -= Dave =-

                            Comment

                            Working...
                            X