PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexing multiple websites using Zoom

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing multiple websites using Zoom

    The website i index contains links to other websites which i would want to avoid from being indexed. These internal hyperlinks can be many in number and unknown. How do i do that?
    Last edited by TanujaK; Feb-19-2007, 10:17 AM.

  • #2
    The default spider behaviour is to not follow links that lead to external web sites (this is to avoid the spider escaping and trying to index the entire internet). So you shouldn't need to do anything special.

    Comment


    • #3
      If you did want to index multiple web sites, then you can add additional start points from the "More" button in the main window of Zoom.

      Comment


      • #4
        Links to multiple regional websites

        Thanks for your reply. Our website operates as 27 regional websites and i would like the search to be specific to each of the regions. Hence i plan to index each of the regional websites seperately. These websites have links to the internal websites of other regions which i would want to avoid. But as a default while indexing each of the regional websites other links to the internal site/s would also get indexed and during the search the visitor gets the results pointing to different regional websites as well which is not correct.Can you please suggest as to how i can index only one particular region and not allow indexing of other regional websites which appears as a hyperlink?

        Thanks!

        Comment


        • #5
          Can you give some examples as to what you mean by a "regional website" and what you are considering to be "internal" and "external"? I suspect there is some confusion in the terminology used here.

          When we are typically referring to an "internal" link, we are referring to links within a site, usually under the same domain name. In Zoom, we consider it to be any links that is outside of the "Base URL" specified. So for example, if we were crawling the following URL:
          http://www.mysite.com/news/page1.html

          And the base URL was:
          http://www.mysite.com/

          Then any of the following links are internal:
          http://www.mysite.com/index.html
          http://www.mysite.com/blah/blah/dog.html
          http://www.mysite.com/news/page2.html

          However, the following are external:
          http://www.someothersite.com/index.html
          http://yahoo.com/
          ... etc.

          By default, Zoom will only index internal links and not follow external links. You can create multiple start points to index different sites.

          If, what you mean is that you have 27 websites (with 27 different domain names) to index, you should be able to add each as a different start point, as David mentioned above.

          I am not sure what you mean by "links to internal websites of other regions". Do you mean that you don't actually want to index the whole website of some of these domains, and only a page? There's an option for that as well - look up "Index page and follow internal and external links" in the Users Guide.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Right! Let me explain. I have different regional websites with URLs as

            www.mysite.com?r=uk&l=en (Opens UK regional website)
            www.mysite.com?r=cn&l=cn (Opens Chinese regional website)
            www.mysite.com?r=tw&l=tw (Opens Taiwan regional website)
            etc.

            Each of the above URL point to different regions based on their querystring. I would want to index each of the regional websites seperately and the search seperately done for each of these above regional websites. For example when i index UK regional website, it may contain links to the Chinese regional web pages as well. Which means the base url (www.mysite.com) for the chinese regional website is the same as that of the UK except for the querysting. In this case this is the internal link.

            The problem i am facing is that, when i index the UK region website the internal links (to the chinese or taiwan region having the hyperlinks from the UK regional website) will also be indexed.And when the visitor search for a word in UK region the results may point to the Chinese website as well which is incorrect.

            Thanks!

            Comment


            • #7
              I see. I think for what you are trying to do, there are several better approaches:

              The first and probably most obvious solution is to create separate index files per region. This means, actually having a separate ZCFG file for each language - and setting them up so that they only index their specific region (eg. the UK ZCFG file would be setup to skip all pages containing "&l=cn", "&l=tw", etc.). You should also of course, configure each to output to a different directory, so that you will end up with a number of different search pages and sets of index files - one search function per region (eg. www.mysite.com/search/en/search.php, www.mysite.com/search/cn/search.php, ...etc.)

              This is a recommended option if you never want your users to be able to search across multiple regions. The disadvantage to this method is that you need to manage multiple configurations, and be prepared to run the indexer per region - not necessarily too bad if you have it all scheduled and automated.

              However, if you would like the option to search across multiple regions (but also have the ability to restrict searches to a specific region most of the time), then you could potentially achieve this by using the Categories option. This means that you can index your entire site (all regions) together, and then using the Categories feature, setup a category per region (eg. have a category for "Chinese" with the pattern "?r=cn&l=cn"). You can then restrict searches to a specific region either via a dropdown menu that the user can select (and you can pre-select), or you could even hard-code in the category via a custom HTML search form - so that the UK site would only have a search box that searches the UK category, for example.

              Now, the only problem with the latter solution is that you would naturally need to use UTF-8 for the search page, because the same search page is capable of serving up results in different languages. Zoom will convert the content scanned from each of the webpages, from their charset to the encoding selected for the index (on the "Languages" tab of the Config window).

              For more information on the Categories option, see our Users Guide:
              http://www.wrensoft.com/zoom/usersguide.html
              Last edited by Ray; Feb-21-2007, 06:21 AM.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                Thanks Ray for your suggestions.

                I have actually started implementing the first option that you have suggested on to our website but unfortunately as i mentioned to you earlier, while i index one of the region and have the output stored into a seperate directory, the internal hyperlinks (having hyperlinks with r=cn;r=tw) also get indexed. I want to restrict the indexing from doing so and only index files with uk region (r=uk) and not others although there are hyperlinks available to other regional websites from the UK regional website. Hope i am understood? How do i achieve this with ZOOM? Is there a way where i exclude internal websites from being indexed?

                Comment


                • #9
                  Yes, please see the "Page and folder skip list", on the "Skip options" tab of the Configuration window.

                  Here you can specify what URLs should be skipped, so if you have a skip list like so:

                  r=cn
                  r=tw

                  Then any links containing these parameters will be excluded from the current index. Similarly, you can then avoid the UK regions by specifying a skiplist in your Chinese configuration which skips "r=uk" and "r=tw".

                  I hope that clears things up.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment


                  • #10
                    Hi Ray:

                    Thanks for your help. This works.

                    With regard to scheduling and automatic indexing of sites, Is there a way to schedule indexing of all of the regions at once or do we need to set the schedule for each one of them seperately?

                    Thanks!

                    Comment


                    • #11
                      You need to schedule each of the ZCFG files separately. So, if you had gone with the method of creating a separate set of index per region (and having a number of different ZCFG files), then you would need to schedule each of them separately. Note that if you're using plugins, you should not schedule multiple copies of the Indexer to run simultaneously.

                      However, if you had gone with the method of using one set of index files to span multiple regions, then you would only need to schedule the one ZCFG file, and you would only need to index once.
                      --Ray
                      Wrensoft Web Software
                      Sydney, Australia
                      Zoom Search Engine

                      Comment

                      Working...
                      X