PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Index multiple domains and restrict indexing to certain folders unique to each site

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Index multiple domains and restrict indexing to certain folders unique to each site

    Hi,

    I'm trying to build a vertical search engine but keep running into indexing problems. What I'd like to do is index about 100 unique domains but restrict the indexing to certain folders that are unique to each site.

    For example:

    domain1.com/news-events/
    domain2.com/news/
    domain3.com/events/
    domain4.com/calendar/
    etc.

    Every time I try to do this by adding multiple starting points and restricting the spider URL to the unique folder for each site, all I'm able to index is the first domain. All other domains are skipped because they're considered external sites that don't match the base URL.

    How would I configure the additional start points to effectively index these domains but only for the folders as listed above?

    Thanks,
    Cory

  • #2
    You can add additional domain (start points) via the "More" button on the "Start options" configuration window.

    Each start point gets it's own base URL. So you should be able to enter in the URLs more or less just like you have them listed in your post above.

    Start point #1: http://www.domain1.com/news-events/index.html
    Base URL #1: http://www.domain1.com/news-events/

    Start point #2: http://www.domain2.com/news/index.html
    Base URL #2: http://www.domain2.com/news/


    Comment


    • #3
      Hi,

      Thanks for your quick reply. I think I have two problems that I'm trying to solve, but I'll focus on the first one for now. The first problem I encounter is when the URL format of the pages I want to index differs from the starting point URL format.

      For example, with a spider URL of this:
      domain1.com/news-events/

      The base URL is often something like this:
      domain1.com/news/

      OK, so far so good. But when the news index page breaks across multiple results pages like /news-events/?p=2, p=3, etc., I can't figure out the proper spidering options to ensure that all news pages on the site within the /news/ folder are indexed -- and no other URLs are indexed. I've tried all spidering options, and none of the settings find and locate all the news pages. So far I'm only able to index the first 20 results or however many news stories are shown on the main page.

      Please advise.

      Thanks,
      Cory

      Comment


      • #4
        For example, with a spider URL of this:
        domain1.com/news-events/
        Strictly speaking this is not a URL. You need the http:// (or https://) prefix.

        The base URL is often something like this:
        domain1.com/news/
        You control the base URL. It is whatever you need it to be. It normally makes sense to have the base URL a subset of the start point URL. As per my example above.

        If the URLs you are trying to index all start with,
        http://domain1.com/news-events/
        then the base URL should be exactly that URL
        http://domain1.com/news-events/

        Might be better to give a real life example, rather than made up URLs.

        Comment


        • #5
          I've handled tech support before, so I know where you're coming from, but I think I can convey my challenge without revealing the actual URLs.

          The URLs I want to index do not start with:
          http://domain1.com/news-events/

          They start with:
          http://domain1.com/news/

          The page that contains the links to the /news/ URLs is:
          http://domain1.com/news-events/

          I'm able to spider that page just fine and index the 20 /news/ URLs on this page. But how do I spider URLs like these to catch all of the /news/ URLs?

          http://domain1.com/news-events/?p=2
          http://domain1.com/news-events/?p=3
          etc.

          Which spidering option should I use? I think I've tried them all with no luck. Or did I need an alternate setup altogether?

          Thanks,
          Cory

          Comment


          • #6
            Start point: http://domain1.com/news-events/
            with the default setting, "Index page and follow internal links"
            Base URL: http://domain1.com/news-events/;http://domain1.com/news/

            Comment


            • #7
              Hi,

              Thanks for your help. This setup works like a charm, and I'm able to index all of the desired URLs.

              However, the problem is that other subfolders under /news/ are also being indexed, and I'd prefer to only index news detail pages.

              For example, I want to index URLs like this:
              http://domain1.com/news/big-news-happened-today/

              But not URLs like this:
              http://domain1.com/news/category/
              http://domain1.com/news/page2/

              I want to spider these pages to discover all links to news detail pages but don't want to index these pages.

              How would I configure the settings to achieve this? I've tried to add words like "category" and "page" to the skip list, but then those pages aren't scanned at all, which results in many desired URLs not being indexed.

              Thanks,
              Cory

              Comment


              • #8
                If you want to follow the links on a page, but not have the page itself indexed, then add this metatag on the page,

                <meta name="robots" content="noindex">

                Comment


                • #9
                  Hi,

                  I would do that if I owned the sites, but I don't. Again, this is a vertical search engine I'm trying to create that covers about 100 unique domains. I take it there isn't a way to exclude those subfolders?

                  Thanks,
                  Cory

                  Comment


                  • #10
                    There is no easy solution, given the other requirements.

                    Comment


                    • #11
                      I'm able to get Zoom to mostly do what I want to do, so I'm moving ahead with the project.

                      Another issue is that some of these third-party sites don't populate the title tag in their news pages. Hard to believe, but true. The lack of a title tag leaves the unhelpful phrase "Not title" as the search results link title.

                      Is there any way to tell Zoom to use the H1 tag on the page as the search results link title instead?

                      Thanks,
                      Cory

                      Comment


                      • #12
                        No, sorry, there is no option to use the header tag as the title.

                        You could write them an E-mail suggesting you could send more traffic to their site if they had their pages more optimised for search engines. (It would help Google and all the other engines as well).

                        Comment

                        Working...
                        X