PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

2 Problems with indexing with with 100000 spider start points

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2 Problems with indexing with with 100000 spider start points

    Having a couple of issues with indexing at the moment.

    Running 5.1.1003 Enterprise Edition.

    Problem 1 - Importing Spider URLs hangs the program when importing a CSV with 100,000 URLs.

    Problem 2 - I am doing a "Index single page only" for each URL in my list but it only indexs the first one. The URL is generated by PERLView for Remedy which uses a script to generate the HTML page.

    For example:
    http://thisistheurl/perlview.pl?00001
    http://thisistheurl/perlview.pl?00002
    http://thisistheurl/perlview.pl?00003

    In this example I have imported each of those URLs into the sipder to index the single page but it only scans the first one and then fails on the rest saying "Additional start URL invalid or already scanned:..."

    Thanks

  • #2
    Problem 1:
    While Zoom can support 100,000s of pages, we have never tested it with 100,000+ spider start points. But it should work, in theory.

    Are you sure it hangs, maybe the import is just a slower than you expect with this number of URLs. Did you leave it for 15 - 30min to see if it recovers? Do you have enough RAM in your machine?

    A much better solution would be to have a single start point to a sitemap page. The sitemap page would then list all available URLs.

    Problem 2:
    This is probably caused by page 1 having a link to page 2. So when the 2nd start point is hit, it seems as though it has already been processed (during indexing of the 1st start point). The suggested solution to problem 1 would also resolve this.

    Comment


    • #3
      I will let it try to load longer...I only waited 15 min.

      I tried your other method as well but I run into a problem.

      This is a remote site (not my server) and there is no site map. I created a single HTML page with a link to every URL however to do this I then need to enable "follow internal and external links". This doesn't allow me to index the single HTML page at each link. Instead it follows the link from my page plus the 1000s of links in each additional page.

      So in the end...what I need to accomplish is to index 100,000+ single HTML pages each with a unique URL. Is this possible?

      Thanks

      Comment


      • #4
        Originally posted by MikeR View Post
        So in the end...what I need to accomplish is to index 100,000+ single HTML pages each with a unique URL. Is this possible?
        This is certainly possible.

        Originally posted by MikeR View Post
        This is a remote site (not my server) and there is no site map. I created a single HTML page with a link to every URL however to do this I then need to enable "follow internal and external links". This doesn't allow me to index the single HTML page at each link. Instead it follows the link from my page plus the 1000s of links in each additional page.
        I'm not sure why you needed to use "follow internal and external links" as opposed to just "index and follow internal links". From your example above anyway, there wasn't any need for external links.

        However, if you actually do have URLs which are considered external to the base URL (eg. you actually have a list of URLs more like:
        http://thisistheurl/perlview.pl?00001
        http://thisistheurl/perlview.pl?00002
        http://thisistheurl/perlview.pl?00003
        http://thisisanotherurl/perlview.pl?00001
        http://thisisanotherurl/perlview.pl?00002
        ... etc.)

        Then I can see why you would need to use the "internal and external" option.

        In such a case, you might want to review the description of this option in the Help file:

        Index page and follow all links – will index the contents of the page and follow internal and external links (but only up to one level of external links – eg. it will scan each external page linked from an internal page, but will not index external pages linked from external pages).
        Note the last line. This means so long as your start point (or site map URL) is considered external to ALL your links, then it will not follow any internal links for each of those pages. One easy way to this would be to host your HTML page containing all the links in a subdirectory (so that the base URL is different to your other start points).

        Originally posted by MikeR View Post
        Problem 2 - I am doing a "Index single page only" for each URL in my list but it only indexs the first one. The URL is generated by PERLView for Remedy which uses a script to generate the HTML page.

        For example:
        http://thisistheurl/perlview.pl?00001
        http://thisistheurl/perlview.pl?00002
        http://thisistheurl/perlview.pl?00003

        In this example I have imported each of those URLs into the sipder to index the single page but it only scans the first one and then fails on the rest saying "Additional start URL invalid or already scanned:..."
        Just wanted to review this part of your original post as well - did you import that list of URL as it is? Because you need to specify the parameter ", INDEX_ONLY" after EACH URL in your imported text file, if you wanted them to be indexed as single pages. See "Importing and Exporting additional start URLs" in the Users Guide for more information.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Originally posted by Ray View Post
          This is certainly possible.

          I'm not sure why you needed to use "follow internal and external links" as opposed to just "index and follow internal links". From your example above anyway, there wasn't any need for external links.
          Yes...my bad. Just internal works.

          Originally posted by Ray View Post
          Just wanted to review this part of your original post as well - did you import that list of URL as it is?
          No...sorry I shortened it for the purposes of the post.

          I have this working to do mostly what I want. However the problem that still remains is that in each file there may be a like to another file. For example file 90000 may have a reference to 67000 as a hyperlink. So since the base URL matches the reference URL it indexes both.

          So my question is...lets sat file 67000 has a reference to 90000 as a hyperlink and 90000 has a reference to 67000 as a hyperlink. It will index 67000 and 90000 but when it gets to 90000 wil it index 90000 again and 67000 again?

          Comment


          • #6
            Zoom will not index the same URL more than once. Unless your links are different (eg. they might have different parameters added to the end of the URL like "perlview.pl?0001&add=2"), then this would not happen.

            If the URLs are different, but the page content is actually identical, you can skip the page from being indexed by either (a) adding the extra parameters into the skip page list, or (b) enabling the CRC duplicate page prevention option.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment

            Working...
            X