PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Index Skip Options

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Index Skip Options

    I am having trouble getting this to work properly. I am using DNN 4.8 with a store module. Zoom Ver 6 using the ASP index. I do not want to include the category (/cid/) in the search results. I tried to skip " /tabid/58/cid/" or just /cid/ the entire index fails. I have successfully skipped other pages using this but they use a different tabid.

    I have been manually editing the index to resolve this but I would like to automate this process.

    The first example is a category in my store.

    http://www.meinhard.com/Store/tabid/58/cid/1/Nebulizers.aspx

    The second is a product (/pid/) in my store. This is what I would like to include.

    http://www.meinhard.com/Store/tabid/58/pid/257/Meinhard-Type-A-Quartz-4-mLmin-20-psi.aspx

  • #2
    If you enter /tabid/58/cid/ in the skip list this this would exclude the first page, but not the second.

    Can you describe in more detail what you mean by "the entire index fails". Are there error messages, or too many documents skipped, what is the actual problem?

    Comment


    • #3
      That is what I thought? But for some reason if I skip /cid/ the index takes about 2 seconds and misses all of the /pid/

      Comment


      • #4
        Can you post or E-Mail us your log.

        Comment


        • #5
          10|02/27/09 15:12:23|Start indexing (spider mode) at Fri Feb 27 15:12:23 2009
          02|02/27/09 15:12:23|Maximum number of words: 90000
          02|02/27/09 15:12:23|Maximum number of files: 65000
          02|02/27/09 15:12:23|Will scan files with extensions
          02|02/27/09 15:12:23| .htm
          02|02/27/09 15:12:23| .html
          02|02/27/09 15:12:23| .txt
          02|02/27/09 15:12:23| .php
          02|02/27/09 15:12:23| .asp
          02|02/27/09 15:12:23| .cgi
          02|02/27/09 15:12:23| .aspx
          02|02/27/09 15:12:23| .pdf
          02|02/27/09 15:12:23| .docx
          02|02/27/09 15:12:23| .pptx
          02|02/27/09 15:12:23| .xlsx
          02|02/27/09 15:12:23|Spider from: http://www.meinhard.com/
          02|02/27/09 15:12:23|Web site URL: http://www.meinhard.com/
          02|02/27/09 15:12:23|Estimated RAM required during index process: 242691 KB
          04|02/27/09 15:12:23|Downloading robots.txt file found at http://www.meinhard.com/robots.txt
          02|02/27/09 15:12:23|Initiating HTTP session (thread #1) ...
          14|02/27/09 15:12:23|DL Thread #1, got URL (http://www.meinhard.com/) off queue
          04|02/27/09 15:12:23|Downloading file http://www.meinhard.com/
          14|02/27/09 15:12:23|Index Thread got ready buffer for http://www.meinhard.com/ (Content-type: HTML text)
          02|02/27/09 15:12:23|Initiating HTTP session (thread #3) ...
          02|02/27/09 15:12:23|Initiating HTTP session (thread #9) ...
          02|02/27/09 15:12:23|Initiating HTTP session (thread #7) ...
          02|02/27/09 15:12:23|Initiating HTTP session (thread #5) ...
          02|02/27/09 15:12:23|Initiating HTTP session (thread #4) ...
          11|02/27/09 15:12:23|Spidering for links on http://www.meinhard.com/
          02|02/27/09 15:12:23|Initiating HTTP session (thread #6) ...
          02|02/27/09 15:12:23|Initiating HTTP session (thread # ...
          02|02/27/09 15:12:23|Initiating HTTP session (thread #10) ...
          02|02/27/09 15:12:23|Initiating HTTP session (thread #2) ...
          11|02/27/09 15:12:23|Queued URL: http://www.meinhard.com/portals/0/default.aspx
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/portals/0/Pix/PortalSiteBanner.jpg (Blocked by extensions list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/images/spacer.gif (Blocked by robots.txt)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/1/Nebulizers.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/3/ICP-Torches.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/2/Spray-Chambers.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/4/Peristaltic-Pumps.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/80/Sample-Introduction-Accessories.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/75/Sample-Preperation-Accessories.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/6/Pump-Tubing.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/StoreLinksAdmin/tabid/58/cid/152/Sale-Items.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/65/Default.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/63/Default.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/70/Default.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/71/Default.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/72/Default.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/docs/pdf/MGP2008Cat2.pdf (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/nebulizer1.jpg (Blocked by extensions list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/icp1.jpg (Blocked by extensions list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/spray1.jpg (Blocked by extensions list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/placeholder.jpg (Blocked by extensions list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/smplintroductionacc.jpg (Blocked by extensions list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/Peristaltic.jpg (Blocked by extensions list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/store/tabid/58/cid/75/Sample-Preperation-Accessories.aspx (Blocked by page skip list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/sampleprepacc.jpg (Blocked by extensions list)
          01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/QASISOnew.jpg (Blocked by extensions list)
          11|02/27/09 15:12:23|Queued URL: http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx
          00|02/27/09 15:12:23|Indexing http://www.meinhard.com/
          14|02/27/09 15:12:23|DL Thread #1, got URL (http://www.meinhard.com/portals/0/default.aspx) off queue
          14|02/27/09 15:12:23|DL Thread #3, got URL (http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx) off queue
          04|02/27/09 15:12:23|Downloading file http://www.meinhard.com/portals/0/default.aspx
          04|02/27/09 15:12:23|Downloading file http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx
          04|02/27/09 15:12:23|URL redirected to: http://www.meinhard.com/ [thread #1]
          01|02/27/09 15:12:23|Redirected file already scanned [thread #1]
          14|02/27/09 15:12:23|Index Thread got ready buffer for http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx (Content-type: HTML text)
          11|02/27/09 15:12:23|Spidering for links on http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx
          00|02/27/09 15:12:23|Indexing http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx
          03|02/27/09 15:12:23|All index files will be written to: C:\domains\localuser\armi\meinhard.com\search
          03|02/27/09 15:12:23|Writing index data for ASP search... (Please wait)
          03|02/27/09 15:12:23|Created pagedata data file (zoom_pagedata.zdat)
          03|02/27/09 15:12:23|Created pagetext data file (zoom_pagetext.zdat)
          03|02/27/09 15:12:23|Created pageinfo data file (zoom_pageinfo.zdat)
          03|02/27/09 15:12:23|Created dictionary data file (zoom_dictionary.zdat)
          03|02/27/09 15:12:23|Created wordmap data file (zoom_wordmap.zdat)
          03|02/27/09 15:12:23|Created script settings file (settings.asp)
          10|02/27/09 15:12:23|Indexing completed at Fri Feb 27 15:12:23 2009
          12|02/27/09 15:12:23|INDEX SUMMARY
          12|02/27/09 15:12:23|Files indexed: 2
          12|02/27/09 15:12:23|Files skipped: 25
          12|02/27/09 15:12:23|Files filtered: 0
          12|02/27/09 15:12:23|Files downloaded: 2
          12|02/27/09 15:12:23|Unique words found: 282
          12|02/27/09 15:12:23|Variant words found: 78
          12|02/27/09 15:12:23|Total words found: 335
          12|02/27/09 15:12:23|Avg. unique words per page: 141.00
          12|02/27/09 15:12:23|Avg. words per page: 167
          12|02/27/09 15:12:23|Start index time: 15:12:23 (2009/02/27)
          12|02/27/09 15:12:23|Elapsed index time: 00:00:00
          12|02/27/09 15:12:23|Peak physical memory used: 62 MB
          12|02/27/09 15:12:23|Peak virtual memory used: 119 MB
          12|02/27/09 15:12:23|Errors: 0
          12|02/27/09 15:12:23|URLs visited by spider: 3
          12|02/27/09 15:12:23|URLs in spider queue: 0
          12|02/27/09 15:12:23|Total bytes scanned/downloaded: 25594
          12|02/27/09 15:12:23|File extensions:
          12|02/27/09 15:12:23| .htm indexed: 0
          12|02/27/09 15:12:23| .html indexed: 0
          12|02/27/09 15:12:23| .txt indexed: 0
          12|02/27/09 15:12:23| .php indexed: 0
          12|02/27/09 15:12:23| .asp indexed: 0
          12|02/27/09 15:12:23| .cgi indexed: 0
          12|02/27/09 15:12:23| .aspx indexed: 1
          12|02/27/09 15:12:23| .pdf indexed: 0
          12|02/27/09 15:12:23| .docx indexed: 0
          12|02/27/09 15:12:23| .pptx indexed: 0
          12|02/27/09 15:12:23| .xlsx indexed: 0
          12|02/27/09 15:12:23| No extensions indexed: 1
          02|02/27/09 15:12:23|Cleaning up memory used for index data... please wait.
          02|02/27/09 15:12:23|Finished cleaning up memory.
          03|02/27/09 15:12:23|Copied search script to: C:\domains\localuser\armi\meinhard.com\search\sear

          Comment


          • #6
            If I remove the /cid/ skip option. This is a summary of what it looks like.


            03|02/27/09 15:20:29|Writing index data for ASP search... (Please wait)
            03|02/27/09 15:20:29|Created pagedata data file (zoom_pagedata.zdat)
            03|02/27/09 15:20:29|Created pagetext data file (zoom_pagetext.zdat)
            03|02/27/09 15:20:29|Created pageinfo data file (zoom_pageinfo.zdat)
            03|02/27/09 15:20:29|Created dictionary data file (zoom_dictionary.zdat)
            03|02/27/09 15:20:29|Created wordmap data file (zoom_wordmap.zdat)
            03|02/27/09 15:20:29|Created script settings file (settings.asp)
            10|02/27/09 15:20:29|Indexing completed at Fri Feb 27 15:20:29 2009
            12|02/27/09 15:20:29|INDEX SUMMARY
            12|02/27/09 15:20:29|Files indexed: 938
            12|02/27/09 15:20:29|Files skipped: 40
            12|02/27/09 15:20:29|Files filtered: 0
            12|02/27/09 15:20:29|Files downloaded: 938
            12|02/27/09 15:20:29|Unique words found: 5045
            12|02/27/09 15:20:29|Variant words found: 2646
            12|02/27/09 15:20:29|Total words found: 155556
            12|02/27/09 15:20:29|Avg. unique words per page: 5.38
            12|02/27/09 15:20:29|Avg. words per page: 165
            12|02/27/09 15:20:29|Start index time: 15:18:13 (2009/02/27)
            12|02/27/09 15:20:29|Elapsed index time: 00:02:16
            12|02/27/09 15:20:29|Peak physical memory used: 62 MB
            12|02/27/09 15:20:29|Peak virtual memory used: 127 MB
            12|02/27/09 15:20:29|Errors: 0
            12|02/27/09 15:20:29|URLs visited by spider: 939
            12|02/27/09 15:20:29|URLs in spider queue: 0
            12|02/27/09 15:20:29|Total bytes scanned/downloaded: 20110858
            12|02/27/09 15:20:29|File extensions:
            12|02/27/09 15:20:29| .htm indexed: 0
            12|02/27/09 15:20:29| .html indexed: 0
            12|02/27/09 15:20:29| .txt indexed: 0
            12|02/27/09 15:20:29| .ph

            Comment


            • #7
              Just to clarify your original description of the problem: the above logs don't demonstrate that "the index fails" (which to us, implies a crash), but rather, that it does not find the other pages of your site and only indexes a few pages. This is a common spidering situation that is explained in the FAQ here:
              Q. I am indexing with spider mode but it is not finding all the pages on my web site

              You must understand that the spider needs to find links to locate other pages. If your Skip List causes a page containing important links to be excluded, then those links will not be crawled by the spider, and many other pages will be excluded.

              As your product pages (pid) are only linked via your category pages (cid), then skipping the "cid" pages will mean that the spider will never get to the "pid" pages.

              What you should consider is:
              (a) If you have a page on the site (which isn't a "cid" page) that links to all the products. If so, specify this as the "Start Spider URL" or as an additional start point.
              (b) If not, specify each of your "cid" pages as an additional start point with the Spidering option of "Follow links only" so that they won't be indexed, but the links will be crawled.

              See the Users Guide for details on how to use additional start points.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment

              Working...
              X