PDA

View Full Version : Index Skip Options



gpatton
02-27-2009, 07:05 PM
I am having trouble getting this to work properly. I am using DNN 4.8 with a store module. Zoom Ver 6 using the ASP index. I do not want to include the category (/cid/) in the search results. I tried to skip " /tabid/58/cid/" or just /cid/ the entire index fails. I have successfully skipped other pages using this but they use a different tabid.

I have been manually editing the index to resolve this but I would like to automate this process.

The first example is a category in my store.

http://www.meinhard.com/Store/tabid/58/cid/1/Nebulizers.aspx

The second is a product (/pid/) in my store. This is what I would like to include.

http://www.meinhard.com/Store/tabid/58/pid/257/Meinhard-Type-A-Quartz-4-mLmin-20-psi.aspx

wrensoft
02-27-2009, 07:15 PM
If you enter /tabid/58/cid/ in the skip list this this would exclude the first page, but not the second.

Can you describe in more detail what you mean by "the entire index fails". Are there error messages, or too many documents skipped, what is the actual problem?

gpatton
02-27-2009, 07:26 PM
That is what I thought? But for some reason if I skip /cid/ the index takes about 2 seconds and misses all of the /pid/

wrensoft
02-27-2009, 08:41 PM
Can you post or E-Mail us your log.

gpatton
02-27-2009, 09:15 PM
10|02/27/09 15:12:23|Start indexing (spider mode) at Fri Feb 27 15:12:23 2009
02|02/27/09 15:12:23|Maximum number of words: 90000
02|02/27/09 15:12:23|Maximum number of files: 65000
02|02/27/09 15:12:23|Will scan files with extensions
02|02/27/09 15:12:23| .htm
02|02/27/09 15:12:23| .html
02|02/27/09 15:12:23| .txt
02|02/27/09 15:12:23| .php
02|02/27/09 15:12:23| .asp
02|02/27/09 15:12:23| .cgi
02|02/27/09 15:12:23| .aspx
02|02/27/09 15:12:23| .pdf
02|02/27/09 15:12:23| .docx
02|02/27/09 15:12:23| .pptx
02|02/27/09 15:12:23| .xlsx
02|02/27/09 15:12:23|Spider from: http://www.meinhard.com/
02|02/27/09 15:12:23|Web site URL: http://www.meinhard.com/
02|02/27/09 15:12:23|Estimated RAM required during index process: 242691 KB
04|02/27/09 15:12:23|Downloading robots.txt file found at http://www.meinhard.com/robots.txt
02|02/27/09 15:12:23|Initiating HTTP session (thread #1) ...
14|02/27/09 15:12:23|DL Thread #1, got URL (http://www.meinhard.com/) off queue
04|02/27/09 15:12:23|Downloading file http://www.meinhard.com/
14|02/27/09 15:12:23|Index Thread got ready buffer for http://www.meinhard.com/ (Content-type: HTML text)
02|02/27/09 15:12:23|Initiating HTTP session (thread #3) ...
02|02/27/09 15:12:23|Initiating HTTP session (thread #9) ...
02|02/27/09 15:12:23|Initiating HTTP session (thread #7) ...
02|02/27/09 15:12:23|Initiating HTTP session (thread #5) ...
02|02/27/09 15:12:23|Initiating HTTP session (thread #4) ...
11|02/27/09 15:12:23|Spidering for links on http://www.meinhard.com/
02|02/27/09 15:12:23|Initiating HTTP session (thread #6) ...
02|02/27/09 15:12:23|Initiating HTTP session (thread #8) ...
02|02/27/09 15:12:23|Initiating HTTP session (thread #10) ...
02|02/27/09 15:12:23|Initiating HTTP session (thread #2) ...
11|02/27/09 15:12:23|Queued URL: http://www.meinhard.com/portals/0/default.aspx
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/portals/0/Pix/PortalSiteBanner.jpg (Blocked by extensions list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/images/spacer.gif (Blocked by robots.txt)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/1/Nebulizers.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/3/ICP-Torches.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/2/Spray-Chambers.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/4/Peristaltic-Pumps.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/80/Sample-Introduction-Accessories.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/75/Sample-Preperation-Accessories.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/58/cid/6/Pump-Tubing.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/StoreLinksAdmin/tabid/58/cid/152/Sale-Items.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/65/Default.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/63/Default.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/70/Default.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/71/Default.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Store/tabid/72/Default.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/docs/pdf/MGP2008Cat2.pdf (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/nebulizer1.jpg (Blocked by extensions list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/icp1.jpg (Blocked by extensions list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/spray1.jpg (Blocked by extensions list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/placeholder.jpg (Blocked by extensions list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/smplintroductionacc.jpg (Blocked by extensions list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/Peristaltic.jpg (Blocked by extensions list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/store/tabid/58/cid/75/Sample-Preperation-Accessories.aspx (Blocked by page skip list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/buttons/sampleprepacc.jpg (Blocked by extensions list)
01|02/27/09 15:12:23|Skipping http://www.meinhard.com/Portals/0/images/QASISOnew.jpg (Blocked by extensions list)
11|02/27/09 15:12:23|Queued URL: http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx
00|02/27/09 15:12:23|Indexing http://www.meinhard.com/
14|02/27/09 15:12:23|DL Thread #1, got URL (http://www.meinhard.com/portals/0/default.aspx) off queue
14|02/27/09 15:12:23|DL Thread #3, got URL (http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx) off queue
04|02/27/09 15:12:23|Downloading file http://www.meinhard.com/portals/0/default.aspx
04|02/27/09 15:12:23|Downloading file http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx
04|02/27/09 15:12:23|URL redirected to: http://www.meinhard.com/ [thread #1]
01|02/27/09 15:12:23|Redirected file already scanned [thread #1]
14|02/27/09 15:12:23|Index Thread got ready buffer for http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx (Content-type: HTML text)
11|02/27/09 15:12:23|Spidering for links on http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx
00|02/27/09 15:12:23|Indexing http://www.meinhard.com/Home/tabid/36/ctl/Privacy/Default.aspx
03|02/27/09 15:12:23|All index files will be written to: C:\domains\localuser\armi\meinhard.com\search
03|02/27/09 15:12:23|Writing index data for ASP search... (Please wait)
03|02/27/09 15:12:23|Created pagedata data file (zoom_pagedata.zdat)
03|02/27/09 15:12:23|Created pagetext data file (zoom_pagetext.zdat)
03|02/27/09 15:12:23|Created pageinfo data file (zoom_pageinfo.zdat)
03|02/27/09 15:12:23|Created dictionary data file (zoom_dictionary.zdat)
03|02/27/09 15:12:23|Created wordmap data file (zoom_wordmap.zdat)
03|02/27/09 15:12:23|Created script settings file (settings.asp)
10|02/27/09 15:12:23|Indexing completed at Fri Feb 27 15:12:23 2009
12|02/27/09 15:12:23|INDEX SUMMARY
12|02/27/09 15:12:23|Files indexed: 2
12|02/27/09 15:12:23|Files skipped: 25
12|02/27/09 15:12:23|Files filtered: 0
12|02/27/09 15:12:23|Files downloaded: 2
12|02/27/09 15:12:23|Unique words found: 282
12|02/27/09 15:12:23|Variant words found: 78
12|02/27/09 15:12:23|Total words found: 335
12|02/27/09 15:12:23|Avg. unique words per page: 141.00
12|02/27/09 15:12:23|Avg. words per page: 167
12|02/27/09 15:12:23|Start index time: 15:12:23 (2009/02/27)
12|02/27/09 15:12:23|Elapsed index time: 00:00:00
12|02/27/09 15:12:23|Peak physical memory used: 62 MB
12|02/27/09 15:12:23|Peak virtual memory used: 119 MB
12|02/27/09 15:12:23|Errors: 0
12|02/27/09 15:12:23|URLs visited by spider: 3
12|02/27/09 15:12:23|URLs in spider queue: 0
12|02/27/09 15:12:23|Total bytes scanned/downloaded: 25594
12|02/27/09 15:12:23|File extensions:
12|02/27/09 15:12:23| .htm indexed: 0
12|02/27/09 15:12:23| .html indexed: 0
12|02/27/09 15:12:23| .txt indexed: 0
12|02/27/09 15:12:23| .php indexed: 0
12|02/27/09 15:12:23| .asp indexed: 0
12|02/27/09 15:12:23| .cgi indexed: 0
12|02/27/09 15:12:23| .aspx indexed: 1
12|02/27/09 15:12:23| .pdf indexed: 0
12|02/27/09 15:12:23| .docx indexed: 0
12|02/27/09 15:12:23| .pptx indexed: 0
12|02/27/09 15:12:23| .xlsx indexed: 0
12|02/27/09 15:12:23| No extensions indexed: 1
02|02/27/09 15:12:23|Cleaning up memory used for index data... please wait.
02|02/27/09 15:12:23|Finished cleaning up memory.
03|02/27/09 15:12:23|Copied search script to: C:\domains\localuser\armi\meinhard.com\search\sear

gpatton
02-27-2009, 09:25 PM
If I remove the /cid/ skip option. This is a summary of what it looks like.


03|02/27/09 15:20:29|Writing index data for ASP search... (Please wait)
03|02/27/09 15:20:29|Created pagedata data file (zoom_pagedata.zdat)
03|02/27/09 15:20:29|Created pagetext data file (zoom_pagetext.zdat)
03|02/27/09 15:20:29|Created pageinfo data file (zoom_pageinfo.zdat)
03|02/27/09 15:20:29|Created dictionary data file (zoom_dictionary.zdat)
03|02/27/09 15:20:29|Created wordmap data file (zoom_wordmap.zdat)
03|02/27/09 15:20:29|Created script settings file (settings.asp)
10|02/27/09 15:20:29|Indexing completed at Fri Feb 27 15:20:29 2009
12|02/27/09 15:20:29|INDEX SUMMARY
12|02/27/09 15:20:29|Files indexed: 938
12|02/27/09 15:20:29|Files skipped: 40
12|02/27/09 15:20:29|Files filtered: 0
12|02/27/09 15:20:29|Files downloaded: 938
12|02/27/09 15:20:29|Unique words found: 5045
12|02/27/09 15:20:29|Variant words found: 2646
12|02/27/09 15:20:29|Total words found: 155556
12|02/27/09 15:20:29|Avg. unique words per page: 5.38
12|02/27/09 15:20:29|Avg. words per page: 165
12|02/27/09 15:20:29|Start index time: 15:18:13 (2009/02/27)
12|02/27/09 15:20:29|Elapsed index time: 00:02:16
12|02/27/09 15:20:29|Peak physical memory used: 62 MB
12|02/27/09 15:20:29|Peak virtual memory used: 127 MB
12|02/27/09 15:20:29|Errors: 0
12|02/27/09 15:20:29|URLs visited by spider: 939
12|02/27/09 15:20:29|URLs in spider queue: 0
12|02/27/09 15:20:29|Total bytes scanned/downloaded: 20110858
12|02/27/09 15:20:29|File extensions:
12|02/27/09 15:20:29| .htm indexed: 0
12|02/27/09 15:20:29| .html indexed: 0
12|02/27/09 15:20:29| .txt indexed: 0
12|02/27/09 15:20:29| .ph

Ray
03-01-2009, 11:34 PM
Just to clarify your original description of the problem: the above logs don't demonstrate that "the index fails" (which to us, implies a crash), but rather, that it does not find the other pages of your site and only indexes a few pages. This is a common spidering situation that is explained in the FAQ here:
Q. I am indexing with spider mode but it is not finding all the pages on my web site (http://www.wrensoft.com/zoom/support/faq_problems.html#spider_finding)

You must understand that the spider needs to find links to locate other pages. If your Skip List causes a page containing important links to be excluded, then those links will not be crawled by the spider, and many other pages will be excluded.

As your product pages (pid) are only linked via your category pages (cid), then skipping the "cid" pages will mean that the spider will never get to the "pid" pages.

What you should consider is:
(a) If you have a page on the site (which isn't a "cid" page) that links to all the products. If so, specify this as the "Start Spider URL" or as an additional start point.
(b) If not, specify each of your "cid" pages as an additional start point with the Spidering option of "Follow links only" so that they won't be indexed, but the links will be crawled.

See the Users Guide (http://www.wrensoft.com/zoom/usersguide.html) for details on how to use additional start points.