PDA

View Full Version : duplicate content detection is not correct



maryjili
09-16-2008, 05:31 PM
I checked all crc found and their contents are just similar but not at all identical. Is this what duplicate content detection meant to be?

maryjili
09-16-2008, 05:48 PM
The skip files counter inside the indexing status box is always over counting. When there is 25 skipped files in the indexlog.txt, it saids 56. When there is no skipped files, it always above 0 and will keep growing as the indexing proceed.

wrensoft
09-16-2008, 08:45 PM
The CRC option is for detecting and removing pages that have identical content but different URLs. (Not pages which might just be similar).

Regarding the skip page count. Turn on verbose mode, so you get a full log, before assuming the counter is wrong. There might be files skipped that you are not aware of.

maryjili
09-17-2008, 12:22 AM
Where is the verbose mode located? Do you mean the debug mode?

Ray
09-17-2008, 01:31 AM
It is a button on the main index window. Alongside "Start indexing", "Configure", "Exit", there is a button that says "Verbose is off" (when Verbose mode is off) and "Verbose is on" (when Verbose mode is on).