PDA

View Full Version : Newbie problem: Duplicate results from PHP site


ricksterv4n1x8
06-29-2005, 01:39 AM
I have indexed a PHP site but when I do a search I just get a ton of pages that have the same content on each one, each with different URLs. Even when I turn the duplicate pages function off.

Also, the URL's are super long and quite different than when I step through the site manually. For example, the page http://www2.algosolutions.com/?page=nsc&sid=10 is also listed as http://www2.algosolutions.com/?page=nsc&sid=10&PHPSESSID=87755b9f6a829cd023ed3c4608d669d1
as well as a whole bunch of other ones as well like:
http://www2.algosolutions.com/?page=nsc&sid=10&PHPSESSID=ac9dd393f19d8e5708b20822c744e251

Thanks

Ray
06-29-2005, 03:17 AM
What you are seeing is simply PHP session ID's which are passed in the URL. It appears that your website is changing how it creates the links depending on whether the client/browser has cookies enabled.

You can click on the Configuration window -> "Authentication" tab, and enable "Use cookies from Windows and IE". This would allow it to work the same way it does when you access the site from your browser.

ricksterv4n1x8
06-29-2005, 04:18 PM
Thanks. I have enabled the "Use Cookies from Windows and IE" however I still get search results that have the session ID in the URL, although it appears to be less of a problem. The result is that duplicates still show up, even with the duplicate page detection function checked. For example, here are two URLS that showed up in the search results for the same page:

http://www2.algosolutions.com/?page=nsc&sid=10

http://www2.algosolutions.com/?page=nsc&sid=10&PHPSESSID=3f3894b0fbb31fe6ddfdd2e17eba1a73

Is it possible to configure the indexing process so that it, for example, excludes URLS with certain text in it, like "PHPSESSID"?

ricksterv4n1x8
06-29-2005, 05:44 PM
I clicked the "Reload all files (do not use cache)" function under the General tab of the Configuration function. That seemed to work for a couple of indexings but, then, for some reason, the Session ID URLS started to get included again. Sometimes it works, sometimes it doesn't . Am I doing things right?

Ray
06-30-2005, 01:03 AM
I think it is just the cache acting up. Checking "reload all files (do not use cache)" AND "Use cookies from Windows and IE" should fix it in theory. It might be worth clearing your cache in IE to be sure.

The other problem is I can't be sure how your website is determining whether it should provide session IDs or not without seeing the actual source code to the backend. You might want to check that theres no other scenario where session IDs are given instead (eg. if there's a link to disable the use of cookies for this session and the spider follows it... or if the website changes behaviour for different browsers, etc.)

As for why they do not get detected by the "duplicate page" function - that is because the pages are not actually identical. If you view the source, you will see that the PHPSESSID page links back to the index.php page (and various other pages) with this session ID appended. The other page does not have this. Also - they each use different banner images.