PDA

View Full Version : HTML index error (strange of course)


Liebel
03-13-2007, 02:37 PM
Hi again,
allow me one (hoepfully last) stupid question.
i came across a "funny" HTML error message with some of our pages.
interesting: The HTML is not perfect but according to various HTML checker
ok. 1127 error reported in 52644 ....(see log)
The are several things on the html pages, which i initially thought might be the reason for the problem, but they are ok on other pages

...any idea what this might be?...
Thanks already...
Greetings ...

---------------------
14:17:40 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00758/IPI00758369.htm, page aborted
14:17:42 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00759/IPI00759894.htm, page aborted
14:17:43 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00759/IPI00759928.htm, page aborted
14:17:47 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00760/IPI00760082.htm, page aborted
14:17:50 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118096.htm, page aborted
14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118238.htm, page aborted
14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118271.htm, page aborted
14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118296.htm, page aborted
14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118304.htm, page aborted
14:17:53 - [ERROR] Invalid HTML found while spidering http://harvester.fzk.de/harvester/mouse/IPI00118/IPI00118309.htm, page aborted
14:19:22 - Indexing completed at Tue Mar 13 14:19:22 2007
14:19:22 - INDEX SUMMARY
14:19:22 - Files indexed: 52644
14:19:22 - Files skipped: 4681074
14:19:22 - Files filtered: 0
14:19:22 - Files downloaded: 52644
14:19:22 - Unique words found: 1778195
14:19:22 - Total words found: 46694865
14:19:22 - Avg. unique words per page: 33
14:19:22 - Avg. words per page: 886
14:19:22 - Start index time: 13:54:59 (2007/03/13)
14:19:22 - Elapsed index time: 00:24:23
14:19:22 - Errors: 1127
14:19:22 - URLs visited by spider: 52644
14:19:22 - URLs in spider queue: 0
14:19:22 - Total bytes scanned/downloaded: 1793589489
14:19:22 - File extensions:
14:19:22 - .htm indexed: 52394
14:19:22 - .html indexed: 250
14:19:22 - Cleaning up memory used for index data... please wait.
14:19:22 - Finished cleaning up memory.

Ray
03-14-2007, 12:36 AM
What HTML checker did you use? The most common and reliable is the W3's Validator at:
http://validator.w3.org/

Here is one of your pages put through the validator:
http://validator.w3.org/check?uri=http%3A%2F%2Fharvester.fzk.de%2Fharveste r%2Fmouse%2FIPI00758%2FIPI00758369.htm

It failed validation, and reported 161 errors.

Despite this, the actual problem that Zoom picked up was not mentioned in the report. The cause of the Zoom error message is actually due to some extremely long URLs in the links on your page.

Below is one of your long links. (I kid you not, that is the actual full link you have on the page):

<A HREF="http://smart.embl-heidelberg.de/smart/show_motifs.pl?INCLUDE_SIGNALP=INCLUDE_SIGNALP&DO_PFAM=DO_PFAM&SEQUENCE=MVALSLKICVRHCNVVKTMQFEPSTAVYDACRVIRERVPEA QTGQASDYGLFLSDEDPRKGIWLEAGRTLDYYMLRNGDILEYKKKQRPQK IRMLDGSVKTVMVDDSKTVGELLVTICSRIGITNYEEYSLIQETIEEKKE EGTGTLKKDRTLLRDERKMEKLKAKLHTDDDLNWLDHSRTFREQGVDENE TLLLRRKFFYSDQNVDSRDPVQLNLLYVQARDDILNGSHPVSFEKACEFG GFQAQIQFGPHVEHKHKPGFLDLKEFLPKEYIKQRGAEKRIFQEHKNCGE MSEIEAKVKYVKLARSLRTYGVSFFLVKEKMKGKNKLVPRLLGITKDSVM RVDEKTKEVLQEWPLTTVKRWAASPKSFTLDFGEYQESYYSVQTTEGEQI SQLIAGYIDIILKKKQSKDRFGLEGDEESTMLEESVSPKKSTILQQQFNR TGKAEHGSVALPAVMRSGSSGPETFNVGSMPSPQQQVMVGQMHRGHMPPL TSAQQALMGTINTSMHAVQQAQDDLSELDSLPPLGQDMASRVWVQNKVDE SKHEIHSQVDAITAGTASVVNLTAGDPADTDYTAVGCAITTISSNLTEMS KGVKLLAALMDDDVGSGEDLLRAARTLAGAVSDLLKAVQPTSGEPRQTVL TAAGSIGQASGDLLRQIGENETDERFQDVLMSLAKAVANAAAMLVLKAKN VAQVAEDTVLQNRVIAAATQCALSTSQLVACAKVVSPTISSPVCQEQLIE AGKLVDRSVENCVRACQAATGDSELLKQVSAAASVVSQALHDLLQHVRQF ASRGEPIGRYDQATDTIMCVTESIFSSMGDAGEMVRQARVLAQATSDLVN AMRSDAEAEIDMENSKKLLAAAKLLADSTARMVEAAKGAAANPENEDQQQ RLREAAEGLRVATNAAAQNAIKKKIVNRLEVAAKQAAAAATQTIAASQNA AISNKNPSAQQQLVQSCKAVADHIPQLVQGVRGSQAQAEDLSAQLALIIS SQNFLQPGSKMVSSAKAAVPTVSDQAAAMQLSQCAKNLATSLAELRTASQ KAHEACGPMEIDSALNTVQTLKNELQDAKMAAAESQLKPLPGETLEKCAQ DLGSTSKGVGSSMAQLLTCAAQGNEHYTGVAARETAQALKTLAQAARGVA ASTNDPEAAHAMLDSARDVMEGSAMLIQEAKQALIAPGDTESQQRLAQVA KAVSHSLNNCVNCLPGQKDVDVALKSIGEASKKLLVDSLPPSTKPFQEAQ SELNQAAADLNQSAGEVVHATRGQSGELAAASGKFSDDFDEFLDAGIEMA GQAQTKEDQMQVIGNLKNISMASSKLLLAAKSLSVDPGAPNAKNLLAAAA RAVTESINQLIMLCTQQAPGQKECDNALRELETVKGMLENPNEPVSDLSY FDCIESVMENSKVLGESMAGISQNAKTGGNPKAQHTHDAITEAAQLMKEA VDDIMVTLNEAASEVGLVGGMVDAIAEAMSKLDEGTPPEPKGTFVDYQTT VVKYSKAIAVTAQEMMTKSVTNPEELGGLASQMTTDYGHLALQGQMAAAT AEPEEIGFQIRTRVQDLGHGCIFLVQKAGALQVCPTDSYTKRELIECARS VTEKVSLVLSALQAGNKGTQACITAATAVSGIIADLDTTIMFATAGTLNA ENGETFADHRENILKTAKALVEDTKLLVSGAASTPDKLAQAAQSSAATIT QLAEVVKLGAASLGSNDPETQVVLINAIKDVAKALSDLIGATKGAASKPA DDPSMYQLKGAAKVMVTNVTSLLKTVKAVEDEATRGTRALEATIEYIKQE LTVFQSKDIPEKTSSPEESIRMTKGITMATAKAVAAGNSCRQEDVIATAN LSRKAVSDMLIACKQASFYPDVSEEVRTRALRYGTECTLGYLDLLEHVLV ILQKPTPELKHQLAAFSKRVAGAVTELIQAAEAMKGTEWVDPEDPTVIAE TELLGAAASIEAAAKKLEQLKPRAKPKQADETLDFEEQILEAAKSIAAAT SALVKSASAAQRELVAQGKVGSIPANAADDGQWSQGLISAARMVAAATSS LCEAANASVQGHASEEKLISSAKQVAASTAQLLVACKVKADQDSEAMKRL QVMVTDAGGKILLLERAAGNAVKRASDNLVRAAQKAAFGKADDDDVVVKT KFVGGIAQIIAAQEEMLKKERELEEARKKLAQIRQQQYKFLPTELREDEG"> ACTIVATE: SMART analysis</A>

Now, I think you would agree that's a pretty long link ;) In fact, the URL is 2301 chracters long. The maximum length for URLs is 2083 characters. This limit is imposed by the Windows Internet API and is enforced in Internet Explorer. Other web browsers and servers have slightly different limits, but a practical limit is still enforced. That link is likely to fail in a variety of browsers and operating systems (I just checked, and it gets truncated if you try to enter that URL in Internet Explorer 7).

At the moment, Zoom assumes that the page's HTML is broken due to the ridiculously long link, and skips indexing the page. We would recommend looking into why such a long link is needed, and changing your site to use shorter URLs.

wrensoft
03-14-2007, 01:46 AM
For a 2nd opinion we also used the more in depth 'CSE HTML Checker'.

It found 179 HTML errors and warnings on your first page.

Liebel
03-14-2007, 08:14 AM
Thanks Ray for the help...
(w3 validator detects everything ...a bit too much )

Unfortunately the long links we use are "amino acid sequences" (proteins) we have to "parse" to the corresponding bioinformatic server.

cutting the link/ protein? Let me see...ah....we would have no eyes (probably)
... :-) ...anyway...thanks for your help....we will find a way around the
"long link problem" ....

Danke and Greetings....

Ray
03-14-2007, 08:30 AM
I don't mean to suggest that you should truncate the data. But some data is not suitable to be sent via the URL and this is one such case. It would make more sense for example, if the database/backend did not depend on the sequence as the identifying parameter. Usually, a database would be designed to contain a shorter, internal ID# (unique for your database only) that you can use for example.

And if large data needs to be sent between pages, typically you should use HTTP PUT (eg. forms) instead of HTTP GET (parameters in URL).

As mentioned before, the existing website implementation is already broken for Internet Explorer and probably many other web clients. So there may well already be some missing eyes and noses as it is. :)