PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

PDF's not ALL getting indexed

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PDF's not ALL getting indexed

    hi

    I builded intranet and use zoom to index it
    I spider the intranet since it contains database items that needs to be indexed

    i have about 270 pdf files all created here internally without security settings
    i have set the in zoom configuration extremely high to get around this, but it won't get around this problem :S

    all files are linked
    i threw all the words i had in my skiplist out

    it just seems like it wont scan my complete site

    i indexed with verbose on but that didnt show anything that could clear up things

    could it be that the directory level is too deep?
    some files are like


    intranet/folder/folder/folder/folder/pdffile
    If i think as i thought, i will do as i did and if i do what i did i will think as i thought....

  • #2
    What are you seeing in the index log (with Verbose mode on) when it fails to index the files? Does it find the links at all? There are normally reasons given for why a link is not scanned (eg. "File not found", "Blocked by extensions list", "External site - blocked by base URL", etc.). Are there any error messages?

    Save your index log (click on "File" menu and select "Save index log to file") and send us a copy of this via e-mail (see our Contact Us page). We can take a closer look at the problem.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Hi

      as requested i have send what u wanted to the email info email

      i dont see errors that are relevant to the issue of not indexing the pdf files
      it just doesnt follow all the links wich are present

      there is 1 document that cant be found as u will see in the index log file that i send, but that is correct


      [offtopic]
      im at GMT+1 and its 10:22am now, i figure ur 12 hours ahead of me?
      [/offtopic]

      ps:
      system specs:
      P4 2.4 GHZ
      512MB memory



      edit:
      i took all the pdf and doc files and smashed them in 1 sub directory, had a php file list all the files in the index of the test site, and all the documents got indexed.....

      in the intranet site (where it goes wrong) all the documents are properly linked yet they dont get indexed


      suggestion that popped up in my mind right now:
      a log file that keeps track of every webpage it checks and for every page wich links it followed to index

      and im tending to make a system where i use a javascript popup (not currently in use if u wonder) to open pdf documents and word documents from the intranet, will it follow the javascript too?
      If i think as i thought, i will do as i did and if i do what i did i will think as i thought....

      Comment


      • #4
        We've had a look at your files and have sent back a reply via e-mail. We first recommend you upgrade to the latest build to make sure this is not an old bug:
        http://www.wrensoft.com/zoom/whatsnew.html

        i took all the pdf and doc files and smashed them in 1 sub directory, had a php file list all the files in the index of the test site, and all the documents got indexed.....
        in the intranet site (where it goes wrong) all the documents are properly linked yet they dont get indexed
        This would indicate that the problem is with the linking on the Intranet site. It may be a number of things, such as invalid HTML (which Zoom can not handle), etc. Find a page where links are not found, and send us a copy of the HTML source of that page (ie. open the page in a browser, select "View Source" and save the source code).

        suggestion that popped up in my mind right now:
        a log file that keeps track of every webpage it checks and for every page wich links it followed to index
        This is what the index log already does. When you switch to "Single threaded downloading" in the configuration window, and enable "Verbose mode", the messages are displayed in order of files that the spider scans, and the links it finds while scanning that page.

        Eg. all "queueing" and "skipping" messages after a "Scanning [pageA]" message, are the links that it found whilst scanning [pageA]. If a link is not mentioned, then it was not found or considered as a link.

        and im tending to make a system where i use a javascript popup (not currently in use if u wonder) to open pdf documents and word documents from the intranet, will it follow the javascript too?
        No. Javascript links are not followed by most web spiders, as it can be impossible to determine (a link may not be formed until the script is executed, and the user interacts in some way).

        More information here:
        http://www.wrensoft.com/zoom/support...spider_finding
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          first of all....

          d*mn i love this quick answering, dont see that much...


          since i downloaded the zoom i use just last week i figured i had the latest, but i noticed that there was already a newer version now... i will try that first

          i will take a look at my generated code and if i dont see anything strange in it i will send the code to u

          i must be getting lost by all the info thats in the indexer log i guess
          ill see about that and see what happens

          i can work around the javascript problem with the tags
          <noscript> href link here </noscript> no problem


          edit:
          hmmm..... im getting a few red lines in my indexer telling bad html :S
          i will look into the code then..... and let u know

          my code was a little bit sloppy :S
          i ran it tru the w3c html validator and worked out the errors (saved the html code then uploaded it to my webserver and then checked, since the upload from w3c didnt work like it should) w3c validator

          now i get a lot more files scanned


          a reason was indeed that i also use javascript for some links/documents to open already.....
          but the <noscript> tag works wonders
          still have a few pages with invalid html so im working that out now

          thnx
          [/url]
          If i think as i thought, i will do as i did and if i do what i did i will think as i thought....

          Comment

          Working...
          X