PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

2007 .ppt not being indexed

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2007 .ppt not being indexed

    The file is not locked, but is a 2007 (yes three years old .ppt file).

    When are we going to be able to spider three year old files?

    Found the answer, not in the forum, but in support....

    Made a pptx file and ran the spider again, no error regarding the file in the log, but words in the pptx are not being shown in the results.

    Search results for: *.pptx

    2 results found.
    CQC-registration.pptx
    ... CQC-registration.pptx ...
    Terms matched: 1 - Score: 9 - 30 Sep 2010 - 4,243k - URL: http://www.****re.org.uk/intranet/presentations/CQC-registration.pptx
    Ivan-Wass.pptx
    ... Ivan-Wass.pptx ...
    Terms matched: 1 - Score: 9 - 30 Sep 2010 - 309k - URL: http://www.****re.org.uk/intranet/presentations/Ivan-Wass.pptx

    I converted the files to .odp and added the extension to the formats section, spidered again.

    Search results for: *.odp

    2 results found.
    CQC-registration.odp
    ... CQC-registration.odp ...
    Terms matched: 1 - Score: 9 - 30 Sep 2010 - 4,296k - URL: http://www.****re.org.uk/intranet/presentations/CQC-registration.odp
    Ivan-Wass.odp
    ... Ivan-Wass.odp ...
    Terms matched: 1 - Score: 9 - 30 Sep 2010 - 386k - URL: http://www.****re.org.uk/intranet/presentations/Ivan-Wass.odp

    However, still no results for search text from these files.

    Here's a link to a small pptx file http://www.emcare.org.uk/Ivan-Wass.pptx

    When you say meta data, do you mean the Dublin Core meta data? Then no, that wasn't checked. I've checked it now, ran the indexer with the pages now linking again to the pptx files.

    The plugin I've used is your all-in-one plugin, it seems to spider other ppt files easily.

    Yes, the setting is that under "Configure"->"Scan options", you have setup ".pptx" to be indexed as an "Office 2007 file" file type.
    Last edited by grahamtinley; Oct-01-2010, 04:53 AM. Reason: More info.

  • #2
    Do you have a file you can share with us, so we can look at it in more detail.

    Did you check the settings in the "indexing options" configuration window to make sure you are indexing page content and meta data.

    Comment


    • #3
      You should also make sure you have the Office 2007 plugin installed, which provides support for indexing .docx, .pptx, and .xlsx file formats.

      Also confirm that under "Configure"->"Scan options", you have setup ".pptx" to be indexed as an "Office 2007 file" file type. And NOT as "Binary (filename only)" file type.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        Bump

        Comment


        • #5
          Just realized you've edited your original post to answer the questions we've asked.

          As general advice for web forums, make your answers via a new post (in the same thread that is). We don't get notified if a post has been edited, it doesn't show up as a new post. Second, it means we have to re-read your original post and "spot the difference" to find where you've edited and added new information, which is just wasting time.

          We'll have a look at the file given.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Originally posted by grahamtinley View Post
            Made a pptx file and ran the spider again, no error regarding the file in the log, but words in the pptx are not being shown in the results.

            Search results for: *.pptx

            2 results found.
            CQC-registration.pptx
            ... CQC-registration.pptx ...
            Terms matched: 1 - Score: 9 - 30 Sep 2010 - 4,243k - URL: http://www.****re.org.uk/intranet/presentations/CQC-registration.pptx
            Ivan-Wass.pptx
            ... Ivan-Wass.pptx ...
            Terms matched: 1 - Score: 9 - 30 Sep 2010 - 309k - URL: http://www.****re.org.uk/intranet/presentations/Ivan-Wass.pptx
            We tested this specific file given and the words in the PPTX files are indexed fine. Try searching for the words inside the file rather than the filename.

            Because you searched for "*.pptx", the context description will show a match in the filename (as that is the word matched), and no words within the document is shown because it wasn't matched inside the document content.

            If you search for, say "safeguarding vulnerable groups", you'll find the words within the PPTX file will appear as part of the context description.

            Also if you enable "Title of page" from appearing in the search results ("Configure"->"Results Layout"->Check "Title of page" box), then you will see the title link as "Vetting and Barring Scheme" rather than the filename looking like its being repeated.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Search results for: "safeguarding vulnerable groups"

              2 results found.
              Guidance-Compliance.pdf
              ... that they are registered with the Independent Safeguarding Authority: where they are undertaking a Safeguarding Vulnerable Groups Act 2006 regulated activity or controlled activity; and are required to ...
              Terms matched: 1 - Score: 530 - 2 May 2010 - 451k - URL: http://www.emcare.org.uk/intranet/csci/Guidance-Compliance.pdf
              may09.pdf
              ... and the Independent Safeguarding Authority will deliver the new scheme as laid down by the Safeguarding Vulnerable Groups Act 2006. The events will explain the scope of the scheme ...
              Terms matched: 1 - Score: 139 - 10 Jul 2009 - 45k - URL: http://www.emcare.org.uk/intranet/minutes/may09.pdf

              Comment


              • #8
                Running the engine I get this error message in the browser, you can see the search engine at:

                http://www.emcare.org.uk/zoom/search.php

                Fatal error: Allowed memory size of 8388608 bytes exhausted (tried to allocate 257 bytes) in /home/emcare/www/zoom/search.php on line 1540

                Comment


                • #9
                  OK....

                  Here's a new folder I've created with pptx files, try viewing this:
                  http://www.emcare.org.uk/dave/Spanglish.pptx

                  Now pick up some text and use this search engine:
                  http://www.emcare.org.uk/dave/search.php

                  Comment


                  • #10
                    Fatal error: Allowed memory size of 8388608 bytes exhausted
                    For this error see this FAQ
                    "Fatal error: Allowed memory size of 8388608 bytes exhausted..." or similar error message

                    Comment


                    • #11
                      Originally posted by grahamtinley View Post
                      OK....

                      Here's a new folder I've created with pptx files, try viewing this:
                      http://www.emcare.org.uk/dave/Spanglish.pptx

                      Now pick up some text and use this search engine:
                      http://www.emcare.org.uk/dave/search.php
                      That PPT file was not indexed. You should check the Index Log (the Log tab in the Indexer window) after indexing to see if some files are not indexing for a given reason. See "Skipped" messages, and Error/Warning messages.

                      If you are using Spider Mode (as implied in your original post) then you need to make sure there is a link to this file from one of the pages you are crawling.
                      Q. I am indexing with spider mode but it is not finding all the pages on my web site
                      --Ray
                      Wrensoft Web Software
                      Sydney, Australia
                      Zoom Search Engine

                      Comment


                      • #12
                        Here's the log after running the engine again:

                        07:40:27 - Start indexing (spider mode) at Fri Oct 08 07:40:27 2010 (922103 bytes)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/icons/blank.gif (External site - does not match base URL)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/?N=D (Blocked by extensions list)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/?M=A (Blocked by extensions list)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/?S=A (Blocked by extensions list)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/?D=A (Blocked by extensions list)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/icons/back.gif (External site - does not match base URL)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/ (External site - does not match base URL)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/icons/unknown.gif (External site - does not match base URL)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/icons/text.gif (External site - does not match base URL)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/zoom_dictionary.zdat (Blocked by extensions list)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/zoom_pagedata.zdat (Blocked by extensions list)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/zoom_pageinfo.zdat (Blocked by extensions list)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/zoom_pagetext.zdat (Blocked by extensions list)
                        07:40:29 - [SKIPPED] Skipping http://www.emcare.org.uk/dave/zoom_wordmap.zdat (Blocked by extensions list)
                        07:40:36 - [SKIPPED] Skipping http://www.wrensoft.com/zoom/ (External site - does not match base URL)
                        07:40:38 - Indexing completed at Fri Oct 08 07:40:38 2010



                        Errors and warnings both show zero.

                        Comment


                        • #13
                          It would help if you included the Scanned messages or tell us if the "Spanglish.pptx" file showed up as being scanned in the index log.

                          If it did, did you upload the resultant index files? Because it is evident going here:
                          http://www.emcare.org.uk/dave/

                          That your index files are all dated 7th Oct. And the index log above is from 8th Oct. So you most likely didn't upload the new index files.
                          --Ray
                          Wrensoft Web Software
                          Sydney, Australia
                          Zoom Search Engine

                          Comment


                          • #14
                            OK, here's the index.html, linking to all of the pptx files in the same directory:
                            http://www.emcare.org.uk/dave/

                            No errors, no warnings, no skipped, all files have green backgrounds in the log to show they were spidered ok.

                            The engine still cannot find text....

                            Comment


                            • #15
                              Please read my last message. Maybe you missed it.

                              Originally posted by grahamtinley View Post
                              No errors, no warnings, no skipped, all files have green backgrounds in the log to show they were spidered ok.
                              Yes, and was one of the green messages "Spanglish.pptx"?

                              And did you upload the resulting index files? Because last I checked, you didn't.

                              You need to upload the index files after indexing.

                              UPDATE: Woah, hang on, it's worse than that. The above is still a problem yes, and you didn't upload the index files, but your web server is completely misconfigured and not serving the PPTX files correctly.

                              Go to that index page in your browser. Click on any of those PPTX file links and try to open it.

                              IE will think that they are ZIP files and won't recognize them as PPTX files.

                              Reason for this is that your server is currently serving them with incorrect Content-Type. They are served as "text/plain", and that's wrong.

                              If you are configuring and setting up this web server yourself, you need to fix up your Apache configuration.

                              If you have a hosting company do this for you, you need to show them this and tell them to fix up your server configuration so that their content-types are correct for PPTX files.
                              --Ray
                              Wrensoft Web Software
                              Sydney, Australia
                              Zoom Search Engine

                              Comment

                              Working...
                              X