PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Plugin and spider - Not finding word documents

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plugin and spider - Not finding word documents

    I startup Zoom Search from within my app:

    -----
    Process p = new Process();
    p.StartInfo.WorkingDirectory = MyPath;
    p.StartInfo.FileName = System.Configuration.ConfigurationManager.AppSetti ngs["ZoomIndexer"];
    p.StartInfo.Arguments = " -s " + MyPath + "/zoom.zcfg";
    p.StartInfo.CreateNoWindow = true;
    p.StartInfo.UseShellExecute = false;
    p.Start();
    p.WaitForExit();
    -----

    Everything works perfect!

    Now I want to index ".doc" also - so I download the plugin and unzip it - and move the word2txt.exe file to my /plugins folder.

    Now the .doc type appear and I choose to use it from within "Config/Scan Options".

    I look in my "zoom.zcfg" file and see that the ".doc" type is added.

    I then run my application and start up "Zoom Indexer" as posted above.

    No word documents are found/indexed ( example: http://localhost/mydoc.doc?

    The size of the word document does not exceed my settings.

    (my "zoom.zcfg" are local to the appl., and the ZoomIndexer.exe/Plugins are located under: C:\Programmer\Zoom Search Engine 5.0)

    Where do I go wrong?

    Regards
    Bo Hessner
    Last edited by hessner; Feb-13-2007, 08:52 PM.

  • #2
    I guess there are a few possibilities.

    1) The Zoom config file that you created with .DOC files enabled is not the config file you are using when you call Zoom from your process.

    2) There were no .DOC files available to index. Either becuase there really are no .DOC files, or the .DOC files that you have are not linked to and thus not found be the indexer.

    3) The Word documents were found, but there was a problem indexing them. e.g. they were password protected.

    You should turn on debug logging in Zoom and then look through the log to see if the .DOC files were found and if they were skipped and if there was any error messages.

    Comment


    • #3
      Thanks, the "Debug" set me back on track

      "Can not write file C:\Programmer\Zoom Search Engine 5.0\zoom_plugin.in"

      I then permitted "Full access" to "C:\Programmer\Zoom Search Engine 5.0" and started up again.

      And now zoom_plugin.in(what is it?) get's created with the size of 31kb and my ZoomIndexer.exe process just keeps on living - until I kill it.

      The error, inside the logfile, are gone

      But the logfile dident make it to: "Deleting presaved index data...", but ended just before that one.

      The zoom.ini file are also touched.

      After the kill, my index are not updated.

      1. Why write zoom_plugin.in(what is it?) into "C:\Programmer\Zoom Search Engine 5.0", and can I get the system to write elsewhere?
      2. Whats the "hanging" about?

      Kind regards
      Bo Hessner

      Comment


      • #4
        Originally posted by hessner View Post
        1. Why write zoom_plugin.in(what is it?) into "C:\Programmer\Zoom Search Engine 5.0", and can I get the system to write elsewhere?
        This is a temporary file required for the plugin operation. It is currently written out to the folder where Zoom is installed. Unfortunately, this means that Zoom requires write access to the folder accordingly.

        With the increased security measures in place in Vista, and on more and more desktop machines, this has become more of an issue recently. We are planning to change this behaviour and write the temporary files in either the user specified Output Directory or the corresponding User folders in Vista. We'll look into if its possible to include these changes in the upcoming build.

        Originally posted by hessner View Post
        And now zoom_plugin.in(what is it?) get's created with the size of 31kb and my ZoomIndexer.exe process just keeps on living - until I kill it.
        ...
        2. Whats the "hanging" about?
        I suspect there may be something special about your DOC file and it may have caused the word2txt.exe plugin to crash.

        Here's a common problem with invalid DOC files (which have been renamed as .DOC):
        http://www.wrensoft.com/zoom/support...l#word2txt_rtf

        If you can send us a copy of the DOC file in question, we can check if there is bug in the plugin.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Thanks for your answers.

          With the increased security measures in place in Vista, and on more and more desktop machines, this has become more of an issue recently. We are planning to change this behaviour and write the temporary files in either the user specified Output Directory or the corresponding User folders in Vista. We'll look into if its possible to include these changes in the upcoming build.
          1. When do you plan to implement this(have a customer asking)?
          2. I have sent you the word document.

          Regards
          Bo Hessner

          Comment


          • #6
            We've implemented the changes regarding the use of the "zoom_plugin.in" folder in the latest build. Zoom now no longer requires write permission to the installation folder.

            You can download the latest version (V5.0 build 1004) from here:
            http://www.wrensoft.com/zoom/whatsnew.html

            We also had a look at the DOC file you sent us. We did not have any problems indexing this file - the plugin did not "hang" or exhibit any other issues.

            Please try the latest version and let us know if you continue to have this problem. If so, you may want to check the index log carefully to make sure that this was the actual file that it was "hanging" on - perhaps you can paste us an extract of your index log to indicate this.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Thanks, that was fast!

              Removed requirement for ZoomIndexer.exe to have write permissions to the folder in which Zoom is installed to. Zoom will now write temporary files (eg. "zoom.ini" and "zoom_plugin.in", etc.) in the Windows assigned User folders (eg. "Documents and Settings" on XP).
              I am running in console mode (from within an asp.net 2.0 appl.), and would like to control the output folder, because more instances could run at the same time - I predict your approach would produce a write problem.

              I will download the latest build tonight and try it out.

              Btw. Word is not installed on the machine running Zoom, but this is not required - is it?

              I'l be back

              Comment


              • #8
                Ok, now I have downloaded the latest Version 5 build(Proff).

                Here are my findings:

                1. A normal run, without .doc & .pdf plugins:

                02/14/07 21:45:49 - Start indexing (spider mode)
                02/14/07 21:45:51 - Broken link found on page: http://localhost/default.aspx?f=1&s=81
                02/14/07 21:45:51 - Could not download file: http://localhost/default.aspx?f=1&s=811 (File not found)
                02/14/07 21:45:52 - Could not download file: Plugin not installed for this file format (Image file)
                02/14/07 21:45:52 - Could not download file: Plugin not installed for this file format (Image file)
                02/14/07 21:45:52 - Broken link found on page: http://localhost/default.aspx?f=1&s=1156
                02/14/07 21:45:52 - Could not download file: http://localhost/forum (File not found)
                02/14/07 21:45:53 - Deleting presaved index data...
                02/14/07 21:45:53 - Deleting pageinfo data...
                02/14/07 21:45:53 - Deleting miscellaneous buffers...
                02/14/07 21:45:53 - Deleting URL history...
                02/14/07 21:45:53 - Deleting duplicate page history...
                02/14/07 21:45:53 - Writing out the dictionary...
                02/14/07 21:45:53 - Indexing completed
                02/14/07 21:45:53 - INDEX SUMMARY
                02/14/07 21:45:53 - Files indexed: 29
                02/14/07 21:45:53 - Files skipped: 220
                02/14/07 21:45:53 - Files filtered: 0
                02/14/07 21:45:53 - Files downloaded: 29
                02/14/07 21:45:53 - Unique words found: 1904
                02/14/07 21:45:53 - Total words found: 7910
                02/14/07 21:45:53 - Avg. unique words per page: 65
                02/14/07 21:45:53 - Avg. words per page: 272
                02/14/07 21:45:53 - Start index time: 21:45:49 (2007/02/14)
                02/14/07 21:45:53 - Elapsed index time: 00:00:04
                02/14/07 21:45:53 - Errors: 4
                02/14/07 21:45:53 - URLs visited by spider: 38
                02/14/07 21:45:53 - URLs in spider queue: 0
                02/14/07 21:45:53 - Total bytes scanned/downloaded: 679492
                02/14/07 21:45:53 - File extensions:
                02/14/07 21:45:53 - .aspx indexed: 28
                02/14/07 21:45:53 - .pdf indexed: 1
                02/14/07 21:45:53 - No extensions indexed: 0
                02/14/07 21:45:53 - Deleting wordmap data...
                02/14/07 21:45:53 - Deleting presaved index data...
                02/14/07 21:45:53 - Deleting pageinfo data...
                02/14/07 21:45:53 - Deleting miscellaneous buffers...
                02/14/07 21:45:53 - Deleting URL history...
                02/14/07 21:45:53 - Deleting duplicate page history...
                02/14/07 21:49:03 - Start indexing (spider mode)
                02/14/07 21:49:04 - Broken link found on page: http://localhost/default.aspx?f=1&s=81
                02/14/07 21:49:04 - Could not download file: http://localhost/default.aspx?f=1&s=811 (File not found)
                02/14/07 21:49:04 - Could not download file: Plugin not installed for this file format (Image file)
                02/14/07 21:49:04 - Could not download file: Plugin not installed for this file format (Image file)
                02/14/07 21:49:04 - Broken link found on page: http://localhost/default.aspx?f=1&s=1156
                02/14/07 21:49:04 - Could not download file: http://localhost/forum (File not found)
                02/14/07 21:49:06 - Deleting presaved index data...
                02/14/07 21:49:06 - Deleting pageinfo data...
                02/14/07 21:49:06 - Deleting miscellaneous buffers...
                02/14/07 21:49:06 - Deleting URL history...
                02/14/07 21:49:06 - Deleting duplicate page history...
                02/14/07 21:49:06 - Writing out the dictionary...
                02/14/07 21:49:06 - Indexing completed
                02/14/07 21:49:06 - INDEX SUMMARY
                02/14/07 21:49:06 - Files indexed: 28
                02/14/07 21:49:06 - Files skipped: 222
                02/14/07 21:49:06 - Files filtered: 0
                02/14/07 21:49:06 - Files downloaded: 28
                02/14/07 21:49:06 - Unique words found: 1859
                02/14/07 21:49:06 - Total words found: 7541
                02/14/07 21:49:06 - Avg. unique words per page: 66
                02/14/07 21:49:06 - Avg. words per page: 269
                02/14/07 21:49:06 - Start index time: 21:49:03 (2007/02/14)
                02/14/07 21:49:06 - Elapsed index time: 00:00:03
                02/14/07 21:49:06 - Errors: 4
                02/14/07 21:49:06 - URLs visited by spider: 37
                02/14/07 21:49:06 - URLs in spider queue: 0
                02/14/07 21:49:06 - Total bytes scanned/downloaded: 618832
                02/14/07 21:49:06 - File extensions:
                02/14/07 21:49:06 - .aspx indexed: 28
                02/14/07 21:49:06 - No extensions indexed: 0
                02/14/07 21:49:06 - Deleting wordmap data...
                02/14/07 21:49:06 - Deleting presaved index data...
                02/14/07 21:49:06 - Deleting pageinfo data...
                02/14/07 21:49:06 - Deleting miscellaneous buffers...
                02/14/07 21:49:06 - Deleting URL history...
                02/14/07 21:49:06 - Deleting duplicate page history...
                All is well and I can search my indexes...success


                2. A run with the .doc plugin:

                02/14/07 21:57:37 - Start indexing (spider mode)
                02/14/07 21:57:39 - Broken link found on page: http://localhost/default.aspx?f=1&s=81
                02/14/07 21:57:39 - Could not download file: http://localhost/default.aspx?f=1&s=811 (File not found)
                02/14/07 21:57:39 - Could not download file: Plugin not installed for this file format (Image file)
                02/14/07 21:57:39 - Could not download file: Plugin not installed for this file format (Image file)
                02/14/07 21:57:39 - Broken link found on page: http://localhost/default.aspx?f=1&s=1156
                02/14/07 21:57:39 - Could not download file: http://localhost/forum (File not found)
                It is hanging, and the "ZoomIndexer.exe" are running twice in the task manager. 0 in CPU. 28924KB and 980KB in memory usage. After 45 minuts I kill it.


                3. A run with the .pdf plugin:

                02/14/07 21:45:49 - Start indexing (spider mode)
                02/14/07 21:45:51 - Broken link found on page: http://localhost/default.aspx?f=1&s=81
                02/14/07 21:45:51 - Could not download file: http://localhost/default.aspx?f=1&s=811 (File not found)
                02/14/07 21:45:52 - Could not download file: Plugin not installed for this file format (Image file)
                02/14/07 21:45:52 - Could not download file: Plugin not installed for this file format (Image file)
                02/14/07 21:45:52 - Broken link found on page: http://localhost/default.aspx?f=1&s=1156
                02/14/07 21:45:52 - Could not download file: http://localhost/forum (File not found)
                02/14/07 21:45:53 - Deleting presaved index data...
                02/14/07 21:45:53 - Deleting pageinfo data...
                02/14/07 21:45:53 - Deleting miscellaneous buffers...
                02/14/07 21:45:53 - Deleting URL history...
                02/14/07 21:45:53 - Deleting duplicate page history...
                02/14/07 21:45:53 - Writing out the dictionary...
                02/14/07 21:45:53 - Indexing completed
                02/14/07 21:45:53 - INDEX SUMMARY
                02/14/07 21:45:53 - Files indexed: 29
                02/14/07 21:45:53 - Files skipped: 220
                02/14/07 21:45:53 - Files filtered: 0
                02/14/07 21:45:53 - Files downloaded: 29
                02/14/07 21:45:53 - Unique words found: 1904
                02/14/07 21:45:53 - Total words found: 7910
                02/14/07 21:45:53 - Avg. unique words per page: 65
                02/14/07 21:45:53 - Avg. words per page: 272
                02/14/07 21:45:53 - Start index time: 21:45:49 (2007/02/14)
                02/14/07 21:45:53 - Elapsed index time: 00:00:04
                02/14/07 21:45:53 - Errors: 4
                02/14/07 21:45:53 - URLs visited by spider: 38
                02/14/07 21:45:53 - URLs in spider queue: 0
                02/14/07 21:45:53 - Total bytes scanned/downloaded: 679492
                02/14/07 21:45:53 - File extensions:
                02/14/07 21:45:53 - .aspx indexed: 28
                02/14/07 21:45:53 - .pdf indexed: 1
                02/14/07 21:45:53 - No extensions indexed: 0
                02/14/07 21:45:53 - Deleting wordmap data...
                02/14/07 21:45:53 - Deleting presaved index data...
                02/14/07 21:45:53 - Deleting pageinfo data...
                02/14/07 21:45:53 - Deleting miscellaneous buffers...
                02/14/07 21:45:53 - Deleting URL history...
                02/14/07 21:45:53 - Deleting duplicate page history...
                The ZoomIndexer.exe process ends normally, but AcroRd32.exe process are active in the task manager....This means that pdf documents cannot be opened at all. When I then search my index, non of the words inside the pdf document appear in the result set

                4. I still have an open question about the writing of "zoom_plugin.in". Is it a correct assumption, that only one instance of zoomindexer can run as long as I cannot control the location of this file?

                5. Microsoft Word are not installed on this machine - can this have anything to do with the ".doc" plugin trouble I have?

                Regards
                Bo Hessner
                Last edited by hessner; Feb-14-2007, 11:25 PM.

                Comment


                • #9
                  Originally posted by hessner View Post
                  2. A run with the .doc plugin:
                  02/14/07 21:57:37 - Start indexing (spider mode)
                  02/14/07 21:57:39 - Broken link found on page: http://localhost/default.aspx?f=1&s=81
                  02/14/07 21:57:39 - Could not download file: http://localhost/default.aspx?f=1&s=811 (File not found)
                  02/14/07 21:57:39 - Could not download file: Plugin not installed for this file format (Image file)
                  02/14/07 21:57:39 - Could not download file: Plugin not installed for this file format (Image file)
                  02/14/07 21:57:39 - Broken link found on page: http://localhost/default.aspx?f=1&s=1156
                  02/14/07 21:57:39 - Could not download file: http://localhost/forum (File not found)
                  It is hanging, and the "ZoomIndexer.exe" are running twice in the task manager. 0 in CPU. 28924KB and 980KB in memory usage. After 45 minuts I kill it.
                  It looks like you have turned off all logging messages except for Errors. This isn't giving us a very useful picture of what's really happening - I suspect the "hanging" part is occuring on a particular file with the message "Indexing DOC file ...." or "Processing DOC file ...." - but we don't see this because these message types are turned off. It might even be something else like "Downloading file ..." and it's waiting for your server to respond.

                  You should turn on Verbose Mode to get a complete picture of what's happening. You can save the index log to disk if it gets too large to analyse from within Zoom (see the options on the "Indexing options" tab of the Configuration window).

                  BTW - a normal single instance of Zoom appears as two "ZoomIndexer.exe" processes in the Task Manager window. So what you see there is normal.

                  Originally posted by hessner View Post
                  The ZoomIndexer.exe process ends normally, but AcroRd32.exe process are active in the task manager....This means that pdf documents cannot be opened at all.
                  AcroRd32.exe is never called or used by Zoom. This is Adobe's Acrobat Reader. A well known trait of the Acrobat Reader for browsers is that it stays around in memory even after you've closed the browser window - or left the PDF page. This is not related to Zoom in any way.

                  Zoom does not use Acrobat Reader to process PDF files. It also does not use Microsoft Word to process DOC files.

                  Originally posted by hessner View Post
                  When I then search my index, non of the words inside the pdf document appear in the result set
                  There can be many possible reasons why this is so, but the quickest way for us to tell you is if you can provide us a copy of the PDF in question, and the words you're searching for.

                  Here is one of many possibilities from our FAQ:
                  Q. Why can't I find words from my scanned PDF files? (PDFs created from scanning in physical documents)

                  You should also confirm that the PDF file was actually indexed successfully. Turning on Verbose Mode messages should help.

                  Originally posted by hessner View Post
                  4. I still have an open question about the writing of "zoom_plugin.in". Is it a correct assumption, that only one instance of zoomindexer can run as long as I cannot control the location of this file?
                  Yes, only one instance of Zoom can be indexing on a machine at any one time if you are using plugins.

                  In retrospect, perhaps we should have written the temporary plugin files to the output directory afterall - this would have solved that problem. We'll consider it further.

                  Originally posted by hessner View Post
                  5. Microsoft Word are not installed on this machine - can this have anything to do with the ".doc" plugin trouble I have?
                  No, as mentioned above, the word2txt plugin does not use or require Microsoft Word to process DOC files.
                  Last edited by Ray; Feb-15-2007, 12:00 AM.
                  --Ray
                  Wrensoft Web Software
                  Sydney, Australia
                  Zoom Search Engine

                  Comment


                  • #10
                    It looks like you have turned off all logging messages except for Errors. This isn't giving us a very useful picture of what's really happening
                    You are right, I will set debug mode to 1 and run all the tests again, tonight. My mistake.

                    In retrospect, perhaps we should have written the temporary plugin files to the output directory afterall - this would have solved that problem. We'll consider it further.
                    I really would appreciate if you would implement this, else it means that I can't (in practice) use the plugins.

                    Kind regards
                    Bo Hessner

                    Comment


                    • #11
                      Ok, now I have switched debug on.

                      Here are my findings:

                      1. A run with the .doc plugin:

                      zoom.zcfg file:

                      __5_0
                      #STARTDIR:
                      #SPIDERURL:http://localhost/zoomSearch.aspx?f=1
                      #BASEURL:http://localhost/
                      #OUTDIR:C:\Inetpub\Jeresforening2CS\uploads\I1\Sea rch
                      #SPIDERURLTYPE:0
                      #SPIDERURLUSELIMIT:0
                      #SPIDERURLLIMIT:0
                      #USE-CRC:1
                      #CURRENTMODE:1
                      #DLTHREADS:4
                      #NOCACHE:0
                      #BEEP-ON-FINISH:0
                      #OUTPUT:CGI
                      #OUTPUT_OS:0
                      #VERBOSE:1
                      #LOGOPTIONS:ERROR|WARNING|SUMMARY|BROKEN|
                      #LOGWRITETOFILE:1
                      #LOGWRITETOFILENAME:C:\Inetpub\Jeresforening2CS\Up loads\I1\Search\zoomindexer.log
                      #LOGDEBUGMODE:1
                      #SCAN_NOEXTENSION:1
                      #SCAN_FILELINKS:0
                      #SCAN_USELOCALDESCPATH:0
                      #SCAN_LOCALDESCPATH:
                      #INDEXOPTIONS:METADESC|CONTENT|TITLE|KEYWORDS|
                      #RESULTOPTIONS:NUMBER|TITLE|METADESC|CONTEXT|SCORE |URL|FILESIZE|
                      #USE-UTF8:0
                      #CODEPAGE:28591
                      #ZLANGFILEanish.zlang
                      #SKIPUNDERSCORE:1
                      #MINWORDLEN:2
                      #FORMFORMAT:0
                      #HIGHLIGHTING:1
                      #GOTOHIGHLIGHT:1
                      #USEXML:0
                      #XMLTITLE:
                      #XMLDESC:
                      #XMLURL:
                      #XML_OPENSEARCH_DESCURL:
                      #LOGGING:0
                      #LOGGING_FILE:./logs/searchwords.log
                      #TIMING:1
                      #NOCHARSET:1
                      #DEFAULT_TO_AND:1
                      #CONTEXTSIZE:30
                      #EXACTPHRASE:500
                      #SEARCHASSUBSTRING:0
                      #NO_TOLOWER:0
                      #ZOOMINFO:0
                      #USEDATETIME:0
                      #WORDJOINCHARS:.-_'
                      #ZOOMIMAGE:0
                      #SPELLING:0
                      #SPELLINGWHENLESSTHAN:5
                      #LINKBACKURL:sammeside
                      #WIZARD_UPLOADREQD:0
                      #REPORTUSEDATES:0
                      #WORDWEIGHT_TITLE:0
                      #WORDWEIGHT_DESC:0
                      #WORDWEIGHT_KEYWORDS:0
                      #WORDWEIGHT_FILENAME:0
                      #WORDWEIGHT_HEADINGS:0
                      #WORDWEIGHT_LINKTEXT:0
                      #WORDWEIGHT_DENSITY:1
                      #WORDWEIGHT_SHORTURLS:1
                      #USE-AUTH:0
                      #USE-COOKIES:1
                      #BINUSEDESC:0
                      #PLUGIN_DESCFILES:
                      #PLUGIN_USEMETA:PDF|DOC|PPT|RTF|SWF|WPD|XLS|DJVU|I MAGE|MP3|DWF|
                      #PLUGIN_USETECHNICAL:MP3|IMAGE|DWF|
                      #PLUGIN_PDF_METHOD:0
                      #PLUGIN_PDF_HIGHLIGHT:1
                      #PLUGIN_IMG_MINFILESIZE:5
                      #MAXPAGES_LIMIT:1000
                      #MAXWORDS_LIMIT:50000
                      #MAXFILESIZE_LIMIT:1048576
                      #DESCLENGTH_LIMIT:150
                      #OPTIMIZE_SETTING:3
                      #EXTENSIONS_START
                      .aspx
                      .doc
                      #EXTENSIONS_END
                      #SKIPPAGES_START
                      #SKIPPAGES_END
                      #SKIPWORDS_START
                      og
                      eller
                      den
                      det
                      der

                      af
                      en
                      af
                      de
                      os
                      med
                      #SKIPWORDS_END
                      #USECATS:1
                      #USEDEFCATNAME:0
                      #SEARCHMULTICATS:1
                      #CATEGORIES_START
                      Misc.
                      UCat0

                      Alle|0|
                      UCat1

                      Nyhedsbrev|0|
                      UCat2

                      Medlem|0|
                      UCat3

                      Redaktør|0|
                      UCat4

                      Admins|0|
                      UCat5

                      Super|0|
                      #CATEGORIES_END
                      #RECOMMENDED_MAX:3
                      #USEFILTER:0
                      #FILTER_START
                      #FILTER_END
                      #SITEMAP_TXT:0
                      #SITEMAP_XML:0
                      #SITEMAP_UPLOAD:0
                      #SITEMAP_UPLOADPATH:
                      #SITEMAP_USEPAGEBOOST:1
                      #SITEMAP_BASEURL:http://www.mywebsite.com/
                      zoomindexer.log

                      02/15/07 17:42:09 - Config file loaded: C:\Inetpub\Jeresforening2CS\uploads\I1\Search/zoom.zcfg
                      02/15/07 17:42:11 - Start indexing (spider mode)
                      02/15/07 17:42:11 - Maximum number of words: 50000
                      02/15/07 17:42:11 - Maximum number of files: 1000
                      02/15/07 17:42:11 - Will scan files with extensions
                      02/15/07 17:42:11 - .aspx
                      02/15/07 17:42:11 - .doc
                      02/15/07 17:42:11 - Spider from: [url]http://localhost/zoomSearch.aspx?f=1
                      02/15/07 17:42:11 - Web site URL: http://localhost/
                      02/15/07 17:42:11 - Estimated RAM required during index process: 36948 KB
                      02/15/07 17:42:11 - Initiating HTTP session (thread #1) ...
                      <REMOVED TO FIT IN THREAD>
                      02/15/07 17:42:14 - Indexing http://localhost/default.aspx?f=1&s=1155
                      02/15/07 17:42:14 - Index Thread got ready buffer for http://localhost/Uploads/D1/Hjemmeside.doc (Content-type: Word document)
                      02/15/07 17:42:14 - DL Thread #1, got URL (http://localhost/default.aspx?f=1&s=1162&FP5603=114) off queue
                      02/15/07 17:42:14 - Processing DOC file http://localhost/Uploads/D1/Hjemmeside.doc
                      02/15/07 17:42:14 - Downloading file http://localhost/default.aspx?f=1&s=1162&FP5603=114
                      02/15/07 17:42:14 - Could not download file: Plugin not installed for this file format (Image file)
                      02/15/07 17:42:14 - DL Thread #3, got URL (http://localhost/) off queue
                      02/15/07 17:42:14 - Downloading file http://localhost/
                      02/15/07 17:42:14 - URL redirected to: http://localhost/default.aspx?f=1&s=81 [thread #2]
                      02/15/07 17:42:15 - Redirected file already scanned [thread #2]
                      02/15/07 17:42:15 - URL redirected to: http://localhost/default.aspx?f=1&s=81 [thread #3]
                      02/15/07 17:42:15 - Redirected file already scanned [thread #3]
                      02/15/07 17:42:15 - DL Thread #2, got URL (http://localhost/default.aspx?f=1&s=590) off queue
                      02/15/07 17:42:15 - Downloading file http://localhost/default.aspx?f=1&s=590
                      It is hanging here. I have tried to ajust the number of threads, and then it will stop a little later/before. But allways arround the .DOC handling. If I remove the .DOC type, then the logfile will continue to grow and everything will end with success.


                      Regarding the PDF trouble,- I will sent you my PDF file in a e-mail.

                      Regards
                      Bo Hessner
                      Last edited by hessner; Feb-15-2007, 10:14 PM.

                      Comment


                      • #12
                        Ahhhh, I found the error.

                        Instead of keep on starting in the background, I startet zoomindexer from windows and discovered the error.

                        The word2txt.exe file was blocked, and marked as unsafe. When I removed the block everything worked out perfect.

                        Regarding the problem with not finding documentens in my search, I suspect it have to do with me using categories to filter pages.

                        Now I will look into the categories/word/pdf connections.

                        Regards
                        Bo Hessner

                        Comment


                        • #13
                          Windows do not block executables that do not access the internet (and the word2txt.exe plugin doesn't), so I suspect you actually have some form of anti-virus/anti-spyware software running. Can you tell us what exactly "blocked" the plugin?

                          And yes, we do recommend running the indexing mnaually (with the GUI) prior to scheduling the task, to ensure that your scheduled task is setup properly.
                          --Ray
                          Wrensoft Web Software
                          Sydney, Australia
                          Zoom Search Engine

                          Comment


                          • #14
                            OK, it turns out that what hessner was referring to was the XP SP2 warning for unsigned .exe files that were downloaded from the Internet. This shows a window stating "The publisher could not be verified. Are you sure you want to run this software?"

                            However, you normally should not see this on the plugin executable itself because the file downloaded from our website is a ZIP file (or the Installer package). We have tried unzipping the file using Windows, as well as WinZip and other alternatives, and in all cases we came across, the extracted .exe file did not inherit this attribute from the ZIP file. So the word2txt.exe was never "blocked" this way.

                            Perhaps you unzipped the file on a remote computer, and then downloaded the extracted plugin file to your local computer? In this case, yes, Windows would warn you when running the file for the first time. Or perhaps you are using a different program to unzip which somehow inherits these file properties.

                            While we could sign some of the plugins that we develop, we can't do this for all of them, since many of them are developed by third parties and/or are open source projects.
                            --Ray
                            Wrensoft Web Software
                            Sydney, Australia
                            Zoom Search Engine

                            Comment

                            Working...
                            X