PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

any limitations on file extensions?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • any limitations on file extensions?

    Does Zoom have any limitations on the file extensions it can index e.g. could it index this file?

    http://www.nature.com/nature/journal/v444/n7121/otmi/444799a.otmi

  • #2
    Zoom supports a list of known file extensions for the file types it indexes. For file extensions which it does not recognize (such as ".otmi" in this case), it will index them as "Unknown text". This means, that it will treat it the same as a ".txt" text file and ignore all formatting within it (so XML or HTML tags would also be indexed). You can easily test this out by just entering ".otmi" into your Scan Extensions list.

    Note also that Zoom will index a file differently if the server specifies a Content-Type header which indicates it is of a different format (so for example, your server could return a "text/html" content-type, and Zoom will filter out the XML tags).

    One of the things we are considering for the next major release, is the ability to specify the file type for such unrecognized files. For example, Zoom actually has the ability to index HTML and XML files, but this is currently limited to files specified with the correct content-type header, or file extension such as ".html", ".htm", ".php", ".asp", ... etc. In the future, we could allow users to select the file type that a file extension should be associated with.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Also to add, I just noticed that we're not recognizing "text/xml" as a content-type at the moment. Although for indexing purposes, this is essentially pretty much the same as "text/html". We'll add this to the next public build (5.1.100.

      [Update: Version 6 of Zoom is much more flexible in this regard, you can assign any file type to any indexing method]
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        Problem with indexing .xlsx files in a database

        Hello,
        I have got a problem with the indexing of Excel/Office2007 files stored in a database (MS SQL DB in combination with iGrafx software). Each file is not addressed by name/path but by a link to the database containing a unique object-id. Yet, most of the files are properly retrieved and indexed by Zoom. So far so good. But not all of the plugins are able to determine the proper file type. Especially the Office2007 Plugin does not recognize .xlsx files as Excel 2007 and therefore will not be able to retrieve any useful contents from them.
        I configured the .xlsx plugin to "retrieve internal meta information", but it will still not work.
        Other file types like MS Word or PDF are perfectly indexed including their meta info.
        Is that a bug which I detected or did I miss something?

        The message in the log file goes like
        Index Thread got ready buffer for http://dnde_igrafx/webcentral/BMS_approved/?objid=1336 (Content-type: Unknown text)

        Would be great if anyone here could give me a hint how to solve this.

        Thanks, Christian

        Comment


        • #5
          The example URL that you gave doesn't have a file extension. So there is no information in the URL to allow Zoom to determine the file type.

          So instead Zoom will look at the HTTP headers returned from the server. In particular the "Content-Type" field and maybe other fields if they are present like the "location" field.

          As the URL you provided is a private URL we can't test it from here, but you chould check the HTTP header fields. These are probalby being set by the script that acceses your database.

          Comment


          • #6
            Might also want to refer to this page, which provides a list of recognized content-type for Office 2007 documents:
            http://www.wrensoft.com/forum/showthread.php?t=2834
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Originally posted by wrensoft View Post
              The example URL that you gave doesn't have a file extension. So there is no information in the URL to allow Zoom to determine the file type...
              So instead Zoom will look at the HTTP headers returned from the server. In particular the "Content-Type" field and maybe other fields if they are present like the "location" field.
              The URL is created by a PHP script which I programmed myself - it points to the file in the database. Maybe I can solve the problem, but how can I set or retrieve the contents of the "Content-Type" HTTP header?

              Originally posted by wrensoft View Post
              As the URL you provided is a private URL we can't test it from here, but you chould check the HTTP header fields. These are probalby being set by the script that acceses your database.
              Yes the URL is an internal one in our Intranet, this is why you cannot access it from outside. How can I check the HTTP header fields?
              And how can I let Zoom know the correct content-type of the file, when there is no file extension and the http header does not provide it?
              Last edited by forestgreen; Mar-11-2010, 04:50 PM.

              Comment


              • #8
                At the risk of stating the obvious. To set the HTTP header in PHP use the PHP header command.
                http://php.net/manual/en/function.header.php

                Comment

                Working...
                X