PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

V5 development progress - Adding meta data to remote files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • V5 development progress - Adding meta data to remote files

    This post continues the series of posts about what new features will be in V5 of Zoom (when it is finally released).

    If you are building internet search engine it is often the case that you be indexing files PDF and DOC files that you don't control, becuase they are on someone eles web site.

    A lot of the time the authors of these documents don't know how to, or forget to, correctly set the document properties. Having invalid meta data means searches are not as accurate as they should be a results, which display the meta data appear to be wrong.

    This new feature will largely solve this problem by allowing the owner of the search engine to override the incorrect meta data in the document. New meta data is placed in .desc files.

    To enable this feature, click on "Configure"->"Scan Options" and check the "Use the offline folder for all plugin .desc files". Specify or select the folder path where your .desc files are to be found.

    With this setup, you can now index external sites using Spider Mode, and and the Indexer will look for the .desc files for any plugin supported file formats (such as .pdf, .doc, etc.) in the local directory. This allows you to specify custom .desc files without having to host them up on the remote web site.

    The offline .desc files need to include the full domain name and URL path in its filename. This is usually everything after the "http://" or "https://" prefix. It must also end in ".desc" (see examples below).

    However, since a number of characters possible in a URL are not valid as filenames, you must encode these characters in their hexadecimal form and precede them with a "%" sign. This is similar to the HTTP encoding required for URLs. The following is a list of the characters in URL which must be encoded.

    Character Encoded
    \ %5C
    / %2F
    : %3A
    * %2A
    ? %3F
    " %22
    < %3C
    > %3E
    | %7C

    For each of the above characters in a URL, substitute them with the Encoded form of the character when naming a .desc file for that URL.

    Here are some examples of URLs and their corresponding .desc filenames

    URL: http://www.mysite.com/files/mydocument.pdf
    .desc filename: www.mysite.com%2Ffiles%2Fmydocument.pdf.desc

    URL: http://www.mysite.com/download.php?fileid=123
    .desc filename: www.mysite.com%2Fdownload.php%3Ffileid=123.desc

    Of course the prefered solution would be to create documents with correct meta data in the first place. But when this hasn't been done, local .desc files can provide more accurate searches and better looking results.

    ------
    David

  • #2
    Re: V5 development progress - Adding meta data to remote fil

    Originally posted by Wrensoft
    ...A lot of the time the authors of these documents don't know how to, or forget to, correctly set the document properties. Having invalid meta data means searches are not as accurate as they should be a results, which display the meta data appear to be wrong.

    ...Of course the prefered solution would be to create documents with correct meta data in the first place. But when this hasn't been done, local .desc files can provide more accurate searches and better looking results.

    ------
    David
    David, would you point me to a page or site on the internet that documents the proper procedure concerning the meta data of pdf's?

    Also, do you know how to prevent search engines like Google from putting pdf documents in cache or from producing HTML versions of the pdf documents? Apparently, protecting the pdf document from import isn't enough.

    Finally, I am eagerly awaiting the release of version 5. Any ideas as to the release date?

    Comment


    • #3
      Re: V5 development progress - Adding meta data to remote fil

      Originally posted by Wrensoft
      ...
      A lot of the time the authors of these documents don't know how to, or forget to, correctly set the document properties. Having invalid meta data means searches are not as accurate as they should be a results, which display the meta data appear to be wrong.

      ...Of course the prefered solution would be to create documents with correct meta data in the first place. But when this hasn't been done, local .desc files can provide more accurate searches and better looking results.

      ------
      David
      Also, do you have any idea how to keep the search engines from including CSS files in their search results? I have a robots.txt file that excludes all of my "assets" folder. However, Yahoo doesn't seem to respect that and indexed my CSS files. Can you imagine?! I could never have imagined that they would do that! Do you have any suggestions to keep these files out of their results? I even individually declared them in the robots.txt file and they are still in Yahoo's results.

      Comment


      • #4
        ...the proper procedure concerning the meta data of pdf's?
        If you are making PDF files from Word. Then set the document properties in Word before converting to PDF.

        do you know how to prevent search engines like Google from putting pdf documents in cache
        Usin this meta data prevents HTML pages being cahed by Google.
        <meta name=”googlebot” content=”noarchive”>
        But as for PDF files I don't know the answer. Best you ask Google, it is their product after all

        There is no fixed date for the V5 release. We keep adding new stuff, so the date keeps moving back. But we should have an near complete beta in about 2 to 3 weeks, I hope.

        Comment


        • #5
          Yahoo doesn't seem to respect that and indexed my CSS files. Can you imagine?!
          Yes, this is surprising. CSS files don't really contain any information that is worth searching for.

          Maybe there is a mistake in your Robots file. I did a quick (30 sec) search on the web but didn't find any useful information. I don't know why Yahoo would do this or how to stop it. You should ask Yahoo about this. It is their product after all. (We don't want to get into the situation where we provide free support for Yahoo and Google, they make enough money to provide support, if they wanted to, and some times we are in competition with them).

          ------
          David

          Comment


          • #6
            Metadata replacing the extension

            This new feature will largely solve this problem by allowing the owner of the search engine to override the incorrect meta data in the document. New meta data is placed in .desc files
            The most important metafile information is the type of document it is. That is, I have many Word or pdf files that have been renamed at the site and stripped of their extensions. Will there be a parameter in the .desc file that can tell Zoom which plugin to use on the file?
            Gabe Fineman
            Washington, DC
            -Gabe Fineman
            Washington, DC [still defranchised]

            Comment


            • #7
              No. The document type is determined by the MIME type information returned by the server. So even if the document is incorrectly named, using the wrong file extension, it will be OK if the correct MIME type is returned.

              Comment


              • #8
                Is there an estimated time to release of v5?

                Regards,
                Ian Tresman

                Comment


                • #9
                  As stated above. There is no fixed date for the V5 release. We keep adding new stuff, so the date keeps moving back. But we should have an near complete beta in about 1 week, I hope.

                  Comment

                  Working...
                  X