PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexing .desc files while skipping their associated files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing .desc files while skipping their associated files

    I have a bunch of rather large (apx. 2 MB) .pdf files that contain mostly mapping imagry. So, I have put the entire folder in my skip list.

    However, I wish to use .desc files to indentify key features contained within the .pdf files. For instance, if one searches for "Bridge WB-414" it will return "Sheet037.pdf: Bridge WB-414 at Milepost 36.25".

    So my question is: How can I skip indexing the .pdf files within this folder, while enabling Zoom to index the .desc files therein?

  • #2
    If the PDF contains only images, then nothing would be indexed because the plugin would not be able to extract content from the images. In this case, you wouldn't need to skip the PDF files, and you can just let them index.

    However if you have graphing data, labels, etc. in text form surrounding the image (not part of the image), then this content may get extracted, in which case, you would want to skip the content within the PDF file and only use the .desc file's meta information. Unfortunately, the current version of Zoom does not provide a way to do this. You can disable indexing the content for all your files and only rely on titles, meta keywords and descriptions; but there's no way to do this ONLY for PDF files.

    You should try allowing the PDF files to index first to see what sort of extraneous content gets indexed, and how badly this affects your search data. You can also try setting the ZOOMPAGEBOOST to -5 in the .desc file for each of these PDFs, which would lower the importance of content found within the file.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      This is not much of a problem Raymond. As you indicated, I can have it scan the .pdfs. It will simply take longer, but it'll work ok.

      Comment


      • #4
        .desc files

        I have made .desc files for each of my .pdfs, but the spider doesn't seem to be indexing the content of the .desc files.

        The .desc files include the following meta text:

        <meta name=”keywords” content=”WB-403 @ MP 30.69, WB-403A @ MP 30.73, WB-404 @ MP 31.06, WB-405 @ MP 31.22, WB-406 @ MP 31.55">

        (Those are bridge names at the specified milepost locations.)

        I can see the spider download and apparently read the .desc files, but when I search on, for instance, "WB-403" I get no hits from the associated pdf file.

        Any ideas?

        Comment


        • #5
          Make sure you have:

          - Enabled "Use .desc files for plugin extensions" in the configuration window (under the "Scan Options" tab).
          - Selected "Meta keywords" under the "Indexing options" tab
          - Selected "Hyphens" under the "Indexing options" tab

          If you have checked the above and still have the problem, make sure you are using the latest build (4.0.1012) available from: http://www.wrensoft.com/zoom/whatsnew.html

          I also just noticed that you are using non-standard (and mis-matching) quote characters in your meta tag. ie: content=<slantedquote>....<normal-double-quote>. I think this would be ignored by the Indexer - use normal (non-slanted/curled) quote characters. Check if this is how you have the tag defined in the actual file or if its just a typo in your message.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            >- Enabled "Use .desc files for plugin extensions"<
            Done!
            >- Selected "Meta keywords" under the "Indexing options" tab<
            Done!
            >- Selected "Hyphens" under the "Indexing options" tab <
            Done!

            The following is the complete contents cut-and-pasted from one of the files:

            <title>Milepost 31.3; 200 Scale Ortho Mosaic</title>
            <meta name='keywords' content='WB-403 @ MP 30.69, WB-403A @ MP 30.73, WB-404 @ MP 31.06, WB-405 @ MP 31.22, WB-406 @ MP 31.55'>

            Note that after indexing and uploading, a search for "WB-403" does NOT return the .pdf file.

            Comment


            • #7
              You should be using double quotes and not single quotes.

              So try this,
              <meta name="keywords" content="WB-403 @ MP 30.69, WB-403A @ MP 30.73, WB-404 @ MP 31.06, WB-405 @ MP 31.22, WB-406 @ MP 31.55">

              The HTML standard for the <meta> tag is here,
              http://www.w3.org/TR/html4/struct/global.html#h-7.4.4

              ------
              David

              Comment


              • #8
                D'oh!

                Applying meta tags correctly as per your suggestion resulted in the following when searching for "WB-403A":

                1. Milepost 31.3; 200 Scale Ortho Mosaic [MP 31-40]
                ... WB-403 @MP 30.69, WB-403A @MP 30.73, WB-404 @MP 31.06, WB-405 @MP 31.22, WB-406 @MP 31.55 ...
                Terms matched: 1 - Score: 10 - 28 Oct 2004 - URL: http://mywebsite.com/31-40/mosaics/sheet031.pdf

                This is exactly the behavior that I was looking for. Searching on any of these structures by name will yield:
                • The link to the aerial image containg the structure
                  The location of the structure (that's the "@ MP xx.xx" part)
                  Documents relating to the structure


                Perfect!

                Thanks for the help.

                Comment


                • #9
                  Boosting .desc meta data

                  It would be a very useful feature to allow the boosting of the .desc meta data.

                  For example:
                  If <meta name="ZOOMPAGEBOOST" keywords="5"> would boost the keywords contained within the .pdf's .desc file rather than the keywords contained within the .pdf file itself, so that in the case given above the first result returned would be from the .desc file.

                  Comment


                  • #10
                    Just so you know, since build 4.0.1010, Zoom also supports meta tags specified by single quote characters. If you are using an older build, you can download the latest from here:
                    http://www.wrensoft.com/zoom/whatsnew.html

                    Also, ZOOMPAGEBOOST does boost the keywords within the .desc file (as well as those within the PDF). However, the syntax should be:

                    <meta name="ZOOMPAGEBOOST" content="5">

                    Hope that helps.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment


                    • #11
                      <meta name="ZOOMPAGEBOOST" content="5">

                      changing the meta text from keywords="5" to content="5" increased the file's score from 10 to 15, moving it up from #23 to #16. The #1 result has a score of 380.

                      The desired performance is to elevate the .pdf file with the associated .desc file to the #1 position for the search terms equivalent to the .desc file's keywords.

                      Comment


                      • #12
                        You should consider why these other files are ranked higher for this keyword. For example - why do they appear so often on these pages (if they are not relevant)? Do you want them to appear at all in your search results?

                        There are a number of things you can do, depending on the situation:

                        a) Define a negative ZOOMPAGEBOOST (eg. "-5") on each of these pages. Note that this would decrease the overall importance of these pages, and not just for this particular keyword.

                        b) Add more references to the keyword (that you wish to appear higher in the results) in the PDF .desc file. eg. doubling the meta keywords content. You could also add it in a <title> or a meta description, and increase weighting for those elements in the configuration window.

                        c) If these other pages don't really contain any 'real' information for this keyword, and it is only mentioned for reasons such as navigation (eg. breadcrumbs) or features like "previous pages you visited:", you should consider excluding that part of the page from being indexed with a and tag. However, this means that these pages will not show up in your results for this particular keyword at all rather than just being further down the list.

                        Another alternative might be to use the category feature. If you define a category for all the PDFs or Maps specifically, then you can have searches which exclude other files/pages on your website from the search.
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment


                        • #13
                          >You should consider why these other files are ranked higher for this keyword. For example - why do they appear so often on these pages (if they are not relevant)? Do you want them to appear at all in your search results? <

                          The other pages certainly are relevent. The reason that I want these specific pages to rank first has nothing to do with these other documents. The .pdfs that I want to rank in first place contain mapping info that shows the location of the structure. What I am trying to do is set it up so that if one searches on a structure (e.g. "WB-403A") the first result is invariably its location (e.g.: "WB-403A @MP 30.73") and a link to the map. (See above posting.)


                          > Define a negative ZOOMPAGEBOOST (eg. "-5") on each of these pages. Note that this would decrease the overall importance of these pages, and not just for this particular keyword. <
                          Can't do! As I say theses pages are relevent. and even if I wanted to, I can't edit these documents. These documents are reports, meeting minutes and other engineering documents that have been uploaded by various users.

                          >Add more references to the keyword (that you wish to appear higher in the results) in the PDF .desc file. eg. doubling the meta keywords content. You could also add it in a <title> or a meta description, and increase weighting for those elements in the configuration window. <
                          I can do some experimenting along these lines. Whatever scheme that I adopt, it must involve the .desc files (for the reason stated above). What I need is a means of boosting the .desc meta keywords by a factor of 40.

                          >If these other pages don't really contain any 'real' information for this keyword, and it is only mentioned for reasons such as navigation (eg. breadcrumbs) or features like "previous pages you visited:", you should consider excluding that part of the page from being indexed with a and tag. However, this means that these pages will not show up in your results for this particular keyword at all rather than just being further down the list. <
                          As explained above, this is not applicable.

                          >... use the category feature. If you define a category for all the PDFs or Maps specifically, then you can have searches which exclude other files/pages on your website from the search.<
                          I am already using categories extensively. I have 10 categories as it is: one for each of nine projects, and "general" content.

                          What I am doing as a workaround for this is to include the following "tip" on the search page:
                          To return just the milepost and a link to the ortho image, include the term "ortho". e.g.: wb-415 ortho

                          Comment


                          • #14
                            Can't do! As I say theses pages are relevent. and even if I wanted to, I can't edit these documents. These documents are reports, meeting minutes and other engineering documents that have been uploaded by various users.
                            I assume these are HTML or text files and not PDF or Word documents. Otherwise you can specify .desc files for them without having to modify the file itself (as you are doing with the PDFs).

                            I think your current method is quite good (search tip with an additional keyword). Combine this with the boosting of keywords, titles and descriptions should help you achieve a higher scaling.
                            --Ray
                            Wrensoft Web Software
                            Sydney, Australia
                            Zoom Search Engine

                            Comment

                            Working...
                            X