PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Excluding Files

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Excluding Files

    Is there any way to entirely exclude large numbers of files without moving them to an excluded directories?

    Our website has documents that expire at some point in time(and even have the expire meta tag). There are too many to excude through the normal program so we have used the zoomstop and zoomrestart within the expired documents, but there is still the possiblity of the documents showing up in results based only on the filename.

    Thanks
    federalgovernmentjobs.us

  • #2
    The easiest way would be to either,

    1/ Remove the old pages from your site entirely

    OR

    2/ Leave the pages on your site, but remove the links to them so that they are no longer found by the spider.

    OR

    3/ Change the file names of the old files (e.g. add an underscore at the start of the file name) and then use Zoom to filter out the pages.

    ----
    David

    Comment


    • #3
      Exlusion by tag?

      The earlier reply to this subject doesŽnt help me with my problem:

      I have to exclude certain files (of a different language and the print version, wich should also not to be indexed) wich all reside in one directory.

      Because I need different search indexes for each language I would like to tag them with ie. one Meta Keyword. It is not very handy to exclude them with zoomstop though because it has to be entered on top AND end of the pages. Nice would be a tag to explicitly index this file and ignore all other(?).

      Are there any other tips to efficiently exclude files through i.e. a tag inside the html?

      Regards
      Ronald

      Comment


      • #4
        If the print version, and other language versions of your files are marked accordingly by their filenames, then you can use the Skip Pages option in the Configuration window.

        For example, your print version of the file "mydocument.html" may be named "mydocument_print.html", and the french version may be named "fr_mydocument.html" etc. In such cases, you can add "_print.html" and "fr_" to your skip list to exclude all such files.

        Similarly, if you are spidering a dynamically generated website, and use URL parameters such as "mydocument.php?print=1&lang=fr", you can specify these parameters as skip list entries (eg. "print=1", etc.) - they will be matched against the URLs that are found.

        At the moment, there is no single tag to mark an entire HTML page to be skipped. You can use the ZOOMSTOP and ZOOMRESTART tag pair to mark portions (or the entirety) of the page to skip. But to skip entire files, we generally recommend use of the skip list OR if you are using spider mode, then there are a variety of ways to configure the spider to not follow certain links (see chapter 2.1.4 in the Users Guide for Advanced Spider URL options, and the ZOOMSTOPFOLLOW tag). These methods avoid the need for the Indexer to access each file to find that they should be skipped, allowing for much faster indexing.

        It's really worthwhile to make the effort and maintain a consistent, and meaningful naming scheme (and folder structure) for your web pages, especially if there are multiple versions of the same page. This not only helps the indexing and searching procedure, but also makes maintenance easier in the future if you can easily determine which files are related.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Exclusion

          Hi Ray,

          I cannot exclude by the filenames method, because the offline reader returns look-alike files like "index~1.htm" or "sale~483.htm" etc. from my dynamix pages.
          The only solution therefore seems to be the mentioned tags. Hach...

          regards
          Ronald

          Comment


          • #6
            Zoom should not return truncated filenames like "index~1.htm" in Offline Mode. I am also confused by your reference to them as "dynamix pages".

            Dynamically generated webpages (eg. PHP, ASP, CFM files) are server-side scripts and need to be indexed using Spider Mode. However, since you said you are using Offline mode and you are referring to ".htm" files, I'm not sure where these dynamic pages fit into the picture.

            Can you clarify what you are referring to by "dynamix pages" and also, whether the "offline reader" is what we presume to be the Zoom Indexer in Offline mode, or another piece of software?

            Also, can you give us some examples of the full URL/filepath to these files, and clarify whether you are actually using Zoom in Offline or Spider mode.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Sorry for confusing and causing intransparency.

              My workflow is as follows:

              a. I have a Coldfusion based catalogue (online app)

              b. I spider all CF pages with 3rd party product "metaproducts offline explorer Pro 3.8". This one produces the truncated 8.3 filenames (I tried Joliet CDR 64 symbols instead yesterday wich also seems to work)

              c. I use Zoom Search Engine Pro to index the results of b in Javascript/Offline mode.

              (The output should be put on a CDROM for offline distribution - thus I use the javascript mode)


              regards
              Ronald

              Comment


              • #8
                I see what you mean now. We haven't come across this scenario/usage before. Did increasing the filenames to 64 characters help your problem? ie: can you now exclude some files by filename now?
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  >> ie: can you now exclude some files by filename now?

                  Increasing filename length helped but I have some problems with duplicated webpages wich have different URLs though the content is the same.

                  It would be very helpful in that respect to have to tag to INCLUDE a page rather than to EXCLUDE everything else in my case.

                  IŽd better now build some specific URL parameters with every link so I can use the exclusion option more efficiently with this project..

                  regards
                  Ronald

                  Comment


                  • #10
                    Try using the "Duplicate page detection" option ("Use CRC-32") in the Indexer Configuration window, under the "Scan Options" tab.

                    This should prevent duplicate pages with the same content, but different URLs, from being indexed.

                    We are planning to add robots.txt and meta robots tag support in Version 5.0 (probably sometime next year). This will provide more per-page indexing options and may be useful for you.

                    [Update 18/June/2007: Robots.txt support was added in V5.1]
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment

                    Working...
                    X