PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Scan options, extra extions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scan options, extra extions

    My multilingual website has files with the following file extensions: *.nl.html and *.en.html
    Those files are generated by a CMS. For every language I have setup a search page and separate index files. With Skip options I can filter out the .en.html files when indexing Dutch and .nl.html when indexing English. But I have also normal *.html files that I want to include or exclude. Is it possible to change the Scan Extensions setting to accept files with more dots in the extension (eg. .en.html or .nl.html) ?

  • #2
    The scan extenions list represents a list of file formats that are to be accepted and the way the indexer processes each file type. In your case the format of both files is HTML format.

    Was your post just a suggestion or was there some specific problem you were having with the skip list?

    Comment


    • #3
      You should also realize that your file extensions are in fact, still just ".html", and not ".en.html" or ".nl.html". Instead, what you have achieved with your use of multiple dots is created filenames with dots in them.

      That is, the file "index.en.html" would actually mean:

      A filename of "index.en"
      with an extension of ".html"

      Having realized that, it should be clearer that what you need to do is not change your Scan Extensions (which should only need to have a ".html" entry), but instead, you should be specifying entries like the following in the "Page skip list" on the "Skip Options" tab of the Configuration window:

      .en.html (in your Dutch config file)
      .nl.html (in your English config file)

      Hope that helps.
      --Ray
      Wrensoft Web Software
      Sydney, Australia
      Zoom Search Engine

      Comment


      • #4
        That's exactly the way I now index the files, but I want to be able to index only all ".en.html" files and exclude all ".html" files. The same for the ".nl.html" files. Setting up the Scan extensions with .en.html and without .html isn't possible now because skipping .html skip all files

        Comment


        • #5
          Can you rename your xx.html files to be xx.htm files? This would allow differentiation.

          Comment


          • #6
            No, the CMS doen't allow it.

            Comment


            • #7
              Can you put the noindex meta tag in each HTML file that you don't want indexed. r.g.
              <meta name="robots" content="noindex">

              Comment


              • #8
                Another alternative, if your entire website is encoded in UTF-8, is to index the whole site as one and use the Categories feature to search within a specific region or language.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment

                Working...
                X