PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

URL's with full stop in them are not spidered

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • URL's with full stop in them are not spidered

    I operate an article site with the Zoom script. It all works great, except for one thing: it doesn't spider pages with URL's that have a full stop in them. It regards the full stop in the URL as the one that normally goes before an extension.

    Let's say the title of an article is "This is the first part of the title. And this is the second part." The URL of that article will be
    "www.domainname.com/articles/This-is-the-first-part-of-the-title.-And-this-is-the-second-part". Zoom will not index the page because it regards the part after the full stop as a file extension, and of course it cannot find it in the config file.

    Is there way to just disable the extension function, so that Zoom indexes all pages, regardless of the extension (or what it interprets to be an extension)?

  • #2
    What type of files are these files (DOC, HTML, text, PDF, Images)?

    Zoom often needs to know the file extension to know how to process the file. For example, .DOC files can not be processed in the same way as .XML files.

    It would make more sense (to me) to add a file extension to indicate what type of file it is? It would be more standard.

    ------
    David

    Comment


    • #3
      Originally posted by Wrensoft
      What type of files are these files (DOC, HTML, text, PDF, Images)?

      Zoom often needs to know the file extension to know how to process the file. For example, .DOC files can not be processed in the same way as .XML files.

      It would make more sense (to me) to add a file extension to indicate what type of file it is? It would be more standard.

      ------
      David
      They're just normal html pages. For example
      http://www.klantenservicekenniscentr...ing-en-passie.

      I'm sorry, it's a Dutch site But do you see the full stop? When I remove that, Zoom finds it. Now it doesn't. You can try here: http://www.klantenservicekenniscentr...earch/advanced

      The URL is generated by the article script (Interspire Article Live), based on the title the author gives it. I could instruct authors to not use full stops in titles, but I'd prefer some solution on the Zoom side. The problem is Zoom thinking the full stop is extension related. So it would be best to tell Zoom to ignore extensions, or to simply scan all extensions. Isn't that possible?

      Comment


      • #4
        No this is not possible to just scan all extensions. Zoom (like Windows and other operating systems) use the file name extension to determine the type of the file.

        In windows, if you take a HTML file and remove the .html at the end of the file name, then double click on the file, Windows doesn't know what to do with the file any more.

        With HTTP there are mime types, which can help determine the file type, if the server sets them correctly. What would be technically possible, with some additional development, would be to include a function that treats all files with unknown file name extensions as HTML files or processes them only according their mime type.

        It would take some significant additional work to support this. But at the moment you are the only person asking for this feature & we have other higher priority features to work on.

        The work around is to have a .htm at the end of your HTML files or to avoid using full stops in URLs.

        ------
        David

        Comment


        • #5
          Thank you for the elaborate reply. I understand the problem with the script being a Windows program, and thus looking for known file types.

          I'll instruct my authors to not use full stops anymore, that's easiest.

          Comment


          • #6
            I was just wondering what your CMS would do if you have an article title such as "This is a page about the file extension .JPG"?

            If it is not stripping out the dots from the URL name, I would believe that could cause all sorts of havoc with various browsers and web clients believing they should treat that page as a JPEG file.

            To elaborate on what David posted, Zoom does index pages with no file extensions (there is a checkbox in the Scan Options tab of the Configuration window). This however, refers to URLs which have no dots in the filename, such as the following:

            http://mysite.com/articles/my-test-page
            http://mysite.com/news/
            http://mysite.com/articles/myscript/...tpageetcetcetc

            However, when you have a dot in the filename part of the URL, then this will be considered a file extension.

            Zoom does currently look at the MIME type (when available) to determine the correct method of handling the file, but it also looks at the file extension to determine what should or should not be indexed. It is also used as a fall-back method to determine the file type when the MIME type is not available.

            I agree with David that it seems more reasonable to expect the CMS to be stripping out the dot from the topic title as it will lead to other problems (as suggested above).
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Originally posted by Ray
              I was just wondering what your CMS would do if you have an article title such as "This is a page about the file extension .JPG"?
              I tried it, and it just saves it with the extension. However, IE recognizes it as an HTML file, and displays it correctly. So, there is no browser problem with that (in IE anyway, I didn't try other ones).

              Comment

              Working...
              X