Results 1 to 7 of 7

Thread: URL's with full stop in them are not spidered

  1. #1
    Join Date
    Mar 2006
    Posts
    4

    Default URL's with full stop in them are not spidered

    I operate an article site with the Zoom script. It all works great, except for one thing: it doesn't spider pages with URL's that have a full stop in them. It regards the full stop in the URL as the one that normally goes before an extension.

    Let's say the title of an article is "This is the first part of the title. And this is the second part." The URL of that article will be
    "www.domainname.com/articles/This-is-the-first-part-of-the-title.-And-this-is-the-second-part". Zoom will not index the page because it regards the part after the full stop as a file extension, and of course it cannot find it in the config file.

    Is there way to just disable the extension function, so that Zoom indexes all pages, regardless of the extension (or what it interprets to be an extension)?

  2. #2
    Join Date
    Dec 2004
    Location
    Sydney
    Posts
    4,283

    Default

    What type of files are these files (DOC, HTML, text, PDF, Images)?

    Zoom often needs to know the file extension to know how to process the file. For example, .DOC files can not be processed in the same way as .XML files.

    It would make more sense (to me) to add a file extension to indicate what type of file it is? It would be more standard.

    ------
    David

  3. #3
    Join Date
    Mar 2006
    Posts
    4

    Default

    Quote Originally Posted by Wrensoft
    What type of files are these files (DOC, HTML, text, PDF, Images)?

    Zoom often needs to know the file extension to know how to process the file. For example, .DOC files can not be processed in the same way as .XML files.

    It would make more sense (to me) to add a file extension to indicate what type of file it is? It would be more standard.

    ------
    David
    They're just normal html pages. For example
    http://www.klantenservicekenniscentr...ing-en-passie.

    I'm sorry, it's a Dutch site But do you see the full stop? When I remove that, Zoom finds it. Now it doesn't. You can try here: http://www.klantenservicekenniscentr...earch/advanced

    The URL is generated by the article script (Interspire Article Live), based on the title the author gives it. I could instruct authors to not use full stops in titles, but I'd prefer some solution on the Zoom side. The problem is Zoom thinking the full stop is extension related. So it would be best to tell Zoom to ignore extensions, or to simply scan all extensions. Isn't that possible?

  4. #4
    Join Date
    Dec 2004
    Location
    Sydney
    Posts
    4,283

    Default

    No this is not possible to just scan all extensions. Zoom (like Windows and other operating systems) use the file name extension to determine the type of the file.

    In windows, if you take a HTML file and remove the .html at the end of the file name, then double click on the file, Windows doesn't know what to do with the file any more.

    With HTTP there are mime types, which can help determine the file type, if the server sets them correctly. What would be technically possible, with some additional development, would be to include a function that treats all files with unknown file name extensions as HTML files or processes them only according their mime type.

    It would take some significant additional work to support this. But at the moment you are the only person asking for this feature & we have other higher priority features to work on.

    The work around is to have a .htm at the end of your HTML files or to avoid using full stops in URLs.

    ------
    David

  5. #5
    Join Date
    Mar 2006
    Posts
    4

    Default

    Thank you for the elaborate reply. I understand the problem with the script being a Windows program, and thus looking for known file types.

    I'll instruct my authors to not use full stops anymore, that's easiest.

  6. #6
    Join Date
    Dec 2004
    Location
    Sydney, Australia
    Posts
    3,768

    Default

    I was just wondering what your CMS would do if you have an article title such as "This is a page about the file extension .JPG"?

    If it is not stripping out the dots from the URL name, I would believe that could cause all sorts of havoc with various browsers and web clients believing they should treat that page as a JPEG file.

    To elaborate on what David posted, Zoom does index pages with no file extensions (there is a checkbox in the Scan Options tab of the Configuration window). This however, refers to URLs which have no dots in the filename, such as the following:

    http://mysite.com/articles/my-test-page
    http://mysite.com/news/
    http://mysite.com/articles/myscript/...tpageetcetcetc

    However, when you have a dot in the filename part of the URL, then this will be considered a file extension.

    Zoom does currently look at the MIME type (when available) to determine the correct method of handling the file, but it also looks at the file extension to determine what should or should not be indexed. It is also used as a fall-back method to determine the file type when the MIME type is not available.

    I agree with David that it seems more reasonable to expect the CMS to be stripping out the dot from the topic title as it will lead to other problems (as suggested above).
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

  7. #7
    Join Date
    Mar 2006
    Posts
    4

    Default

    Quote Originally Posted by Ray
    I was just wondering what your CMS would do if you have an article title such as "This is a page about the file extension .JPG"?
    I tried it, and it just saves it with the extension. However, IE recognizes it as an HTML file, and displays it correctly. So, there is no browser problem with that (in IE anyway, I didn't try other ones).

Similar Threads

  1. Stop Zoom from following a link
    By christodhunter in forum Zoom Search Engine V3, V4 & V5 (Old Versions)
    Replies: 1
    Last Post: 02-01-2006, 05:14 AM
  2. Broken URL's
    By Anonymous in forum Zoom Search Engine V3, V4 & V5 (Old Versions)
    Replies: 1
    Last Post: 07-08-2005, 10:16 PM
  3. Problems with Zoom URL's
    By DaveF in forum Zoom Search Engine V3, V4 & V5 (Old Versions)
    Replies: 1
    Last Post: 06-15-2005, 10:29 PM
  4. Problems with Zoom URL's
    By DaveF in forum Zoom Search Engine V3, V4 & V5 (Old Versions)
    Replies: 1
    Last Post: 06-12-2005, 11:18 PM
  5. Questions about Zoom Search - Javascript full text & boo
    By Anonymous in forum Zoom Search Engine V3, V4 & V5 (Old Versions)
    Replies: 6
    Last Post: 04-27-2005, 12:07 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •