PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Confusion about word join characters

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confusion about word join characters

    Either there is a bug or I am confused about the word join characters. Using the PHP configuration of the latest standard Zoom. If I leave the word join character boxes checked for apostrophes and hyphens, then index and search for Pete's and for Wal-Mart, neither is found although both are present. If I uncheck the boxes for apostrophes and hyphens and run the searches both Pete's and Wal-Mart are found (with the highlighting skipping the apostrophe and the hyphen, which is OK). I thought it should be the other way round - leaving the boxes for these characters would allow words containing them to be found, and unchecking the boxes would stop the indexing from recognizing them. Please explain. If I leave them unchecked, which seems to give the results I want, am I creating other problems I haven't recognized yet?
    Thanks
    whk

  • #2
    We could not produce any error with the word join character options for apostrophes and hyphens using V5.1.1001. In our tests, when they are enabled as word join characters, they are both found and correctly highlighted.

    It is possible that this is related to the content you are indexing. For example, if you are indexing PDF documents, where the apostrophe character is unusual, and it may be using a curly quote character instead, or it is actually "Pete`s". PDF documents may also have some issues where the words are broken up due to layout (eg. "Wal-
    Mart").

    Can you provide a URL to the page containing these words in question? If not, can you send us a copy of the document? If your search page is online, it may also be helpful if we can see it in action.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Ray: First, let me congratulate you and the other staff members on the way this forum is handled. Very quick and to the point courteous responses. Much appreciated.
      Next, the hyphen problem: I found that the program I am using to code HTML is putting <wbr> tags after hyphens. If I remove these then the hyphen word join feature works; removing them all manually would be a bit much and doing it automatically might do damage someplace. Be nice if Zoom would ignore them. I have a lot of phone numbers entered in an xxx-xxx-xxxx format. They have <wbr> tags after each hyphen and aren't found in a search. I fixed the Wal-Mart problem by removing an unneeded <wbr> tag.
      Lastly, the apostrophe problem. I had pdf problems with them earlier but now I am having problems in HTML files. If you wish, look at www.whknoth.com/cartmel/companion and search for Pete's. You won't find it though a search for Pete* will bring a result list that includes Pete's; (searching for Pete?? won't find it either). Apostrophes (and hyphens) were included in the word join character list when generating the index. If you search for Genuardi's you will find it. The apostrophe in Pete's differs from that in Genuardi's. I am not sure how the one in Genuardi's got there (Some of this material predates me) but the one in Pete's is from my keyboard so it would be nice if it could be found. Thanks for any help
      whk

      Comment


      • #4
        Originally posted by whk View Post
        Ray: First, let me congratulate you and the other staff members on the way this forum is handled. Very quick and to the point courteous responses. Much appreciated.
        Cheers! Thank you for the positive feedback. We try our best to help, and work hard to keep things useful around here so we're glad to hear that people notice the effort.

        Originally posted by whk View Post
        Next, the hyphen problem: I found that the program I am using to code HTML is putting <wbr> tags after hyphens. If I remove these then the hyphen word join feature works; removing them all manually would be a bit much and doing it automatically might do damage someplace. Be nice if Zoom would ignore them. I have a lot of phone numbers entered in an xxx-xxx-xxxx format. They have <wbr> tags after each hyphen and aren't found in a search. I fixed the Wal-Mart problem by removing an unneeded <wbr> tag.
        Hmm, the <wbr> tag is actually not part of any HTML standard, and thus, technically invalid HTML. It is a proprietary tag which is only supported by some browsers (although it seems that most modern versions of IE and Netscape do support it). More information online if you Google for "wbr tag".

        We will consider adding support to treat <wbr> tags as non-word breaking in a future release (the tag is supposed to act as a "soft" word break, that is, if the browser has to break a word, it should break at the spot marked, but it is not supposed to break otherwise). I can imagine users who are more pedantic about standards compliancy coming back to haunt us if we put this in place though ...

        Originally posted by whk View Post
        Lastly, the apostrophe problem. I had pdf problems with them earlier but now I am having problems in HTML files. If you wish, look at www.whknoth.com/cartmel/companion and search for Pete's. You won't find it though a search for Pete* will bring a result list that includes Pete's; (searching for Pete?? won't find it either). Apostrophes (and hyphens) were included in the word join character list when generating the index. If you search for Genuardi's you will find it. The apostrophe in Pete's differs from that in Genuardi's. I am not sure how the one in Genuardi's got there (Some of this material predates me) but the one in Pete's is from my keyboard so it would be nice if it could be found.
        The problem, as you know, is on this page:
        http://www.whknoth.com/cartmel/companion/shopping.html

        If you right-click on the page and select "View source", you can see the HTML source code for the page and the cause of the problem is more evident.

        The problem is that the instance of "Pete's" that appears here is actually invalid. The apostrophe is a curly apostrophe (aka smart quotes) which would be fine otherwise, but it should have been encoded as a HTML entity (i.e. it should appear as "Pete&rsquo;s" in the HTML). If you look at the HTML source code however, it was not encoded, and it was infact inserted as a UTF-8 character.

        If this was generated by your web page authoring application, then I'll say that this is a bug in their program. You may also find that the problem might only occur if you copy+paste the curly quotes from something like MS Word (which commonly uses curly quotes) as opposed to typing it in manually.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          Some Progress

          Thanks for your comments Ray. They are leading me in the right direction. I looked at the options in the HTML program I use and found that inserting <wbr> after hyphens is optional and can be turned off, so I will do that as time permits. (tried it on a test page and it works - can now search a phone number)
          The Character Set used is also optional with UTF-8 as default. I switched that to ISO 8859-1, because that seems to be the set you use. Now Pete's is encoded as Pete&#x2019;s, not as Pete&rsquo;s which you said is correct. The Peteߣs format is NOT allowing Zoom to find Pete's. There are other character sets available but none that seem appropriate. I'll play with it more - do you have any ideas on this?
          whk
          Last edited by whk; Jun-20-2007, 02:56 PM. Reason: correct typing error

          Comment


          • #6
            Originally posted by whk View Post
            The Character Set used is also optional with UTF-8 as default. I switched that to ISO 8859-1, because that seems to be the set you use.
            Actually, Zoom supports both UTF-8 as well as iso-8859-1 and many other common charsets. The problem mentioned before however, was the use of acharacter which is a legitimate UTF-8 character in any other text file or document, but would be considered "unsafe" to use in a HTML file. HTML specifies that certain characters need to be represented by what are known as HTML entities instead, such as "&rsquo;".

            Originally posted by whk View Post
            Now Pete's is encoded as Pete&#x2019;s, not as Pete&rsquo;s which you said is correct.
            "&#x2019;" is actually a valid entity for the "right single quote" / apostrophe character. Fact is that the entities can be represented in either a named form ("&rsquo;") - which is the most common, or in decimal form (’) referring to the character on the codepage. A little rarer but equally valid is the hexadecimal form "&#x2019;".

            The reason this does not work is actually because Zoom is not automatically converting "curly" quotes in the form of decimal or hexadecimal entities to their non-"curly" equivalent at this point. Note that you are searching for "Pete's" while it actually has been entered as "Petes" on the webpage. We are doing automatic conversion (because people rarely type the curly apostrophe, it is usually only done by programs like MS Office, and your webpage authoring app) for the named entity form ("&rsquo;") but not the numeric forms at this point.

            We will add it to the list of things to do for the next release, and this should mean that those quotes will then be searchable even in their numeric entity form.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              Ray, I'm having a similar problem to the one described at the beginning of this thread. I have an HTML document that has "1099-R Setup" as its title and first heading. Yet when I have hyphens selected as a join-word option, this document will not appear in the normal search results; I can only get it to appear as a Recommended Link. I've checked the doc code, and the hyphen in "1099-R" is just a normal hyphen character. (I even tried substituting a numeric entity reference for the hyphen and reindexed, but the results were the same.) The only way the document will show up as a regular hit is if I deselect the hyphen join-word option. Any ideas what might be wrong? I can send you the HTML file, screen shots of the search results, etc., if you like.

              Thanks,
              Robert Miles

              Comment


              • #8
                Yes, e-mail us the HTML file in question, and tell us what you are entering into the search box exactly. We'll have a closer look.
                --Ray
                Wrensoft Web Software
                Sydney, Australia
                Zoom Search Engine

                Comment


                • #9
                  Email sent

                  Email just sent, Raymond. I sent it from my personal account at work, so look for my name in the sending email address.

                  Thanks,
                  Robert Miles

                  Comment


                  • #10
                    We've had a look at the file, and we tried indexing it. We were able to search for "1099-R" without any problems, and the page was returned correctly. You didn't mention which script platform you are using, but we ended up trying it in PHP, ASP, JS, CGI and it all worked fine.

                    So there's a few things to check:
                    - Make sure you are using the latest build available on this page.
                    - Let us know if you have modified the search script in any way at all. Your changes may have broken functionality.
                    - Make sure the page was indexed. Does it show up in the natural results with any other keyword searches? Even though it appears for a recommended link, it might not actually have been indexed.

                    If you still can't find the problem, ZIP up the search files (including the search scipt, settings file, and all ZDAT files) and e-mail them to us.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment


                    • #11
                      Answering your questions

                      Thanks for your reply, Ray! See below for responses to your questions:

                      Originally posted by Ray View Post
                      You didn't mention which script platform you are using, but we ended up trying it in PHP, ASP, JS, CGI and it all worked fine
                      CGI.

                      Originally posted by Ray View Post
                      - Make sure you are using the latest build available on this page.
                      We're on Version 5.1.1003. I'll try upgrading and see if that makes a difference.

                      Originally posted by Ray View Post
                      - Let us know if you have modified the search script in any way at all. Your changes may have broken functionality.
                      I don't believe we have, but I'm confirming this with our site developer.

                      Originally posted by Ray View Post
                      - Make sure the page was indexed. Does it show up in the natural results with any other keyword searches? Even though it appears for a recommended link, it might not actually have been indexed.
                      Yes, that's the other odd thing I forgot to mention -- if I search for the eID number at the top of the page (55805), the page appears at the top of the results (as a normal hit -- NOT as a Recommended Link).

                      Originally posted by Ray View Post
                      If you still can't find the problem, ZIP up the search files (including the search scipt, settings file, and all ZDAT files) and e-mail them to us.
                      I'll try using the latest build, and if the problem persists and I confirm that we didn't change the search script, I'll do this.

                      Comment

                      Working...
                      X