PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Partial match on the right side only

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Partial match on the right side only

    Our language is very complex and I have to use partial matches to get anything. Problem is the partial match option in Zoom is like:

    *word*

    not

    word*

    We have different endings for same word and using *word* get many wrong words. Is there an option somewhere or a workarund for using word* as default (without wildcard match of course). If not can I somehow add the * on the end of every search in word or do something else?

    Not to mention searching *word* can not use any indexes so it is very very slow.

    Jerry

  • #2
    Our language is very complex
    I am assuming you are referring to the English language here.

    I would advise to turn off the sub-string matching option in Zoom and turn on stemming.

    Stemming was designed to deal with this situation.

    Comment


    • #3
      No, I am from Slovenia, there is no Slovene stemmer ;-( So I need the substring match, but substring should match from the beginning like word* not like *word*. Is it possible to add such option? It would be usefull for all languages that has no stemmer.

      I have tried doing substring match and using "word" as exact match, but "word" still finds all *word* which is wrong I think. "word" should find only word and nothing else.

      As it is I am unable to find our simmilar words with different endings. I get many wrong words using substring (and it is slow) or without it I only get exact words. I need:

      1. To be able to use word* automatically
      2. or to be able to use "word" to find word only, so give users option to look for exact word if in quotes, withought it look for substring match.

      I would prefer the first option. I guess I could hack ASP search.asp, but I am using CGI, which is not "hackable"

      All the langauages without stemmer will face this same problem. Usually substring match is word* not *word*, because we are most interested in the different ending, different begiining of the word means another word (and search is much slower because of unability to use index properly, at least in database it is like that).

      Any chance to have option if we want substring match to match beggining also or only end? Or is there any other way for me to give my users a good search experience?

      Comment


      • #4
        The source code for the CGI is here if you wanted to customize / hack it.

        Yes, the stemmer is only supported for 16 languages at the moment.

        You could go back to doing exact word matches. This was the way the product used to be in the past (and so was Google for that matter) and it wasn't so bad.

        Another option would be to write a script in ASP/PHP that called the CGI, and got the CGI to return results in XML format. Then you can parse the results in your script to display what you want. Importantly you could also add on the '*' wild card to the end of search terms when you thought it was required. You then get the speed of the CGI with control over the input and output of the engine.

        Comment


        • #5
          Unfortunately I don't have Visual Studio, so source is no good for me.

          About second option. I'll have to look how to get rsults from CGI to XML, I guess this is an option somewhere. But then I'd need to make my own parser I guess to filter substrings again and remove results that do not begin with word.

          Would you at least consider adding this as an option. In source it would not need much I guess. Maybe for V7?

          But I think that exact match "word" finding words, wording etc... is wrong, even when substring is on. Exact search has to be exact search, else searching for word or "word" is the same I would think this could be a bug.

          Comment


          • #6
            Exact matches are really no good because one of our words has about 16 different endings, depending of how it is used

            Comment


            • #7
              If you had a wrapper script, you would not need to "filter substrings again" in the suggested scenario above. The idea is, you disable partial match and in your script, you take the query, add "*" to the end, before passing it to the CGI.

              Another way to do this is to use some JS to do it on the client side (add "*" to the end of zoom_query before submitting the form values).

              While we would like to add every request that comes in, it simply isn't feasible and we have to prioritize based on demand. You're the first person to ask for a partial match option that only works on the end of the word. So we simply can't prioritize it reasonably at this point (note that each option added means it has to be ported to each of our script platforms: PHP, ASP, JS, ASP.NET, CGI, as well as indexer configurations etc.).

              It also won't satisfy everybody because word* is unlikely to always be appropriate.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                English has 12 verb tenses, but we managed without a stemmer for a long time.

                The best solution would be to have a Slovene stemmer. Maybe you can use your knowledge of the language to help create one, or fund development.
                It needs to be Snowball compatible. See,
                http://snowball.tartarus.org/
                (or there might be one someone has done already we can just drop in).

                Comment


                • #9
                  How did you manage with 12 verb tenses using exact match? So the user needed to enter word in 12 tenses to get all the results?

                  Comment


                  • #10
                    Originally posted by Ray View Post
                    If you had a wrapper script, you would not need to "filter substrings again" in the suggested scenario above. The idea is, you disable partial match and in your script, you take the query, add "*" to the end, before passing it to the CGI.

                    Another way to do this is to use some JS to do it on the client side (add "*" to the end of zoom_query before submitting the form values).

                    While we would like to add every request that comes in, it simply isn't feasible and we have to prioritize based on demand. You're the first person to ask for a partial match option that only works on the end of the word. So we simply can't prioritize it reasonably at this point (note that each option added means it has to be ported to each of our script platforms: PHP, ASP, JS, ASP.NET, CGI, as well as indexer configurations etc.).

                    It also won't satisfy everybody because word* is unlikely to always be appropriate.
                    Although I am not quite happy with your answer, I understand your reasons. Maybe some other customer posts here with this problem to make "more of us"

                    The second method with javascript could work, there is only one problem: in the results page it would be seen that I was searching for word* not word and that would make users wonder what * is

                    About first method, di you mean before:

                    env.Item("QUERY_STRING") = Request.QueryString

                    I break querstring to parameters and add * to the search words? That would work also, but again, I would have a * at the end of the word in "results for search word: word*" that I couldn't get rid of...
                    Last edited by jerry2; May-25-2011, 11:09 AM.

                    Comment


                    • #11
                      But I believe this is a bug:

                      search for "word" brings me words. Although I have partial match, the " delimiter should find EXACT match but it doesn't. So words and "words" behave the same, the partial match is taken for the exact search also This is I think not the desired behavior.

                      Comment


                      • #12
                        Wrapping a single word in quotation marks does not have any effect in Zoom. The quotation marks only work for "exact phrase" matches where it contains more than one word. See this page for the search syntax supported in Zoom:
                        http://www.wrensoft.com/zoom/support/searchtips.html

                        So it is not a bug. You might expect that behaviour however because of Google's syntax, and that's a fair assumption. However, Google doesn't support wildcards while we do, so it's never intended to be the same. We could look at it as a possible change for the future though.

                        And yes, adding "*" to the end of words will change the heading, but this would be good for users to note why certain unexpected words might be matching as a result of this assumption. For example, why someone searching for "cat" is getting "categories". I'm sure there are similar cases in Slovene. As noted before, a Slovene stemmer would be a better solution.
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment


                        • #13
                          Not only the heading, but also the form field and doing search again, it would have two ** On top of that it would need not to add the * at the end of the "word"...

                          Your notes are valid. I still believe a end only partial word is a good solution for all languages without stemmers and Slovenian is one of them. We don't have any public stemmers available, I think our search engine company has one private. Making a stemmer for Slovenian langauge is... well, I say if it would be fairly easy, it would be done by now. We have not only a single and plural but also "dualismo". In English you have one and many, we have one, two and many. Every word has different ending depending on what is behind... We have so many exceptions etc...

                          When I do mysql database search it get quite good results with word* solution. Of course *word* bring whole new words and it is not good.

                          I didn't know the "word" is by design same as "word" and only works for phrases. Why is that? You have English stemmer, so searching for apple should bring also apples, but what if a user want to find EXACTLY one apple? I would say user "apple", so it isn't "stemmed". So how can one search for single apple then if database is set to use stemmer? I think you can not.

                          I of course can not force you what to do, I can only make some advices as a customer. I see great benefit is being consistent. If "" means exact search that means exact search for me, one or more words, but exact. No substrings and no stemmers.

                          As it is I have 2 bad solutions:

                          1. Give users substring search which brings many bad results.
                          2. Use * by jquery which is not a very elegant solution to force them to wonder what is the * and how to disable it
                          3. Use without stemmer and without substring so user has to search for a word 10 times. Not good also.

                          Ok, that makes 3 solutions, but none is acceptable to the client.

                          Comment


                          • #14
                            Found this Slovene stemmer definition on the interwebby,
                            http://snowball.tartarus.org/archives/snowball-discuss/0670.html
                            http://snowball.tartarus.org/archives/snowball-discuss/att-0670/01-slo.proc
                            Looks a bit old and brief to be comprehensive. Hard to know not being a Slovene linguistics expert.

                            Just found a follow up thread
                            http://snowball.tartarus.org/archives/snowball-discuss/0672.html
                            that seems to indicate it could be better, "The script you have written could definitely be imporved by use of among constructions..."

                            And there was this one,
                            http://article.gmane.org/gmane.comp.search.snowball/700/match=slovene

                            Comment


                            • #15
                              There was also a research paper,
                              Popovic M and Willett P (1990) Processing of documents and queries in a Slovene language free text retrieval system. Literary and Linguistic Computing, 5: 182-190

                              Comment

                              Working...
                              X