PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Frequently occurring words not found

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Frequently occurring words not found

    Hi

    I've been a very happy user of Zoom 5.1 for several years, notably on two business magazine web sites that I run. I've seldom had any problems, so I don't often delve into the intricacies of the system, but I've suddenly noticed a major glitch. On both web sites, the search function fails to find words which I know appear VERY frequently. The search results page doesn't say there are no finds - it just displays the message "search results for" and the word in question, then nothing at all.

    You can try this on one of the sites, www.elogmag.com. Try looking for "delivery" or "deliveries" or "logistics". No finds at all. Yet as this magazine is all about stuff like home delivery, there must be literally hundreds of instances of these words. On the other web site, www.mlogmag.com, the system doesn't find "telematics", which is one of the main subjects of that publication.

    Is there some kind of limit to the number of instances of a word that Zoom will tolerate? I can't imagine there is, but I don't know what else could be wrong. With my current settings, the system happily finds hundreds of other words - just not these obvious ones (and quite possibly others that I haven't tried).

    Hopefully there's a setting I can alter when I do the indexing - but what? I'm sorry if I'm missing some obvious setting that I should know about. I've trawled through he configuration screens, but as an infrequent user, I'm not sure if I'm looking at the right things.

    Thanks so much for any help or direction here!

  • #2
    For more intensive searches (e.g. common words), I think your server might be killing the script before it can finish executing.

    The problem and solution is described in this old post.

    Comment


    • #3
      Hi

      Thank you for your very quick response to my query (on a weekend, too!). I followed the thread you mentioned, and I think it almost certainly explains our problem.

      We use a very large international shared hosting company. It is not easy to have an ongoing dialogue with them, and I think the likelihood of getting them to alter their Apache settings are probably slim to zero. But other aspects of their hosting are good, and we have a number of web sites hosted with them, so changing hosts is not an option to consider lightly.

      I see you advise trying a CGI script instead of PHP. I've never used this technology, so I would have to learn how it works from scratch. Presumably I would have to alter some of the coding in the web pages, would I? It sounds a complicated solution, but maybe it's not?

      I just uploaded your test script to the shared server (the script that looks for 9 million integers) and ran it, and it gave up when it reached "allocated memory for 195000 integers", which sounds a tiny number. Bearing this in mind, is it even worth bothering with the CGI alternative, or is the shortfall in capacity on the server simply too great?

      If there's no remedy but to change web hosts, it seems rather a drastic answer to making Zoom work as intended. Is there not a way of getting Zoom to work within this constraint - e.g. by halting the search after a certain time or a certain number of finds?

      Thanks for any further enlightenment.

      Comment


      • #4
        Presumably I would have to alter some of the coding in the web pages, would I? It sounds a complicated solution, but maybe it's not?
        You would need to change any links that point to search.php to search.cgi. So that part should be easy. There should be no other HTML code changes required.

        The harder part is getting it installed and getting the file permissions correct. But if your server is running Apache, then you are probably hosting on Linux or Unix, which is an easier install than windows. See this FAQ is you have a problem.
        http://www.wrensoft.com/zoom/support/faq_cgi.html

        The CGI is 5 times faster and uses maybe 1/3rd the RAM compared to PHP. So there is a good chance it will work even on a very resource restricted machine. (But you have to hope your hosting company hasn't entirely blocked the use of CGI).

        Comment


        • #5
          Thanks for the further helpful information. Sounds as if I need to get my brain round the CGI option, and find out if it could work with our web host. I'll investigate this in the next day or two and report back to you. Currently crossing fingers!

          Comment


          • #6
            Later that day ...

            Hi again

            I couldn't resist trying the CGI search option to see how it would pan out. Well, success! I've reindexed one of our sites and uploaded the files, and the system now DOES find words that it failed to find before (e.g. the word "delivery", which occurs 2,187 times). Hooray! So many thanks indeed for pointing me at a solution that works. (Currently it is only set up on certain trial pages.)

            However, I am now faced with a resultant problem. I was originally attracted to Zoom because it can use PHP, and I have customised the search results page by incorporating various include files and references to PHP variables. [To see how it currently looks, go to the home page of the site (www.elogmag.com), which still uses the PHP search process, and enter a search term that I know will be found: "solution".]

            With the CGI script, however, I don't see how I can do this. I realise I can painstakingly rebuild my search_template.html file and hard-code into it a lot of the components that are read into my current page with PHP, but the result will be a static page that can't read in specific files or references that are currently determined at runtime.

            Can you see an easy way round this? Is there some obvious way to incorporate PHP functionality into presentation of the search results, but still use CGI to do the actual work?

            If not, sadly it looks as though I will have regained full Zoom search functionality but lost a lot of the flexibility that the PHP version offered.

            Again, thank you for sticking with this query. It's a very helpful learning process.

            Comment


            • #7
              You can use the same template file.

              If you want a PHP header or footer with CGI search results see this FAQ.

              Comment


              • #8
                Once again thanks. This is extremely encouraging. It sounds as though I should be able to end up where I need to be with this. Thanks for pointing me to the thread about embedding CGI in scripts, which probably would have taken me ages to find on my own.

                I really will need a day or so to find the time to experiment properly with this, but I'll definitely get back to report the outcome.

                Comment


                • #9
                  Hi again

                  First, thanks for your very quick responses throughout this thread, and second, sorry to keep coming back for more! It's really not my style to keep bothering support services and forums like this. The only reason I'm doing it now is that I feel slightly on the back foot. I thought Zoom was working correctly, and since realising it wasn't, I feel as if I've been desperately sucking in information to get back to where I thought I was already. If you see what I mean.

                  To recap, I've switched from PHP to CGI, and it works MUCH better. It's now finding words that it was missing before. But I keep coming up with new challenges, which is why I'm still here.

                  1. I wanted to incorporate the CGI search results in a PHP page. Your recommended way to read CGI scrips into PHP pages is to use the VIRTUAL command, but this doesn't seem to work for me, and I'm wondering if it's because Apache is possibly run as in CGI mode on my shared hosting server, not as an Apache module. The search page just throws up the PHP error message saying the function is not available. Is there an easy way to tell if this is the problem, and if it is, would another PHP command work instead, such as EXEC?

                  2. I have set up the search page with basic formatting, and it seems to work fine, but when I search for words that occur very frequently, it seems to come up with the message "2187 results found", which can't be correct in every case. Do you know what is happening here?

                  3. If I select 100 finds per page, it will only display the first ten pages, even though it says there are 22 pages. Presumably this is because of a limit set somewhere in the configuration, but I can't work out which it is.

                  Thanks yet again for sticking with this and saving me endless trawling.

                  Comment


                  • #10
                    Originally posted by ontrack View Post
                    To recap, I've switched from PHP to CGI, and it works MUCH better. It's now finding words that it was missing before.
                    Unless you are indexing over 65,000 pages (the PHP version will not allow you to index more than this), there would be no difference with what words are indexed and what words are found between the PHP and CGI version.

                    You would only be seeing this behaviour if you have changed some other indexing configurations while changing the platform, OR you're using a really old build with a bug. What you describe isn't a known problem in the final V5.1 build that you can still download from here:
                    http://www.wrensoft.com/zoom/version5history.html

                    In case you don't realize, the current release is V6.

                    Originally posted by ontrack View Post
                    1. I wanted to incorporate the CGI search results in a PHP page. Your recommended way to read CGI scrips into PHP pages is to use the VIRTUAL command, but this doesn't seem to work for me, and I'm wondering if it's because Apache is possibly run as in CGI mode on my shared hosting server, not as an Apache module. The search page just throws up the PHP error message saying the function is not available. Is there an easy way to tell if this is the problem, and if it is, would another PHP command work instead, such as EXEC?
                    You might be able to use shell_exec() or similar. See the alternative example titled "PHP file on an IIS server" on this page:
                    http://www.wrensoft.com/zoom/support/faq_ssi.html

                    Originally posted by ontrack View Post
                    2. I have set up the search page with basic formatting, and it seems to work fine, but when I search for words that occur very frequently, it seems to come up with the message "2187 results found", which can't be correct in every case. Do you know what is happening here?
                    Can you elaborate on why this can't be correct? Do you have less than 2187 files indexed? Your description is too vague. Give us some examples, or preferably, show us the search page and the query in question.

                    Perhaps your index files are corrupted or perhaps you mixed files from different sessions? Perhaps you have substring matching enabled and a search for "cat" is matching "categories", "bobcat", "cathy", and other words you haven't considered?

                    Originally posted by ontrack View Post
                    3. If I select 100 finds per page, it will only display the first ten pages, even though it says there are 22 pages. Presumably this is because of a limit set somewhere in the configuration, but I can't work out which it is.
                    The default optimization setting (the slide bar on the "Limits" tab) restricts to a maximum of 1000 results returned in V5 (just like Google). You can increase the accuracy and the maximum number of results returned by dragging it towards "Slower (more accurate)".

                    Note that all pages were still taken into consideration in the search, there's just little point in returning the 1001th most relevant result - especially when excluding this provided room for optimization.

                    The end user is alerted to what's happening and the following message should have appeared: "Your search query contained too many common words to return the entire set of results available. Please try again with a more specific query for better results."
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment


                    • #11
                      Hi Ray

                      Thanks for yet more quick and helpful responses. I need to absorb them and experiment (I'm in the UK, so I guess there may be a delay in this exchange).

                      In the meantime, on your specific responses:

                      I've switched from PHP to CGI, and it works MUCH better...
                      Sorry - I didn't recap on the history here. I wasn't complaining that Zoom was getting the wrong number of words. This whole thread started because Zoom was failing to find ANY words when I typed in very common words. The script just gave up. Your colleague correctly identified this as a timeout problem on the shared hosting server that we use, so I switched to the CGI version, and it now does find those words. Problem solved!

                      it seems to come up with the message "2187 results found", which can't be correct in every case. ...
                      Sorry - I was trying to keep it brief! What I'm now finding is that when I search for certain words that happen to be very common on our web site (words that Zoom was not able to handle at all before), it now DOES find the references, but with several different common words, it always reports the same total number of finds.

                      The URL is www.elogmag.com, where I have now put a temporary search results page using the CGI option. Try these words: "delivery", "unattended", "logistics", "web", "online" and "software". In every case it reports "2187 results found. 219 pages of results."

                      I don't think there is a problem with the index files. I reindexed the site with the CGI option a couple of days ago, and I'm using all the new index files in a different directory from the old search.

                      I realise the quoted figure is not very important, but some visitors might notice eventually if it keeps coming up with the same figure. I just felt I should try to understand what's happening here.

                      If I select 100 finds per page, it will only display the first ten pages ...
                      Thanks for explaining this. I totally agree that usually no one wants the 1001th find! But our site is partly a magazine archive, where people doing the searching are sometimes following the history of a subject or concept for research, so in a few cases very old finds may be valid. To be honest, though, I probably need to get a life here! I just wanted to understand the mechanics of it, so I'm now happy with this.

                      I wanted to incorporate the CGI search results in a PHP page. Your recommended way to read CGI scrips into PHP pages is to use the VIRTUAL command, but this doesn't seem to work for me, and I'm wondering if it's because Apache is possibly run as in CGI mode on my shared hosting server, not as an Apache module. The search page just throws up the PHP error message saying the function is not available. Is there an easy way to tell if this is the problem, and if it is, would another PHP command work instead, such as EXEC?
                      I didn't look at the IIS example, as the server we use runs Linux, but I guess it might also work on Linux. I'll experiment with SHELL_EXEC and report back.

                      Thanks for your continuing help with this.

                      Peter

                      Comment


                      • #12
                        Originally posted by ontrack View Post
                        Sorry - I didn't recap on the history here. I wasn't complaining that Zoom was getting the wrong number of words. This whole thread started because Zoom was failing to find ANY words when I typed in very common words. The script just gave up. Your colleague correctly identified this as a timeout problem on the shared hosting server that we use, so I switched to the CGI version, and it now does find those words. Problem solved!
                        Oh I see, I missed that. If the PHP version was being terminated because it exceeded your server's timeout limit, and the faster CGI version is able to complete the search within the server timeout, then it would certainly work better. The CGI, by design and implementation, is much faster than the PHP script.

                        Originally posted by ontrack View Post
                        The URL is www.elogmag.com, where I have now put a temporary search results page using the CGI option. Try these words: "delivery", "unattended", "logistics", "web", "online" and "software". In every case it reports "2187 results found. 219 pages of results."

                        I don't think there is a problem with the index files. I reindexed the site with the CGI option a couple of days ago, and I'm using all the new index files in a different directory from the old search.

                        I realise the quoted figure is not very important, but some visitors might notice eventually if it keeps coming up with the same figure. I just felt I should try to understand what's happening here.
                        I think it's just the case where you have 2187 files indexed, and those words appear in every file. I can't verify that easily from here however.

                        In some cases, I think you may have headers, footers, and other navigation components like menus etc, which appear on every page and add to the commonality of certain words. Take a look at this FAQ and you can exclude these sections from being indexed, which should help lower the number of words that appear on every page:
                        Q. How do I prevent parts of my webpage from being indexed (eg. exclude navigation menus, or page footers)?

                        Of course, if those words naturally appear in the content of every page indexed, then that won't eliminate the behaviour.

                        Originally posted by ontrack View Post
                        Thanks for explaining this. I totally agree that usually no one wants the 1001th find! But our site is partly a magazine archive, where people doing the searching are sometimes following the history of a subject or concept for research, so in a few cases very old finds may be valid. To be honest, though, I probably need to get a life here! I just wanted to understand the mechanics of it, so I'm now happy with this.
                        Just to clarify in case it's not clear - the "old" files can still be found, the user just needs to come up with better search terms to bring them further up the top of the results (e.g. add more words that are relevant only to the article they're looking for).
                        --Ray
                        Wrensoft Web Software
                        Sydney, Australia
                        Zoom Search Engine

                        Comment


                        • #13
                          In some cases, I think you may have headers, footers, and other navigation components like menus etc, which appear on every page and add to the commonality of certain words.
                          Bingo! This must be the explanation. In fact, I thought I had been very conscientious about this. My pages have always had ZOOMSTOP codes all over the page, especially on navigation menus, so I thought I had it covered. But I failed to use them on a piece of sign-off text in the footer (which appears on every page), and I also didn't use them on the meta keyword tags. I suppose I was thinking I only needed them for visible text that appears on the screen.

                          Sorry for bothering you about such a basic mistake!

                          You might be able to use shell_exec() or similar. See the alternative example titled "PHP file on an IIS server" on this page: http://www.wrensoft.com/zoom/support/faq_ssi.html
                          I'm now down to the last aspect of my original query! I would like to stick to the CGI version of the indexing if possible, now that I see the benefits, but I'm struggling slightly with my need to call the CGI code from a PHP page. As detailed before, I'm finding that the VIRTUAL function is not recognised (possibly because my web host has configured Apache to run PHP in CGI mode, not as a module, though I haven't absolutely verified this), so I wanted to try a different function or command.

                          You've suggested the IIS code mentioned on the page above, which uses SHELL_EXEC, but I can't get this to work, even after tweaking the code in various ways. I realise the Windows-type file path in your example code has to be different on a Linux system, but although I've tried it in numerous ways (eg using no path, since the CGI file is in the same directory as the script; using an absolute URL; and using localhost), but in every case I just get a blank page. Can you give me any more clues about the exact code to use?

                          Meanwhile, I'd like to thank you and your colleagues again for resolving all my other issues.

                          Comment


                          • #14
                            A full path is likely necessary for shell_exec() on a Linux machine. It would most likely need to be a proper absolute path at the disk level (rather than a relative link within the website folder) for it to find the file. This means something more like "/usr/home/mysite/public_html/cgi-bin/search.cgi". You may need to check with your host (or use telnet or ftp) to confirm what the full path is.
                            --Ray
                            Wrensoft Web Software
                            Sydney, Australia
                            Zoom Search Engine

                            Comment


                            • #15
                              A full path is likely necessary for shell_exec() on a Linux machine. It would most likely need to be a proper absolute path at the disk level (rather than a relative link within the website folder) for it to find the file. This means something more like "/usr/home/mysite/public_html/cgi-bin/search.cgi". You may need to check with your host (or use telnet or ftp) to confirm what the full path is.
                              Hi Raymond

                              Fantastic! That works! I checked on my shared web host's help section to find out how to reference the absolute path, and they say you have to access with $DOCUMENT_ROOT. So I incorporated that in your IIS script for calling CGI routines, and it works. On our server, if I were using the example above, $DOCUMENT_ROOT would replace everything up to the slash before the directory where the CGI file resides.

                              It's now down to me to play around with the search template and get it how I want it. Thanks so much again for everyone's help in this thread. It's so nice to talk to people who have the answers!

                              Peter

                              Comment

                              Working...
                              X