PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Mixed Case URL Variation in Skip List... Am I doing this right?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mixed Case URL Variation in Skip List... Am I doing this right?

    Woohoo!

    Just got Zoom Search running at http://www.dadsworksheets.com and it's awesome! Thank you!

    I do seem to be having some trouble getting the indexer/spider to skip certain classes of URLs. In particular, it seems buried on the site I have URLs that mention the mixed case form "DadsWorksheets.com" whereas everything elsewhere is all lower-case "dadsworksheets.com". What happens is we get search results mentioning the same page twice, for example a search for 'Multiplication Worksheets' winds up returning the same page as two distinct search results, one with the the mixed case URL variant and one with the lower case variant. You can see this right now by visiting the home page and trying the site search form on the upper-right part of the home page or just hit this link with a multiplication worksheet example.

    I've added "www.DadsWorksheets.com" to the 'Skip Options' and re-spidered the entire site, but no luck. The bit out of my config file looks like this where I'm skipping this /v1/ directory and also trying to skip the mixed case URLs...

    Code:
    #SKIPPAGES_START
    www.dadsworksheets.com/v1/
    www.DadsWorksheets.com
    #SKIPPAGES_END

    I also tried a rewrite rule, that looks like this in my config...

    Code:
    #REWRITELINKS:1
    #REWRITEFIND:DadsWorksheets
    #REWRITEWITH:dadsworksheets
    I even went as far as trying to add a 301 redirect in my Apache configuration to point the mixed case URL to the all-lower version, restarted Apache, re-spidered with Zoom, and still, same thing.

    I did install Zoom in a subdirectory, but have the search form in the site root. However, I've been conscientious about the location of the zcfg file being in the root as well and where it writes the index files. I verified there is only one set of index files on the site so it's not something dumb like spidering one set of files but reading a different one from the search settings in some temp directory or something. I think anyway.

    Running out of ideas and could use another set of eyes on this.

    Thanks for your help!

    Jim



  • #2
    Just as background. Many web servers on the internet are based on Linux which is case sensitive. So the skip list is as well.

    Ignoring Zoom for a moment, a couple of comments come to mind.

    1) It would be better if all your internal links were relative links. (i.e. didn't include HTTP nor the domain name in the link). I say this because you might one day want the site to run on a different domain, or you might want to run the site as HTTPS (this will be standard in a few years).

    2) Maybe you can do a search and replace on your HTML source code to replace all instances of www.DadsWorksheets.com with www.dadsworksheets.com. There are tools available to do this across multiple files very quickly.

    Having said that, what you have done in the Skip List looks like it should have worked. I see you are using the Linux release of Zoom. Maybe there is a bug in the Linux release. Give us a couple of days and we'll see if we can see the same behaviour from here.

    Comment


    • #3
      Okay, so here's an interesting twist.

      The site has pages like this printable graph paper page and if you look at the bottom of the actual graph paper, there's a div that has a URL in it for display purposes with that mixed case form, but it's not an actual link. I think what's going on is the spider is picking up the URL out of that <div> and thinking it's an <a> and crawling it. Which seems, at best, counterintuitive, and perhaps more likely a bug...

      Comment


      • #4
        I don't think the link is coming from that text.

        I am doing a test index of your site, and Zoom came across this link:
        Code:
        http://www.dadsworksheets.com/worksheets.multiplication.html

        Now it may happen to be a broken link, but in any case, if you look at the HTTP header / response (either via your browser's Developer Tools or,) using a HTTP viewer by entering that URL here:
        http://www.rexswain.com/httpview.html

        You will see the following:

        Code:
        Receiving Header:
        HTTP/1.1·302·Found(CR)(LF)
        Date:·Fri,·20·Jan·2017·06:39:48·GMT(CR)(LF)
        Content-Type:·text/html;·charset=UTF-8(CR)(LF)
        Transfer-Encoding:·chunked(CR)(LF)
        Connection:·close(CR)(LF)
        Set-Cookie:·__cfduid=d92daf8908fd240963820b5839a8673571484894388;·expires=Sat,·20-Jan-18·06:39:48·GMT;·path=/;·domain=.dadsworksheets.com;·HttpOnly(CR)(LF)
        Expires:·Wed,·11·Jan·1984·05:00:00·GMT(CR)(LF)
        Cache-Control:·no-cache,·must-revalidate,·max-age=0(CR)(LF)
        Link:·<http://www.dadsworksheets.com/wp-json/>;·rel="https://api.w.org/"(CR)(LF)
        [B]Status:·301·Moved·Permanently(CR)(LF)
        Location:·http://www.DadsWorksheets.com(CR)(LF)[/B]
        Server:·cloudflare-nginx(CR)(LF)
        CF-RAY:·32408806e2044fcf-DEN(CR)(LF)
        (CR)(LF)
        So it seems that the REDIRECT setup on your server is pointing to the mixed case domain name.

        In any case, this could be avoided by enabling the feature under "Configure"->"Scan options"->"Duplicate page detection"->"Use CRC to skip files with identical content".

        But its probably best to fix the redirect for consistency and to avoid SEO problems.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment


        • #5
          More info -- the broken link is on this page:
          http://www.dadsworksheets.com/category/rocket-math/

          For the link "Lots and lots of multiplication worksheets".

          It's generally not a good idea to use redirects when a page is not found. This makes it very difficult for you to actually find (and fix) your broken links. Most users get confused/frustrated with this, and also Google will penalise the redirection as it can seem like fake content with auto generated URLs.

          A proper 404 file not found error and user friendly 404 page is preferred. Zoom can also tell you where the broken links are if filter your log messages for it.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Thanks Ray... I'll look into it. Yes, all the 404s are redirecting to the hope page right now and I know that's not great practice... Thanks for the nudge to get this cleaned up. If nothing else, I'll go figure out where that redirect is occurring and make sure the domain in the destination URL isn't mixed-case because that's clearly stupid...

            Comment


            • #7
              Wow. Okay, still not working. Figured some things out, but broke it worse in the process.

              I think I had all of the problems we discussed above, which basically produced a set of index files that included the mixed case domains. But I fixed all of that, and also tried skipping the mixed case URL, CRC detection, everything. Even fixed up the bad link AND the 404 page so that it doesn't redirect to the mixed case. All of which should have corrected this, but still, I'd spider and re-search and same problem. Zoom spider log doesn't even mention the mixed case URL anymore so something had to be up.

              I looked, and the search output files had a modification date from several days ago. So that was the problem. I was getting new files like zoom_pagedata.tmp and similar ".tmp" files in the directory, but nothing was renamed to .zdat. So I went ahead and deleted the .zdat and .tmp files, then respidered everything.

              Now I've got another set of .tmp files, but no .zdat files. So that's weird. Nothing in the log to indicate an error.

              But also, I don't have the zoom_dictionary or zoom_wordmap in either tmp or zdat flavors being written anywhere. So just copying *.tmp to *.zdat doesn't work. Worse my search is completely offline right now.

              Tried chmod 777 on the directory where I'm writing the files, also tried changing the output directory in the zcfg to point to a different temp directory, and always the same results (pageinfo, pagetext, pagedata but no dictionary or wordmap.)

              Please help!



              Comment


              • #8
                Still fighting my way through this... I've tried a dozen times now trying to get the a complete set of index files out, and all the indexer seems to write is the three tmp files. I still can't get a dictionary or wordmap file out, and I can't find any errors anywhere.

                Any suggestions? Is there something I can post here to help diagnose?

                Trying to use the search function on the site yields an "Zoom files missing error" error right now, and I'm trying hard to not have to roll this off the site... I've got a search form in the header on every page...

                Comment


                • #9
                  So, a complete removal and reinstall of Zoom seemed to get things running again. I left everything configured on the output side exactly the way it was default installed, so the files were generated into /tmp (including the dictionary and wordmap files this time) and I manually moved them to my web server's root directory... Search results back online and no duplicate URLs so I think I'm good to go. Thanks again for the pointers getting the redirects cleared up...

                  Comment


                  • #10
                    Glad to hear you got it fixed in the end. Sorry I was away for a few days.

                    Some pointers,
                    - There should be log messages when your indexing sessions were finishing without .zdat files. Make sure you have enabled to display ALL message types (including Warnings, Errors, File IO). Most likely it had trouble renaming the .tmp files to the .zdat files at completion because the web server was possibly holding the .zdat files opened. Or you had multiple indexing sessions going that haven't terminated properly. Check your "ps" process list in Linux. It's possible that something went wrong with an early session and an orphan "ZoomEngine" executable is still running and making it impossible for the current session to write to the file.
                    - Always check for indexing errors. Zoom shouldn't interfere with the .zdat files in the output folder until it is complete, so unless another application is holding onto the files, it should be fine. But certainly it is safer to have Zoom write to a temporary folder somewhere else and manually move the files over. You could at least do this for a little bit to suss out what the original cause of the problem was.
                    - On Windows, it is common for anti-virus applications to interfere with writing to the .zdat files. Since you're on Linux, this is a lot less common. But if you do have any programs like anti-virus / security applications, take a look at those.
                    --Ray
                    Wrensoft Web Software
                    Sydney, Australia
                    Zoom Search Engine

                    Comment

                    Working...
                    X