PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

ASCII control characters appearing in the titles of the search results

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ASCII control characters appearing in the titles of the search results

    HI, I am using Spider mode to crawl a site which predominantly contains links to a lot of files of various types (PDFs, MS Office Zip). However, after I have uploaded the DB and do a search, most of the titles in the serach results contain ASCII control characters, such as NAK, ACK, SO, VT, HT etc. This are displayed as question marks and/or the spaces between the words of the titles are removed. Any clues as to what might be causing this? Second question. At the moment if I choose any of the options to index the content, I end up with a very large DB, which takes @30 seconds to return the results. If I deselect the option to Index content, the DB is obviously significantly smaller resulting in a search which takes a fraction of a second to complete. As I see it, the Indexing options are global. It would be useful if the Indexing options are extending to cater for different file types. Is this something which is planned in a future release? Third question. When can we expect the next release. Many many thanks, Kind regards, Russ

  • #2
    1) My first guess regarding the strange characters you are seeing, is that your web server is serving the file types in question with the incorrect MIME / Content-Type header. This is something you can check with your browser's Developer Tools (in both Chrome and Firefox), under the "Network" tab, there are options to view/copy the "Response Headers".

    If your web server is returning a Content-Type header for a .PDF file to be "text/plain", then Zoom will obey your web server and treat the file as a text file, instead of a PDF file. In doing so, it will index alot of garbage (thus your large database and slow search) instead of actual content.

    So first confirm if the above is happening, then you should look into configuring your web server to return the proper content-type headers for the file extensions necessary. Ask your web hosting company if this is new to you.

    2) I think the proper with the large files is due to the above. So fixing that would also fix the issue with your index size. Having said that, if you want to configure any file extension to not index the content, you can do so under "Configure"->"Scan options"->"Scan extensions". Here you can add/remove extensions and how they are treated (note however that the abovementioned Content-Type header will take priority over the file type specified here). There is a file type here for "Binary file" which by default, will only index the filename (and not the content). So for example, you can specify to index all ".zip" files only by filename by giving it this file type.

    3) V7.1 is the current release. We don't have any set schedule for the next release. All minor version increments (e.g. V7.x) will be a free upgrade. Major version increments (e.g. V8.0) will be a free upgrade for any one who purchased the software within 6 months of its release. So you can rest assured you won't be caught with an old version right after purchasing.
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Hi Ray,

      Thanks for the prompt reply.


      1) To try and isolate the problem, I copied over the site to a local version of IIS running on Windows 7. I made sure that the MIME type for PDFs was correct, i.e. "application\pdf" and checked the HTTP response headers using Fiddler and all seemed to be in order. However I am still getting the problem. To elaborate a little bit, I am configuring the Spidering option for the Start URL as "Follow links only". Because of the performance hit that I am currently experiencing I have selected the Indexing options: "Title of page", "Meta description", "Meta keywords", "Filename" and "Link text". I am not indexing the "Page content". If I do a search after the DB has been built, I notice that random spaces in the filename have been replaced by any of the bottom 32 ASCII characters and sometimes by a question mark. The spaces in the full URL, and the hyperlink behind the title are intact, albeit replaced by Hex 20 or %20%. I also looked in "zoom_pagedata.zdat" using NotePad++ I could see that the indexing has replaced some of the spaces in the title. For example, the URL could be:

      http://localhost/international/Conte...dy%20(USA).pdf

      But the title following the bar symbol is:

      Twenty-TenFFandVTBrand?Case Study (USA).pdf

      where "FF" and "VT" and ? represent spaces that have been replaced by Form Feed, Vertical Tab and question mark characters. Is there anything else you can think of that would help me to resolve this issue.

      2) Accepted, but until I have fixed point 1), I won't know for sure.

      3) Noted, and thanks.

      Kind regards,

      Russ

      Comment


      • #4
        Hi Ray,



        Thanks for the prompt reply. Sorry for the delay in responding.. I am havng problems with your forum rejecting my posts (getting "Missing Human Verification Information")... so here is a rerun on what I tried to send yesterday...



        1) I copied over the site to a local copy of IIS so that I could check MIME types and HTTP response headers. I can confirm that all MIME types are coorect e.g. 'application/pdf' is being used for PDF. I used Fiddler to monitor the session and confirm that the Content-Type in the HTTP Content Response header is also set to 'application/pdf'. I then reindexed the local version of the site and copied over the DB to the local instance. Unfortunately, I still have the problem. Just to reiterate what I was saying in my earlier e-mail. Some of the spaces in some of titles of some (not all) of the search results are being replaced either by one of the bottom 32 ASCII characters or by a question mark. Because the ASCII control characters are non-printable, this means that some words appear together, or are separated by a question mark. Checking the "zoom_pagedata.zdat" file in NotePad++ I can see that the hyperlink behind the title is unaffected, albeit with spaces replaced by Hex 20 or %20. It is just the title representing the filename that is affected. So for example, the following is an exmaple URL behind a title:

        http://localhost/international/Conte...1%20Lookup.pdf

        But the totle is:

        Product?USA2006FFtoVT2011 Lookup.pdf

        Where ?, FF and VT represent question mark, Form Feed and Vertical Tab characters taht have replaced some of the spaces in the title. This means that when rendered in HTML, the title is displayed as:

        Product?USA2006to2011 Lookup.pdf

        I am running Zoom Indexer in Spidering mode with an ASP script output. The Start URL has been set to "Follow links only" and for teh Indexing options I have turned off "Page content" (purely because of the performance problems I am facing).

        2) Noted. This might be the case, but until point 1 has been fixed I am unable to confirm.

        3) Noted and thanks.

        Kind regards and many thanks,

        Russ


        Comment


        • #5
          Those characters appearing in the Title is unusual, I can't say I've seen that before. What you should check is opening one of these PDF files up in Adobe Acrobat, and looking at the Properties (File->Document properties) and seeing what the Title is stored within the PDF file. It is possible the PDF file was created with this title, perhaps by whatever program you used to generate the PDF file in the first place. For example if you were printing to a "PDF Printer", this could explain the Form Feed and other unusual characters in question. But like I said, I've never seen this.

          If you can't figure this out, e-mail us the PDF file in question, along with your .zcfg file with your indexer configuration. And we will try to reproduce it here and have a closer look.

          Another note -- if you are not indexing dynamic web pages (e.g. PHP, ASP, etc.) and only static documents like PDF and DOC files, then you might want to consider using Offline Mode. Your search result links will still point to "http://localhost/" if you set your Base URL correctly.
          --Ray
          Wrensoft Web Software
          Sydney, Australia
          Zoom Search Engine

          Comment


          • #6
            Hi Ray,

            Just to follow-up on this one. There was nothing wrong with the PDF files. I got around the problem by switching to offline mode. But now have other issues related to indexing Office files. I have created a new post for this item.

            Many thanks,

            Russ

            Comment


            • #7
              If the problem was avoided by switching to Offline Mode, this means it has something to do with how the files are served via your web site. For example, if your web site serves the PDF files via a download script (e.g. "download.php?mydoc=1234") then the script would determine what the filename is, and it would declare this in the HTTP header. So any funny characters in the filename are likely generated at that point.
              --Ray
              Wrensoft Web Software
              Sydney, Australia
              Zoom Search Engine

              Comment


              • #8
                I did check the headers and used Fiddler to look at the packets and all seemed to be in order. I will come back to this after you have looked at the other problems I have posted.

                Many thanks and kind regards,

                Russell

                Comment


                • #9
                  For anyone experiencing similar issue to those described above. The issue was fixed by the 7.1.1011 release.

                  Thanks Ray,

                  Russ

                  Comment

                  Working...
                  X