PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Indexing options for picture based pdf

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing options for picture based pdf

    Hi,

    I have Zoom Search 7 Enterprise Edition. I have it installed on a website that contains more than 10 thousand PDF files. nearly all of them are picture based pdfs.

    When i used to index the website, Zoom search would download all pds to index them, which was not a practical step due to the huge total size. In addition, the Zoomsearch does not index the pdf bookmarks, as advised by Wrensoft before.

    This led me to use the commands "zoom Stop" and Zoom start" around the pdf link to stop Zoomsearch from downloading the file, and now the books appear in the search results based on their name which is mentioned in the html, rather than the file name.


    My question is: these pdf files contain various topics that can not be represented in the file name.
    Now that the pdf are picture based (OCR not practical due to language restraint),
    are there any ways to get around this by specifying keywords (somewhere) that enable the search engine to pick that a certain book contain a certain word?

    Currently the Zoom search is working perfectly for me as a "book finder", only if you know the name of the book or part of it. However, i can not find a way to add keywords to enable the search engine to pick that the search query (despite not mentioned explicitly in the book name in HTML) is actually found inside a certain pdf.

    a sample page from the website:

    arabic.coptic-treasures.com/canon/canon.php

    Thank you very much.
    Last edited by atef; Oct-13-2014, 11:37 PM. Reason: adding url

  • #2
    Have a look at the .desc file functionality in the User's Guide (around page 97)
    http://www.wrensoft.com/zoom/usersguide.html
    It allows you to add text to a PDF file. Obviously you'll need to allow the indexing of PDF files for the .desc files to be picked up.

    I had a look at your site, some of the PDF files seem fairly big. From an efficiency point of view offline mode might be better than using spider mode to index the files.

    Comment


    • #3
      Thanks a lot,
      .desc itself won't solve the problem, but maybe combining it with offline mode can sovle the problem (due to size as you mentioned).

      As for the desc file, i can't find specs for it. is it a utf-8 based text tile? can it contain arabic words?
      Also the example in the manual is for description.
      so for me to add keywords, shall i use, for example:

      <meta name="keywords" content="keyword1, keyword2, keyword3">

      or what is the best way?

      **Edit**: can this method work for mp3 fand wma files too?

      cheers

      Comment


      • #4
        If you have selected UTF-8 in Zoom, then yes, you can use UTF-8.

        Yes, you can use .desc files for MP3 and WMA files.

        Comment


        • #5
          Thanks a gain,

          but as for the proper format, is it:
          <meta name="keywords" content="keyword1, keyword2, keyword3">

          or
          <meta name="keywords" content="keyword1 keyword2 keyword3">

          or other?

          thanks

          Comment


          • #6
            With or without commas. Either way is fine.
            --Ray
            Wrensoft Web Software
            Sydney, Australia
            Zoom Search Engine

            Comment


            • #7
              The remaining question for me is:

              Does Zoom search work with WAMP server (php/apache server on windows) ?

              My website is a php based website, where files are served through a php download script.

              Offline indexing + desc files will solve all problems. however , the only problem is how to get Zoom search to index the files offline through the php download script.


              My website size is currently 90GB, with anticipation to reach 1TB in the near future. indexing it online is not practical, that's why i excluded the pdf links and index the web pages only. but this means that people are not able to search for topics inside the pdf books.

              Comment


              • #8
                If you are using offline mode then, (of course) PHP on your server won't be executed.

                Using the URL rewrite function is an option however.

                Comment

                Working...
                X