PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

PDFs and Max Unique Words

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PDFs and Max Unique Words

    I have recently purchased Zoom Search V5 Pro.

    On my website I have uploaded 8000 OCR'd pdf files (Eventually I will have 47,000 OCR'd pdf files). I have indexed using the the CGI feature.

    I am having problems with the Max Unique Word function. After indexing only 2526 files I have hit the max of 500,000 unique words. AVE Unique Words per page is 1285.

    I believe these results are do to the OCR'ing of the pdf documents.

    Is there any way to have the program skip (not index) words not in the dictionary (unique words).

    jj

  • #2
    What sort of content do you have in these PDF documents? Are there large amount of serial numbers, product codes, etc.? These can naturally push up the unique words count as there can essentially be an infinite number of these.

    The dictionary is created from the words indexed. So it wouldn't make much sense to not index words not in the dictionary (then there is no dictionary!). What you would want is to limit the words indexed.

    One way is to add entries in the Word skip list (on the "Skip Options" tab of the configuration window) using the "*" wildcard character. For example, if the content in your PDF documents are mostly product codes in the form of "PCD12344", "PCD3591001", "PCDA191AzZ", ... etc. then you could potentially skip all of them with a skip word entry of "*PCD" (see the Help file for more information on how the wildcard character works here).

    If this isn't possible (the content that you want to skip varies too much for example) then you should click on the "Limits" tab, and check the option to "Limit words per file". From the Help file description of this feature:

    Limit words per file
    This allows you to specify the maximum number of words to index from each file. Once this limit is reached, the indexer will move on to indexing the next file. This can be useful if you are indexing a very large archive of content, and only consider the first 100 words on a page to be useful. Another example is when you are indexing PDF documents, which may contain many pages. Using this feature you can limit the indexing to the words on the first page (with an approximation of 600 words per page for example).
    --Ray
    Wrensoft Web Software
    Sydney, Australia
    Zoom Search Engine

    Comment


    • #3
      Ray
      Thank you for your very quick repy.

      However, none of the options you offer will work. The pages I am indexing are OCR'd pdf files of newspaper radio logs. (jjonz.us/RadioLogs).

      Here is a sample file. http://www.jjonz.us/RadioLogs/pagesnfiles/logs_files_OCR/CDT/1940s/1940/40_01Jan/[c]40-01-15-(Mon)ocr.pdf

      or you could try indexing
      http://www.jjonz.us/RadioLogs/pagesnfiles/logs_files_OCR/CDT/1940s/1940/40_01Jan/
      This should result in 30 indexed files. All files are PDF. This will show you why the unique words increase so quickly.

      Thanks
      jj

      Comment


      • #4
        The real problem is that your OCR has failed to pick up much meaningful text from the PDF files, due to the poor quality of the scans and/or original documents.

        For example, this file:
        http://www.jjonz.us/RadioLogs/pagesnfiles/logs_files_OCR/CDT/1940s/1940/40_01Jan/%5bc%5d40-01-01-(Mon)ocr.pdf

        ... is hard to make out as it is with the human eye, and the OCR program you are using struggles even more so. Here is an extract of what your OCR program has picked out from the above page (and stored within the text layer of your PDF document):

        e01i-2: ??tanthird Time.
        c. HIE AtiO 1-11.EQUENCIES.
        A-7,. A
        1
        W3IE1-1010 WJJD-11:30
        11:30 A. M. V.G-NQuIn Ryan. V.MAQReligion and 72}3MIlelen Trent. VINDJoe Alexander.
        New World.
        CBBMBilly and Betty. VMAQBeautlful Life.
        5:00 P.M.
        W WA E-120.:
        N., EN lo.;-0
        WSBCI210
        k,D7)
        IH. .
        It has created alot of gibberish and this naturally leads to a high number of unique words.

        The effectiveness and accuracy of your search will be limited to what your OCR program has managed to retrieve from these scanned PDFs. If you hope to improve this, you will need to consider a better source (or more sophisticated OCR, although some of these PDFs will prove to be difficult/impossible for any program).

        If there is no alternative source for these documents, you must realize that the end solution will always be a compromise. And that it is really impossible to index all the content, and some of the less legible content will always be excluded.

        Given the above, it is fair to say that the "Limit words per file" option is a practical solution. No, it will not be able to index all meaningful text on the page (as it will stop after the first x number of words), but it will ensure a reasonable number of total unique words to be searched from.

        There is no functionality in Zoom to automatically determine what is a meaningful word, and what is not. Zoom builds its dictionary from the words given, and presumes that the content is legitimate to begin with.

        For what you wish to achieve, it would make more sense to find an OCR package which will only return recognized words from a dictionary, so that you can create PDF files with more meaningful text layers before indexing.
        --Ray
        Wrensoft Web Software
        Sydney, Australia
        Zoom Search Engine

        Comment

        Working...
        X