PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Unique word limits

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unique word limits

    I recently bought Zoom search index to use as a search function for my company's website that has a lot of PDF files on it. Is there any way to have the search go beyond 500000 unique words while staying a PHP script? Changing it to CGI would require so much extra setup, just to have a search work.

    From looking at other similar topics how can 500000 unique words even be reached? This is for 300 PDFs all about 10-15 pages each, the indexer averaging about 3000 words a page. This is just one of the smaller projects.

    Another one has over 5000 PDFs that I have to index.

  • #2
    Would it be possible to point us to one of the PDF files so we can take a look at it.
    The numbers you posted don't make sense for normal documents.

    There are around 50,000 words in the English language. It is hard to imagine how you can find 3000 new previously un-encountered words on every page of your PDF documents. Normally once you have indexed 20 or so pages, there are only 10 to 20 new words found per page. Then the more you index, the slower the rate of increase. So after 100 pages you might only find 3 or 4 new words per page.

    I assume you aren't confusing a straight word count, with the unique word count?

    500K is a fixed limit. The performance is starting to get pretty poor at this point in any case with PHP. Usually you'd want to switch to CGI just to get the better performance.

    Comment


    • #3
      Here are two sites we had made a long time ago with similiar PDFs. These 2 don't use zoom search, but mine will in the future.

      http://matawan.ididigital.com/index.cgi?level=2&pub_title=The%20Independent

      http://173.12.11.248/

      I agree, the search for unique words maxed out at 500k at only 168 PDFs. Maybe the OCR job is bad? We do that here with the best software for OCRing since we are a digital imaging company. So they usually come out looking good.

      Comment


      • #4
        I had a look at the documents. There are so many issues it is hard to know where to start.

        Issue 1:
        The documents have very fine print so there is a lot of text on each page.

        Issue 2:
        The quality of the scanned documents is poor. This is not helped by the fine print. So the resulting OCR job is poor. So the actual text coming from the document is often random garbage. Here is an actual example of the text extracted by Zoom.
        Vecliohl '1'owiuhlp $lft,,1l, 1-VcelioM
        IIRCII nllijlitly lunl Saliirduy, whim ll Town $2't,M>, Holitnlel $1«,B4, Ilowcll
        nklililcil nnd run Into a telegraph pule. $IW', Mniinliipnli $17.91, Miirlhtiro
        Tim necupiintii were not hurl. $l7,'1(i, Miilnwnu $2.1,fi(i, Mlildlctowii
        I'our permms were Blitjlilly hurt I'rl. $23.54i Mllhtnni' flfl.54, Nepliino
        T , , „ , , „ A ., „ „ , . dny when nn niiluniobllo. lirlunulng til ^1.15, Oienn $1(107, Kurllnu $22.11,
        Jamos B. NaRlo Woundeil In Ilatlle. '
        w^. Vuta WiaKticr of KntsiHl.iiris. wlallc Shrewsbury %U.%. Upper I'rrehold
        lying lo ilium a truck gnlng'in the $IMI, Wall %2VH>, Anbury -Park
        ll,,, $,1$.,1H, Attiintle llhihlunils $.1-1.11, Al
        Here is a second example, picked at random.
        widow Klltn Pftllon, of John Fallon of 1'lcnsmu i w John o
        Valley, tiled In Calvary Hospital, The Ili'onx, taut Saturday noon from a eaniplicntlnii nl IIIHCWS, IIKCIIY 74
        The body was taken to her Into houir, 45 Sccottil Street, Kcyport, Ihnt afternoon, mul the fimerul irtvIn St. Joneph'n Church '• •• A 10 o'clock, Hcv, ,•• I ollkintliiK, mid Inlees were nelll Tliesiiay ii""
        and a third example,
        iin Iif llm M M . , " tfrafil'ir, nhlrlt i*tfar tU<m
        ( ' i>nli>i1, altli'iniili |i«l until IIHU ll d l h i i f j l l l t l Itr lr I
        «l Jiilv, IN Hit yt-or W tnir | n | I I V Hum liiliiilfixl atiil flKfify. IhifiHKl r»(tli, mi>( fur iitaltlatftltiiii, alol ttlllmiil itiribti uf a I'll* I ll l hll *tfar tU<mt
        So many if not most of the words in the document are collections of random letters.

        Issue 3:
        The document is in narrow columns. And the hyphenation is being dealt with. So even when the OCR is OK, words are being split up. This can be seen in this example, where words like advertising have been split into two at random.
        he candidates lor public office thisfall evidently are on a still hunt, op
        patently doing but little personal can
        vassing and using very little advertis
        ing space
        Issue 4:
        The OCR job seems to have sometimes lost track of the columns in the document entirely. So the text from one column can be seen to run into a 2nd column. This combined with Issue 3 means that the text being indexed is again more random than it should be.

        Issue 5:
        There are a lot of numbers and names of people & places in the documents. Which enlarges the unique word count as well.

        So I don't think there is any easy solution. Switching to the CGI will allow you to index more of your random words, but it is going to be a case or garbage in garbage out. And in this case a lot or garbage is being fed into the indexer.

        Comment


        • #5
          Hmmm, in our OCR program whenever it picks up unrecognized characters like all the garbage you selected we can remove that entirely. That way it won't show up in the indexer. Since when people search for keywords it won't find the garbage words anyways, so if they weren't there to begin with it wouldn't be an issue.

          Thanks for the info, I'll have to look into different settings for our OCR. Maybe something like, if it doesn't find proper words then don't include them at all.

          Comment


          • #6
            Certainly removing all the garbage would help a lot. But you might have very little left after removing the garbage.

            In several of the documents I looked at the initial scan seemed to be of very poor quality. Poor contrast, poor resolution, pages not being flat on the scanner so that only half the page was scanned correctly. So I am guessing the OCR software didn't have nearly enough data to work with.

            I also think the text might appear clearer in the PDF files is it wasn't converted back to just 2 colors. Scanning in full color or grey scale would surely make the PDFs more human readable, and maybe help the OCR as well.

            With the super fine text you are scanning you are going to need to do a very high resolution scan to get enough detail for the OCR software.

            If you have the resources, someone should spell check the documents after you have optimized the OCR process. Is a big job however.

            Comment

            Working...
            X