PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

PDF pdftotext Plug-in - using option to index 1st page

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PDF pdftotext Plug-in - using option to index 1st page

    In the downloaded pdftotext.zip the pdftotext.txt file mentions using the "-l" option to specify the last page to index.
    I would like to implement this during the indexing to speed things up and also because the first page of the PDF files is the cover for the reports so it contains all the words needed for searching.
    I can do this from the command line, for example: "C:\pdftotext -l 1 203465.pdf test.txt". This will create a text file of the first page only.
    But how do I configure Zoom Search to use the -l 1 option on the fly?

    Thanks,
    Chris Bungart

  • #2
    There is no easy way to do this and we have not tested this option. The parameters are hard coded in Zoom.

    I guess if you were really desperate you could write an interface module to translate the command line paramters.

    But I don't think you are going to save a lot of indexing time in any case. You would still need to download the entire document even if you were only indexing the 1st page. Downloading is normally much slower than indexing.

    ------
    David

    Comment


    • #3
      >the first page of the PDF files is the cover for the reports so it contains all the words needed for searching <

      What about the content of the reports? You want to index the content so that users can find the words and phrases contained within the document. I think that you should reconsider your strategy.

      Comment


      • #4
        PDF pdftotext Plug-in - using option to index 1st page

        David,
        If by writing a new module you mean an iteration of the "pdftotext" plug-in and if this is a programming language - that's beyond my scope.

        Geoffrey,
        The first page is all that is relevant. The rest of the report is long pages of numbers, dimensions, generic geometric trims, graphs - nothing that's really specific or relevant to a search.

        These are big files, 95% of which is generated automaticaly - raw data. The first page option would be great for..., maybe in a future version.

        This is for an intranet - distributing engineering data (an effort for saving tons of paper and printer supplies).

        Thanks,
        Chris

        Comment


        • #5
          Chris -

          Sounds like your .pdf files are similar to mine.

          ( See my post above.)

          Regards -

          Geoffrey

          Comment

          Working...
          X