PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

"View as HTML" uploaded documents

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • "View as HTML" uploaded documents

    Challenge: Grab text in an uploaded document and auto-post to an HTML template.

    Google and other online search services have an option to "View as HTML" the content of trawled documents (DOC, RTF, DOCX, PDF, etc.). Does Wrensoft plan to add this feature to the Zoom Search Engine results page? I know Zoom already does a nice job indexing various document formats.

    Here's my challenge. I'd like to have users be able to upload a document to a dedicated site where the textual content is automatically parsed and posted to an HTML page. The goal is to have the resulting HTML page be found and indexed by internet web crawlers and bots (not just the Zoom Search Engine). It must be a no-brainer for the person who submits the document. No cut-and-paste. No monitoring of the resulting HTML pages. It all gets done automatically and reliably.

    I realize the problems inherent with some documents which may have difficult-to-read text (custom encoding) or rasterized type. Just looking for ideas on how to get started with the task of parsing text and slapping it on an HTML page. Is there something off-the-shelf we can use? I would really like to be able to hook this into the Zoom Search Engine.

  • #2
    As Google and all the major engines already index PDF's and other file formats (just as Zoom does) what is the purpose of making a new HTML version of the page?

    I don't think the document format impacts greatly on the ranking of pages via Google. Other factors, like incoming links have a much greater impact. So I don't understand the goal of the project.

    Comment


    • #3
      After chatting a bit more with my potential client, it appears he wants to build an intranet application and host it on a dedicated server. So Google, Live.com, Yahoo Search and other major internet search engines are out of the loop.

      The content of thousands of documents needs to be parsed and placed onto easily-accessible HTML pages. The content will be indexed by the dedicated server, so intranet users can do key word searches.

      So now, it looks like Wrensoft does have an opportunity here.

      Comment


      • #4
        I still don't understand the goal. Why do you need the HTML page? Zoom can index the text in the PDF files, and the user can view PDF files.

        So why go to the trouble to duplicate all the content, especially when the HTML conversion will screw up all the formatting in PDF and similar files?

        Comment


        • #5
          I am just telling you what the client said: They want to "parse the text" so it appears in plain HTML text format (no columns, pictures, etc. -- just paragraphs of copy). Maybe they want to be able to easily copy and paste sections of copy to new documents. Think of a researcher, student, lawyer or journalist. You know how difficult it can be sometimes to copy text from a PDF, especially as hard line returns are not stripped out. And Word docs are also sometimes problematic.

          Maybe they want to make the HTML text available for PDAs or webbooks or iPod touches or iPhones. Maybe they want to free themselves from dependence on Word or Adobe Acrobat.

          I know from experience that there are news services which do plain text captures of media articles. Like BurrellesLuce. And the results are sometimes mangled, as you suggested. What they do provide is the article in both plain HTML text as well as a raster image PDF of the original print article. The mangled text is a result of OCR.

          I keep asking the same questions you are and have not gotten a clear answer. I am just trying to explore the client request, as stated.

          Question: Does Zoom Search index Microsoft DOCX files?

          Comment


          • #6
            Zoom does text extraction from PDF's and other documents. But the loss of formatting means that the document will at best just be solid block of text. (i.e. no better than if you did a copy / paste directly from the PDF). Zoom also doesn't keep these converted files. It deletes them after the document is indexed.

            Yes, Zoom will index DOCX files.

            Comment

            Working...
            X