Announcement

**vidbo** · Oct-09-2010, 09:17 PM

Video search! as in can search video - play a little clip or whatnot on mouseover...

**6StringGeek** · Oct-10-2010, 06:28 PM

I would like to see jump to match and highlighting work in conjunction with stemming...unless it already does...?

If I search for "monster" and the results only happen to contain a page with the word "monsters", the jump to match highlight doesn't function. I know this issue has been addressed in older posts...so if the status is the same then just consider it a wish list request for the new version.

It would also be nice if it worked in the JS version. I have a need for that right now actually.

**TKY** · Oct-13-2010, 03:43 AM

I need Incremental indexing for offline mode.
Please consider.

**Ray** · Oct-13-2010, 05:46 AM

Originally posted by TKY View Post

I need Incremental indexing for offline mode.
Please consider.

How many files are you indexing?

This is an unlikely requirement in Offline Mode because there is no bottleneck on network download speeds as with Spider Mode. You index as fast as you can read off the disk.

And to do incremental, Zoom has to re-read in all the data from the existing index files (and prepare more memory to index stuff to add to it), and that can take some time. Usually it works out to be pretty comparable to just indexing the actual files off the disk again, so often there's very little to gain in this situation.

UPDATE (20/Oct): Perhaps I need to word this more clearly. We actually already support the ability to perform "Incremental Add" in Offline Mode ("Indexing"->"Incremental indexing"->"Add list of new..." or "Add start points...") in the current version.

That means you can add new files to an existing index created in Offline Mode. What you can't do is "Incremental Update" in Offline Mode, because the benefit is negligible - it usually takes about as much time to check whether an offline file has changed (+ the overhead in reloading the existing index) as it does to re-index the files. This is different to the situation in Spider Mode.

**teedoff** · Oct-13-2010, 03:57 PM

Originally posted by Vernon View Post

The feature that would make Zoom a dream for content management driven web sites is the ability to read direct from database tables using SQL queries.

It would then make Zoom a contender against using Index Server for SQL Server.

Our need is to index all the content held in a database. This contents is rendered into HTML when a particular page is requested by a user on the web site. Standard CMS stuff.

To get this content indexed at the moment, we load into Zoom's configuration a no-follow URL for every record in the database we want to index. When called, this URL renders the page into simple HTML (via the CMS) and then Zoom adds it to its index. The templates are simple: we strip out complex coding and images used to render the page when a visitor requests it from the web site.

It would be so much quicker if Zoom could define a database connection string and thereby issue an SQL query; the configuration would need an option to weight the fields. Alternatively (if this would make the process easier to add), the configuration could hold a reference to an HTML template so that an HTML page could be built before being parsed in the same way as now.

I don't know, but I suspect one of the limits on Zoom's processing speed is determined by the number of simultaneous HTTP threads that can be sustained, and creating the page more directly by an SQL query would improve the speed of chewing through 100,000 or so database records. It would also not add so much load to the site itself. Adjusting the SQL query used could easily accommodate new pages recently added etc.

If there is any chance of this feature, I'd be delighted to propose more detail for consideration. We don't use an open source CMS (we have built our own), but I can see that options to read WordPress, Joomla et al could be added.

We'd love to abandon all our straight database searches, remove any use of Index Server, and just rely on Zoom.

Thanks.

Vernon

I second that! database indexing would be a very useful tool. My current site is written in Coldfusion, so I would like support for .cfm and .cfc files as well.

**David** · Oct-13-2010, 06:52 PM

I would like support for .cfm and .cfc files as well

They are already supported in the current release.

For database indexing Ray's comments above need to be addressed.

There is also this FAQ for how to do it in the current release.
Q. How can I use Zoom to index my site stored in a database (eg. SQL, Access, etc.)?

**teedoff** · Oct-13-2010, 07:39 PM

Ah well maybe I didnt explain myself well enough. I see I can index a coldfusion site, but unlike PHP, there is no search.cfm page automatically generated by Zoom. Thanks!

**Ray** · Oct-14-2010, 12:03 AM

Originally posted by teedoff View Post

Ah well maybe I didnt explain myself well enough. I see I can index a coldfusion site, but unlike PHP, there is no search.cfm page automatically generated by Zoom. Thanks!

Please see this FAQ:
Q. How do I create a ColdFusion (CF) search page?

**Monday** · Oct-15-2010, 01:35 PM

Maybe this could be for a new plug-in but the possibility to OCR text out of an image and then be able to search on it could be pretty nice. May need to bring in a 3rd party OCR engine?

**David** · Oct-15-2010, 07:36 PM

OCR might be nice, but it might not make sense from an efficiency point of view.

OCR is slow, resource intensive and often needs human intervention to correct the text.

If you are indexing a site daily, then you don't want to repeat this OCR job for every files each day. Would take forever. I guess the text extracted could be saved into a file, but it isn't possible to write a file to a remote web server.

In my mind it would make more sense to OCR all the documents in advance with a 3rd party tool, e.g. turn them into PDFs with a text layer, or re-write the image with the text included as image meta data.

**Monday** · Oct-18-2010, 04:00 PM

OCR would be slow, but you already have an option to "use CRC skip files" so as long as they are not changing there would be no need to re-OCR them.

**David** · Oct-18-2010, 07:20 PM

The CRC function that is currently in the software is used to detect and remove identical files. It is not used to determine if a file is new and should be re-indexed during incremental indexing. The reason for this is that if you are doing incremental indexing you really need to avoid downloading the entire file again. This is the purpose of incremental indexing. But you need the whole file if a CRC calculation was to be done.

Also, how would the Zoom software know which files required OCR. You certainly would want to do this on every image file the spider encountered. For example there is no point trying to OCR a typical photograph.

I agree we could, technically speaking, include an OCR package in the product, but it would be a fair amount of work, and probalby do a worse job than the existing tools already out there.

**chrave43** · Nov-08-2010, 10:04 PM

Opposite of a skip list

In version 7 I would like the ability to have something similar to a skip list whereby zoom would only include files in an index where the name matches a certain word. For example, I have an e-commerce site with 100 stand alone pages and 4000 dynamically generated pages that are mysite.com/product-detail.asp?id=1 etc... I'd like to just index pages called product-detail.asp and ignore all other pages. Currently I have to skip 99 of the pages and then use content filtering to follow the links in the sitemap but then ignore them.

**David** · Nov-08-2010, 10:15 PM

Opposite of a skip list

Some options are already in the existing software release.

You can provide Zoom with a precise list of start point URLs
You can use the page and folder skip list to skip unwanted URLs
You can use a content filter to specify that every page must contain a certain word in the content before a page is indexed.
You can use the robots.txt file to skip files.
You can use the noindex robots meta tag on pages you don't want indexed
You can use the BaseURL to limit indexing to a certain base URL (e.g. a subfolder).

**Ray** · Nov-09-2010, 01:07 AM

I can understand how that might help your particular scenario, but we're a little concerned as to how user friendly the idea of an "Opposite skip list" is. It's a little hard to explain to some users (double negatives are always a bad idea), and it can lead to confused users using it incorrectly (and wondering why things don't work as they expected).

As mentioned above, the Content Filter can work as a POSITIVE filter, and that should address this. I gave more details in your other thread here.

Announcement

What would you like to see in Zoom V7 - Please post your suggestions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment