This is another short update on one aspect the development process of V5 of Zoom and feature rather cryptically known as 'Charset by page'.
V5 of the indexer will use the charset specified in the page's meta tag or the HTTP header sent by the server when indexing files. Previously, the Indexer would always expect content to be delivered in the same charset specified in the Zoom configuration window (one charset per session).
This means that you can now index various web pages (or websites) which
employ different charsets or encoding. The indexed content will then be
converted to the encoding selected in the configuration window, and your
search page will use the same encoding.
In V4.2 it was possible to have a set of index files that spanned multiple languages, but only if all the web sites used the UTF-8 character set or the same character set. In V5 it will be possible index for example, some pages in UTF-8, some pages in English 1252, some pages in ISO-8859-5 Cyrillic, and have them all combined into the same set of index files.
So this is a significant enhancement in multi-language web site support.
Note: We offer free upgrades for 6 months after a purchase, so if you purchase V4 now, there will be a free upgrade to V5 when it becomes available.
This all sounds excellent! This all would apply to crawling the pages offline too, I would assume, correct? You said, "V5 of the indexer will use the charset specified in the page's meta tag or the HTTP header sent by the server when indexing files." Will this feature be dependent on crawling the site off the server?
This will apply to files scanned in Offline Mode as well. So yes, you will be able to scan pages of varying charset/encoding (as specified by their meta tags) in offline mode, and have the content correctly indexed and searchable.
Wrensoft Web Software
Zoom Search Engine