New features in V6
After almost 2 years of development we are really pleased to make available the V6 beta release for public testing. [Update, 17/Dec/2008: We are out of beta testing and the final V6 release is now available]
This major new release has dozens of new features and 100's of smaller changes. A large proportion of which were implemented as a result of user feedback from the user community. We estimate that there are now over 100,000 web sites and CD's using Zoom Search.
We are really proud of this new release, working on it has been like creating a work of art and we sincerely hope you'll enjoy using it as much as we enjoyed creating it.
Here is a summary of the major new features,
- New User Interface: We have given the UI a complete overhaul and in its place is what we hope is a more intuitive and flexible design. Here are some screenshots of the new interface. It also gives us much more space to add in new configuration options.
- Log window with real-time filtering: You can now filter index log messages on the fly which should be helpful when tracking down error messages, skip reasons, and broken links.
- New Search Ranking Algorithm: The ranking algorithm and index structure have gone through some major improvements. This should produce much more relevant results than before and also allow you greater control of your search results. This includes preference given to words that are found closer together on the same page, and more.
- Faster and more accurate exact phrase searches: There can be an improvement from at least 5% to 60% in search time for exact phrase matches depending on the query and content. You can also now perform exact phrase matching on synonyms, accent insensitive words and stem-matched words (see stemming below).
- Stemming: This allows searches to match words which are similar or derivatives of each other. This addresses an often requested feature to match plural form of words to their singular forms. When this feature is enabled, a search for the word "boat" will match "boats" and "boating". More details on stemming here.
- ASP.NET Server Control: V6 will feature a native ASP.NET platform option. This provides performance similar to the CGI version and is best suited to integrating with ASP.NET web sites.
- Improved highlighting in context descriptions: The highlighting in context descriptions has been improved such that words matching due to stemming, synonyms, or accent insensitivity options enabled can be highlighted.
- Office 2007 plugin: You can now index and search Office 2007 file formats (e.g. DOCX, PPTX, XLSX)
- Improved Vista support: V6 is designed with Vista in mind, and features better compatibility with UAC (User Access Control, folder permissions, etc.). Note that we fully support Vista in the existing V5 of the software, but these V6 changes remove a few Vista quirks.
- Advanced Template options: While your existing search templates from V5 will work in V6, we now also support advanced options so you can customize the appearance of your search results without having to modify the scripting. This includes the ability to repeat or omit certain elements (such as the "Results: 1 2 3 ... " links)
- Custom Meta Fields: Specify arbitrary meta fields to be indexed and made search-able. For example, index and search on a real estate website, by "Number of bedrooms", "Suburb", "Price", "Property type", etc. This is a big feature as it effectively means you can build simple custom databases with a multi criteria search using Zoom without actually having a database. See the Fruits-R-Us demo site for an example of this in action.
- User-specifiable file types: You can specify how each file extension will be handled. For example, you can specify that a .JPZ file be treated as a jpeg file.
- New and improved "Jump to highlighting" script which will be more compatible with other scripts and also exclude highlighting within ZOOMSTOP sections. This can be used to avoid highlighting some sections of your page like the navigation menu.
- MHT plugin for indexing Internet Explorer's MHT web pages.
- Support for optional ZOOMTITLE and ZOOMDESCRIPTION to allow each web page to have a custom meta data that is used only by Zoom, but not by other search engines, like Google. This is useful for SEO work where you need to optimise pages in different ways for different engines.
- Configurable truncate title length option for super long page titles.
- Support for <!--ZOOM_SHOW_QUERY--> tag to insert the search terms into the title of the search results page (or elsewhere on the page)
- Option to "Open all plugin file formats in a new window" so that you can have HTML files open in the same window, and PDF files open in a new window. At the moment in V5 all documents open in the same window, or they all open in a new window.
- Spider image maps. The Zoom spider will now crawl image map links.
- New spider option: FOLLOW_ALL (follow all pages to one level for a start point without indexing the start page).
- Boosting/weighting of each start point. So you can boost/deboost entire domains.
- Thumbnail/images for Recommended Links.
- Checks for changes made to configuration, and prompts user to save config before quitting if changes have been made. This helps avoid accidentally loosing changes.
- Check Thumbnail Exists: Option to check that a thumbnail image exists on the web server before using the link. This means avoid broken links to images that don't exist.
- New, improved method of CRC duplicate page detection: the CRC comparison is now made after stripping out HTML and ZOOMSTOP sections. This means that a page with ads excluded using ZOOMSTOP will now be recognized as being duplicate, despite having different dynamic ads on the page.
- Zoom will now reload the last ZCFG configuration file used by default.
- New PDF scanning method ("Scan text by text layer") to allow for more flexibility when indexing some PDF layouts.
- Toggle option to switch behaviour of accent/diacritic insensitivity to use digraphs or otherwise (i.e. "ö" = "oe", etc.)
- Improved compatibility and tolerance of antivirus software and the Windows Indexing Service. Zoom will now deal with cases where the Windows Indexing Service or other 3rd party software (like Antivirus software) is locking Zoom's files.
More new features:
- Category results summary (x results found for category A, y results for category B, etc)
- Automatic login for cookie-based authentication: Zoom can attempt to login to cookie-based login pages (like PHP pages). It will attempt to mimic the form parameters and send a HTTP POST with your login details. This means it can now login to websites which are not protected with HTTP authentication (which Zoom already supported).
- ZIP file indexing: You can now index the content of ZIP files. Zoom will actually extract the files within ZIP archives and index each one individually.
- Wildcard support for Skip Pages and Disallow: entries in robots.txt files.
- Wildcards for Recommended Links (so you can specify one recommended link that will match multiple search words and queries)
- Option to truncate URL displayed in search results
- More Custom Meta Field options including: "Money" and "Multi-select" data types, and "Partial text matching" search method.
- New weighting option: "Body content" - allows you to give (or lower) preference to content found within the <body> ... </body> tags of a page. This means you can decrease the weight of the main content of a page, and effectively increase weighting for text found in headings, titles, etc. more than the current weight settings allow.
- Improved "Content Density" weighting to exclude mark up code, which should make this more effective.
- A new PHP script to generate web site search statistics on the server in real time. (Previously stats reports needed to be generated offline)
- Significantly improved the "Additional start URL" window in handling a large number of additional start points. For those of you who are indexing 10,000+ domains, this should be a godsend.
- New Status window features Progress for each thread, and also System Information such as CPU load, memory load, and physical and virtual memory information. Screenshot below:
... and of course, many other improvements, bug fixes, and performance optimizations than we can fit here.
And few more in the latest beta release...
PHP and ASP capacity
We have also increased the maximum unique words limit for PHP and ASP from 300,000 to 500,000. This was made possible by the optimizations and improvements we have made for these two scripts in V6.
Search Statistics PHP Script
There is a new script which you can use to generate live statistics on your web server. This script is only available for PHP, and does not generate graphical charts like the "Statistics Report" tool in the Indexer. It will however, provide concise, up-to-date statistics on your server without needing to download the log file. See the Help file for more information (under the "Advanced Options" chapter).
Other new improvements include:
- Faster offline indexing with large folders
- Improved error message reporting
- Option to index "param" tags in the form of:
<param name="Proprietary.Data" value="Serial#12344451">
- Index ZIP files found within other ZIP files (and the files contained within the recursive ZIP files).
- Maximum plugin password length increased from 20 to 40 characters
And one more feature before the final release.
Native 64bit version of the indexer & higher CGI capacity
The Zoom indexer will now be available in both native 32bit and 64bit executables. The 32bit software is full compatible with, and could always run on, 64bit versions of Windows. But it was limited to using only 2GB of RAM, regardless of how much RAM was actually installed in the machine. This was a Window 32bit limitation. The native 64bit release of Zoom in V6 allows an almost unlimited amount of RAM to be used, if the RAM is physically available in the machine and you are running a 64bit O/S.
There is no functional difference between the 32bit and 64bit releases. They have an identical set of features. The only difference is in how much RAM they can use. Being able to access more RAM, means a higher potential capacity. But as the capacity of the 32bit release was around a million pages, very few people will need to use the 64bit release. So the 64bit release will be made available as part of the Enterprise edition.
Removing the RAM bottle neck in the indexer, by itself, doesn't allow for significantly improved capacity however. There were other limits in the CGI script and in the index file format, which was effective 32bit in nature. For example the index had 32bit file pointers and on old versions of Linux it was not possible to handle index files larger than 2GB because of operating system limits.
So we have systemically restructured the index file format and it should now support files of around a terabyte in size, at least in theory. (In practice things start to get impractical one you get into 10's of Gigabytes).
Long time users might be wondering about the CGI's themselves. Are they also 64bit? The answer is no. The CGI's remain as 32bit executables. They don't use enough RAM to justify making a 64bit release, and the 32bit CGI's remain compatible with 64bit Windows and Linux. So there is no benefit to changing them.
So what does this mean for capacity? Capacity in the V4 and V5 releases was effective limited by O/S and file system limits. Now with V6 the limits are related to how much RAM you have installed and how fast your hardware is. Faster hardware allows a larger capacity as large data sets can still be indexed and searched in a reasonable time frame. Over the next few weeks will be posting some benchmarks, but initial testing in house has show search times of a few seconds are still possible with data sets over 2 million pages on a single machine.
V6.0 is now available, 17/Dec/2008.
So this thread can now be closed. But please feel free to open up new V6 threads as issues arise.