PDA

View Full Version : File sizes for Binary files



ecoca
01-08-2009, 08:04 PM
I have some directories with PDF files and I have to display filenames and fisizes (I don't need the PDF files content). Trying to do this, I set up the option to consider PDF files as Binary ones, but the program did not index the sizes (0 k displayed). Is something wrong with my settings ?

wrensoft
01-08-2009, 11:32 PM
It might be better to treat them as PDF files, but instead turn off indexing of the document content.

ecoca
01-09-2009, 04:26 AM
I don't see any option to do this (to treat pdf files as pdf files and not to scan thier contents) - could you guide me (I bought the latest vesion) ?

wrensoft
01-09-2009, 12:57 PM
In the scan options panel, add the extension .PDF
In the indexng options panel turn off page content.

ecoca
01-09-2009, 01:09 PM
I'm sorry but this is not the solution. The program is still opening EVERY pdf file using the PDF plugin ... I do not want the program to open the pdf files (as they are huge) because the indexing takes indefinitely long time to complete.

wrensoft
01-09-2009, 02:44 PM
From your first post I thought the problem was just a display problem. Not an indexing speed issue.

It is not alway possible to get the file size from a HTTP server before actually downloading the file. And we didn't implment any feature to index files without downloading them.

ecoca
01-09-2009, 03:22 PM
My bad - I forgot to mention that the indexing is done locally !

ecoca
01-09-2009, 03:30 PM
To reformulate the problem, in order to make all clear:

I have many folders with many PDF, DOC, EXE, ZIP and CHM files (some files are more than 100 MBytes long). I need to search only filenames within all these directories. The indexing is done locally. I have to display in the search result filenames and sizes. There is no need to search within the files.

I'm using the latest 6.0 professional version.

Can anyone indicate the setings I have to put in Zoom to obtain the desired result ?

Thanks for help !

wrensoft
01-09-2009, 05:05 PM
While it should be possible to do what you are suggesting in V6. (treat .DOC and .PDF files as binary .EXE files, it is not a scenario that we have actually tested.

We can have a look at this next week.

ecoca
01-09-2009, 05:22 PM
This was the only reason I've bought the program ... to do this very simple task. Hope for me you will solve the problem in the next version. Thanks !

P.S. If it is permitted, a suggestion for future versions: you may include the options for what to index just in the Configuration menu in Scan options (title, content, header, meta-tags, etc.) separatelly for every type of file (as a scenario, you may have to index the content for DOC files and only the filename and size for the rest).

Ray
01-13-2009, 04:27 AM
I have many folders with many PDF, DOC, EXE, ZIP and CHM files (some files are more than 100 MBytes long). I need to search only filenames within all these directories. The indexing is done locally. I have to display in the search result filenames and sizes. There is no need to search within the files.

To do this, you should click on "Scan Options" on the Configure tab and add each of these extensions (remove them first if they are already there) making sure to specify each one with a "File type" of "Binary (Filename only)".

This will allow these files to be indexed and displayed just by their filenames and filesizes.

ecoca
01-13-2009, 03:34 PM
This is the configuration I've tried for the first time - starting with a blank cfg and adding only .PDF files as binary ones. It is not working ... sorry. Have you tried this with the lastest version of Zoom (1005) ?

Ray
01-13-2009, 11:19 PM
You're right, it's reporting all binary files to have a filesize of 1kb.

We'll fix this in the next release.

ecoca
01-15-2009, 10:47 AM
Is there possible to estimate a date for this fix ?

Ray
01-16-2009, 06:17 AM
It should be available by early next week. We are just accumulating a number of fixes together.

ecoca
01-19-2009, 06:53 AM
The problem was partially solved in the 1006 release ... For example, I have a file that Zoom says 11,353,346k but the file is only 11,353,346 bytes - without the "k" unit. All file sizes are wrong in this way.

Hope you will corect this in the next release !

Ray
01-19-2009, 10:46 PM
ecoca - that bug was fixed in V6.0.1005, and I've just double checked against every platform (PHP, ASP, JS, CGI) with the latest build (V6.0.1006) and I do not see the problem.

Are you sure you're not using an old copy of the search script with the new index files? Please make sure to update ALL necessary files when you re-index, which includes "search.php" (if you are using the PHP version).

If you still see the problem, ZIP up your search files and send them to us (http://www.wrensoft.com/contactus.html). But I'm betting it's because you're using old files.

ecoca
01-20-2009, 09:16 AM
I'm very sorry, but you are WRONG (don't bet you will lose) ! The bug was partially corrected in the 1006 release (with the release 1005 the file lenghts where 0k OR 1k). I've installed the latest version and deleted all old files (scripts and databases) before launching a new indexing operations. I list bellow the results obtained (with release 1007) with both PHP and JS scripts - you may see the differences (lenghts in JS are correct now):

PHP:

1. Article - RFID - A Basic Introduction to RFID Technology and its use in the supply chain.pdf
Terms matched: 2 - Score: 47 - 13 Feb 2008 - 877,560k
2. Article - Reuters Business Insights Pharmaceutical Anti Counterfeiting Strategies RFID.pdf
Terms matched: 2 - Score: 31 - 12 Jan 2009 - 870,700k
3. Article - RFID Applications Impacts and Country Initiatives OECD 2008.pdf
Terms matched: 2 - Score: 31 - 10 Jan 2009 - 323,773k
4. Article_RFID_Security_and_Privacy_A_Survey_2005.pd f
Terms matched: 2 - Score: 31 - 8 Jan 2009 - 642,641k

JS:

1. Article - RFID - A Basic Introduction to RFID Technology and its use in the supply chain.pdf
Terms matched: 2 - Score: 47 - 13 Feb 2008 - 857k
2. Article - Effects of metallic plate size on the performance of microstrip patch-type tag antennas for passive RFID.pdf
Terms matched: 2 - Score: 31 - 23 Dec 2007 - 393k
3. Article - EPCglobal2 UHF RFID Protocol V109 12-2005.pdf
Terms matched: 2 - Score: 31 - 12 Oct 2007 - 4035k
4. Article - Quazi-Fractal Antennas For Rfid Systems Operating In The Ism 2.45 Ghz And 5.8 Ghz Bands.pdf
Terms matched: 2 - Score: 31 - 16 Feb 2008 - 220k

And with the latest release 1007 I have differences in search results between the PHP and JS versions (with the same files and the same settings) - with the same search word, the results are not the same. Is this normal ?

Ray
01-20-2009, 10:28 PM
I'm gonna have to stick with my bet :) but feel free to prove me wrong and send us your files (link in my previous post).

Check if you have configured to use a custom copy of the search script (on the "Advanced" panel of the Configure tab, theres an option to "Specify my own path for the script source code"). If you have this selected, you would be using your own copy of the script placed elsewhere, and not the latest one that came included with the new update.

Second thing to check - are you uploading your files with Zoom or a third party FTP client? If you are doing the latter, are you uploading and overwriting the "search.php" file after re-indexing?

ecoca
01-21-2009, 03:51 AM
1. There is no check on that option "Custom ... " in Advances options tab. In fact it is checked only "Do not show ..." in that menu.

2. All things are done locally, the files are stored locally, the search script are local, there are no http of ftp activities (the indexing is done on the server, only the client search is remote).

3. search.php files has the date 19.01.2009 and the time 14:44.

You, and any user, may see the differences between the two result versions using any files not necessary mine ...

Why are the results with PHP and JS different with all the files and the settings the same ?

Ray
01-21-2009, 04:30 AM
You, and any user, may see the differences between the two result versions using any files not necessary mine ...

I've said before that I've tried to reproduce this several times now. I also repeated the test with my last post using the latest build. So I've done this 3 times for PHP, and twice for ASP, CGI, and JS. The point is that I cannot reproduce the problem and that is why I asked you for your files.

If you do not want to send me the files then I can't look at the problem no matter how much you insist.


Why are the results with PHP and JS different with all the files and the settings the same ?

This is normal. Search results with the same score are not guaranteed to be ranked the same on different platforms. The different platforms have different sorting algorithms, and we used what was most efficient and practical for that particular scripting platform.

ecoca
01-21-2009, 01:15 PM
I cannot send you hundreds of megabytes over Internet and some files are not for public release.

Please post on this forum two search results you obtained (only 2 or 3 results) with PHP and JS, on the same files for me to compare with the results I obtained and posted on this forum. Thanks !

P.S. This is a public forum - can anyone test the program by setting some PDF files as "binary" and indexing some locally (without their content) by chosing PHP and running the code on a web server ? Are the file sizes reported correctly ?

wrensoft
01-21-2009, 05:39 PM
I cannot send you hundreds of megabytes over Internet and some files are not for public release

I don't see the problem. Just index 1 file from your collection.

We have already looked at this numerous times now and don't plan to spend any more time on it unless you want to co-operate and send us some of the details we asked for.

ecoca
01-21-2009, 05:48 PM
Please tell me what should I have to send to you and how (by E-mail ?).

Later edit:

No necessary to inverstige more . my fault ... sorry ! In the search folder on the server I've created one file called index.php - in fact a copy of search.php (the version from 19dec2008). Everything is working right now from this point of view !

Thanks !