Checking for Duplicates - Efficiency

jimerb · August 23, 2009, 1:39am

I have a need to check for duplicate large video files.

The problem is that they are often different names but the same size.

The most accurate check is to see if they are identical from an MD5 Checksum perspective.

What doesn't make any sense to me is that an MD5 checksum should automatically check file sizes first so it is not wasting time reading files. If the file size doesn't match then it doesn't need to do the MD5 checksum. Right?

What am I missing?

Whenever I do an MD5 search it seems to take forever. It should only do the MD5 search on files with the same size.

The radio boxes on the "Comparison Method" section of the find duplicates dialog should be changed to check boxes. Or better yet, MD5 should always do a file size check first.

What do you think?

Steve · August 23, 2009, 2:14am

File size can be the same but with different contents. I'm not sure how the md5 works internally but to me it would make sense to check for filesize first. Perhaps we're missing something obvious though.

jimerb · August 23, 2009, 2:36am

Yes I agree. But if the filesizes don't match you're not going to get a MD5 match. So instead of reading huge files to figure out they don't match, just look at file size first.

It's a quick way to weed out non-matches so the scan would run much faster.

Right?

What am I missing?

Leo · August 23, 2009, 4:11am

Unless you're using a really old version, or have turned on the column which displays MD5s for all files (and overridden the "max MD5 size" setting), Opus should only be calculating MD5s when at least two files have the same size.

i.e. Opus should already be doing what you're suggesting.

jimerb · October 20, 2009, 1:09pm

There seems to be a method for checking files that would be a welcome improvement and that is to check by file size only (without the name.)

The reason being is that many times I have duplicate LARGE movie files with different names. If I select MD5 then opus has to read the entire file to make sure it is identical. This is time consuming.

However, I know that if the size is the same and viewing the first few seconds of the clip shows the same thing, then I have an identical file.

Does this make sense? Is there a way I could achieve the same thing now? MD5 is taking forever to compare a bunch of movie files that have a lot of duplicates.

jimerb · February 5, 2011, 5:55am

Has this ever been improved?

Leo · February 5, 2011, 11:08am

No. (Was it requested?)

It doesn't make sense to me, FWIW. Lots of files can have the same size without being duplicates. If you have to view the first few seconds of the files to check them, why not just use flat-view and/or do a Find on the folder so you see all files below it, then sort by size?