Speeding Up Find Duplicates by MD5?

ouwen · May 23, 2016, 4:09am

Hello,

I've got several external drives that I've been using for archival (videos, photos, documents). Each of these is many TB as it's many years worth of data. I've been using Directory Opus' "Find Duplicates" (MD5 Checksum) but it's literally an all day thing to search for these, and I find that I don't really have time to spend 2-3 days non-stop just hunting down duplicates and organizing files.

Instead of relying on Opus to take the MD5 of all files each time, what about having either one file per drive, or even more ideally, one file per folder that would cache MD5 hash, last updated timestamp, filename. This could be updated via Opus or via an outside script... Doesn't seem necessary to take the hash of every file every time to find duplicates.

Is there a way to feed cached MD5 sums into the Find Duplicates panel right now?
Have you considered any options for caching MD5 sums that would speed up finding duplicates?
Any integration with locate's updatedb or any other 3rd party standard for building a database of files would seem to be a big bonus as other software may also understand that format.

Thoughts?

Ryan

Jon · May 23, 2016, 4:18am

The whole point of a checksum is it provides greater assurance that a file has (or hasn't) been modified than the timestamp and size do. If Opus cached checksums then it would need some way of knowing when the checksum needed to be recalculated - and the only way would be the size and/or timestamp. If you're going to rely on the size and/or timestamp like that you may as well just use them for the duplicates search in the first place and skip the checksum altogether.

Deipotent · May 23, 2016, 11:40pm

A while back I think I might have suggested a way to speed up MD5/SHA Duplicate search by first comparing, say, the first 100 bytes, or pick 10-100 bytes throughout each file and only calculate the checksum if this first check matches. I haven't done any test to see if this would speed it up, but my guess is that it probably would, possibly substantially.

cryodream · May 24, 2016, 5:15pm

How about using HashSourceCodes – OpenSubtitles.org instead of MD5, or better still - use 2 step verification.

Opensubs hash is extremely fast, especially for large files. I use it constantly. Because it does not use the whole file, it may give false positives for duplicates.

That's why the perfect solution would be an optional 2 step verification:

scan for duplicates using opensubs hash
optionally verify the found duplicates using MD5

You could search terabytes of data for duplicates in minutes. Especially, the less duplicates would be found, the faster it would finish (that's if you would use the double verification using MD5).
At least me personally, I mostly want to know if I have duplicates. Using MD5 to scan terabytes of files just to make sure that I don't have any duplicate files is stupid...

Leo · May 24, 2016, 5:34pm

Opus already only calculates an MD5 if two files have the exact size. So it already does a 2 step verification (you can think of the file size as a very simple hash that is even quicker because it takes zero time).

jimerb · May 28, 2016, 1:10pm

If you stress test this feature by giving it the worst scenario you can significantly optimize performance where it needs it the most.

Think of many, many large videos on diverse network shares.

Some videos start off with the same leading clip (intro for example) and then they start to deviate from each other after about 1 minute. They also can be rendered to compress to a certain size (for example compress/render video to give me a 2GB file) To have to pass a HUGE file just to see if the file is different, is slow and time consuming.

A more flexible and speedy way would be to have a slider for MB to check. So depending on your situation you have:

A file size check
MD5 check on the first X megabytes (x provided via the dialog)
Optionally do a full MD5 check if the X megabyte check gets a positive.

This way you could set a very fast check on the majority of files and only resort to a full MD5 if you want to guarantee uniqueness. This would go lightning fast no matter what because you are always doing brief checks by reading only the first x megabytes.

Leo · May 28, 2016, 1:38pm

They aren't likely to have exactly the same size as each other, down to the last byte. Unless we are talking about uncompressed audio and completely uncompressed video (not even quick/lossless compression/RLE but literally storing each frame as an uncompressed bitmap), identical metadata, and identical durations.

These scenarios seem artificial to me.

jimerb · June 4, 2016, 10:53pm

I just compared 360 video files of about 200mb to 1268 files of the same size. 38 duplicates took 30 minutes to find. The problem is that the sizes were the same but the names were different so it had to read all of those big files to ensure they were dups.

If the MD5 was sampling the video instead of ingesting the whole thing it would take seconds.

(For example, if you grabbed the first 5mb of video and hashed it and the last 5MB of video and hashed it, you could provide a way of sacrificing a guarantee of accuracy for a tremendous speed increase.

I would envision this as a new option for those doing serious work with large files.

Leo · June 5, 2016, 4:30pm

How are you ending up with different video files that are exactly the same size as each other? Are you using uncompressed audio and video codecs? Can you explain this scenario in more detail so we can understand what's happening, and be sure we're diagnosing it correctly?

jimerb · June 5, 2016, 7:49pm

These are downloads of the exact same file but the names are different so you can't use the filename / size option. Just the MD5 option will do the trick. The compression of these is the same typically. It's just the names are different.

It's easy to see if you manually look at the clips that they are the same but today Opus has to ingest the entire huge files 2x just to make sure they are the same.

I would be willing to sacrifice a guarantee of accuracy to make the speed increase by 100 fold or more. You literally could check thousands of files quickly if you simply compared portions of the big file.

Think about it if you were doing it manually: You would start the two videos and check the first few seconds, and then you would skip to the end and check the last few seconds. That's how a human would do it. A program could do that in a split second.

I personally think this would be a killer big feature for opus in an upcoming release. Compare thousands of large files that have different names in seconds.

Leo · June 5, 2016, 9:28pm

I understand now, thank you.

I thought the aim was to avoid calculating full MD5s for files that had identical sizes but different content, but now I understand that the files are exactly the same and the request is about only checking parts of them to get a less accurate but faster confirmation that they are the same.

Leo · June 8, 2016, 9:59am

Stay tuned for 12.0.7 which will implement some of these ideas.