Tips for Deleting A Very Large Collection of Duplicate Files

Jimbo575 · March 10, 2020, 10:44am

Hi everyone,

I am migrating data to a new backup hard drive and wish to reduce unnecessary files being transferred across.

The data consists of:

Size: 2.16 TB
Files: 2.64 million
Folders: 371,000
File Types: Windows Executables (.exe), Word documents (*.docx), Adobe PDFs and text (.txt) files

I have tried running a duplicate file search (using the MD5 checksum matching option) on the 4 main folders which contain the above data, Opus Version 64-bit 12.19.6 Beta became unstable since almost 500,000 files were found totaling over 900 GB. RAM usage by Opus reached over 5 GB. The duplicate file search in-progress dialog box disappeared while the file list continued to grow and could not be scrolled through or sorted.

While I could perform a duplicate search on each of the 4 main folders individually, there is an overlap between them and thus searching all of them would give the best result.

Should I reduce the files returned by limiting them to 5 MB and above? Any other suggestions to effectively process and reduce the size of this data would be much appreciated.

Thanks in advance for your time.

My system specification:

CPU: Intel Core i9 Extreme 7980XE (18 cores, 36 threads)
RAM: 64 GB DDR4 (4 modules in quad channel mode)
Hard Disk (with the duplicate files): 5 TB WD Black (7200 RPM, 128 MB Buffer)
OS Disk: 800 GB Corsair Neutron NX500 PCI Express (NVMe) SSD
OS: Windows 10 Pro for Workstations 64-bit Version 1909 (Build 18363.693)

Leo · March 10, 2020, 11:10am

2.6 million files will definitely put a strain on things.

I would try removing duplicates on sub-folders first, to reduce the total number of files, and then see if things work better on the overall set with a smaller number of items.

You could also use the filter option in the Dupe finder to do executables, word documents and PDFs separately.

Jimbo575 · March 10, 2020, 1:38pm

Thanks Leo, I'll follow your suggestions and respond.

I'm afraid I'm away from the system for the rest of the week but will be back at the weekend.

Jimbo575 · March 24, 2020, 9:27pm

Hi Leo and everyone,

Sorry for not responding sooner.

The data clean-up took a lot longer than I thought but also turned out better than I had initially planned. There were far more file types than I thought e.g. ZIP, ISO (Linux installers), binary files (large blocks of data to be processed programmatically, I don’t mean *.elf, *.exe or *.dmg files), CSV and small disc images (2 – 3 GB in size) from Acronis backups in there too.

Many thanks Leo; I followed your suggestion of processing the subfolders first. I then used the filter in the duplicate finder to find only .exe files larger than 100 MB, 50 MB, 10 MB and deleted many results. My methodology and the order I carried it out are provided below.

Opus performed brilliantly when the data set was smaller. I was disappointed however that Opus doesn’t scale well to such large data sizes it’s a 64-bit application on a powerful system. 5 GB of RAM out of my 64 GB shouldn’t have been issue with original larger list. My CPU usage was extremely low too. I understand that the internal data structures Opus likely uses to represent such collections of files were under pressure, but I would have thought they should have scaled better.

The final result which is still a huge collection of files is “small” enough for me. I’ll try to continue to reduce its size over time. I probably spent about 8 hours on and off (over about 2 weeks) running these searches and manually processing the results.

Separately where can I submit a feature request for the duplicate file checker, please? Its quite easy to explain my request but I imagine it would not be trivial to implement it. It would have saved me from having to manually review and process the results. With this feature I could simply click once and all duplicates would be removed from the results.

Thanks again for your time.

The final result was:

Size: 1.01 TB
Files: 2.05 million
Folders: 294,000

Carried out on sub-folders first as suggested and then all folders:
Search for duplicates of type .exe of sizes greater than 100 MB
Search for duplicates of type .exe of sizes greater than 50 MB
Search for duplicates of type .exe of sizes greater than 10 MB
Search for files (of any type) greater than 1 GB
Search for files (of any type) greater than 500 MB
Search for files (of any type) greater than 100 MB
Search for files (of any type) greater than 50 MB
Search for duplicates of any file type of sizes greater than 100 MB
Search for duplicates of any file type of sizes greater than 50 MB
Search for duplicates of any file type of sizes greater than 10 MB

rcoleman1943 · March 25, 2020, 7:04pm

The response to that question is always to make the request here on the forum.