I am migrating data to a new backup hard drive and wish to reduce unnecessary files being transferred across.
The data consists of:
Size: 2.16 TB
Files: 2.64 million
Folders: 371,000
File Types: Windows Executables (.exe), Word documents (*.docx), Adobe PDFs and text (.txt) files
I have tried running a duplicate file search (using the MD5 checksum matching option) on the 4 main folders which contain the above data, Opus Version 64-bit 12.19.6 Beta became unstable since almost 500,000 files were found totaling over 900 GB. RAM usage by Opus reached over 5 GB. The duplicate file search in-progress dialog box disappeared while the file list continued to grow and could not be scrolled through or sorted.
While I could perform a duplicate search on each of the 4 main folders individually, there is an overlap between them and thus searching all of them would give the best result.
Should I reduce the files returned by limiting them to 5 MB and above? Any other suggestions to effectively process and reduce the size of this data would be much appreciated.
2.6 million files will definitely put a strain on things.
I would try removing duplicates on sub-folders first, to reduce the total number of files, and then see if things work better on the overall set with a smaller number of items.
You could also use the filter option in the Dupe finder to do executables, word documents and PDFs separately.
The data clean-up took a lot longer than I thought but also turned out better than I had initially planned. There were far more file types than I thought e.g. ZIP, ISO (Linux installers), binary files (large blocks of data to be processed programmatically, I don’t mean *.elf, *.exe or *.dmg files), CSV and small disc images (2 – 3 GB in size) from Acronis backups in there too.
Many thanks Leo; I followed your suggestion of processing the subfolders first. I then used the filter in the duplicate finder to find only .exe files larger than 100 MB, 50 MB, 10 MB and deleted many results. My methodology and the order I carried it out are provided below.
Opus performed brilliantly when the data set was smaller. I was disappointed however that Opus doesn’t scale well to such large data sizes it’s a 64-bit application on a powerful system. 5 GB of RAM out of my 64 GB shouldn’t have been issue with original larger list. My CPU usage was extremely low too. I understand that the internal data structures Opus likely uses to represent such collections of files were under pressure, but I would have thought they should have scaled better.
The final result which is still a huge collection of files is “small” enough for me. I’ll try to continue to reduce its size over time. I probably spent about 8 hours on and off (over about 2 weeks) running these searches and manually processing the results.
Separately where can I submit a feature request for the duplicate file checker, please? Its quite easy to explain my request but I imagine it would not be trivial to implement it. It would have saved me from having to manually review and process the results. With this feature I could simply click once and all duplicates would be removed from the results.
Thanks again for your time.
The final result was:
Size: 1.01 TB
Files: 2.05 million
Folders: 294,000
Carried out on sub-folders first as suggested and then all folders:
Search for duplicates of type .exe of sizes greater than 100 MB
Search for duplicates of type .exe of sizes greater than 50 MB
Search for duplicates of type .exe of sizes greater than 10 MB
Search for files (of any type) greater than 1 GB
Search for files (of any type) greater than 500 MB
Search for files (of any type) greater than 100 MB
Search for files (of any type) greater than 50 MB
Search for duplicates of any file type of sizes greater than 100 MB
Search for duplicates of any file type of sizes greater than 50 MB
Search for duplicates of any file type of sizes greater than 10 MB