Limit Find Duplicate Files to the second directory tree

phil_ga · February 14, 2013, 2:43am

I have a question about Find Duplicate Files. I have 50,000 files in each of two directory trees to compare and use the MD5 checksum approach. Of course that takes a long time and then comes up with the list in a Duplicate Files folder. All good so far.

Question: Is there a way to compare two very large directory trees (>7,000 subdirectories in each), and use the first one as the "reference" directory, and search for duplicates ONLY in the second tree, never within the first reference directory itself? I dont want to find duplicates in the reference tree even thought there are some. The problem I have is that my ebooks or music might deliberately appear more than once in the reference directory tree and I would like to keep them there as duplicates, but I am only looking for duplicates in the second directory tree. I cant change their name as I MUST use the MD5 checksum mode because the new files have odd names. I only want duplicates that appear in the second directory tree.
After a search I normally "sort" by folder, and then reselect files for deletion. This works reasonably well to achieve part of this task (because it puts the Reference directory tree as the first choice and marks the second for deletion), and I have been using this to date. HOWEVER, it will find the duplicate inside the reference folder that I want to keep, and marks them for deletion too. Not a problem for a hundred files in the results, but for 2,000 or more duplicates, I cant go thru each file and unselect just these few - task is too big for me...

Any ideas or workarounds?

MrC · February 16, 2013, 6:52pm

Can you ungroup the duplicates collection, sort by path, and select only those files you want deleted (based on path) and Delete?

Leo · February 16, 2013, 7:09pm

I don't know if it will be a problem, but one possible issue with that approach is that there may be two copies of a file in the second directory without any in the main "reference" directory. You'd then delete both copies.

However, you could avoid that by first doing a duplicates search in the second directory only, and deleting all but one of every file there (which the duplicate finder will do easily enough). Now you know there are no duplicates in the second directory by itself.

Then you can do the duplicates search across the two directories and, as MrC says, ungroup the results and delete anything from the second directory that was found to be a duplicate. (You now know those have to be duplicates of things in the reference directory because you just ensured there are no duplicates in the second directory by itself.)

Then, anything which is left in the second directory must be a new file that needs to be moved to the "reference" directory.

MrC · February 16, 2013, 7:23pm

Yes, that absolutely could be an issue. I was assuming the first dir was always the master, and the second could be pruned without concern. Good workaround solution.

phil_ga · February 16, 2013, 11:49pm

Thanks a lot to both of you. This strategy works very well for this task. I am very pleased!
I was quite puzzled at first as exactly how to ungroup the duplicates collection. Turns out, this is under a right-click. Puzzled because why would I want to right-click on a single file or its heading, when there were over 6,000 in the list this time? So I have never seen or heard of that feature. I suggest it could be improved if there is ever time...
After many years, still the power of DOPUS continues to amaze and delight me...