Duplicate file finder

wowbagger · March 5, 2015, 2:26am

I have a folder of images that I have sorted in to sub folders based on what the images are of, when the images were taken and so on.
I have another folder of unsorted/new images. All new and unsorted images start in this folder and are then moved to the sorted folders.
Due to a failure of my workflow I have images that are in the unsorted folder that are also in the sorted folders.

What I would like to do is identify all the duplicates and delete the copies from the unsorted folder.
Any thoughts on how i can achieve this? I have thousands of images, and probably hundreds of duplicates so auto selecting the ones in the unsorted folder would be ideal.
I don't mind if this requires another tool.

thanks

abr · March 5, 2015, 2:44am

Not sure, if it helps. If you are in you're sorted folder, you could try

Select *.jpg Select SOURCETODEST Set Dest=Toggle

So all images present in that folder would be selected in the "unsorted" folder, being potentially dulicates (they should be). of course, there is also the duplicates function, but i'm not too familiar with that. You maybe want to change the part with the "select", to "select all", or "select *.(jpg|tiff), or similar.

Leo · March 5, 2015, 5:24am

Opus has a duplicate file finder built-in:

Opus 11: https://www.gpsoft.com.au/help/opus11/index.html#!Documents/Duplicates.htm

Opus 12: https://www.gpsoft.com.au/help/opus12/index.html#!Documents/Duplicates.htm

steje · March 5, 2015, 3:31pm

I've often had the same need... and the challenges I have with Opus' duplicate file finder for this purpose are:

Depending on the exact folder structures being searched, I don't get the desired files selected that I would prefer to delete using 'delete mode'. But I don't know how to suggest a practical enhancement to make this better. Maybe allowing users to specify a folder path or paths that are 'protected' from delete selection or something? I often have MULTIPLE folders that I wouldn't mind deleting FROM, whereas there is usually one particular folder I want to always KEEP the files in (my variant of the 'sorted' folder).
I've performed a dupe search only to then realize I won't have enough time to sort through everything (made a more time consuming task when I have to change the delete selections in the dupe results collection per the observation above)... and been a bit bummed that there's no way to save the delete selection state in the dupe results collection.

So, now I have sort of abandoned the 'delete mode' option since the first observation above sometimes means I'll delete files from my 'sorted' folder that I actually want to keep if I simply use the 'Delete' button after the find completes... and since the selections aren't saved in any way inside the collection, coming back to work on things 'later' won't be helped by using the delete mode anyway.

So, I generally:

do an initial dupe search just inside the 'unsorted' folders - since I sometimes have dupes there. And I delete all but one copy... the reason described in next step.
do a second dupe search between both 'sorted' and 'unsorted' folders... having already gotten rid of multiple dupes in the 'unsorted' folder above. With this second dupe search I shouldn't run the risk of deleting my only copies of dupe files just in the 'unsorted' folder alone that didn't have dupes in my 'sorted' folder. I can now confidently re-sort the list or do an advanced selection based on location / path or something in order to get at the files I now want to delete.

A bit cumbersome of a workflow, but usually gets me the results I'm after faster than other methods. All depends how many dupes you in fact end up with and have to clean up...

wowbagger · March 8, 2015, 11:10pm

Thanks for the replies.
@abr, that might help thanks for reminding me of the function. If I need to code my own solution it might come in handy.

@steje, what you have described matches my experience. Thanks for sharing your workflow. I will use that if I cant find another more automated solution.

@leo, I am aware that Dopus has a duplicate finder, but thanks for sharing. However the duplicate finder in Dopus does not (at least i don't believe) can easily satisfy my requirements.
Given I have two folders I would like to Identify all the files in folder B that are also in folder A including sub-folders of both A and B. I want to only select the files in Folder B so those can be easily deleted.

Can Dopus do this?
Does any one know of another tool assist in performing this task.

Thanks

tbone · March 9, 2015, 12:02am

What about flatview (no folders) for both, then select a -> b (source to dest) & delete?

wowbagger · March 9, 2015, 12:12am

hi @tbone. How do you do that selection a -> b?

wowbagger · March 9, 2015, 12:16am

I found it under menu edit -> select other -> select source to destination.
I didnt know about this feature. Interesting idea will be worth playing with.

This only matches on file name, ideally I would like something like the duplicate file finder that can look at file size and hash.

tbone · March 9, 2015, 1:25am

Ok, then what about this?: Compare tab contents in dual display mode
There is a version that also checks for size, doing the hash-thing only takes 2 more lines to add I guess.

wowbagger · March 11, 2015, 7:11am

That a great script. I suspect i can get it to do what I need. Will take a bit of messing around. It would be nice to be able to use the built in Duplicate file functionality.

Perhaps a feature suggestion then:
Currently using the duplicate finder is great for finding any duplicates in a given set of folders. Something it does not do so well is Identify all the items in folder A that are also in folder B. Allowing the user to easily select and delete the items from Folder B that are in Folder A.
If a file is in folder B twice and not in folder A it should not be identified.

A scenario where this is useful is. You backup your images from a phone and sort them in to folders but forget to remove the images from the phone. Some time later you then back up your phone images again and want to delete all of the already sorted items so you are only left with the new items.

Regards

Leo · March 11, 2015, 7:37am

In that scenario, why do you want to keep duplicates in the backup folder?

wowbagger · March 11, 2015, 11:20pm

I don't, But i would like to deal with them separately. When looking for duplicates in the sorted folder I would need to check each duplicate and decide what one I want to remove. The current duplicate implementation is good for that.

When looking for items in the unsorted folder that are also in the sorted. I know that they can all be deleted with out any individual thought.

Does that answer your question?

Leo · March 12, 2015, 12:28am

You can influence which files are selected for deletion automatically by sorting by location. If the destination has already been de-duplicated then you don't even have to look through the list of files as long as it's sorted in the right direction, as you know the files marked for deletion will all be in the source (as the destination can't have duplicates). (Either way, you'll only have one copy of each file in the end.)

tomenok · March 12, 2015, 4:38pm

Another duplicate problem - similar to that. When I have two folders and duplicates between the folders and (other files duplicate) inside folder (only there). What can I do to remove all duplicates but to keep 'originals from both folders'? If you group them by location you could select files which are duplicates only in this folder (when you delete you remove duplicate and 'original' files). Of course - after selecting by location you can group them by 'duplicates' and manually check if you selected single file from each or both files in same folder. But this is a bit tedious if you have more files.

Solution would be - selecting all duplicates in one folder which exist in another - but NOT duplicates only inside the one folder. How can it be done do prevent deleting all versions of a file which are in one folder?
Would be great to have an automatic option to select files in one folder, when a copy exist in another folder - when comparing two or more folders. The script above will also do the job - but who can add these 'two lines' to compare files by hash (MD5) only (or additionally - as an option with name and dates)?

(of course - you can first search and remove duplicates from one folder and the same from another folder and then you are sure - duplicates must be between folders. But method preventing deletion of duplicates in one folder - when looking in two folders would be faster and safer - without it - it could quite easily happen with lots of files and searches that you forget to search for duplicates inside each folder and select them - so you could delete them).

Leo · March 12, 2015, 5:23pm

Maybe a concrete example of what you're trying to avoid would help us understand. If you only want to end up with one copy of each file (which it sounds like is the aim, if you're considering de-duplicating source, then dest, then source+dest) then that's what you'll end up with if you just de-duplicate source+dest in one go.

Are you worried that the destination folder will have duplicates you don't know about, and that you'll keep one copy and delete the other when you wanted to do it the other way around? If so then a separate de-duplication of the destination is indeed needed (to avoid having to carefully look through the list of candidates for deletion). It's still not really needed for the source folder since you want to delete duplicates from that anyway. So at most you need to run a check on the dest, and then source+dest. But once the dest has been done once, you shouldn't need to do it again, unless something else is dumping files into the destination which could be duplicates. (But if something is doing that, you want to find out and delete them, if your aim is to delete the duplicates, surely? So that extra stage is either unavoidable, if you want what it guarantees, or irrelevant, if you don't. It seems to me at least.)

tomenok · March 12, 2015, 6:35pm

Ok - you already have made an option I needed - missed 'delete mode' check. This is what I wanted and what resolves my problem (protection of first file in each duplicates group) - thanks

(But of course hash-comparison option in the compare of two folders script from above still would be welcome).

wowbagger · March 18, 2015, 12:36am

I tried the using some of the suggestions with out any success. I have heaps of files that are duplicates for valid reasons. When i run the duplicate finder against my my documents it finds lots, and many of them need to stay. Different projects that have the same files or many other reasons.
Unfortunately none of the suggestions worked in a way that I would feel comfortable performing the delete. The duplicate finder finds so many duplicates that I feel the need to go through them line by line. Picking what should be kept. This is time consuming, hence I am looking for ways to speed it up.

Being able to identifying files in folder A that are also in Folder B, would make this much faster and safer. This would be great as a first run through. After doing this i would still need to do some manual processing. But much of it would be done.
I have looked for an alternative application that can help speed this up. but most of them work similar to Dopus (mostly with a worse UI than dopus) finding any duplicates in a specific folder or set of folders. I only found one that would find files in folder A that are not in folder B. however it threw a dialog for each match, that would also take a long time to process.

Does anyone know of a duplicate file finder that would suit my scenario?

wowbagger · March 18, 2015, 12:46am

Is this a feature that could be added to dopus?

Jon · March 18, 2015, 1:04am

I'm not clear why the Opus Duplicate Finder doesn't do what you want?

wowbagger · March 18, 2015, 1:39am

Yes sorry if i have not been very clear.

I need to identify files that are in folder A and also Folder B, including sub folders. But not files that are in folder A or Folder B multiple times. I know that any file in both Folder A and Folder B can be removed from Folder A safely. I only want to delete files from Folder A. The Dopus duplicate finder treats both folders as equals. If a file is Folder A or Folder B multiple times but not in both folders it will currently also appear in the list.

Sorry if this is presumptuous, but i would see this working by.
In the Duplicate finder Panel, If there are multiple folders in Find In area. Having an addition checkbox called to "Identify only cross folders" or "Exclude duplicates in same base folder". Checking this option would not identify duplicates that have the same base folder as defined in the "Find in" area. I only suggest this to better explain the use case.