Define "identical"

JoeBeans · December 17, 2023, 5:34am

When moving files if I choose skip identical, it uses a different definition of "identical" than I have in my head.

I would like it to skip files with the same hash. It seems to skip based on attributes. perhaps file creation or last modified or something else. For me, I don't care if the attributes are different, if the data contained is the same then I don't need it.

(if it matters, the files seem to have identical names and file size but are not considered identical by whatever makes those decisions.)

Is there a way to define what identical means when moving files?

Jon · December 17, 2023, 10:02am

It's defined in the manual:

There's very little point comparing the file contents when copying files. It would take just as long (or longer) to read the contents of both files as it would to simply overwrite it.

MegMac · December 17, 2023, 4:15pm

As a professional photo organizer, I do a lot of duplicate removal. Yes, there are more consideration when de-duping photos as compared to other types of files.
If there are several criteria you need to compare, and you can't make mistakes, I recommend Duplicate Cleaner Pro by Digital Volcano. It has a few different modes depending on what kind of files you are working with.

JoeBeans · December 18, 2023, 6:30am

That isn't what I asked. My question was, "Is there a way to define what identical means when moving files?" Unless you are saying that I can change the manual and that will change how my installation works. Not sure how I would do that. Your attachment looks like it is from a website.

I'm not copying files; I am moving them.

I don't think we are talking about the same thing. Is there any additional information I can provide to clarify what I want to do?

Jon · December 18, 2023, 10:24am

Move/copy, it's the same thing. If you have to read both files to find out if they're the same or not, you may as well just replace the old ones.

JoeBeans · December 18, 2023, 7:31pm

one deletes the source the other doesn't.

Are you saying that if I choose "replace" instead of "skip" it will do a hash check comparison? Or are you saying that my files aren't important enough for me to make the effort?

Unless your solution verifies the data is the same, It seems like I might be better off using rename new and then deduping them after.

Jon · December 18, 2023, 7:57pm

Say it does a hash check:

The files are different. What do you do? You'll move the source and replace the destination. You end up with one copy of the new file.
The files are the same. What do you do? You delete the original without moving it. You end up with one copy of the new file.

Now say it doesn't do a hash check, and you simply say "yes" to overwrite the original file. You end up with one copy of the new file.

The outcome is the same. There's no point doing the hash check if the eventual outcome is that you always end up keeping the new file, you may as well just overwrite in the first instance.

JoeBeans · December 18, 2023, 8:14pm

I keep them both. then copy the parent folder over instead.

Well, thanks. It seems like the answer to my question is no.

Jon · December 18, 2023, 8:36pm

That's not what you originally said:

My answer was based on that. Anyway, to be explicit, no, you can't define what identical means.

JoeBeans · December 18, 2023, 9:51pm

I do appreciate you taking the time to help me out. I see that the answer is, no. I suspected that, but it is good to have an official confirmation.

The following is just for clarification. No need to respond unless you want to clarify something.

I fully admit that I have no idea what a hash check would actually even check. Maybe metadata like creation date and modified date affect the hash and so what I was asking for was pointless. If that is how it works then I could see why there would be confusion.

Correct. If the actual data contained in the file is exactly the same, then I don't need two copies. I believe my answer was consistent with my original statement since in your example you said, "The files are different". Since you state they are different in your example, I keep both. In your second example you say the files are the same and in that case I didn't disagree with that example.

Leo · December 19, 2023, 12:47am

Generating a hash requires reading the entire file and performing a CPU-intensive algorithm on all of the data, for both the source and destination files. It is not a trivial operation in terms of time.

JoeBeans · December 19, 2023, 3:20am

Thanks for the reply. I searched and it seems that hash check would not be affected by the file attributes since it is created from the binary data.

You are correct the hash option in opus for finding duplicates seems slower than other duplicate-finding software I use. I think I have used the feature in opus twice. Speed isn't usually a problem for me though as I can just let it run in the background.

dpuser441 · January 8, 2025, 8:40pm

I just copied files that already exist and got the "Confirm File Replace" dialog. I was missing an option to show the hashes, so I searched the forum.

The rationale against displaying hashes seems to be performance reasons in most posts from GPSoftware staff. However, I'd argue to separate optimizing performance from the user making the decision - the decision has to be made first, and to be confident the files are the same, I'd like to have hashing. Would be nice if this can be implemented (or allowed to change what "identical" means). Two files can have the same attributes (modified, size), but still have different contents.

Leo · January 8, 2025, 10:56pm

It would take less time to copy the identical file over the existing file than it would take to read both files and check that they are identical in order to skip doing the copy. You gain nothing, in terms of changing what "Identical" means for the Skip Identical option.

You can also already set up diff/comparison tools to run them from the Replace dialog, which lets you use get a much better comparison than just a hash: Compare files from 'Confirm File Replace' dialog?

dpuser441 · January 9, 2025, 12:11am

Thanks @Leo. I think there is a misunderstanding on either your or my side - I'll try to describe what I mean again.

Technically, there are multiple interpretations of what "identical" means - same filesize, same modified date, same (other) attributes, same content (bit-wise), or a combination of those and others.

The "Confirm" dialog - as the name says - is shown so that the user can make a decision. This decision can be based on many aspects and does depend on the interpretation of what identical means. As a user, I'd like to include the fact "files are bit-wise identical" into my decision making. Overall, it depends on the user's preference whether they gain something in the bit-wise-identical interpretation.

Thus, I do not see how the performance ("would take less time [...]" than to calculate the hashes) is relevant here. If "bit-wise identical" is of importance to me for the decision, I'm fine with the additional time/compute power it takes. Otherwise, I risk replacing a file with another file that has the same modified date and size, but different contents; which might cause data loss.

Thanks for the tip to use a tool like Beyond Compare. (I do, but haven't known it can be triggered from the dialog, as shown in the other thread.) I agree with other users in the thread that a "Compare" button would be a nice addition.

I hope the point I'm trying to make does make sense to you.

Leo · January 9, 2025, 8:14am

Changing the definition of “Identical” only affects the “Skip Identical” choice, which happens without any further user input for subsequent files. So it would not help you.

If you want to make a manual choice about each file based on if the old/new file contents are the same, the post I linked to shows a way to do that.

dpuser441 · January 9, 2025, 9:46pm

Thanks @Leo. I still think the definition of "identical" matters. With the current implementation, it could happen that files that DOpus considers identical are overwritten even though the contents are different. (Given filesize + modified matches, but content is different.) Probably an edge-case, but still.