I record all of my live video chats for nostalgia reasons and once doing a mass file download from one my cloud drive the process aborted. I then proceeded to do the mass file download from my cloud drive to another location. I actually had a few mass downloads aborted (I later found out my ISP was having issues) I then proceeded to do the mass file download from my cloud drive to another location several times until it completed successfully. I also did a few mass downloads from other cloud drive accounts that I had all to different locations. I now had several directories with hundreds of files each coming from several cloud drives. I wanted to delete only the identical files so I did a mass move from the many directories to one new location. Whenever a conflict occurred I chose Skip Identical not Skip All because I thought Skip Identical was a more rigorous test, checking not only filename but file size and time stamp. Several moves and deletes later I caught one conflict that just had the same filename but both a different file size and time stamp. Now that I started skipping conflicts manually I saw several others.
I had deleted 22 gigs of files that I thought were IDENTICAL and have now realized that they might just have had the same filename but different file sizes or time stamps and thus in reality had been different chat sessions. I now have no idea what I have lost forever and I am not happy.
First I did a search on "sample" and the search result came back with many hits. I then chose to move all of the found files to a new directory using the Skip Identical parameter. Thinking that all the files left behind would be identical I would delete them. Here is an example of me moving the files after the
find search was finished and choosing Skip Identical. The file on the left was NOT moved into the new directory and would have been deleted despite having a different filesize and timestamp to the existing file and just sharing the same filename.
I've seen many posts around the term 'identical'. And the differences between replace dialog and unattended mode. Normally files with the same date, time and size are the same. There I agree.
But there are files, that are identical with the same size and different date/time like > .pst (Outlook) and .xls* (Excel). The reason is, that MS changes date / time, when closing these files - even if no changes are made. So, If you have an Archive open in Outlook, it will get a new timestamp after closing Outlook.
In that case, when merging old backups into one place (what I'm currently doing) the big untouched pst-Files (each with size of some GB) will be copied (with a new name). Later I can find these duplicates with DOpus (thanks for that feature) but it would take less time and space if those files would be detected as identical (same hash, but different date) and ignored for copies.
I would find it useful to extend that definition of 'identical' for files with different date/time and same size and same hash. A hash for two files with different sizes will be different. GIT uses SHA1, and DOpus MD5, which should be ok too.
Hashing is costly, and - may be - you would like to add a threshold value below no hash is done. Or add a file extensions list to do hashes on same size.
I use DOpus only since one year and I find it very useful, although I use very small parts of it. An THIS would be an improvement for me.
Thanks for the quick answer. I'm sure hashing takes not longer than copiing for bigger files - but have no proof yet. Hashing 4k-Files may be unnessary, but if they are essentially the same - why copy? GIT works only with hashes exactly for that reason: less space.
Copying a file needs additional time for allocating space at the destination and the the normal storages have faster read that write-access even nvme's do. The write-speed is dependend on block-size, scattering of free blocks on HDDs and so on.
If you want, I can try a proof. But reducing traffic by not-copying unwanted stuff is it worth all the time spent for hashing. I think oly hashes for same sizes are necessary (after threshold size) and the user may select that extra time (filter ideas above).
You would have to hash both files. That means reading the entirety of both source and destination files. It would only be faster if the destination is very slow at writing compared to how fast it can read. Even in that situation, you would still have to copy the file if the hashes turned out different. I don't think it would save you any time in any normal situation.
So I give up here - I merge many old backups in one place and never overwrite files. But the I have dozens of useless copies that I have to identify in a second step, that may have been nignored by the redefinition of "identical" from time/date/size to hash when dates differ. Which is still the truth.
That may be a very special workflow, and I never had that before. I wil continue the old way so - sorry for the idea!