Import external duplicate collection

apocalypse · May 15, 2020, 1:55pm

I've written a simple duplicate finder program which enumerates two passed directories and all their subfolders and files, hashes all files' CRC32, even digs within archives and gets the CRC32 of all files within them from their headers without extracting. Finally adds all those hashes and paths to duplicate dictionaries.
So far so good.
It then outputs in two modes.

Duplicate groups of the archives containing files with matching crc.
Or the duplicate files themselves as they are, outside and inside of those archives.

We can import normal file lists to a dopus collection with the dopusrt /col command but the way I see it - duplicate collections are a bit different.

From what I tested Opus refuses to read a new collection file (.col) in /dopusdata/Collections/ at runtime.
It also refuses to reload the changes made to an existing duplicate collection.
I can generate the entire XML but having to restart dopus to read the collection is a pain.

Importing as a normal collection and grouping after that wouldn't work as filenames can differ, filesizes may be equal but the files could have different crc32. Calculating a CRC column would extract each file on the fly to calculate them which is a no-go.

My request is for a method / command / to refresh and re-read collection files located in the /dopusdata/Collections dir without having to restart dopus.

Thank you.

Leo · May 15, 2020, 9:52pm

Duplicate collections are a bit more complicated than normal ones, and we don't currently have an official way to import them (nor export them beyond a simple list of files). (Other than what you're already doing, which is effectively copying the raw config files and restarting.)

I don't think it would be a good idea to have external code generate the raw XML, since the details of that could change on our side in the future. But maybe a way to import duplicate collections could be added. It needs some thought to work out how the input and output data would look and which fields would need to be included.

Jon · May 16, 2020, 10:04pm

In the next update we'll add a proper way to import a "duplicates-style" collection using dopusrt /col.

apocalypse · May 16, 2020, 10:06pm

Thanks a lot - both of you. Looking forward to it

apocalypse · May 24, 2020, 2:53pm

I've tested the duplicate collection import in the new system and I like it so far.
I'll post the file-and-archive duplicate scanner that utilizes it on github once I clean up the code.

However, a few new issues in dopus arose while using the new functionality.

We lack a preset folder format for Duplicate Collections and thus we have to either apply a preset to a static duplicate collection name or apply one every time we generate or navigate to a new one.
When used to diff archives and you attempt a Delete operation on many files in the same archive from within the duplicate collection as opposed to from within the archive - the delete operation is not executed once for all the files within the said archive as it is when browsing the archive itself (delete from within an archive generates a list and runs winrar only once), but are handled sequentially with each file being a standalone delete > recreate archive which takes insanely more time, CPU cycles and disk writes depending on the size, compression of the archive and the number of files being deleted.