Duplicate Music File Finder

JeffReeder · July 21, 2012, 6:30am

I wanted to know what would be involved in somehow enhancing (or replicating) the Duplicate File Finder feature. I would like to do one of the following:

Link into the existing DFF feature so I can alter the "duplicate criteria" logic so I can use ID3 tags for duplicate detection, or
Write a full-on plug-in that uses the tag SDK support so I can write my own DFF that will scan multiple (huge) folders of MP3s and compare their ID3 tags for duplicate detection, and auto-select the lower-quality files for deletion.

I have about six folders (150+ GB) of MP3s with thousands of duplicates, but many of the duplicates are different quality copies of the same songs in the same CDs. I want to automate the process of auto-tagging the lower-quality files rather than having to manually tag all the lower quality files one at a time which could take days.

Anybody know how capable DOpus' SDK / plug-in architecture is to accomplish this?

Ideally I'd love to be able to hook into the Duplicate detection algorithm with a custom callback to determine duplicate status. If this kind of feature existed it could greatly expand the DFF capabilities with duplicate detector plugins based on file types, which could be a very useful enhancement.

Leo · July 21, 2012, 7:35am

At the moment there are only two types of plugins, Viewers (which can include extracting metadata and thumbnails as well as actually viewing files) and Virtual File Systems (for archives and similar).

We don't have a plugin API for extending the Duplicate Finder, but we could consider adding one if you have a proposed interface that you think would let you accomplish what you need.

Are you planning to do the ID3 tag parsing yourself or would you want Opus to pass you the metadata as well as the list of files?

JeffReeder · July 21, 2012, 11:20am

Before getting into any API concepts, I think the first thing to address is the UI itself - how would such a feature be integrated into the existing DOpus DFF UI. Curently you have a "Comparison method" of:

Filename only
Filename & size
MD5 checksum

I think there's room for a fourth option, which I envision as a combo box - labeled something like "Data Type", with entries like Music, Movies, Databases, or what not. The names populated in this combo box could be derived from DFF Plugins with an exposed UI_Name property, along with a GetProcAddress() address and a HMODULE interface to the plugin DLL in question - not hard to make an array of that info after parsing relevant plugin DLLs (with a suitable scanning mechanism).

A DFF-specific plugin could also export a function to define the file extensions that it cares to work with (e.g., ".mp3|mp4|ape|wmv", etc.) Only those extensions would be applied when selecting that particular "Data Type" for duplicate file finding.

As for tag processing, a plugin can surely add their own tag processing, but I don't think that makes a lot of sense from a continuity perspective. DOpus has a very large amount of tag decoding capability and if a DFF-style plugin used different criteria than what DOpus leveraged, it might lead to confusion. As such, it might be beneficial to expose a DVPFileInfoMusic? style object to the file comparator callback, or at the very least some kind of XML-like hierarchical data structure that contains the tags in question. By doing so the plugin will utilize the same decoding logic as DOpus, and can be expanded upon if and when DOpus adds more tag support (via an XML-like concept - not via a static data structure).

As for a DFF API, I'll have to think on this a bit now that I've had a chance to look at the SDK .H headers, but on first blush I'd think it would be advantageous to be able to add a handler into a DFF "Custom Filter Types" hander array (to populate the "Data Types") combo box, as well as a callback mechanism for comparing two file, which would require a file information object to be passes for each file in question as well as any meta data. This could be a list of files, or a simple dual-file comparison callback - depending on how the internal DFF code is structured.

In addition, it would be necessary to hook into the "Select" button so you could auto-tag all files in the DFF list based on various plugin-defined criteria (e.g., "Select lower quality, identical song titles"). The specific capabilities here are plugin-specific. This could require turning the "Select" button into a combo-box button lik you see in the Toolbars.

It might also be necessary to make the "Operations" toolbar, or other similar toolbars.

There could also be a need to tap into the context menu when right-clicking on files.

The commercial software market for duplicate file finders is growing. DOpus could add this capability by some (theortetically) straightforward enhancemets to its existing API infrastructure to handle things like duplicate music file removal (my example), removal of text files that are say 99% similar, but older (another plugin), photos that might have the same filename, but different dimensions (and are older), and other concepts I haven't thought of.

By offering extended plugin architectures, Third Party Developers could theoretically add a number of plugins for DOpus that GPSoft doesn't have the bandwidth to write. Lord knows, there's a lot of engineers out there with good ideas, and Directory Opus provides a rich framework for adding vertical market add-ons. It just requires some code exposue (and a bit of work).

Anything that adds value to DOpus' marketability has my vote!

Jeff Reeder

Leo · July 21, 2012, 11:31am

Thanks, lots to think about there.

One issue is that we've put a lot of effort into the Viewer and VFS plugin APIs, but hardly anyone has actually used them. (Ignoring my own plugins, there's a handful of third-party viewer plugins and zero third-party VFS plugins.) I'd be surprised if a lot of developers made duplicate-finder plugins. But if we can provide a simple enough API then it might make sense.

On the other hand, it might make sense for us to implement the functionality ourselves, if Opus is going to be doing the tag parsing etc. itself anyway. Exposing, documenting and maintaining a plugin API for just "are these tag strings the same, and which files should be selected for deletion" is probably more work than implementing that logic ourselves, at least if the requirements are well-defined.

Deciding that two music files are identical from their tags is a fuzzy problem, of course. (Unless the tags are very well-maintained, with no punctuation differences etc.)

andersonnnunes · March 17, 2017, 12:16am

I also wondered if Opus could be used to find duplicated music files. My goal is a bit different from OP's, as I want to prune recordings repeated on multiple albums, it is not a matter of pruning the lower quality ones from the same album.

I just finished a sync of my music collection's tags with MusicBrainz's database. The title names should be all standardized - except for those odd ones that need to be fixed (for these a fuzzy comparison would be the way to go but they will eventually be fixed on the central database and the correct tags will flow down to mine).

So I was thinking that searching for duplicate title names would be enough, as I could manually compare the albums they are from and select one file to keep. The comparison criteria is going to be subjective, so it won't lend itself to well-defined requirements.

The duplicate finder tool does not support metadata tags, only filename, size and hash. Now that Opus got full scripting support, why not add support for custom fields as the search criteria? It would probably not be as powerful as a plugin, but would be enough for some use cases, like mine.

Maybe the search criteria could based on available columns (including the script add-in defined) and deletion criteria could be specified as a new event tied somehow to column's attributes?

In case it would be too much duplicated effort to maintain both plugin and script interfaces to the duplicate finder, I guess the plugin route is the way to go, even if more complicated for most users.

I don't have many duplicates and they don't bother me that much, so finding them is not that much of a priority. I could also find then in some other way. So that feature has not a high priority to me.