Find Duplicates - Different Names - Same Size - No MD5 hashing

About:

I needed to search and eliminate duplicates within a folder with ~10k files.
Files with different names could have the same filesize and be duplicates. Also most of them were HUGE and as such the possibility of having false duplicates is nonexistent.

The Dopus' own Duplicate finder can only do MD5 content compare with mismatched filenames which would take forever so I rolled a simplistic script function.

My plan was to enumerate any duplicates and add them to a coll://collection
Even disregarding the total lack of optimisation in the enumerating phase - everything went pretty fast until I hit a wall with the Adding to collection part.

Attempt #1:

cmd.RunCommand("Copy COPYTOCOLL=member FILE TO col://collection")

inside the loop. Took 3 minutes to populate the collection with the 1300 found duplicates.

I might completely suck at reading docs but I couldn't find a way to manipulate a specific collection within a script as an object directly without using the DOpus.Command interface.

Attempt #2:

So next I thought of writing the found dupes to a text file and importing that one as a collection with a single command as that's where the bottleneck was.
Writing the list was simple enough but then I hit another wall - namely I cannot invoke

 /col import /clear "tempfile"

from within dopus as it is not an internal but an external dopusrt command.

Finally:

After using another ActiveX scripting object I got the dopusrt path from registry and ran it as an external command. Now the entire thing is near-instant.
As a bonus it tells dopus to navigate to the new collection and sort it by size.

Finally as I'm mostly a C# guy - any ideas on optimizing the dumb duplicate finding logic are welcome.

Have fun with it.

Get the Script:

Raw Javascript
function OnClick(clickData)
{
	DOpus.ClearOutput();
	var cmd = clickData.func.command;
	enumFiles = new Enumerator(clickData.func.sourcetab.files);
	enumFiles.moveFirst();

	DOpus.Output("Enumerating files in: " + String(clickData.func.sourcetab.path));

	var x = new Array();
	
    while (enumFiles.atEnd() == false) {

	var file = enumFiles.item()
	var index = String(file.size);
	var filename = String(file.realpath);

	//Add all files to an array keyed by bytesize
	if(x.hasOwnProperty(index)) {
		x[index].push(filename);
	} else {
		x[index] = new Array(filename);
	}

	enumFiles.moveNext();
	}

	var fso = new ActiveXObject("Scripting.FileSystemObject");
	var tfolder = fso.GetSpecialFolder(2); //TemporaryFolder = 2
	var tname = fso.GetTempName();
	var tfile = tfolder.CreateTextFile(tname, true, true); //Overwrite flag & make it unicode

	var duplicateGroups = 0;
	
	//Add all files from groups with more than one member to a collection
	for (var k in x) {
		if(x[k].length > 1) {
			duplicateGroups++;
			for(var i = 0; i < x[k].length; i++) {
				//cmd.RunCommand("Copy COPYTOCOLL=member \"" +x[k][i]+ "\" TO \"coll://DuplicateSizes/\""); // <-- awfully slow
				tfile.writeline(x[k][i])
			}
			
		}
	}

	tfile.close();
	var collectionDump = tfolder + "\\" + tname;
	DOpus.Output("Dumped " +duplicateGroups+" duplicate groups to: " + collectionDump);

	//Shell object to get dopus path from registry and run the dopusrt command
	var shell = new ActiveXObject("WScript.shell");

	//Find dopus in registry
	var dopusPath = shell.RegRead("HKLM\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\App Paths\\DOpus.exe\\Path") + "\\dopusrt.exe";
	DOpus.Output("Dopusrt found at: " + dopusPath);

	//Run dopusrt command
	shell.run("\"" + dopusPath + "\" /col import /clear /create /nocheck DuplicateSizes \"" + collectionDump + "\"")
	DOpus.Output("Importing to collection: DuplicateSizes");

	//Navigate to the collection
	cmd.RunCommand("GO path=coll://DuplicateSizes/");

	//Sort by Size
	cmd.RunCommand("SET SORTBY=size");
}
Button Code
<?xml version="1.0"?>
<button backcol="none" display="both" label_pos="right" textcol="none">
	<label>File Size Duplicates</label>
	<tip>Dump all files in current view - having size duplicates to a Collection</tip>
	<icon1>#dupepane</icon1>
	<function type="script">
		<instruction>@script JScript</instruction>
		<instruction>function OnClick(clickData)</instruction>
		<instruction>{</instruction>
		<instruction>	DOpus.ClearOutput();</instruction>
		<instruction>	var cmd = clickData.func.command;</instruction>
		<instruction>	enumFiles = new Enumerator(clickData.func.sourcetab.files);</instruction>
		<instruction>	enumFiles.moveFirst();</instruction>
		<instruction />
		<instruction>	DOpus.Output(&quot;Enumerating files in: &quot; + String(clickData.func.sourcetab.path));</instruction>
		<instruction />
		<instruction>	var x = new Array();</instruction>
		<instruction>	</instruction>
		<instruction>    while (enumFiles.atEnd() == false) {</instruction>
		<instruction />
		<instruction>	var file = enumFiles.item()</instruction>
		<instruction>	var index = String(file.size);</instruction>
		<instruction>	var filename = String(file.realpath);</instruction>
		<instruction />
		<instruction>	//Add all files to an array keyed by bytesize</instruction>
		<instruction>	if(x.hasOwnProperty(index)) {</instruction>
		<instruction>		x[index].push(filename);</instruction>
		<instruction>	} else {</instruction>
		<instruction>		x[index] = new Array(filename);</instruction>
		<instruction>	}</instruction>
		<instruction />
		<instruction>	enumFiles.moveNext();</instruction>
		<instruction>	}</instruction>
		<instruction />
		<instruction>	var fso = new ActiveXObject(&quot;Scripting.FileSystemObject&quot;);</instruction>
		<instruction>	var tfolder = fso.GetSpecialFolder(2); //TemporaryFolder = 2</instruction>
		<instruction>	var tname = fso.GetTempName();</instruction>
		<instruction>	var tfile = tfolder.CreateTextFile(tname, true, true); //Overwrite flag &amp; make it unicode</instruction>
		<instruction />
		<instruction>	var duplicateGroups = 0;</instruction>
		<instruction>	</instruction>
		<instruction>	//Add all files from groups with more than one member to a collection</instruction>
		<instruction>	for (var k in x) {</instruction>
		<instruction>		if(x[k].length &gt; 1) {</instruction>
		<instruction>			duplicateGroups++;</instruction>
		<instruction>			for(var i = 0; i &lt; x[k].length; i++) {</instruction>
		<instruction>				//cmd.RunCommand(&quot;Copy COPYTOCOLL=member \&quot;&quot; +x[k][i]+ &quot;\&quot; TO \&quot;coll://DuplicateSizes/\&quot;&quot;); // &lt;-- awfully slow</instruction>
		<instruction>				tfile.writeline(x[k][i])</instruction>
		<instruction>			}</instruction>
		<instruction>			</instruction>
		<instruction>		}</instruction>
		<instruction>	}</instruction>
		<instruction />
		<instruction>	tfile.close();</instruction>
		<instruction>	var collectionDump = tfolder + &quot;\\&quot; + tname;</instruction>
		<instruction>	DOpus.Output(&quot;Dumped &quot; +duplicateGroups+&quot; duplicate groups to: &quot; + collectionDump);</instruction>
		<instruction />
		<instruction>	//Shell object to get dopus path from registry and run the dopusrt command</instruction>
		<instruction>	var shell = new ActiveXObject(&quot;WScript.shell&quot;);</instruction>
		<instruction />
		<instruction>	//Find dopus in registry</instruction>
		<instruction>	var dopusPath = shell.RegRead(&quot;HKLM\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\App Paths\\DOpus.exe\\Path&quot;) + &quot;\\dopusrt.exe&quot;;</instruction>
		<instruction>	DOpus.Output(&quot;Dopusrt found at: &quot; + dopusPath);</instruction>
		<instruction />
		<instruction>	//Run dopusrt command</instruction>
		<instruction>	shell.run(&quot;\&quot;&quot; + dopusPath + &quot;\&quot; /col import /clear /create /nocheck DuplicateSizes \&quot;&quot; + collectionDump + &quot;\&quot;&quot;)</instruction>
		<instruction>	DOpus.Output(&quot;Importing to collection: DuplicateSizes&quot;);</instruction>
		<instruction />
		<instruction>	//Navigate to the collection</instruction>
		<instruction>	cmd.RunCommand(&quot;GO path=coll://DuplicateSizes/&quot;);</instruction>
		<instruction />
		<instruction>	//Sort by Size</instruction>
		<instruction>	cmd.RunCommand(&quot;SET SORTBY=size&quot;);</instruction>
		<instruction>}</instruction>
	</function>
</button>

1 Like

You could set the MD5 percentage slider really low so that it only considers a small part of the data and runs faster.

(Also note that, with the built-in duplicate finder, MD5s are only calculated when there are at least two files with the same size.)

MD5 at 1% took 3 minutes to get 10 duplicates out of 1000+. VS ~1sec of the above script.
It ended a bit after and only found 22 duplicates which made no sense at first.
It turned out that most of the duplicates were slightly different but with the same filesize to the last byte.
Works for me as I know they are dupes. Might just be metadata differences.
I suppose it's just a different use case altogether.

1 Like

In the next update we'll add a "size only" option for the duplicate finder.

1 Like

Yes, that would be nice indeed!
Most duplicate file finders are comparing hashes and have no option to ignore those, but rather check on sizes (or sizes+date-time stamps). In some cases hashes may be 'too strict' though.

For example: one might have 2 PDF files that have identical contents and have identical sizes, but for one reason or the other have different hashes.

For me, as user, it is clear the files are duplicates, but a duplicate file finder is of no use in this scenario.
Files are either duplicates or they are not. They do not have an option for, let's say, "Possible duplicates"
based on size/size+date.

Not sure how the current MD5 optimisations work. But being able to only calculate the MD5 for files with the same size would be a nice way of optimising the MD5 search.

This does not resolve @opw62 scenario, where files have same size but different MD5. Though I don't think I have ever experienced this scenario.

That already happens automatically.

1 Like

I should have guessed that was already covered. Is it only size, or size and date?

Reading @apocalypse original post, that the HASH compare was too slow, will the size compare actually speed things up?

  • Size compare, will say its a dupe if the size is the same (its fast).
  • MD5 will says its a dupe if the size is the same, and then calc MD5 and compare that (slow for files with same size).

@apocalypse mentioned that the files are so large there will be no false duplicates. I would assume then almost none of the files would be the exact same size, so almost no files would even run a MD5.

Can you please explain for a newbie like how can I utilize this script. Where to put the Java Code and where to make the button, does the button triggers the Java code.

Thanks for any clues.

Easiest method would be with the button code:

  1. Copy XML button code from above.
  2. Right click on your Dopus toolbar (where you want to add it as a button and click customize)
  3. Right click again while in customize mode and you should have a Paste option in the context menu (if you copied the XML code properly).
  4. Pasting adds it as a button.

If you'd rather make your own button:

  1. Right click on your Dopus toolbar (where you want to add it as a button and click customize)
  2. Right click again and select new > button
  3. Right click on button and edit it.
  4. Click Advanced next to OK
  5. Select Script Function from Function dropdown
  6. Paste the JS code
  7. Name your button, label, hotkey etc
  8. OK on all dialogs and that's it.

Good luck.

Edit: I'll make sure to bookmark Leo's exhaustive post and link that if needed from now on.

Apocalypse's reply (above this post) should have what you need, but there's also this guide which covers all the different ways a command or script might be shared on the forum, which may be useful later on:

1 Like