How to apply rename script to Unicode zero-width characters?

Hi Leo, I know this is an old thread. However...

DOpus makes my life immeasurably easier. THANK YOU. However, I've become so used to file wrangling the files our 43TB SAN with the ease provided by your efforts that when things don't go as planned, I get frustrated.

I routinely encounter external files with the UTF-8 character U+0097 in file names (UTF-8 Control Character "End of Guarded Area") and U+0096 ("Start of Guarded Area"). It took me forever to just figure out why my scripted renames were failing, then even longer to identify the offending character [see note 1 below]. I finally discovered the pain of UTF zero-width characters, but now I'm stuck at how to fix these filenames efficiently using DOpus renaming. I encounter dozens of these files daily.
I wish simply fixing the filenaming problem at the source was an option. It isn't.

To answer your question, the invisible characters that DOpus is (I think) ignoring or not displaying cause a problem because DOpus doesn't seem to be able to handle renames using a regex in either scripts or the rename dialog that would reference the invisible UTF-8 character and essentially just delete the invisible character. My rename scripts use the regex syntax \x{0097} as provided here:

' Remove zero-length spaces
regex.Pattern = "\x{0097}"
strNameOnly = regex.Replace(strNameOnly, "")

(FWIW, I also tried \u{0097})

The regex behavior I anticipated is that the zero-length invisible UTF-8 control character will be replaced with Null (that is, deleted). Nope. That particular regex in the script does nothing although the rest of the script works as expected. I have also tried a GUI rename in DOpus using:

RenameType: Regular Expressions Find and Replace
Oldname: \u{0097}
Newname:

I get an "Error in Search Pattern. Trailing Backslash" (I tried it without the escape also. The it does nothing when run).

I have banged my head on this for weeks and have digested more than I ever wanted to know about UTF non-printing characters, including several very complex explanations on StackOverflow, SuperUser, and the Notepad++ forums, among many others--all of which in the end only served to confuse me (e.g., How do I find this character(by unicode search) in notepad++ ﻁ (\uFEC1 and only that character) - Super User, and Visualization for zero-width characters | Notepad++ Community). Many of the posts mention in great detail that this is a known and persistent headache with UTF-8 but the solutions I have tried have failed. I'm not a programmer and so the PERL/Bash/etc scripts I've seen elsewhere are pretty far above my paygrade. I am pinning my hopes on DOpus--once again--to the rescue.

At this point I have only been able to weed out the offending files by running the rename script on a batch of files I am cleaning up and visually hunting for "failed" renames, then copying the filenames using DOpus (Copy -> Other -> Short Names Only), look for the offending UTF-8 control characters, and rename them manually. A big burden for an otherwise easy task, but at least I'm halfway there.

Can you help? Can anyone that's solved this? Surely I am not alone. There must be a solution I'm overlooking.

Scott

NOTE 1 (for those that are struggling with this issue too)

Unicode control characters can appear in a filename but are not normally displayed by the Windows and MacOS operating systems. DOpus also ignores these characters in both display and filename handling.

To locate the offending character I used the terrific text wrangling program Notepad++. I used DOpus to copy the actual filenames, then pasted that list into a blank Notepad++ text document. I then selected View->Show Symbol->Show All Symbols to display the (normally) invisible UTF-8 control characters.

Thus, the filenames normally displayed in DOpus appear as:
image

Are actually:
image

The "SPA" and the "EPA" characters are where my issue lies.

(I've moved this to a new thread as it doesn't really have anything to do with the decade old one it was added to, other than involving similar Unicode characters.)

Can you provide some example filenames for us to try with? Maybe via a zip file or similar, assuming it preserves the unusual characters.

Without that, it's difficult to test things, since they don't occur in any normal situation.

Where are they coming from, anyway? It seems like something that should be fixed at the source. But it should also be possible to fix things via the Opus rename tool, if needed.

In fact, I would expect the "Make Safe Name" regex in Various simple rename presets to remove the unwanted characters already, without requiring scripting.

Hi Leo, here is an example rename script. The boilerplate was long ago borrowed from a rename script I found in the DOpus support forums (Titlecase.orp?) but (unfortunately) deleted the attribution comments because I never anticipated sharing it publicly. My apologies to the author.

<?xml version="1.0" encoding="UTF-8"?>
<rename_preset case="none" script="yes" type="normal" version="12">
	<from>*</from>
	<to>*</to>
	<script>@script VBScript
option explicit

	
Function Rename_GetNewName ( strFileName, strFullPath, fIsFolder, strOldName, ByRef strNewName )

	Dim regex
	Dim strWordArray
	Dim strExtension
	Dim strNameOnly
	
	'Create a RegExp object. See http://msdn2.microsoft.com/en-us/library/ms974570.aspx
	Set regex = new RegExp
	regex.IgnoreCase = True ' Case-insensitive matching.
	regex.Global = True ' All matches will be replaced, not just the first match.

	'If we're renaming a file then remove the extension from the end and save it for later.
	if fIsFolder or 0 = InStr(strFileName,"") then
		strExtension = ""
		strNameOnly = strFileName
	else
		strExtension = Right(strFileName, Len(strFileName)-(InStrRev(strFileName,".")-1))
		strNameOnly = Left(strFileName, InStrRev(strFileName,".")-1)
	end if

'----------------------
' Renaming
'----------------------
  
	regex.Pattern = "\sVer\."
	strNameOnly = regex.Replace(strNameOnly, " v.")
	
	regex.Pattern = "\[VS\]"
	strNameOnly = regex.Replace(strNameOnly, "")

	regex.Pattern = "\sAnd\s"
	strNameOnly = regex.Replace(strNameOnly, " and ")

' Delete #
	regex.Pattern = "\s#\s"
	strNameOnly = regex.Replace(strNameOnly, " ")

	regex.Pattern = "\s?#"
	strNameOnly = regex.Replace(strNameOnly, " ")
	
' Delete double spaces
	regex.Pattern = "\s?\s"
	strNameOnly = regex.Replace(strNameOnly, " ")
	
  'Delete Unicode Characters
	regex.Pattern = "\s?\u{0096}"
	strNameOnly = regex.Replace(strNameOnly, " ")
	
	regex.Pattern = "\s?\u{0097}"
	strNameOnly = regex.Replace(strNameOnly, " ")
    
  '-------------------
  ' Finish
  '-------------------

  ' Rejoin the name and extension (to lower case) and we're finished
  strNewName = strNameOnly &amp; LCase(strExtension)
               
End Function
</script>
</rename_preset>

The above script is attached, along with three text files where the filenmae contains Unicode characters. I can't paste the filenames here since they won't display properly but here's a screenshot using Notepad++:

image

And the files I've referenced.

Example File 2017-12-14.txt (30 Bytes)
Example File 2017-12-14.txt (30 Bytes)
Example File 2017-12-14.txt (30 Bytes)
Clean Filenames.orp (1.9 KB)

I'm also including a zip of the above files in case the Unicode gets munged via attachment.

Unicode Examples.zip (1.5 KB)

Let me know if I can provide anything else to help you or others resolve this.

Using \u0097 seems to match that character in a regex.

Regex syntax used in Opus is the ECMAScript one described here (until Microsoft break all their documentation URLs again, probably later today and then again tomorrow :slight_smile: ): Regular Expressions (C++) | Microsoft Learn

That works!

I'll elaborate to wrap this up in case anyone else finds this information useful:

Uisng Leo's syntax, I modified my regex snippet in the script to:

'Delete Unicode Characters
  	regex.Pattern = "\s?\u0097|\s?\u0096"
	strNameOnly = regex.Replace(strNameOnly, "")

Note that the \s? in the code above is specific to my filenames that include a space-Unicode-space (they appear in Windows as two spaces as the Unicode character is invisible). If you are just matching a Unicode character, simply use the \uXXXX format.

If you are using Notepad++ to find these hidden Unicode characters as I did, the regex syntax is slightly different for Find/Replace \x{0097}.

Leo to the rescue again. Thank you so much. I wish I had given up sooner and posted my struggle weeks ago! It would have saved me hours of frustration. I also added a short blurb on the Various simple rename presets page for reference.

Scott

1 Like