Perlscript & Unicode problem [Solved]

I'm seeing a problem with either perlscript or DOpus - I can't tell which from my end.

It appears some unicode is not coming through unmodified. The screenshot shows that the Latin small letter T w/Cedilla (U+0163, or 0xC5 0xA3) is coming into the perlscript as ASCII small t (0x74).

This is causing DOpus to want to rename an unchanged file.

Can you confirm on your end that DOpus is passing the perl dll the correct value?


I've figured out that the problem is with Windows Script Host and ActivePerl, not DOpus. I have a drop handler program that runs outside of DOpus and shows the issue with this particular C5 Unicode character.


It seems that most unicode goes through fine - I don't know much about unicode, so hopefully this is a non-issue. I discovered this while I was adding a unicode to ascii converter for my Dynamic Renamer, and this one pesky test file always wanted to be renamed.

It works fine with VBScript:


And with JScript:


Since Opus does not know which language it is calling, and calls all languages identically via Windows Scripting Host, this must be a problem with ActivePerl.

I installed ActivePerl to check if it went wrong on my machine, in case there was anything machine- or locale-specific going on, and the perlscript example still went wrong.

(Aside: I can't believe the ActivePerl installer still defaults to C:\Perl or C:\Perl64 instead of under Program Files. At least it allows you to change it now, unlike older versions.)

Thanks. I checked the other two script languages as well and found the same. I'll contact ActivePerl.

Edit: I just posted a question in their forum.

I've created a small test script that shows the problem. If I run the script under Cygwin's perl, it works fine. But with ActivePerl it fails. Hopefully there is just an environment variable I need to set to allow it to operate correctly with UTF-8 characters. (I spent the whole of yesterday learning about Unicode, etc.).

The good news is, that I ran my script against 20k music tracks with plenty of file names containing diacritics, and not one triggered the issue (meaning, none of my file names have 2+ byte Unicode characters).

If there is no simple setting for ActivePerl, then it should probably be noted that ActivePerl usage is not safe as a Rename scripting language for users with 2+byte Unicode characters in their filenames. (Which really is a shame, because it is easy to code in a few simple lines what bloats to 100+ for simple tasks in other languages.)

I'm following up on this, as I'd like to get this solved. I spent a little time on this yesterday, and have made some progress.

Here's the situation. OLE Automation transfers strings encoded as UTF-16. But perl uses an internal encoding of UTF-8, so it is necessary to convert these. The Win32::OLE module can help this. The default codepage used is ANSI (CP_ACP), and this is why certain characters get mapped to their look-alike equivalents (e.g. ţ ==> t).

What is required in your perl script is to change the codepage option in the Wine32::OLE module:

Win32::OLE->Option(CP => CP_UTF8)

This now passes in the ţ correctly as UTF-8 (encoded as C5 A3).

However, there is still a problem that I have not been able to resolve. DOpus now shows this character using ISO-8851-1 (aka latin1), and it appears as the incorrect (Mojibake) two character sequence ţ.

It seems no matter how I encode the character, I cannot get DOpus to display the correct single ţ (small t with cedilla).

I'm wondering what DOpus is actually receiving from the return of Rename_GetNewName2(). Is DOpus receiving a Unicode (BSTR)? and expecting encoding of U+0163? (I've tried sending that with and without a BOM). Or is it expecting latin1 of C5A3, the same thing my script receives? This part is a black box to me. Would someone be able to take a look at the octect sequence in the debugger to verify this?

Internally it's all BSTRs (variants of type VT_BSTR) which are UTF-16.

Yeah, that's what I suspected it would have to be. However, here's where I'm entirely unclear. How is DOpus dispatching, or getting perl to call Rename_GetNewName2(). What is the interface there? It must be passing in parameters and reading them back.

We call IDispatch::Invoke to invoke the method. All strings are passed and received as VT_BSTR. If there is conversion going wrong it's happening inside the scripting engine because Opus only deals in BSTRs.

SOLVED!

Shesh, what a pain in the rear, mostly due to very poor ActiveState Perl documentation wrt to the Win32::OLE module. Jon, your hints above helped, as they made clear what to look for. I had to read the 5500 line source code for the Win32::OLE XS module, and from there figured out what I needed.

In order to provide a VT_BSTR back to DOpus via a perl script, you have to use:

Win32::OLE::Variant qw(Variant VT_BSTR);
...
my $ret = your calculated return for Rename_GetNewName2()
...
return Variant(VT_BSTR, $ret);

I'll update my Dynamic Renamer with this, and this will complete the last remaining issue I've wanted to squash. It should also serve as an example of what has to be done for perl code in general.