Script file-encoding support

Zoc · May 19, 2017, 12:40am

Hello
To prevent me from doing an exhaustive test trying to find all compatible file formats, what are the officially supported script formats?

Most of the script formats I downloaded were in UCS-2 Litlle Endian with BOM, which is not supported by git.
(Sadly...?) git has turned into a somewhat of a standard to share code development... I'm not so keen on using it, but it does simplify some of the code sharing aspect.

So.. Does it support UTF-8 scripts? Without BOM?
Are there any disadvantages on using it in place of UCS-2 Little Endian with BOM?

Thank you in advance

Leo · May 19, 2017, 1:21am

Most of the scripts on the forum are pure ASCII. At least the ones I have written are, as are the templates Opus normally creates by default.

Zoc · May 19, 2017, 3:40am

Thanks @Leo

Do you know if DOpus supports UTF-8 without BOM for script files?

Leo · May 19, 2017, 7:32am

Try it and see, but BOMs are standard on Windows as there is no other way to properly detect encoding.

I'd also be surprised if Git has issues with BOMs on files. They're very common in the Windows world.

Zoc · May 21, 2017, 7:20pm

I did a quick test with the following code:

function OnInit(initData)
{
	initData.name = "Test";
	initData.version = "1.0";
	initData.copyright = "(c) 2017 Zoc";
//	initData.url = "https://resource.dopus.com/viewforum.php?f=35";
	initData.desc = "Script Test";
	initData.default_enable = true;
	initData.min_version = "12.0";
	
	DOpus.output("ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ");
}

(The string was taken from this link)

UTF-8 BOM, UCS-2 LE BOM, UCS-2 BE BOM did output the string ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ correctly.

ASCII had the output ????????????????????????????? as expected

UTF-8 (without BOM) had the output áš á›‡áš»á›«á›’á›¦áš¦á›«áš áš±áš©áš áš¢áš±á›«áš á›áš±ášªá›«áš·á›–áš»áš¹á›¦á›šáš³áš¢á›—

I thought it would be a good idea to report this here, in case someone needs the same answer

tbone · May 24, 2017, 9:09am

May I ask, what is UCS2? Is that another name for UTF16?

If I remember correctly, whenever I had DO scripts which where UTF8 BOMed, they would error on line #1.
This is not directly related to DO I guess, since the windows scripting host also does not like this encoding.
I'm a bit surprised that you don't have this problem, hu?! o)

Leo · May 24, 2017, 9:13am

tbone · May 24, 2017, 4:01pm

Thank you. o)
Your screenshot including the number of search results suggests, I should have searched myself, I did now and read it all, but it was just a side-question, I actually wanted to know why Zoc had no problems using UTF-8. o)

I find it weird that there are scripts in UCS2 for download in this forum. Who would choose to save in this encoding. I never encountered it before. Maybe its been detected wrong. This can happen for UTF-16 files (that's what the article told me). My uploaded scripts are always UTF-16, because of missing support for UTF-8 in "Scripting.FilesystemObject", which is used by ScriptWizard to add MD5-hash and things.

Back to UTF-8:
I just tried using UTF-8 encoding for DO12 scripts again, surprisingly it seems to work for me too now.
Did you change something here in regard to the supported encodings?

The windows scripting host still does not like the UTF-8 BOM.

D:\ (17:48:49)
>cscript.exe "..\Command.Generic_GoRegistry.js"
Microsoft (R) Windows Script Host Version 5.8
Copyright (C) Microsoft Corporation. All rights reserved.

..\Command.Generic_GoRegistry.js(1, 1) Microsoft JScript runtime error: 'ï»¿' is undefined

Zoc · May 24, 2017, 6:31pm

I used Notepad++ to open the files, and apparently I was lucky with the files, since they didn't have characters out of the supported range and Notepad++ doesn't support these and it detected as UCS-2 LE BOM and I could read and edit them without issues.

Using Sublime Text 3, it correctly detected the scripts I downloaded to be in UTF-16 LE BOM.

The history of Unicode is a bit messy haha - sorry for the confusion!

When I was searching for a way to download files supported by the latest DOpus, I remember I stumbled upon a forum topic that (I think it was @Leo ?), said he added support to the new version of Microsoft JScript engine, but didn't have access to the newer functions. Maybe this change allowed correct detection of UTF-8 BOM files?

Since I found out that UTF-8 BOM files are supported - and since that file format is supposed by GitHub too - I started to make some utilities to aid me in script development (And @tbone, I'm using your script as a guinea pig, ofc haha - I hope you don't mind )

Leo · August 31, 2018, 4:05am

Following up on this, since I found the thread looking for something else:

Support for characters outside the BMP range (i.e. support for UTF-16 surrogate pairs and 4-byte UTF-8 sequences) is being added for config files (including toolbars) and script add-ins.

The config file change is already in 12.9.2 beta, while the script add-in change is coming in 12.9.3 beta.

Scripts that use UTF-8 (whether BMP or not) should still have a BOM (byte order mark) at the start to indicate they are UTF-8. That hasn't changed. We've just added support for decoding 4-byte sequences. (Which were very rare, at least in western languages, but are increasingly common with things like emoji.)

Prior to the change you would have see "Trunk ? Branch" in the name column for the selected script.

Also, tangentially related, here's a command for adding/removing/toggling BOMs on files, in case it's useful: EditBOM - Command to add, remove or toggle UTF-8 byte-order-mark