GP SoftwareTwitter
Opus FAQsManualCommandsObjects

Script to covert text files (e.g. srt) into UTF-8, without BoM

Is there some script to covert text files (e.g. srt) into UTF-8, without BoM? For the time being I am either using UTFcast Professional for batch changes or NotePad++ for single file conversion. in both cases, many clicks are involved and is going to be extremely helpful to be able to have a Dopus script button converting txt/srt/etc files into UTF-8, no-BoM.

Thank you.

What are you converting from? What format/encoding is the text before conversion?

Georgios means converting from UTF-8+BOM to UTF-8.
There is a difference between UTF-8 and UTF-8 with BOM.
Some applications have problems with the BOM variant. Stripping the BOM, thus saving the text file as UTF-8 solves those problems.
I myself must regularly open text files in Notepad++, change the encoding and than save the file again. Georgios is asking for a script to do this in a batch kind of way. Even for a single file a button click is faster then opening and saving in a text editor.

Here is a button which will do that for the selected files:


EDIT: 31/Aug/2018


Test JScript.dcf (7.0 KB) -- OLD, see edit just above this line.

See: How to use buttons and scripts from this forum.

If it removes the BOM from a file, that file will be deselected. Files that don't start with a UTF-8 BOM will be skipped and left selected.

The script only does minimal error checking, and does not create a backup of the old file before overwriting it, so you might want to create backups first or improve the script if you are going to use it on important data that isn't already backed up.

Click here to see the script code contained in the .dcf above, if you just want to look at how it works without downloading the .dcf:

OLD script code
function OnClick(clickData)
{
	var tab = clickData.func.sourcetab;
	var cmd = clickData.func.command;
	cmd.deselect = false;
	var vecDeselect = DOpus.Create.Vector();
	var blobBOM = DOpus.Create.Blob(0xEF,0xBB,0xBF);
	var blobFile = DOpus.Create.Blob();

	for (var eSel = new Enumerator(clickData.func.sourcetab.selected_files); !eSel.atEnd(); eSel.moveNext())
	{
		var item = eSel.item();
		var file = item.Open("r", tab);

		if (file.error == 0 &&
			file.Read(blobFile, 3) == 3 &&
			file.error == 0 &&
			blobBOM.Compare(blobFile) == 0)
		{
			blobFile.Free();
			file.Read(blobFile);
			if (file.error == 0)
			{
				file.Close();
				file = item.Open("wt", tab);
				if (file.error == 0)
				{
					file.Write(blobFile);
					file.Close();
					vecDeselect.push_back(item);
				}
			}
		}
	}

	if (vecDeselect.size > 0)
	{
		cmd.ClearFiles();
		cmd.AddFiles(vecDeselect);
		cmd.RunCommand("Select DESELECT FROMSCRIPT");
	}
}

Not only from UTF-8-BoM. Original file(s) may be UTF-8-BoM, in which case only stripping BoM is needed, but may be ANSI, ISO, OEM, etc. UTFCast Professional as well as Notpad++ do the job fine but... too many clicks. I am always in favor of executing file related commands within Dopus, whenever applicable. It is The file manager after all!

Thank you.

There is generally no reliable way to automatically detect those encodings from each other.

If you have (or can find) tools which do the guesswork well enough, and if they have command-line interfaces, you can run them from Opus buttons to automate things.

[quote="Georgios, post:5, topic:25807"]
Not only from UTF-8-BoM. [...]but may be ANSI, ISO, OEM, etc[/quote]

Ah, new wishes.

UTFCast Professional has Command line support. You can use that.

Also Windows has built-in support for converting to UTF-8 via PowerShell:
Get-Content .\test.txt | Set-Content -Encoding utf8 test-utf8.txt

or with variables

$yourfile = "C:\temp\test.csv" 
$outputfile = "$"C:\temp\test-out.csv"
get-content -path $yourfile | out-file $outputfile -encoding utf8

The only downside is that this generates a UTF-8 BOM file.
But with some scripting and use of variables you could make it a two step-conversion with Leo's script as last part.
I don't know the nitty-gritty of PowerShell, DO variables and scripting so I can't help you further.

Thanks Leo!
To focus on the actual RemoveUTF8BOM functionality and for using it elsewhere more easily.. I put your jscript code together a bit differently, same logic otherwise.

function RemoveUTF8BOM( doItem, tab) {
	var bBOM	= DOpus.Create.Blob(0xEF,0xBB,0xBF);
	var bFile	= DOpus.Create.Blob();
	tab		= tab || null;

	var f = doItem.Open("r", tab); if (f.error != 0) return false;
	if (f.Read(bFile, 3) != 3 || bBOM.Compare(bFile) != 0) return false;
	bFile.Free();

	f.Read(bFile); if (f.error != 0) return false;
	f.Close();

	f = doItem.Open("wt", tab); if (f.error != 0) return false;

	f.Write(bFile); f.Close(); bFile.Free();
	return true;
}

function OnClick(data) {
	var f = data.func, tab = f.sourcetab, selFiles = tab.selected_files;
	var cmd = f.command; cmd.deselect = false; vecDeselect = DOpus.Create.Vector();

	for (var eSel = new Enumerator(selFiles); !eSel.atEnd(); eSel.moveNext()) {
		var bomRemoved = RemoveUTF8BOM(eSel.item(), tab);
		if (bomRemoved)	vecDeselect.push_back(eSel.item());
	}

	if (!vecDeselect.size) return;
	cmd.ClearFiles();
	cmd.AddFiles(vecDeselect);
	cmd.RunCommand("Select DESELECT FROMSCRIPT");
}

Here is a powershell version (needs to be an external file). For safety reasons it uses a temporary file while copying the file contents. If the targetPath parameter is given, it does not overwrite the source file containing the BOM.

param (
    $filePath   = "D:\UTF8BOMTest3.txt",
    $targetPath = $filePath
)

$erroractionpreference = 'stop';
$file   = gi -path $filePath;
$buf    = new-object System.Byte[] 3;
$reader = $file.OpenRead(); 
$read   = $reader.Read($buf,0,3); 
if ($read -eq 3 -and $buf[0] -eq 239 -and $buf[1] -eq 187 -and $buf[2] -eq 191) { 
    $tempFile = [System.IO.Path]::GetTempFileName();
    $writer = [System.IO.File]::OpenWrite($tempFile);
    $reader.CopyTo($writer);
    $writer.Dispose();
    $reader.Dispose();
    move-item -path $tempFile -destination $targetPath -force;
    exit 0;
} else {
    $reader.Dispose();
    exit 1;
}

trap { exit 20; }

PS: When will inline powershell scripts be supported? o)

I took Leos code and mixed that into a new script addin for easier use. Thanks again Leo! o)

ConvertEx:

After installing the addin, you can use the provided button-menu or create a new button with the following command to remove UTF8-BOM from files:

ConvertEx REMOVEUTF8BOM DESELECT=success NODESELECT

ps: I played some more with powershell to get something more tiny for the UTF8-BOM removal, but I always ended up using an equal amount of code compared to the DO specific JScript, so there's no real benefit in a powershell version (unless you need to run that code outside of DO). The 3-liner powershell versions out there always seem to mess around with the line endings (adding additional linebreaks at the end e.g.), not an option if you ask me.

I've posted a newer version of my script above to the Buttons/Scripts area.

You can use this for buttons or commands which add, remove or toggle the UTF-8 BOM at the start of the selected file(s) (or a file specified on the command line):