Metadata BOM removal

I’d like to talk about the BOM in the room. :slight_smile:

In the course of working with image metadata via both Directory Opus and ExifTool, it’s become apparent that some EXIF fields written by DOpus contain a BOM (UTF byte order mark, or zero-width non-breaking space, U+FEFF / 0xEF,0xBB,0xBF) as the first character. As far as I have seen so far, this BOM does not get included in the respective field value string when displaying/editing metadata via DOpus’ Metadata Pane or Set Metadata dialog, but can be found in JSON data exported from image files by ExifTool:

Under normal circumstances, this BOM probably doesn’t bother anybody else, and it mostly didn’t bother me, either, until my recent efforts to automate a number of routine (for me) metadata operations that I had previously performed manually. For example, I’m often enough prepending and/or appending new strings to existing metadata fields such as Description, Subject, Title and Comment. I’m using ExifTool to do these, via custom DOpus commands like this:

@set descprfx={dlgstring|Enter string to prepend to Description}$
ExifTool "-EXIF:ImageDescription<{$descprfx}EXIF:ImageDescription" .

As written above, this command results in BOM ending up between the new and existing strings, which often causes problems for me later on. I’ve now managed to amend my command line to remove the BOM, if there is one, during the prepend operation:

ExifTool "-EXIF:ImageDescription<{$descprfx}{EXIF:ImageDescription;m/^(?:\xEF\xBB\xBF)?(.*)/; $_=$1}" .

Post-BOM removal, there seems to be no adverse effects on the display/editability of the Description field in the Metadata Pane or Set Metadata dialog. Is the BOM actually necessary? Can a future update to DOpus’ metadata capabilities maybe do away with it? I’m not absolutely sure, but I think there are other metadata fields written by DOpus which do not contain a BOM. I assume the code for dealing with metadata is not strictly DOpus’, but the support library I see looks to me like it’s a customized version of Exiv, credited to GPSoftware, and not updated since 2014.

On the other hand, if the BOM does serve some purpose for DOpus that’s eluding me, am I potentially setting myself up for other complications down the road by removing it? I’m sure I could easily revise my code to restore it to the beginning of the prepended string, but would rather just leave it out if all else is equal.

Contents:

“DOpus_Meta_BOM\”
“2020-09-12 08;56;35 - MAZE - DOpus Meta BOM.png” (57,940) [800 x 340 x 24]
“DOpus_Meta_BOM.png” (4,191) [1 x 1 x 1]
“DOpus_Meta_BOM.png.json” (1,582)
“DOpus_Meta_BOM.png.txt” (370)
“DOpus_Meta_BOM_Original.png” (4,049) [1 x 1 x 1]
“DOpus_Meta_BOM_Original.png.json” (1,441)

2020-09-12 10;51;05 - MAZE - DOpus Meta BOM.7z (55.3 KB)

That was meant to say: “I think there are other UTF-encoded metadata fields written by DOpus which do not contain a BOM.”

BOMs remove ambiguity regarding which 8-bit codepage is in use, and I think are written by other tools as well. (AFAIK, we did not invent doing that for this type of metadata.)

It might make more sense to ask for ExifTool to be able to handle BOMs properly (or it may already have such an option; I don't know it well enough to know for sure).

@mazeckenrode

Another idea is cut out the middleman and do it all in Opus.

A small script can gather the present metadata field you are interested in and display it. If you wish to change the field, you can do so. You can then compare the values an if there is a change you can overwrite the old field with the new.

I have been using this method for years and it seems BOM proof :grinning:

@Leo

From the admins at ExifTool, for what it’s worth:

– “I’ve never seen a BOM in any image I’ve collected from the web, nor have any of the tools I’ve tested ever used a BOM.”

– “I’d have to agree.”

@auden

I appreciate the suggestion, but what I try to automate and accomplish in a single ExifTool command line is generally much more complicated and sweeping than the few short examples I’ve given here. Also, ExifTool is being actively developed, with support for reading and writing many, many tags and tag types, at least some of which I use, that as far as I can tell aren’t supported by DOpus, or in a few cases are supported but not in accordance with the official tag specifications. I do want to maintain DOpus’ ability to display the major tags it’s capable of displaying, though, which is why I use DOpus to write those tags first, before subsequent manipulations by ExifTool.

Maybe they are right. I'll do some more research to try and work out when/why we (or a library we're using) started doing it, in case that brings anything to light.

Opus has done this for a long time, though, and without a BOM every tool is left guessing what the encoding is (some tools use UTF8, some UCS-2, some ASCII or local codepages). I find the idea of avoiding explicitness and going with error-prone guessing a curious one, but if that really is the standard (or if there is some other good way to indicate encoding, which works even when metadata is edited by multiple programs) then we aren't exactly in a position to change that. :slight_smile:

@Leo

In case it’s any help, you may find ExifTool’s FAQ entry on metadata character encoding of interest. In particular:

“ExifTool writes Unicode in native EXIF byte ordering by default, but the byte order may be specified by setting the ExifUnicodeByteOrder tag (see the Extra Tags documentation).”

…and…

“The value of the IPTC:CodedCharacterSet tag determines how the internal IPTC string values are interpreted.”

“Note that unless CodedCharacterSet is UTF-8, applications have no reliable way to determine the IPTC character encoding. For this reason, it is recommended that CodedCharacterSet be set to ‘UTF8’ when creating new IPTC.”

The next Opus beta will change things so that we no longer add the BOM.

Note that this will only affect fields which are added/edited; existing fields are left as-is if they aren't changed.

@Leo

Understood, and thanks for the heads up. Out of curiosity, is that the only change, as far as BOM and metadata encoding go, or have you opted for some other method of encoding indication?

The other change is we're assuming EXIF data is always UTF-8, with or without a BOM on the front, since there is no facility within EXIF to indicate encoding and UTF-8 is the only reasonable thing you can assume.