Viewer Pane fails to show UTF-8 encoding for textfiles

AKA-Mythos · July 9, 2012, 8:04am

We have several textfiles having different charset encodings (e.g. UTF-8 without Signature).

Unfortunately DOpus doesn't detect the correct charset and shows wrong symbols for all these special characters (like german umlauts).

Jon · July 9, 2012, 8:12am

Without a BOM how is Opus meant to know what the encoding is?

AKA-Mythos · July 9, 2012, 9:29am

UTF files without a BOM are really a pain, but they are used to save or export files by some applications. Anyway these files are very well handled by text editors (e.g. EmEditor) or other merge tools etc.

I guess that they have some kind of auto-detection mechanism?

Does DOpus use a library for handling different charsets providing such an auto-detection too?

Leo · July 9, 2012, 9:36am

It doesn't, and there is no reliable way to detect characterset without a BOM or other indicator.

Anything which tries to guess the characterset will get it wrong and show gibberish for some files.

I think we'd be more likely to add an option which says "any text file without a BOM should be treated as UTF-8," since UTF-8 is very common these days. But that would, of course, also go wrong with various other character sets and prevent people using their OEM character set in particular.

UTF-8 files should really begin with a BOM, at least on Windows.

AKA-Mythos · July 9, 2012, 10:46am

Maybe it could be a list of encodings in a way that the user can choose which default charset (for all textfiles without a BOM) should be used.

100% agree ...

AKA-Mythos · July 11, 2012, 7:58am

Anyway this one isn't so much important (for me) as the introduction of an URL handler for the viewer pane allowing to click http(s):// and mailto:// links.

BenjaminW · September 3, 2015, 11:50am

[quote="leo"]It doesn't, and there is no reliable way to detect characterset without a BOM or other indicator.

Anything which tries to guess the characterset will get it wrong and show gibberish for some files.[/quote]

Check this out: userguide.icu-project.org/conversion/detection
That's a robust library to heuristically determine the character set.
Even if the detection fails, you could supply a way to choose the character set of the current file update the way appropriately.