We have several textfiles having different charset encodings (e.g. UTF-8 without Signature).
Unfortunately DOpus doesn't detect the correct charset and shows wrong symbols for all these special characters (like german umlauts).
We have several textfiles having different charset encodings (e.g. UTF-8 without Signature).
Unfortunately DOpus doesn't detect the correct charset and shows wrong symbols for all these special characters (like german umlauts).
Without a BOM how is Opus meant to know what the encoding is?
UTF files without a BOM are really a pain, but they are used to save or export files by some applications. Anyway these files are very well handled by text editors (e.g. EmEditor) or other merge tools etc.
I guess that they have some kind of auto-detection mechanism?
Does DOpus use a library for handling different charsets providing such an auto-detection too?
It doesn't, and there is no reliable way to detect characterset without a BOM or other indicator.
Anything which tries to guess the characterset will get it wrong and show gibberish for some files.
I think we'd be more likely to add an option which says "any text file without a BOM should be treated as UTF-8," since UTF-8 is very common these days. But that would, of course, also go wrong with various other character sets and prevent people using their OEM character set in particular.
UTF-8 files should really begin with a BOM, at least on Windows.
Maybe it could be a list of encodings in a way that the user can choose which default charset (for all textfiles without a BOM) should be used.
100% agree ...
Anyway this one isn't so much important (for me) as the introduction of an URL handler for the viewer pane allowing to click http(s):// and mailto:// links.
[quote="leo"]It doesn't, and there is no reliable way to detect characterset without a BOM or other indicator.
Anything which tries to guess the characterset will get it wrong and show gibberish for some files.[/quote]
Check this out: userguide.icu-project.org/conversion/detection
That's a robust library to heuristically determine the character set.
Even if the detection fails, you could supply a way to choose the character set of the current file update the way appropriately.