German "Umlaute" => ÄÖÜ are not recognized/found during search

Pyradur · August 16, 2016, 10:48am

If I use the search-console at the bottom (Multipanel ?) for searching a file with a German "Umlaut" in the filename like öäü, no file will be found. For instance: Büromöbel.doc

Also no file with an searched "Umlaut" in it will be found.
For instance the Text into the file is like:
"Heute ist ein schöner Tag". If I search for files with "ö" in it, no file will be found.

abr · August 16, 2016, 11:01am

I didn't test it in the search panel, but if you use F3, it works (note, that F3 only searches folders below your current location). As a workaround you could try using

FIND IN "C:" "D:" "E:" QUERY {dlgstring|search everywhere}

given, that you have indexed your drives. Also, you may want to edit the drive letters to your needs.

Jon · August 16, 2016, 11:03am

This works when I try it and I'm sure one of our many German users would have complained about it by now if it really didn't work.

Please post a screenshot showing how you have the Find window configured.

tbone · August 16, 2016, 1:41pm

Can only speak for DO12 now, but there maybe something broken indeed or expectations changed over time.

I found out that for UTF8-NOBOM files, searching for umlauts does not work, you need to insert "??" for each umlaut (and enable wildcards of course). I can see why that is difficult, maybe it was always like this and it just took years to notice. For a UTF8-BOM file, things are different. Here it works as expected, did not test other encodings, but it seems the umlauts are two bytes when comparing? Can this ever work for non-unicode files? Does a translation happen before comparison?

If the encoding of files is not clear, to me it also seems as if you'd need to search through these files twice (at least).
First time with ANSI and default codepage, second time assuming UTF8, is that nonsense? What do you think?

Leo · August 16, 2016, 1:51pm

OP is talking about file names not file contents, which are a very different (and more complicated) topic.

Re-reading, we're talking about both, which will complicate things hugely. Let's stick to filenames for now.

Pyradur · August 16, 2016, 2:37pm

Sorry - I was wrong with no recognition for umlauts in the filename. It affects only the file-content.

And I suppose user "tbone" is right, because:

I searched in *.php-files which are files without BOM - and no umlauts are found.
I converted one *.php File to with BOM - and the umlauts were found.

A solution other than ?? would be nice...

Jon · August 16, 2016, 6:18pm

If you use the advanced find, there's an "Assume UTF-8 without BOM" option for the Contains clause.

Pyradur · August 16, 2016, 9:17pm

Hey, you are right

Thanks a lot.

For my better underständig, if I have to remember in the future, I've attached your description as graphic.

me54899 · March 21, 2022, 8:09am

Hello, I just had the same problem - could not find a text file with a German word that has an Umlaut in contents (which is super-frequent of course). If one has ALL kinds of txt files (with all kinds of formatting, e.g. utf-8 with/without BOM, ansi, ...), will the option "assume utf-8 without bom" always work to find a file by its (umlaut-containing) contents, or could one be force to run two searches? Or is the only disadvantage reduced search speed? If it is just speed, it would be great if there were a global advanced option to always assume utf8 w/o BOM in order to allow users to use Simple Search which has advantages in everyday use. OR give that option in simple search.... but that would defeat the philosophy of simple...

Leo · March 21, 2022, 11:34am

Did you try the option to see if it makes a difference on the files that aren't being found? That will tell us if the issue is one of utf8 text encoding or something else, which is a good starting point.

me54899 · March 21, 2022, 12:09pm

When I do an advanced search for "ö" with "assume ..." checked, it will find 43, and unchecked 80. I didn't look at the results in extreme detail, but there were files within the 43 that did not show up in the 80.
So I would have to always search like this:
contains ööööö, (x) assume
OR contains ööööö, ( ) assume

Leo · March 21, 2022, 12:40pm

If some files need it on and some need it off then you've probably got files in two incompatible encodings, and the only solution is to run a find that does the search both ways (which you can do via the Advanced tab in the find panel).

(Or you could convert the files using a legacy DOS/Windows encoding over to UTF8, if it's safe to re-save them.)

me54899 · March 21, 2022, 10:06pm

Thank you.

For Non-English languages, I bet a very substantial number of users is affected that will not find files as desired via the Simple Search, I think there should be a default that will always search both ways.

As to myself, I can of course use Advanced Search when I look for words that contain Umlauts or ß (which is very frequent), but I seem to fail at setting up the filter:

I did

simple search
name matching txt ()WC, ()any, (x) partial
contains ö ()WC, ()case

RESULT 47 files

2.1 Advanced
YES And Name Match txt ()WC, ()REG, ()BOM, ()whole, ()case
AND Contains Match ö ()WC, ()REG, (x/0)BOM, ()whole, ()case

RESULT: () BOM: 0 files; (x) BOM: 0

2.2 Advanced
YES And Name Match txt (x)WC, ()REG, ()BOM, ()whole, ()case
AND Contains Match ö ()WC, ()REG, (x/0)BOM, ()whole, ()case

RESULT: () BOM: 0 files; (x) BOM: 0

2.3.1 Advanced
YES And Name Match *txt (x)WC, ()REG, ()BOM, ()whole, ()case
AND Contains Match ö ()WC, ()REG, (x/0)BOM, ()whole, ()case

RESULT: () BOM 47 files; (x) BOM: 12

2.3.2 Advanced
YES And Name Match *txt (x)WC, ()REG, ()BOM, ()whole, ()case
AND Contains Match ö (x)WC, ()REG, (x/0)BOM, ()whole, ()case

RESULT: () BOM 47 files; (x) BOM: 12

2.3.3 Advanced
YES And Name Match *txt (x)WC, ()REG, ()BOM, ()whole, ()case
AND Contains Match ö (x)WC, ()REG, (x/0)BOM, ()whole, ()case

RESULT: ()BOM 47 files: (x) BOM: 11 (SIC)

The 12th, missing file is a file that ALSO shows up in the 47 files results.

2.3.4 Advanced
ö*

RESULT 47; BOM: SOMETIMES SHOWS 11, SOMETIMES 12, at random!!!! My God.... I can reproduce this behavior

---> ????

I wanted to combine the searches:

YES And Name Match *txt (x) WC
YES AND Subclause Match
---YES __ Contains Match ö (x) WC () BOM
---YES OR Contains Match ö (x) WC (x) BOM

RESULT: 11 or 12, at random.... !!

I would have expected 11(12) + 47 = 58, but I guess I don't get it....

Jon · March 21, 2022, 10:17pm

The problem is that without a BOM it isn't possible to be certain that the character is actually the one you're looking for. Throwing up incorrect matches would also confuse people.

me54899 · March 21, 2022, 11:08pm

I am going to try to convert most of my text files then. They just contain personal stuff, I can't even think of situations where UTF8 is needed -- what format would you recommend as general purpose txt format / BOM/ w/o BOM? I know that's off topic, but still relevant in the context of this question and DOpus users reading here. I don't want to lose information when converting though. I am aware there are some config files that require certain formatting. --- Of course I would not batch convert C:, but I still also might be keeping config files in the Document folders as reference etc.

Is an example of converter, for myself only affordable as a one-time thing (3 month subspriction).

I still don't understand the fluctuation in results (11/12)?

Thank you!

See also

Jon · March 22, 2022, 12:12am

UTF-8 with BOM is best, in my opinion, since it's explicit what encoding the file uses.

me54899 · March 22, 2022, 6:30am

(I might sound a bit stressed below, please understand, I was working all night )

Editor in Windows 11 does not allow any registry changes pertaining to encoding, and defaults to UTF-8 without BOM.
Previously, users have accumulated lots of ANSI-encoded txt files because that used to be the default upt till a certain build of Win 10.
In French and German, words with é or ä make 10-15% of all words.
This means the average user will no longer be able to find their text files.

There is no decent batch converter that will convert all files to UTF-8 (the one that I found has weird behavior and is not free), and even if there were, one would have to look at every single file because there might be a reason for the encoding (could cause data loss or require a prompt, or other reasons).

Furthermore, after further reading, I don't want to have with-BOM UTF-8, but rather ANSI, or if need be UTF-8 without-BOM. I will quit using Editor and only use Akelpad to make sure I will only have ANSI in the future, but I will have to spend a lot of time now looking for all UTF-8 files and convert them to ANSI. Because otherwise I cannot rely on the search function.

I am not going to do an Advanced boolean search to just find a text file. It's an every day task. Also, like I described above I could not get the OR to work when I wanted the results to show up in one results window instead of doing to Advanced searches.

I also did more testing, and the results number fluctuation issue continues to happen all the time, I search for files, and I get 102, then 108, then 106 and so forth. I think this behavior is limited to Advanced Search.

me54899 · March 22, 2022, 7:09am

The fluctuation bug is independent of the BOM option.

lxp · March 22, 2022, 7:17am

That's an odd combination

I think I once used CpConverter for this job.

me54899 · March 22, 2022, 7:57am

can anyone reproduce the fluctuation bug in advanced search? I got it jumping like a crazy lottery, showing random files, 1 file, 7 files, 2, files, 4, files, 8 files etc.... just click search and it does a lottery?
and name match ,*txt, use wildcard
and contains match, möglich, use wildcards
möglich= search word
not mentioned = not selected