yisqiu
October 8, 2008, 8:24am
1
I use Opus to search the containing text of the utf8 format file
The english character is OK,but when i use it to search chinese character problem appears,Opus can find nothing.
the attach file is my search file(vs_rank.php)
vs_rank.rar (132 Bytes)
Leo
October 8, 2008, 8:49am
2
That file is missing the UTF8 BOM at the start of the file.
If you load it into Notepad and then re-save it as either UTF8 or UTF16 then Opus will find it, since Notepad writes the BOM.
Without the BOM the file's type is ambiguous and programs have to guess/assume which encoding is used. Sometimes they'll guess/assume incorrectly.
yisqiu
October 8, 2008, 8:53am
3
but the emeditor(CTRL+SHIFT+F) can search the character
out with the same file
Leo
October 8, 2008, 9:29am
4
EmEditor probably assumes UTF8 when there is no BOM, then. Or maybe it looks at the file and tries to guess the format (like Notepad does). Neither of those things are 100% reliable and it's not a good idea to depend on them.
Forcing programs to guess the encoding, due to there being no BOM, can go wrong. Here's a famous example: en.wikipedia.org/wiki/Bush_hid_the_facts
If you open the file in the Opus viewer (or text-file thumbnails) then you'll see what's happening. The first file is the original and the second and third are the result of loading that into Notepad and then re-saving it as UTF8 and UTF16:
Leo
October 8, 2008, 9:34am
5
Actually, adding a BOM to a UTF8 PHP file may not be a good idea, which means you can't really win here (unless PHP can accept UTF16 which it probably can't):
While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. It only identifies a file as UTF-8 and does not state anything about byte order.[1] Many Windows programs (including Windows Notepad) add BOM's to UTF-8 files. However in Unix-like systems (which make heavy use of text files for file formats as well as for inter-process communication) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1 characters  in most text editors and web browsers not prepared to handle UTF-8.
From: en.wikipedia.org/wiki/Byte_Order_Mark
(Assuming that is still current and the makers of PHP haven't fixed the problem.)
yisqiu
October 8, 2008, 1:23pm
6
thanks
but there are lots of files like this ,i unable to re-saving the files all,is that any good way to do the Conversion