Byte-order Mark
Always prefix a Unicode plain text file with a
byte-order mark. Because Unicode plain text is a sequence of 16-bit codes, it is sensitive to
the byte ordering used when the text was written.
A byte-order mark is not a control character that selects the byte order of
the text; it simply informs an application receiving the file that the file is
byte ordered.
Ideally, all Unicode text would follow only one set of byte-ordering rules.
This is not possible, however, because microprocessors differ in the position of
the least significant byte: IntelŪ and MIPSŪ processors have the least significant byte first; Motorola processors (and
byte-reversed Unicode files) have it last. With only a single set of byte-ordering rules,
users of one type of microprocessor would be forced to swap the byte order every
time a plain text file is read from or written to, even if the file is never
transferred to another system based on a different microprocessor.
The preferred place to specify byte order is in a file header, but text files
do not have headers. Therefore, Unicode has defined a character (0xFEFF) and a
noncharacter (0xFFFE) as byte-order marks. They are mirror byte-images of each
other.
Since the sequence 0xFEFF is exceedingly rare at the outset of regular
non-Unicode text files, it can serve as an implicit marker or signature to identify
the file as a Unicode file. Applications written to read both Unicode and
non-Unicode text files should use the presence of this sequence as a near-certain
indicator that the file is a Unicode file. (Compare this technique to using the
MS-DOS EOF marker to terminate text files.)
When an application finds 0xFEFF at the beginning of a text file, it typically
processes the file as though it were a Unicode file, although it may also
perform further heuristic checks to verify that this is true. Such a check could be
as simple as a test of whether the variation in the low-order bytes is much
higher than the variation in the high-order bytes. For example, if ASCII text is
converted to Unicode text, every second byte is zero. Also, checking both for
the linefeed and carriage-return characters (0x000A and 0x000D) and for even or
odd file size can provide a strong indicator of the nature of the file.
When an application finds 0xFFFE at the beginning of a text file, it
interprets it to mean the file is a byte-reversed Unicode file. The application can
either swap the order of the bytes or alert the user that an error has occurred.
The Unicode byte-order mark character is not found in any code page, so it
disappears if data is converted to ANSI. Unlike other Unicode characters, it is
not replaced by a default character when it is converted. If a byte-order mark is
found in the middle of a file, it is not interpreted as a Unicode character
and has no effect on text output.
The Unicode value 0xFFFF is illegal in plain text files and cannot be passed
between Windows functions. The value 0xFFFF is reserved for an application's
private use.
- Software for developers
-
Delphi Components
.Net Components
Software for Android Developers
- More information resources
-
MegaDetailed.Net
Unix Manual Pages
Delphi Examples
- Databases for Amazon shops developers
-
Amazon Categories Database
Browse Nodes Database