What does “The .NET framework uses the UTF-16 encoding standard by default” mean?

Posted on

Problem :

My study guide (for 70-536 exam) says this twice in the text and encoding chapter, which is right after the IO chapter.

All the examples so far are to do with simple file access using FileStream and StreamWriter.

It aslo says stuff like “If you don’t know what encoding to use when you create a file, don’t specify one and .NET will use UTF16” and “Specify different encodings using Stream constructor overloads”.

Never mind the fact that the actual overloads are on the StreamWriter class but hey, whatever.

I am looking at StreamWriter right now in reflector and I am certain I can see that the default is actaully UTF8NoBOM.

But none of this is listed in the errata. It’s an old book (cheked the errat of both editions) so if it was wrong I would have thought someone had picked up on it…..

Makes me think maybe I didn’t understand it.

So…..any ideas what it is talking about? Some other place where there is a default?

It’s just totally confused me.

Solution :

“UTF-16” is an annoying term, as it has two meanings which are easily confused.

The first meaning is a series of 16-bit codepoints. Most of these correspond directly to the Unicode character of the same number; characters outside the Basic Multilingual Plane (U+10000 upwards) are stored as two 16-bit codepoints, each one of the Surrogates.

Many languages use UTF-16 in this sense for internal storage purposes, including as a native string type. This is the usual source of phrases like “.NET (or Java) uses UTF-16 as its default encoding”. .NET is accessing the elements of such a UTF-16 string 16 bits at a time (ie, at the implementation level, as a uint16).

The next thing to consider is the encoding of such a UTF-16 string into linear bytes, for storage in a file or network stream. As always when you store larger numbers into bytes, there are two possible encodings: little-endian or big-endian. So you can use “UTF-16LE”, the little-endian encoding of UTF-16 into bytes, or “UTF-16BE”, the big-endian encoding.

(“UTF-16LE” is the more commonly used. Just to add more confusion to the flames, Windows gives it the deeply misleading and ambiguous encoding name “Unicode”. In reality it is almost always better to use UTF-8 for file storage and network streams than either of UTF-16LE/BE.)

But if you don’t know whether a bunch of bytes contains “UTF-16LE” or “UTF-16BE”, you can use the trick of looking at the first code point to work it out. This code point, the Byte Order Mark (BOM), is only valid when read one way around, so you can’t mistake one encoding for the other.

This approach, of not caring what byte order you have but using a BOM to signal it, is usually referred to under the encoding name… “UTF-16”.

So, when someone says “UTF-16”, you can’t tell whether they mean a sequence of short-int Unicode code points, or a sequence of bytes in unspecified order that will decode to one.

(“UTF-32” has the same problem.)

If you don’t know what encoding to use when you create a file, don’t specify one and .NET will use UTF16

If that’s the actual direct quote it is a lie. Constructing a StreamWriter without an encoding argument is explicitly specified to give you UTF-8.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

Test it. Write the string “abcd” to a file. If it uses UTF8, the file will have a size of 4 bytes. Under UTF16, it’ll be 8 bytes. (plus perhaps the BOM)

UTF16 is the default encoding that .NET will use to encode strings in your program (like string variables).

I had this problem with the static System.IO.File class.

I wanted to write a string that contained UTF-16 XML to file.

First, I used

using(StreamWriter writer = File.CreateText(xmlFilePathTarget))

But because it wrote the string as UTF-8, IE would not open it and displayed the error:

The XML page cannot be displayed
Cannot view XML input using style
sheet. Please correct the error and
then click the Refresh button, or try
again later.

Switch from current encoding to
specified encoding not supported.
Error processing resource
‘file:///C:/Documents and Setti…

Thanks largely to this article, I found the solution was to explicitly use the StreamWriter constructor:

StreamWriter writer = new StreamWriter(xmlFilePathTarget, false, Encoding.Unicode));

Leave a Reply

Your email address will not be published. Required fields are marked *