November 9, 2009

Why won't my objects serialize properly and what the heck are byte order marks anyway?

Okay, I just had an issue today with serialization and couldn't for the life of me figure out what was going on. Up until today, I've not had problems with the following code, but today I've had an odd headache that took me a while to fix. A bit of background - I'd originally developed this code on a Windows Vista machine running on a 64 bit cpu on a bilingual laptop and today I plugged it into a project running on a 32 bit Canadian English Windows XP machine - not that I suspect that should make a difference.

using System;
using System.IO;
using System.Text;
using System.Xml.Serialization;

public static class Serializer
{
    private static Encoding encoding = Encoding.UTF8;

    public static string Serialize(T data)
    {
        using (var memoryStream = new MemoryStream())
        {
            var xmlTextWriter = new XmlTextWriter(memoryStream, encoding);
            new XmlSerializer(typeof(T)).Serialize(xmlTextWriter, data);
            return encoding.GetString(memoryStream.GetBuffer());
        }
    }

    public static T Deserialize(string xmlData)
    {
        using (MemoryStream memoryStream = new MemoryStream(encoding.GetBytes(xmlData)))
            return (T)new xmlSerializer.Deserialize(memoryStream);            
    }
}

Up until today, my objects serialized okay. Today, however, I'm noticing in my current project, that XML that the serializer is emitting is prepended with a '?' so my xml strings looked like this:

?<?xml version="1.0"?>
<DummyClass xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
            xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Property1>Hello</Property1>
  <Property2>World</Property2>
</DummyClass>

Now either this is normal and I've just never noticed it, or this has never caused deserialization issues in the past, and it's only today that this has caused issues. Either way, today it's a problem - and either it shouldn't be there and my deserializer is correct in saying that the format of the code is incorrect or my serializer is appending this erroneous character that shouldn't be there - which is what I suspect. So what is it this mysterious question mark?

It turns out that it's actually a hexadecimal byte code to identify the format of string - known as the Byte Order Mark (BOM) - further information can be found here at http://www.unicode.org/faq/utf_bom.html#BOM but to summarize they are as follows:

Bytes Encoding Format
0x0000FEFF UTF-32 Big Endian
0xFEFF0000 UTF-32 Little Endian
0xFEFF UTF-16 Big Endian
0xFFFE UTF-16 Little Endian
0xEFBBBF UTF-8

So this mysterious question mark appears to be the Byte Order Mark (BOM) 0xEFBBBF which is being displayed as a question mark and is what's causing my deserializer to crash. Well, now at least I know what it is I can figure out a way around it. After a little digging, it seems that instead of using Encoding.UTF8, I can specify a new UTF8 Encoding which doesn't include the identifying mark.

new UTF8Encoding(bool encoderShouldEmitUTF8Identifier)

instead of using

Encoding.UTF8

So I've changed my serialization code to the following which is far more flexible and explicitly specifies that unless an encoding is passed in, that the encoder should default to UTF8 without the UTF8 identifier.

using System;
using System.IO;
using System.Text;
using System.Xml.Serialization;

public static class Serializer
{
    /* set the default encoding in a lazy fashion so that 
     * it's not loaded unless it's needed */
    private static Encoding _enc;
    public static Encoding DefaultEncoding
    {
        get
        {
            if (_enc == null) _enc = new UTF8Encoding();
            return _enc;
        }
  
    }
 
    /* Serialize using the default encoding */
    public static string Serialize(T data)
    {
        return Serialize(data, null);
    }
 
    /* Serialize using the specified encoding */
    public static string Serialize(T data, Encoding encoding)
    {
        encoding = encoding ?? DefaultEncoding;

        using (var memoryStream = new MemoryStream())
        {
            var xmlTextWriter = new XmlTextWriter(memoryStream, encoding);
            new XmlSerializer(typeof(T)).Serialize(xmlTextWriter, data);
            return encoding.GetString(memoryStream.GetBuffer());
        }
    }

    /* Deserialize using the default encoding */
    public static T Deserialize(string xmlData)
    {
        return Deserialize(xmlData, null);
    }

    /* Deserialize using the specified encoding */
    public static T Deserialize(string xmlData, Encoding encoding)
    {
        encoding = encoding ?? DefaultEncoding;

        using (MemoryStream memoryStream = new MemoryStream(encoding.GetBytes(xmlData)))
            return (T)new xmlSerializer.Deserialize(memoryStream);            
    }
}

Which makes for a far more flexible and robust utility library because now instead of relying on framework defaults which may vary depending upon the environment, the encoding is specific.

Further reading regarding document encoding can be found on the MSDN website at http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx

No comments:

Post a Comment