Drop Boms Not Bombs

Posted by feydr | Posted in Uncategorized | Posted on 12-05-2009

View Comments

So I come in Monday morning one fine week after a long drunken hiatus to find a ‘badhand’ in my inbox. A badhand is a feature I implemented on one of the sites we are developing so that users can inform me that my parser did not correctly produce the expected xml. This saves me time and the user frustration.

Dragon Book Except, this badhand was a lot like some others I had received — it had a special 2-byte character that looked like ‘‘ in vim. I knew right away that something was wrong and somewhere somehow my unlucky user had formatted this text into a text-editor such as wordpad even though he defiantly shouted back that he had not — he uses a Mac after all.

So I started my quest and found out that eventually this special character was called a ‘BOM’ or a ‘byte order mark’ indicating to a text editor what encoding to use for unicode. Well, after further searching I found the ‘best answer’ on stackoverflow (as usual):

There’s no such thing as a UTF-8 BOM. A UTF-8 file is in a predefined byte order already, adding a BOM prefix is completely pointless. Applications that produce a UTF-8-encoded file with a U+FEFF character at the beginning are wrong.

Hrm… I thought for a second here cause I already take care of special utf8 characters in my parser — among them are variations of the em-dash, euro, and yen. Also, apparently the BOM is only used as the first 2-3 bytes at the top of a text file. I had text in my sample files that were on almost every other line.

Shit! I had been screwed by the powers that be! My lean mean parser was about to pick up an ugly method that would slow it down considerably.
It was time to ride the BOM.


300px-slim-pickens_riding-the-bomb_enh-lores

At first I thought — ok all I need to do was scan for the BOM (which would be simple — 0xfe 0xff) and remove. Then I remembered — most of my files that come through my webserver are being passed through ruby’s File.open method — NOT an unicode safe library call! Shit! For a language that was made in Japan you’d think ruby has decent unicode support — say what you will — ruby treats ALL strings as 1byte char star arrays — multi-byte unicode is not allowed. This threw me into a pickle. IRC was of no help — usually I’ve learnt that when you don’t receive an answer on IRC it’s because a) you are asking dumb questions or b) no one knows the answer to.

I decided I must pull out IRB and just start hacking until something comes.

This is what I came up with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  # translate unicode to latin
  # and drop the BOM
  def unicodeTolatin(byteray)
 
    #flip to unicode
    unistring = byteray.each_byte.map do |p|
      [p].pack('U')
    end
 
    # drop any null bytes
    blah = []
    unistring.each do |o|
      if !o.eql? "\000" then
        blah << o
      end
    end
 
    #convert array byte array over to string
    newstr = blah.to_s
 
    # drop our bom
    newstr = newstr.gsub(/\303\277\303\276/, '')
  end

If any of you ever have to struggle with the fucking bom — remember DROP BOMS NOT BOMBS!

blog comments powered by Disqus