ᏰᎤᎷ  Ⴕჩვ  ⲂⲀⲊⲊ

Posted by feydr | Posted in Uncategorized | Posted on 28-08-2009

View Comments

So this Monday I found out that one of our datasets was not being parsed anymore — it was the most important one…FUCK!

After much cursing and much bullshitting I admitted that I was being sent UTF-16LE encoded text and my program was set to receive UTF8.

My first “fix” was to look for the BOM that usually comes with such encodings and convert to UTF8 based upon that. I’ve discussed my love of Unicode and the BOM before. The short story is that it is put at the top of text files to indicate to other programs what the text should look like since there’s only about 3,00-8,000 different languages in the world. The BOM is NOT always in use but in UTF-16 files it is. For example in UTF-8 files, which are strongly encouraged, BOMS can be present but are not mandatory. In UTF-16LE (think windows) the two bytes signature usually looks like this:

me@mhu: od -h testfile | more
## producing this output
0000000 feff 0046 0075 006c 006c 0020 0054 0069
## ...

Note, that the 0000000 is just the offset of the bytes — not the actual bytes — in our case we see 2 of them FE and FF.

This, however, was not going to work as the data that I received did not always include the BOM. Why? The data was being sent from multiple locations that were issuing HTTP requests to my server that collected and parsed it via an antlr based grammar.
antlr
On the machines that were issuing the HTTP requests there was a file watcher that sent chunks of data to the HTTP server — as the file grew more chunks were sent but only the first chunk ever had the BOM in it.

Knowing I could not immediately send the encoding along with the data or re-encode on the client machine I opted to do this on the server side. My friend over at SoftwareBloat.com suggested I just look for the null bytes. The result ended up looking at the first 10 bytes of a dataset — if it included the null byte at least 2 times this was a fairly good guess that we were dealing with UTF-16LE encoding. I did not bother checking for UTF-32 LE/BE or BE in general as I’d very much like to meet the person who is running a MIPS or RISC processor and using our service — although with the rise of netbooks utilizing ARM processors and such there may come a day when we have to support this.

My detection and conversion code looks like this:

      // guess encoding if utf-16 then
      // convert to UTF-8 first
      try {
        FileInputStream fis = new FileInputStream(args[args.length-1]);
        byte[] contents = new byte[fis.available()];
        fis.read(contents, 0, contents.length);
        byte[] real = null;
 
        int found = 0;
 
        // if found a BOM then skip out of here... we just need to convert it
        if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
          found = 3;
          real = contents;
 
        // no BOM detected but still could be UTF-16
        } else {
 
          for(int cnt=0; cnt<10; cnt++) {
            if(contents[cnt] == (byte)0x00) { found++; };
 
            real = new byte[contents.length+2];
            real[0] = (byte)0xFF;
            real[1] = (byte)0xFE;
 
            // tack on BOM and copy over new array
            for(int ib=2; ib < real.length; ib++) {
              real[ib] = contents[ib-2];
            }
          }
 
        }
 
        if(found >= 2) {
          String asString = new String(real, "UTF-16");
          byte[] newBytes = asString.getBytes("UTF8");
          FileOutputStream fos = new FileOutputStream(args[args.length-1]);
          fos.write(newBytes);
          fos.close();
        }
 
        fis.close();
        } catch(Exception e) {
          e.printStackTrace();
      }

END NOTES:
* I found out when writing this article that my wordpress installation could not handle my characters well — the fix is to edit your wp-config.php and comment out the two lines describing db_charset and db_collate. It should look like this:

//define('DB_CHARSET', 'utf8');
//define('DB_COLLATE', '');

* My favorite website for finding unicode characters with associated information is: FileFormat.info. Just replace the ’2c8a’ hex with whatever code you want to know about. Also, if you are looking at a code in vim the command “:ascii” will give you the hex and octal of it.

* The official soundtrack for this particular problem comes from our friend, dumbfounded all the way out in Los Angeles, CA.

blog comments powered by Disqus