HBase to the rescue

Posted by feydr | Posted in Uncategorized | Posted on 10-08-2009

View Comments

Beginnings (circa 1 year ago)
I had finally grabbed a schema dump of how their ‘wonderful’ database worked. I was so elated that we would not have to re-invent the wheel. This dump was actually a fairly nice looking pdf describing in detail about their 60 column wide 60+ table design — to say the least my mouth dropped and I shit a brick cause this was for ONE USER to be used on a desktop — we needed this to scale to hundreds of thousands if not millions of users. They were using postgresql and could do 20 hands/second. We knew we had some problems ahead of us.

The first task that we noticed right away was our parsing was not up to speed — after about a year of dicking around in different languages (c++, java, ruby) we settled on java with antlr and now are pushing over 800 hands/second with no summarizing, 300 hands/second with summarizing and 80 hands/second stuffing rows in a mysql. This clued us into the fact that if we wanted to go faster without having to ‘scale up’ we’d need other alternatives.

Of course what use is the speed of our database if we can’t use it after 100 players start hitting our site everyday? One user alone could generate over 200,000 rows using the competition’s schema within one day — sometimes within a hour! This was the real main concern as we slowly realized that one database to rule them all would never cut it. It’s true we talked about sharding and the like for oh say a couple of minutes. We spent more time on the phone with TerraCotta and Vertica then we’d like to admit to.

Ever since early spring I have been on a key-value storage kick, yet those are only front line defenses — you need something in the background that can kick some ass — that little piece of enterprise software is HBase.

HBase?! What the fuck is that?
HBase is in short — awesome! Awesome like alligators artfully eating elephants awesome. HBase is the ‘database’ layer of a typical Hadoop stack. It sits on top of HDFS which is a distributed filesystem. It’s main competition is Hypertable written in C++ (which claims to be faster, yet my benches speak otherwise) and BigTable.

So what is it really? Well, to quote Michael Stack, it structures data as tables of ‘column-oriented rows’ which can scale to billions of rows and millions of columns with thousands of versions — can your mysql do that? There are no joins and no transactions. For you paranoid ACID heads out there the lack of transactions is not worthy of an argument — row updates are atomic.

There is some bullshit tossed around regarding HBase — the whole concept of living on ‘commodity hardware’ is a bit of a joke considering they suggest 4-8 cores with 6-8 gigs of ram to get started. To be fair though if you are involved in large projects a production server housing oracle or even mysql can easily start out with 32 gigs of ram — so it’s not that much of a joke.

Right now I have not done extensive benchmarking with ab yet but it appears that it will serve our needs directly from our ruby thrift code on a production site — when the time comes we plan on implementing front-side key-value caching.

Installing and using it is not a pain but it is not straightforward either.

Let’s get started
First we need to install hbase proper:

git clone http://git.apache.org/hbase.git/
# of course change this to wherever your jvm sits
export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.14/jre/
cd hbase; ant

Thrift is a popular framework for cross-language services. This allows us to access hbase from languages like ruby, haskell and ocaml. It usually requires some libs that are not available on a virgin install — the most notable of which is libboost libs.

sudo apt-get install autoconf libtool libboost-dev g++ \
sun-java6-jdk ant flex bison pkg-config libevent-dev \
ruby-dev zlib1g-dev
  sudo ln -s /usr/include/boost/ /usr/local/include/boost-1_34_1

grab thrift:

wget -O thrift.tgz "http://gitweb.thrift-rpc.org/?p=thrift.git;a=snapshot;h=HEAD;sf=tgz"
tar -xzf thrift.tgz
cd thrift; ./bootstrap.sh; ./configure; make; sudo make install

since ruby is my language of choice for fast development let’s install the thrift gem:
(I really have no clue why you need mongrel for this — haven’t dug into deep.)

gem install mongrel echoe --no-ri --no-rdoc
cd ~/thrift/lib/rb
rake gem
sudo gem install pkg/thrift-0.1.0.gem --no-ri --no-rdoc

Let’s check out our shell:
HBase Shell

$bin/hbase shell
 
# create a table
>create 'treasureChest', 'col1', 'col2'
 
# stuff 2 columns into a row into the table
>put 'treasureChest', 'myveryfirstrow', 'col1:notmine', 'the queens underwear'
>put 'treasureChest', 'myveryfirstrow', 'col1:mine', 'elf spice'
 
# scan for anything on this row
>scan 'treasureChest'
 
# explicitly request for the notmine stuff
>get 'treasureChest, 'myveryfirstrow', {COLUMNS => 'col1:notmine'}
 
# disable/drop the table
>disable 'treasureChest'
>drop 'treasureChest'

My Patches:
As of 0.20 HBase was not able to scan across start/stop timestamps anymore using the thrift interface as the thrift interface doesn’t even compare to what the native java stuff can do — this as it turns out is incredibly useful to have. So I spent ~2 hours going through what was most assuredly
Proper Java Development using Eclipse for Enterprise Applications

Proper Java using Eclipse for Enterprise Software Development(tm)

I could straight shoot someone if I saw one more getter/setter pair. If you think I’m full of shit please read this. I mean reading through these 1000 line classes are really bad for your eyes and it is BAD PROGRAMMING.

The end result was a patched thrift server that supports scanning on start/stop timestamps. I provide the ruby thrift code for convenience — you still have to re-compile with ant to get your thrift server working.

git clone git://github.com/feydr/hrb.git
 
# to apply the patch cp the patches to: $HBASE_HOME/src/java/org/apache/hadoop/hbase/thrift
 
patch <ThriftServer.java.patch
patch <Hbase.thrift.patch
 
# then apply this last one in: $HBASE_HOME/src/java/org/apache/hadoop/hbase/thrift/generated
patch <Hbase.java.patch
 
# then just do a:
export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.14/jre/
ant clean
ant
 
cd hrb/hrb-ng; gem build hrb-ng.gemspec
sudo gem install hrb-ng --no-ri --no-rdoc

Run it:

require 'rubygems'
require 'hrb-ng'
 
transport = Thrift::BufferedTransport.new(Thrift::Socket.new('127.0.0.1', 9090))
protocol = Thrift::BinaryProtocol.new(transport)
client = Apache::Hadoop::Hbase::Thrift::Hbase::Client.new(protocol)
transport.open()
 
scanner = client.scannerOpenWithStopStartTs("mytable", "myrow.0", "myrow.1", ["mycol:"], 1249677001723, 1249677009731)
blah = client.scannerGet(scanner)

Note: This patch is probably only necessary until the thrift interface is re-written. This code exists in the native java client but NOT in thrift as of yet.

If you look at the source of scannerOpenwithStopStartTs you’ll see all I did was copy it and modify the start/stop to be set to the passed args — everything else is thankfully already in place.

Generating new ruby Thrift Code:

mkdir ~/hrb
thrift --gen rb -o ~/hrb /opt/hbase2/srv/java/org/apache/hadoop/hbase/thrift/Hbase.thrift

And remember kids — “All your HBase are belong to us!”

  • Matt
    >> I could straight shoot someone if I saw one more getter/setter pair. If you think I’m full of shit please read this (http://www.pragprog.com/articles/tell-dont-ask)

    That article speaks specifically about retaining encapsulation, and is COMPLETELY THE OPPOSITE of what you were trying to state.
blog comments powered by Disqus