Redis: Another Hash db FTW

Posted by feydr | Posted in Uncategorized | Posted on 24-05-2009

View Comments

So I have to admit — I kinda have a hardon for key-value stores, also known as hash dbs. Why? Cause they are hashes which is the data strucuture I’ll use right before deciding to make something a class and they are incredibly fast.

Relational databases are about fucking retarded when it comes to storing data. The whole idea here is to persist data structures in a manner that is efficient right? If you had the option of storing already ordered data in an un-ordered fashion or storing it in the order it is already in what choice would you choose? The more acute observers amongst you would probably say O(1) versus O(n) is the right choice and you’d be correct.

Sometimes mysql just does not cut it when it comes to storing data. Actually, let’s not kid ourselves, mysql is awfully slow. If it weren’t for the need of persistence mysql probably wouldn’t be one of the inherent staples of web applications. I don’t know about you but I can usually max out my mysql with around 10-11k tx/second IF I have upped the ram limit.

So how fast is redis? Well, for list operations I was doing 48k tx/second — very noice!
Want to install it? Lets!
(notice: the rake task auto dls and installs the redis c for you — why?? cause ruby hackers are hot like that)

git clone git://github.com/ezmobius/redis-rb.git
rake redis:install
# start it up
redis-server

now let’s check out a simple SET/GET

cyn0n@spicetrader:~/work/$ telnet 127.0.0.1 6379
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
SET foo 3
bar
+OK
GET foo
$3
bar
^]
telnet> quit

This is your pretty stereotypical hash storage. We tell the server that we are giving it a variable of 3 bytes. We terminate that line with ‘\r\n’ then we give it the data which is ‘bar’ and terminate it again with ‘\r\n’.
The server gives us a response saying we are good to go.
Now we tell the server we want our variable back. The server responds with a ‘$’ stating that data is to follow up to the next newline/carriage return. Then it proceeds to give us our requested data. Very straightforward.

Benchmarking
Ahh, this is what everyone comes for huh? So my benching wasn’t as awesome as tokyo cabinet was (however, keep in mind that redis comes with a built-in networking layer so it’s a lot faster than tokyo tyrant is)

Here they are using the included benchmarking boilerplate from ezra:

                            user     system      total        real
set                    33.300000   2.980000  36.280000 ( 37.641193)
set (pipelined)        14.390000   1.350000  15.740000 ( 26.550251)
push+trim              36.520000   2.520000  39.040000 ( 40.420952)
push+trim (pipelined)   0.410000   0.040000   0.450000 ( 10.630631)

which as you can see result in the following:

# sets
set                           606 tx/sec
set (pipelined)               1428 tx/sec
# lists
push+trim                     555 tx/sec
push+trim (pipelined)         48780 tx/sec

obviously sets being unordered do not have O(1) as our lists do, however it makes them useful for things such as storing state

Features of Redis

  • keys can store diff. data type rather than just strings
  • master-slave replication (can move databases with a simple cp)
  • persists data asynchronously versus tokyo cabinet that does this synchronously and memcache that does not at all
  • sets are tables of hash keys
  • can perform numerous server-side operations on sets such as set-intersections, unions, subtractions, and additions
  • push head/tail O(1), range of lists
  • list trims to implement circular buffers (ala rrdtool)

Another useful thing is that most (all?) primative operations are atomic such as the PUSH/POP/INRC/DECRs. What does this mean? This means that there is NO NEED for locking algorithims satisfying the A in ACID compliance and one main reason for the speed that is witnessed. There is more than one way to skin a cat and redis having dropped the requirement of having to lock is in my opinion an excellent choice!

Unlike tokyo cabinet, redis comes with a built-in networking layer which is much much faster than tokyo tyrant (the tokyo cabinet networking layer). I assume that the creators of tokyo cabinet were under the impression that you’d just embed the c lib into whatever program you wanted as the tyrant seems to have come as an afterthought.

Since data is written synchronously concerns of data loss pop up but since there is a chance you might lose a couple of keys here or there, however with the master/slave replication this defends against this pretty well.

If you look at redis you’ll see that another performance boost comes from the fact that certain objs are put into a free list and then recycled later on when needed negating the need to allocate/de-allocate system resources which ties up cpu — very smart indeed.

bullshit
Memory usage is another point of contention when it comes to concerns about hash databases. While this is a valid concern it really amounts to a bullshit argument by someone who is taking facts out of context. Yes, it is true that you are unable to store more data at any given time in the store as you have ram. This is shocking to some people surprisingly, however it is not uncommon for a bottom-shelf production server to have 16gigs of ram to start out with and that’s only one server — NOT partitioning. If it’s a simple app that you wrote for yourself on a development box I am going to guess you don’t need more than the 2 or 3 gig anyways — if you do I’d suggest looking into getting real hardware instead of your laptop before you start bitching about performance.

As an after-note to this if you are not going to heed this advice do yourself a favor and perform this:

echo 1 > /proc/sys/vm/overcommit_memory

What this does is tell linux to allow you to use more memory then you actually have. Whenever redis saves, it forks the process over using twice the amount of memory. Although, apparently this never really happens and the memory is shared. Anyways, if you find yourself doing this quite a lot this is a warning to upgrade your hardware — as if the sluggish connections wouldn’t have been a clue to begin with.

More Reading
If you wish to do more research on redis ezra gave a lightning talk at the MountainWest Ruby Conf
here and antirez has done a pretty damn good job of documenting his code. His comments are also very good considering if he had wanted to do he could’ve named all of his vars and written his comments in Italian. (on that note I know the Italian hacker community is known for it’s hackerspaces and badass hacker conferences)

Drop the relational and enjoy your scalable, speed-demon database!

Drop Boms Not Bombs

Posted by feydr | Posted in Uncategorized | Posted on 12-05-2009

View Comments

So I come in Monday morning one fine week after a long drunken hiatus to find a ‘badhand’ in my inbox. A badhand is a feature I implemented on one of the sites we are developing so that users can inform me that my parser did not correctly produce the expected xml. This saves me time and the user frustration.

Dragon Book Except, this badhand was a lot like some others I had received — it had a special 2-byte character that looked like ‘‘ in vim. I knew right away that something was wrong and somewhere somehow my unlucky user had formatted this text into a text-editor such as wordpad even though he defiantly shouted back that he had not — he uses a Mac after all.

So I started my quest and found out that eventually this special character was called a ‘BOM’ or a ‘byte order mark’ indicating to a text editor what encoding to use for unicode. Well, after further searching I found the ‘best answer’ on stackoverflow (as usual):

There’s no such thing as a UTF-8 BOM. A UTF-8 file is in a predefined byte order already, adding a BOM prefix is completely pointless. Applications that produce a UTF-8-encoded file with a U+FEFF character at the beginning are wrong.

Hrm… I thought for a second here cause I already take care of special utf8 characters in my parser — among them are variations of the em-dash, euro, and yen. Also, apparently the BOM is only used as the first 2-3 bytes at the top of a text file. I had text in my sample files that were on almost every other line.

Shit! I had been screwed by the powers that be! My lean mean parser was about to pick up an ugly method that would slow it down considerably.
It was time to ride the BOM.


300px-slim-pickens_riding-the-bomb_enh-lores

At first I thought — ok all I need to do was scan for the BOM (which would be simple — 0xfe 0xff) and remove. Then I remembered — most of my files that come through my webserver are being passed through ruby’s File.open method — NOT an unicode safe library call! Shit! For a language that was made in Japan you’d think ruby has decent unicode support — say what you will — ruby treats ALL strings as 1byte char star arrays — multi-byte unicode is not allowed. This threw me into a pickle. IRC was of no help — usually I’ve learnt that when you don’t receive an answer on IRC it’s because a) you are asking dumb questions or b) no one knows the answer to.

I decided I must pull out IRB and just start hacking until something comes.

This is what I came up with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  # translate unicode to latin
  # and drop the BOM
  def unicodeTolatin(byteray)
 
    #flip to unicode
    unistring = byteray.each_byte.map do |p|
      [p].pack('U')
    end
 
    # drop any null bytes
    blah = []
    unistring.each do |o|
      if !o.eql? "\000" then
        blah << o
      end
    end
 
    #convert array byte array over to string
    newstr = blah.to_s
 
    # drop our bom
    newstr = newstr.gsub(/\303\277\303\276/, '')
  end

If any of you ever have to struggle with the fucking bom — remember DROP BOMS NOT BOMBS!