Share

libketama - a consistent hashing algo for memcache clients

10 Apr 2007, 17:36

We wrote ketama to replace how our memcached clients mapped keys to servers. Previously, clients mapped keys->servers like this:

server = serverlist[hash(key)%serverlist.length];

This meant that whenever we added or removed servers from the pool, everything hashed to different servers, which effectively wiped the entire cache. We add (and sometimes remove) servers from the memcached pool often enough to warrant writing this - if your memcached pool never changes, you can probably stop reading now :)

Ketama is an implementation of a consistent hashing algorithm, meaning you can add or remove servers from the memcached pool without causing a complete remap of all keys.

Here's how it works:

* Take your list of servers (eg: 1.2.3.4:11211, 5.6.7.8:11211, 9.8.7.6:11211)
* Hash each server string to several (100-200) unsigned ints
* Conceptually, these numbers are placed on a circle called the continuum. (imagine a clock face that goes from 0 to 2^32)
* Each number links to the server it was hashed from, so servers appear at several points on the continuum, by each of the numbers they hashed to.
* To map a key->server, hash your key to a single unsigned int, and find the next biggest number on the continuum. The server linked to that number is the correct server for that key.
* If you hash your key to a value near 2^32 and there are no points on the continuum greater than your hash, return the first server in the continuum.

If you then add or remove a server from the list, only a small proportion of keys end up mapping to different servers.


The majority of the code is a C library (libketama) and a php4 extension that wraps it. I've also included a class from our Java client. (Java Collections makes it rather easy). We use a single-server memcache client wrapped with a native php class to make it multi-server capable, so we just replaced the hashing method with a ketama_find_server call. (should be easy enough to plug this into libmemcache if need be)

http://static.last.fm/rj/ketama.tar.bz2
http://static.last.fm/ketama/ketama-0.1.1.tar.bz2
svn://svn.audioscrobbler.net/misc/ketama/

We've been using this in production for all our php installs and java services at Last.fm for around 10 days now. We deployed it just in time to smooth over moving loads of webservers between datacenters.

Comments

  • Russ wrote:
    10 Apr 2007, 18:18

    View Profile | Leave Russ a shout

  • RJ wrote:
    10 Apr 2007, 18:28
    I should also reference some perl code (that i think came from Brad) which i've lost all trace of :/

    View Profile | Leave RJ a shout

  • kryton02 wrote:
    11 Apr 2007, 00:24
    I've never heard of consistent hashing before.

    What a fantastic idea

    View Profile | Leave kryton02 a shout

  • muesli wrote:
    11 Apr 2007, 05:05
    Please note: I just updated the tarball, so first of all please redownload it from here: http://static.last.fm/ketama/ketama-0.1.1.tar.bz2

    Installation
    ============

    * libketama (the general purpose C library)

    $ cd libketama
    $ make
    $ su -c make install

    This will compile libketama and install it to the default prefix /usr/local.
    You can change the prefix by editing the PREFIX variable in 'Makefile'.

    * php_ketama (PHP4 extension that wraps libketama and therefore depends on it)

    $ cd php-4.4.x/ext
    $ ln -s /your/ketama/php_ketama ketama
    $ cd ..
    $ rm -Rf autom4te.cache
    $ ./buildconf --force
    $ ./configure --all_your_configure_options --with-ketama[=/your/ketama/prefix]
    $ make
    $ su -c make install

    Don't forget you might have to restart your httpd!

    cheers,
    muesli

    View Profile | Leave muesli a shout

  • brokenbeat wrote:
    13 Apr 2007, 10:14
    that looks like an algorithm i used (also from bit torrent) for distributed file parts called CHORD.

    the files are taken to a circular hash just like this one, then each client has a part of the hash and the files placed on that part, he can search forward or jump fixed sizes looking for files (1/2, 1/4 hash, etc).

    http://pdos.csail.mit.edu/chord/

    but in this case all the list is known and kept at the same place :)

    View Profile | Leave brokenbeat a shout

  • sickdm wrote:
    15 Apr 2007, 23:40
    Kickass, you hackers! :)

    -Anthony @ hype machine

    View Profile | Leave sickdm a shout

  • rschuil wrote:
    20 Jul 2007, 09:42
    First of all: great work! I've been playing with Chord and Kademlia in the past and loved to see your library for memcached. I suggest that you make it a project on SF or so, so people can join to continue it's development.

    Ketama is also nice to use in combination with MySQL slaves. Sure, you can do load balancing using e.g. LVS, but as the number of slaves grow, your query cache efficiency drops (query cache entries are not shared between mysql slaves. It gets less and less likely that a specific query was already in that slave's cache.)
    You could hash the SELECT query to find your preferred MySQL slave to connect to. All requests for the same query would be routed to the same MySQL slave, thus resulting in a more efficient use of the query cache.
    If a slave dies, it will pick the next greater value from the continuum and find another slave to connect to.
    Just an idea :)

    View Profile | Leave rschuil a shout

  • RJ wrote:
    20 Jul 2007, 11:06
    Yup - you could use a consistent hash like ketama to improve various things. We also considered using it to balance internal webservice requests.

    I will stick it on SF when i get a chance. In the meantime, i will happily review and commit patches against our public SVN of the project: svn://svn.audioscrobbler.net/misc/ketama/

    View Profile | Leave RJ a shout

  • rschuil wrote:
    23 Jul 2007, 09:40
    Just wondering: it looks to me that the extension currently returns only a single node. What if the node is unreachable? How can you failover to another node?
    I guess that you'd have to increment server->point by one and call ketama_get_server() again to find the next node on the continuum?

    View Profile | Leave rschuil a shout

  • RJ wrote:
    23 Jul 2007, 10:30
    Yeah that's the right approach, but not implemented - we don't do failover between nodes because of the possibility of stale cached data, we would just hit the DB if a node is unresponsive.
    We have a monitoring daemon that checks all the nodes and pushes out updated ketama.servers files to all the webnodes should a memcache server go down.

    View Profile | Leave RJ a shout

  • rschuil wrote:
    23 Jul 2007, 10:59
    we don't do failover between nodes because of the possibility of stale cached data
    If you push out an updated ketama.servers, it would have the same effect and thus the same problem, wouldn't it?
    It would be nice if ketama_get_servers would return e.g. three candidate nodes instead of one. If one doesn't respond, simply try the next one and the next one. If all three nodes fail to connect, hit the db.

    View Profile | Leave rschuil a shout

  • RJ wrote:
    23 Jul 2007, 11:13
    If you are centrally controlling when to add and remove servers from the pool, you can do a flush_all on servers that you re-add, to avoid the stale data issue.
    It should be easy enough to pass get_servers a parameter indicating the number of candidates you want, and have it return an array.

    View Profile | Leave RJ a shout

  • ludvigericson wrote:
    25 Jul 2007, 15:56
    Hello RJ, just would like you to know this,
    I've got a fix for the bug you mention in ketama/TODO:
    * Invalid sever definition file causes ketama_roll to crash.

    And we're developing a Python extension wrapped around libketama. You'll be hearing from us when it's ready for release, along with aforementioned bug-fix.

    Thank you for a great library.

    View Profile | Leave ludvigericson a shout

  • RJ wrote:
    25 Jul 2007, 16:01
    Great - if you have patches against the svn feel free to send them over. When your python lib is done i will gladly add it to the ketama page http://www.audioscrobbler.net/development/ketama/


    Currently top of my ketama wishlist is adding ketama support to http://pecl.php.net/package/memcache - hopefully i'll get a chance to look at that in the next few weeks.

    View Profile | Leave RJ a shout

  • rschuil wrote:
    27 Jul 2007, 08:16
    Hi RJ, I have trouble doing a checkout from your subversion repository. Using svn:// my client claims that your server does not respond, http:// doesn't seem to be supported by your server. Could you please verify that it is working externally?

    View Profile | Leave rschuil a shout

  • rschuil wrote:
    27 Jul 2007, 08:21
    Grmbl. From a different location the checkout does work - port 3690 must be blocked by the firewall here.

    View Profile | Leave rschuil a shout

  • rschuil wrote:
    27 Jul 2007, 13:49
    FYI: Just compiled libketama with FLV-1 hash instead of MD5. It is about twice as fast now.
    I'll contribute my patch to the repository as soon as I'm done testing

    View Profile | Leave rschuil a shout

  • rschuil wrote:
    27 Jul 2007, 15:03
    I analyzed the output of ketama_test for both hash functions:

    Number of duplicate keys
    MD5: 141
    FNV-1a: 4

    Execution time
    MD5: 0m2.990s
    FNV-1a: 0m1.290s

    Distribution among servers is the same, if not slightly better.

    View Profile | Leave rschuil a shout

  • RJ wrote:
    27 Jul 2007, 15:21
    Cool :)
    We would have a hard time changing, as we'd need it working and tested in java too.. It could certainly be made an option in libketama tho.

    View Profile | Leave RJ a shout

  • ludvigericson wrote:
    29 Jul 2007, 18:09
    I'd say LOOKUP3 looks like a more proper algorithm to use, MD5 is of course a bit shaky since it's made for _secure_ hashing in mind, not diversity & speed.

    Anyway, I'd say it's better if the Java class could utilize libketama.so as well, so as to tie it all down to one place.

    View Profile | Leave ludvigericson a shout

  • RJ wrote:
    1 Aug 2007, 11:28
    Yeah i just used md5 because it was convenient..
    I've put the patch from rschuil into a patches folder in SVN (svn.audioscrobbler.net) for now.

    My initial testing shows it's quite a bit faster - a 3.7s test reduced to 2.0s when using FNV. Distribution looks about the same.

    Will post an update once i have tidied up the lib and merged in a couple more changes i have lying around.

    View Profile | Leave RJ a shout

  • RJ wrote:
    3 Aug 2007, 19:16
    public svn updated, license changed to BSD, some style and other patches from blogg.se folks - python patch imminent, still testing fnv one. off to the pub ;)

    View Profile | Leave RJ a shout

  • psz wrote:
    17 Oct 2007, 12:25
    How does libketama relate to consistent hashing introduced recently in pecl/memcache?

    View Profile | Leave psz a shout

  • RJ wrote:
    17 Oct 2007, 15:36
    similar, but ketama uses md5 instead of crc32. I've not yet had a chance to test the new branch of pecl memcache cvs.

    there is a python and java implementation of the ketama algorithm in our SVN. So compatibility with non-php languages is the only reason to use it over the pecl branch really - at least until it is integrated into a memcache extension.

    View Profile | Leave RJ a shout

  • ludvigericson wrote:
    17 May 2008, 11:44
    Though, I don't see why you'd use CRC32 for something that isn't data integrity checking. CRC32 is designed to give distant checksums only for two *similar* inputs, I'll try to explain better:

    When f is CRC32,
    f(AAA) = 9839781 (not true, example)
    f(AAB) = 2387523 ( ^ )
    But,
    f(ZZZ) = 9839782 (same)

    What I'm trying to say is that CRCs are designed to detect small changes. MD5 isn't ideal either, but it's designed to have good distribution all-over. CRC32 isn't.

    Using a checksum as hashing is just backwards, IMHO. Again, MD5 isn't optimal either, but it's a better choice.

    View Profile | Leave ludvigericson a shout

See all 32 comments
Leave a comment. Log in to Last.fm or sign up (it’s free).