The “Mainline Kademlia” protocol

Kademlia is a description of the network interactions and rpc calls that can form a distributed hash table. Today, these kad based DHTs are one of the most common forms of distributed storage. However, since kademlia does not specify the application level protocol to make calls, it is instead implemented on top of existing application protocols, making many of the rings incompatible with each other.

I was interested specifically in the Kad ring that is used by the Mainline Bitorrent client. This storage is used in practice as a distributed source of information about torrents, so that new peers can join a swarm even when there is no tracker present. The ring is used by utorrent, the official bittorrent.com client, and the Vuze client with an optional plugin. (Vuze has a native DHT which is incompatible with other systems.)

Unsurprisingly, the protocol for this DHT is well advertised. The reason is two-fold: first, the amount of space is limited, and it is hard for clients to tell between valid and invalid data, so you don’t want to allow arbitrary data into the DHT. Secondly, the questionable legality of many torrents means that it is detrimental for a third party to have a means to easily monitor the swarm.

The Protocol

This DHT is described as a Bittorrent BEP . There are three remote procedure calls that need to be implemented for the protocol, and a fourth (ping) that can be used to help with reliability

ping

ping is used simply to make sure that another host in your routing table is still alive. this is sent out periodically, and in fact with quite rapid frequency to make sure that your view of the DHT ring is mostly correct. The ping is made with your id in the ring, and is responded to with the remote hosts ID.

find_node

this is used to discover new hosts in the DHT my querying known nodes. You specify a specific point on the ring, and the host will return the 8 closest hosts to that point that it knows. You must recursively send this request to find the closest possible node to any given point, since that mirrors the mode used by new nodes to join the ring.

get_peers

This is the get request of the protocol. You can ask a host for the data stored against a 20 byte key. If it has data for that key it will return it, as well as a token which you can then use to modify that data.
Data is stored in multiples of 26 bytes, where the first 20 represent a position on the ring, the next four are the IP4 address of that node and the final two are the port. The claim is that these nodes may be active in the swarm with the requested 20byte key.

announce_peer

Using the token from a previous get_peers request, you can add your own 26 bytes of data to the table.

Uses

It turns out that you can use this data to store messages of around 1MB. To do this, you choose a key for the message, find the closest node to that key, and then execute a string of alternating get_peers and announce_peer requests. You would use the port number as an index for ordering of your messages, and the 20 byte index fields to store the actual data. The primary problem is that recovery is not guaranteed. Each get_peers request will return only a limited number of peers, and depending on the method used data will either be lost (all but the last few messages will be dropped from the DHT) or returned randomly.

A more sophisticated storage method could get around this problem. My first inclination would be to set up a system which jumped keys fairly frequently, storing at most 4 to 8 messages in each key, and then a pointer to the next key. This method would also remove the size limitation caused by the naive implementation but would be more susceptible to data loss issues.