Gossip-based Failure Detectors — 0.8.0 released
We have just released version 0.8.0 of our distlib C++ library that implements a number of popular distributed systems algorithms.
This concludes a first iteration of the “SWIM Gossip Protocol” for failure detection: the implementation is now fully thread-safe and can scale up to many servers with minimal process overhead.
The gossip_example.cpp
toy server shows how a client application can use the library to implement an embedded (or external) failure detector for a large fleet of servers:
auto detector = std::make_shared(
port,
kDefaultUpdateIntervalSec,
kDefaultGracePeriodSec,
ping_timeout_msec,
ping_interval_sec
);
std::for_each(seedNames.begin(), seedNames.end(),
[&](const std::string &name) {
...
detector->AddNeighbor(server);
}
);
...
detector->InitAllBackgroundThreads();
and this is pretty much it.
The detector is now running in background and both reporting the server’s health as well as recording (and gossiping about) the other “neighbors.”
A few of them can be started like this:
$ ./bin/gossip_detector_example --port=8085 \
--seeds=gondor:8086,gondor:8087
and after a little while, all will be reporting about the others’ health:
if any of these becomes unresponsive, it will be reported as such:
the other servers will eventually become aware of the failure via gossip, without necessarily having to directly ping this one server; conversely, if the unresponsive server recovers within a configurable “grace period” (or comes back online after a reboot) it will be re-admitted in the “healthy” list.
Thanks to the use of a Gossip-based protocol, the number of server-to-server communications (and request/response overhead) is O(1)
instead of O(n²)
as with traditional heartbeat protocols and it does not depend on a central master controller (a major Single Point of Failure itself).
However, the time bound on the spread of the information is upper bound to be O(log k)
, where k
is the size of the “gossip ring.”
See the paper for more details and a rigorous analysis of the protocol.
The distlib
library is build and installed as a SHARED
(.so
) library and also contains a number of other utilities, including a Consistent Hash; Merkle Trees; and various utilities.
We are looking for folks willing to contribute to the code (the library is released under the Apache 2 license).
More details can be found on the Github Project Page and I plan to start a series of blog posts describing its implementation.
Originally published at codetrips.com on September 8, 2017.