Published: 2011-06-21

Tagged: availability, fault-tolerance, nosql, probability theory, reliability

My forthcoming talk at Jazoon 2011 features a brief discussion on reliability (or fault tolerance if you like) of a cluster of machines. I expand on that in this post.

Reliability and Fault-tolerance of NoSQL Systems

Reading this posts requires some basic knowledge in probability theory. Consult "Grinstead and Snell's Introduction to Probability" e.g.; it is available online.

Assumptions

We start out by making some assumptions on the reliability of hardware. The given values are examples as I don't have any real values to share. This discussion is about how to arrive at a certain result.

	Commodity Hardware	Rackmount Server	SAN
expected failure	1 day in 1 year	1 day in 5 years	1 day in 50 years
probability of failure p_f	0.00273973	0.00054794	0.00005479
reliability p_r	0.99726	0.999452	0.999945

There are two possible outcomes when evaluating one particular day in the lifetime of a unit: it either fails, or it does not. We denote the probability of failure with p_f, and the probability that the unit stands with p_r. Since we only have these two events, the relation p_r = 1 − p_f must be true. Think of p_r as the probability that the systems runs or equivalently as the reliability.

To given an example we consider the case of one expected failure in one year. We derive p_f = 1/365 = 0.00273973 and the remaining numbers follow accordingly.

Fault-tolerance of a traditional system

Our imaginary traditional/vertial system is composed of one SAN, one rackmount database, and one rackmount compute-server. The system as a whole is available if and only if all three components are running. This translates into

p_r^(vertical) = p_r^(SAN) ⋅ (p_r^(rackmount))² = 0.999452 .

Expressed in other terms, the systems fails every 2.38 years in expectation.

Fault-tolerance of a sharted system

Let us consider a sharted system as depicted in the following figure.

shart architechture

We compute the reliability of one shard first. A shard is a system of 3 units of commodity hardware. The shard fails if and only if all three systems fail. Hence, we can compute the reliability by

p_r^(shard) = 1 − (p_f^(Commodity))³.

To compute the total reliabilty for n shards we have to employ

p_r^{(shard − storage)} = (p_r^(shard))ⁿ .

Even if we assume that there are 100 such shards we arrive with an expected failure only every 1332 years in expectation!

Thus, the weak link in such a composition is not the shards but the server which maintain the data distribution (named sharder in the figure) and the access points (named router in the figure) of the composition. They act as a single point of failure. If one wishes to compute the reliability, we can proceed as with the vertical system by replacing the SAN with our sharded storage. The numbers will be roughly the same as the single points of failure dominate the result. The reliability will improve largely if they are backed up with a fail-over system as indicated in the figure.

Fault-tolerance of a nothing shared architecture

We start out from a different model when considering nothing shared architectures. We assume that there is a number of r systems required to finish the computation in a timely manner. We also assume that once one system fails the data and computation is seamlessly overtaken by another system. This is a reasonable assumption given that the size-distribution of the values behaves well. The question is how many systems n are required to achieve a certain reliability p_r^(total). The result is given by the following equation

$$ p_r^{(total)} = \sum_{k=r}^n {n \choose k} \; p^{n-k}_{f} \; p_{r}^k $$

where p_f (respectively p_r) is the probability of failure (respectively reliability) of a unit.

Let us try to understand the given equation intuitively. The basic idea is to count and add all the cases that yield exactly r running systems. Let us consider the corner case r = n first. Then the equation simplifies to p_r^(total) = p_r^k which is obviously correct.

For some k with r ≤ k ≤ n, consider the case that exactly k systems are running. The probability that the first k units are running and the remaining n − k units are not running is exactly p_r^k ⋅ p_f^n − k. There are n choose k configurations that yield exactly k running systems. Finally, we sum over all k where r ≤ k ≤ n.

Numbers

Let us consider some numbers. We assume that we need at least 100 commodity type machines. The following table shows the reliability of the corresponding nothing shared architecture.

Number of Units	100	101	102	103	104	105
Reliability	0.760067	0.968304	0.997115	0.999799	0.999988	0.999999
expected failure in years	0.011	0.086	0.949	13.6	241	5077

Exactly 100 machines yield a terrible reliability. However, an overhead of just 5% gives a failure of more than every 5000 years in expectation.