Published: 2011-06-21
Tagged: availability, fault-tolerance, nosql, probability theory, reliability

My forthcoming talk at Jazoon 2011 features a brief discussion on reliability (or fault tolerance if you like) of a cluster of machines. I expand on that in this post.

Reliability and Fault-tolerance of NoSQL Systems

Reading this posts requires some basic knowledge in probability theory. Consult "Grinstead and Snell's Introduction to Probability" e.g.; it is available online.

Assumptions

We start out by making some assumptions on the reliability of hardware. The given values are examples as I don't have any real values to share. This discussion is about how to arrive at a certain result.

Commodity Hardware Rackmount Server SAN
expected failure 1 day in 1 year 1 day in 5 years 1 day in 50 years
probability of failure pf 0.00273973 0.00054794 0.00005479
reliability pr 0.99726 0.999452 0.999945

There are two possible outcomes when evaluating one particular day in the lifetime of a unit: it either fails, or it does not. We denote the probability of failure with pf, and the probability that the unit stands with pr. Since we only have these two events, the relation pr = 1 − pf must be true. Think of pr as the probability that the systems runs or equivalently as the reliability.

To given an example we consider the case of one expected failure in one year. We derive pf = 1/365 = 0.00273973 and the remaining numbers follow accordingly.

Fault-tolerance of a traditional system

Our imaginary traditional/vertial system is composed of one SAN, one rackmount database, and one rackmount compute-server. The system as a whole is available if and only if all three components are running. This translates into


pr(vertical) = pr(SAN) ⋅ (pr(rackmount))2 = 0.999452 .

Expressed in other terms, the systems fails every 2.38 years in expectation.

Fault-tolerance of a sharted system

Let us consider a sharted system as depicted in the following figure.

shart architechture

We compute the reliability of one shard first. A shard is a system of 3 units of commodity hardware. The shard fails if and only if all three systems fail. Hence, we can compute the reliability by


pr(shard) = 1 − (pf(Commodity))3.

To compute the total reliabilty for n shards we have to employ


pr(shard − storage) = (pr(shard))n .

Even if we assume that there are 100 such shards we arrive with an expected failure only every 1332 years in expectation!

Thus, the weak link in such a composition is not the shards but the server which maintain the data distribution (named sharder in the figure) and the access points (named router in the figure) of the composition. They act as a single point of failure. If one wishes to compute the reliability, we can proceed as with the vertical system by replacing the SAN with our sharded storage. The numbers will be roughly the same as the single points of failure dominate the result. The reliability will improve largely if they are backed up with a fail-over system as indicated in the figure.

Fault-tolerance of a nothing shared architecture

We start out from a different model when considering nothing shared architectures. We assume that there is a number of r systems required to finish the computation in a timely manner. We also assume that once one system fails the data and computation is seamlessly overtaken by another system. This is a reasonable assumption given that the size-distribution of the values behaves well. The question is how many systems n are required to achieve a certain reliability pr(total). The result is given by the following equation


$$ p_r^{(total)} = \sum_{k=r}^n {n \choose k} \; p^{n-k}_{f} \; p_{r}^k $$

where pf (respectively pr) is the probability of failure (respectively reliability) of a unit.

Let us try to understand the given equation intuitively. The basic idea is to count and add all the cases that yield exactly r running systems. Let us consider the corner case r = n first. Then the equation simplifies to pr(total) = prk which is obviously correct.

For some k with r ≤ k ≤ n, consider the case that exactly k systems are running. The probability that the first k units are running and the remaining n − k units are not running is exactly prk ⋅ pfn − k. There are n choose k configurations that yield exactly k running systems. Finally, we sum over all k where r ≤ k ≤ n.

Numbers

Let us consider some numbers. We assume that we need at least 100 commodity type machines. The following table shows the reliability of the corresponding nothing shared architecture.

Number of Units 100 101 102 103 104 105
Reliability 0.760067 0.968304 0.997115 0.999799 0.999988 0.999999
expected failure in years 0.011 0.086 0.949 13.6 241 5077

Exactly 100 machines yield a terrible reliability. However, an overhead of just 5% gives a failure of more than every 5000 years in expectation.