Consensus, async, Ae : What does it all mean and where does it matter?

This may be the issue unless it’s changed recently (quite possible!) because I believe it is incorrect. Joining means you become an Adult and start doing useful work (storing chunks and responding to messages). Joining isn’t joining a queue, everyone waiting to join is effectively in an unordered queue.


The dichotomy between crash and malice has been mentioned in numerous papers. No one bothers going into further detail between the two because it’s not meaningful to do so. Consensus / byzantine agreement algorithms solve the problem of unpredictable behavior by participants in a decentralized network, where unpredictable behavior ranges from simple crashes all the way to sophisticated malice. Put differently, the 1/3 threshold is already inclusive of simple crash considerations.

This says nothing about your group size (elder size for safe), which of course should be experimented with through tests.

But you’re saying that the group size influences the byzantine threshold. It appears so given the examples you’ve chosen. But in actuality, it doesn’t from a mathematical standpoint. That’s why no one provides for a different byzantine threshold depending on the network size.



I was wondering about that distinction too. Maybe a ‘crash’ can be detected as a specific subset of behaviours such as dropping offline, but exclude running the wrong software which is definitely malice. Crashing nodes are then treated differently than byzantine nodes, which once detected are immediately booted.


Would you like to link some of them? Even one, so that we can see the details?

I have a hunch that unpredictable behaviour can be responded to differently if it is unpredictable activity or unpredictable passivity.

I think it’s kind of a mistake to even speak about “crash” or “malice”. Because to get to get it right we have to stay in the level of another node and not resort to a viewpoint that sees anything from above.

Identity, meaning “bad” or “benevolent” is after the fact construction, that is actually not relevant at all. All we care is if the action is appropriate or not. And non-action is slightly, but meaningfully different than action.


I am specifically saying the byzantine threshold is constant and the byzantine node count is constant from 3f+1 to 4f+1.

The CFT count is not constant. it goes from 0->f as you go from 3f+1 to 4f+1 as a population count.

We don’t :wink:

Perhaps this makes it clearer.

Consensus algorithms exist to prevent forks. That’s it. They look like they provide order, and they do, but their purpose is to prevent forks. (doublespend)

If you can agree to this then byzantine nodes are clearly nodes running code in a manner that will create forks.

Crashed nodes are by definition, unable to create forks.

Therein lies the difference between CFT and BFT


I tried to make it pretty at one point…


Here’s the seminal paper that provided proof of the 1/3 byzantine fault threshold:

You should read the whole thing. But you can find the definition of byzantine fault on the second page (874), e.g.,

Byzantine faults. Having briefly discussed our communication model, [we] must now mention our fault model. Processors become faulty when they deviate from their prescribed programs. "Crashing (i.e., ceasing all activities) is a benign way for a processor in a network to become faulty. The faulty behavior considered in this paper is instead much more adversarial: faulty processors may deviate from their prescribed programs in an arbitrary fashion, perform an arbitrary amount of computation, and Even coordinate their disrupting actions. Modeling software and hardware faults as malicious behavior may not be unreasonable when dealing with the interaction of very complex programs, but it is actually very meaningful, and even mandatory, if there are people behind their computers.

I wish I had more time to say more. I’m sure the right thing will get implemented eventually. My goal for saying something is to hopefully help minimize wasting time on obviously wrong paths. Don’t listen only to the positive-sounding but ultimately wrong voices.

The point in using that expression is to convey the meaning that for some matters, it is better to rely on well established proofs instead of rolling one’s own. And yea, no need to bring up innovation. One still needs to work within the confines of certain established axioms when innovating. Breaking an axiom requires a great level of proof that hasn’t been provided here.

I know :slight_smile:

This is wrong. See above. Crashing their node(s) would be one of the simpler ways for a malicious participant to disrupt a network. Crashes without malice have the same effect so should be treated the same.


The final point, how can crashing nodes cause a fork?


I appreciate that.

Is your problem calling faults either BFT or CFT? There is an argument there as we see above (crashed nodes cannot commit faults via bad messages). Also a node not sending a message for any reason is only an issue when your 2 CFT nodes don’t and then you have another crash (or miss messages) only when N=3f+1. Even in this case the fault is limited to a stall if the number of non participating nodes means we have less than 2f+1 participants left.

There is confusion also in imagining all honest nodes get agreement as it assumes comms work, which we know they don’t always. This is where Ae kicks in for us.

So lets call it ft (fault tolerance) to cover cft plus bft

I know that paper and it shows many angles and additions like common coins(ft<1/4 or even =1/3 and so on), but most focus on ft <1/3 of N.

So N == 7 ft==2
N == 8 ft==3
N ==9 ft==4

As <1/3 of 7 8 or 9 is 2 we cover bft as per papers, we still have 2f+1 (5) as agreement.

Can you point out in any paper where the above is not correct? This is why I say it has been missed with the focus on supermajority meaning >2/3 regardless of N. That’s wrong, it’s clearly wrong from the simple math above. The focus should always be on the bft figure (<1/3) and 2f+1 as deciding number.


Please hurry! There’s an emergency on the forum with all this politics etc. directing energies of good and helpful people in meaningless debates. They, we, need something else to do. You know how youngster are, they should not be left alone, loitering and causing trouble. :yum:


I was actually concurrently making the same point in our internal channels yesterday.

My thought was that since we can’t subjectively discern benign faults from malign faults, the assumed FT_{total} = FT_{malign} + FT_{benign} = 2N/3-2, i.e. FT=4 with N=9, could not be achieved. If every fault could be a byzantine fault, then we could still only have FT <=N/3-1.

However… if crashes cannot cause forks, then maybe we can have FT=4 if at least 2 of the faults are crashes.

But, as Mark says

There was an additional note added about collusion.

The tolerance to faults depends on the level of collusion between the faulty nodes.
3 nodes trying 3 different forks won’t be a problem.

But the level of collusion also strikes me as at least sometimes problematic to discern.


I do not see that crashing (non-malicious) nodes can cause a fork at all. They can at most cause the network to halt because it cannot reach supermajority for a vote. as @dirvine says above, iirc.

Given this, collusion is only relevant for malicious (BFT) nodes. (Who cares if 3 nodes “collude” to go offline together?) So for calculations, i think we must assume that all bft nodes are colluding all the time. In this sense then, there is no need to discern because we always assume the worst.


Yes, I can’t see it either (but also can’t say that I can definitely exclude the possibility). So, the question was if it can be definitely concluded to be the case (perhaps best to explicitly state in our system).

If so, then I am currently out of major objections to 4f+1, and its suggested gain of extra CFT on top of the unchanged BFT.

I agree btw that the worst case must be assumed, of all malice being in collusion.
But, while saying that, I can envision clearly discernable different forks attempted, in which case exceeding the BFT is possible, while not being bust by it.


The increased tolerance in such a case is fragile, because it could be only a matter of time before the malicious actors realise a way to collude and profit. Being malicious to start with is the basic problem. Finding ways to collude (and profit) are to be expected.

But that brings in the probabilistic view of number of byzantine nodes.

The assumption is that no more than N/3-1 nodes will be malicious.

But I take that to be a global assumption.

A (long held) thought:

The assumed and expected number of malicious nodes cannot(?) be expected to be uniformly distributed.
With a normal distribution, there will be sections where both lower and higher than this ratio are malicious. With many sections, it is not unlikely to exceed N/3-1 in some section (one of the tails).

All that is needed then is for those to realise this and to start colluding.

I don’t know if the reasoning holds, but I’ve thought it interesting to try get to the bottom with at some point.


A local majority imho is very likely at some point in the life of a real network… If we manage to get a a section checked by another randomly selected section then we should be okay… But maybe that is an issue for an optimization when we have our first stable network…? =D


I should say @Bogard I really enjoyed this interjection. This is what makes us strong. I Am not sure if you have completed the critique here, but it has been really helpful in checking all angles under all those stones. We MUST do this kind of critique and continually check and recheck.

Especially in this case as I belive the industry has feked up choosing 3f+1 as population sizes. I think some have realised, but not said so or realised the impact.

This is a great one to get many involved as it has some math, but not heavy duty statistics or calculus etc. I think this has been engaging and should help everyone understand the depths we must look at for this network. We need commodity computers being the keystone here and it means we need to squeeze every wee benefit we can from logic. It’s way too easy to get complex and say you must buy a huge computer setup to make this work. That v bad in my eyes. We need to get the best from what we have and this helps.

An important distinction is being made here as we show we don’t need a supermajority in agreement, but instead a straight majority. It buys more safety, liveness and termination likelihood.


The answer to this question is not relevant for our purpose. Let’s take a step back. The primary concern I was addressing about the main post is whether consensus is needed (yes it is). Once we settle that consensus is needed, then what’s the number of deviant nodes that can be tolerated by the network? That number is f, with N being the total number of nodes in the system in N = 3f+1. That equation describes a BFT world and you can’t add CFT allowance on top of it as was being proposed because it’s already inclusive of it, i.e., CFT is a subset of the superset that is BFT. A simpler way to think about it is that BFT is a galaxy and CFT is a solar system within the BFT galaxy.

In the CFT subset, because a single entity is in control of all the nodes, there is clarity that crashes won’t lead to forks as the controlling entity will resolve the fork given its control over all the nodes to begin with. But in the BFT superset, that assumption can’t be made because for instance crashes may be just the initial volley in an attack cascade that will result in a fork. Even a mild crash case can result in “forks” because a stalled network can mean having to build on a different branch depending on the severity of the crashes. But the CFT case is better protected from take-over attacks because only one entity is in charge anyway (in a practical sense).

This is exactly what I’ve been saying. It’s great that we’re in agreement.

I appreciate that @dirvine. Only mean to help, so I’m glad it did.


This is the way we’ve seen it always so far (afaik).

I think @dirvine has been exploring a possible nuance miss in there.

I’m not sure myself if there is one or not. It’s intriguing for sure.

But about this…:

…I must say that I think that is formulated with great clarity and flexibility of imagination, and it resonates with my thinking.


Can we get a refresher on the requirement for 2/3 super majority during consensus?

For example, consider 12 elders with 4 colluding nodes and 3 faulty/failed nodes. After loss of the 3 failed nodes we end up with 5 good nodes vs. 4 colluding nodes. A 5/9 = 0.555 affirmative vote is still a simple valid/correct majority… isn’t that good enough? To require a 66% super majority after removal of the failed nodes allows the attack nodes to disrupt consensus, no? Please offer some clarity on this.


Good question

With N==12, we still only have a tolerance of <N/3 i.e. 3 BFT nodes.
We still need 2f+1 (7) as decision maker size (this is the key we should always focus on)

Therefore in a broadcast consensus algorithm with N==12 we can lose 3BFT nodes and 2CFT nodes before trouble.

Your question then has us look at N larger than 9 to see what happens (interesting)

If we look at various ranges for f

f min N (3f+1) max N MAX N as f
2 7 9 4f+1
3 10 12 4f
4 13 15 4f-1
5 16 18 4f-2
6 19 21 4f-3
7 22 24 4f-4
8 25 27 4f-5
9 28 30 4f-6
10 31 33 4f-7
11 34

We can see the pattern. As the min N grows by (3) each time it constrains the max N which is not 4f+1 when you get past f==2. The Max N is 3f+1 + 2 :slight_smile:

So this table the MAX N continues 4f-8 4f-9 and so on. It’s an interesting one to plot out and look at why? Min size is 3f+1, decisions are always 2f+1 and bft is always <1/3. These are the constraints. So the Max N is constrained by the next 3f+1.

It’s 5:30am here. So some math guru can add more explanations on the bounded by <N/3 population min size (i.e. bounded by steps of 3).

Great question @jpell I hope this helps. It’s the strictly less than 1/3 that catches us out so many times (hard to write unless expressed as a fraction)