Consensus, async, Ae : What does it all mean and where does it matter?

Very detailed summary , great work! :dizzy: :100:


I used to think that, but it’s not true. The byzantine resistance is constant, but the crash resistance improves.

This is the key. Allowed failures not constant. If you have 3f+1 group size, then your tolerance is all byzantine nodes. This tolerance (BFT) is fixed at having <1/3 nodes allowed to be byzantine.

So with a group size of 7 8 or 9 that number for byzantine fails is fixed at 2. However
At group size 7 you can not tolerate any more fails
At group size 8 you can tolerate another crash
At group size 9 you can tolerate two crashes on top of byzantine fails.

If you dig into this we say we get agreement at 2f+1 nodes (5) and that is constant. We tolerate <1/3 (2) byzantine and that is constant (i.e. in any vote the byzantine nodes are in the minority and honest nodes are in the majority).

So look deeper here. At groups size of 7 8 or 9 the byzantine tolerance is fixed at 2 (<1/3). So it is fixed, BUT the CFT numbers go from 0 at 7, 1 at 8 and 2 at 9.

So with N==9 you can have 2 BFT failures and an additional 2 CFT failures.

It is not documented anywhere I can find and I believe this is lost on many Engineers, but it happens to be accurate.

at N==4f+1 you still have 2f+1 to agree, you till have <1/3 byzantine tolerance. But, you have an additional f CFT nodes.

If you still wonder then think of this. Set N==9, set agreement == 2f+1 and tolerate <1/3 byzantine nodes. Now look at what number of honest voting nodes are available to you, it’s 3f+1 nodes, yes you assume 3f+1 are honest. That is the current state of art. So now then ask yourself, given we have 2 byzantine guys, how many honest nodes need to vote (5) and how many honest nodes do we have (7).

The math is accurate, but the application of this has been missed by way too many projects, in my opinion.

Maybe referring to the extra f (f==2) as possible CFT makes it feel weird, but it happens to be true.However you can only add these when you have N==4f+1 instead of 3f+1 (which most set N to).

I feel part of the problem that has hidden this from us is the focus on >2/3 nodes needing to agree to prevent byzantine failures and this op shows that is in fact false when we set the group size at 4f+1 (somebody could make themselves famous writing this in a paper). We actually only need >50% of nodes to agree IFF we set the group size at 4f+1 and not 3f+1. The number of nodes needing to agree is a constant 2f+1 to ensure the byzantine fails are in the minority (we don’t need to worry about CFT fails spoiling a vote :wink: )

Note, 4f+1 is a max value here. If you go beyond that then your f value changes and you start again.


But only if the network has need of a new node, so the elders still have to accept your new node.


Good point. But then, is it easier to get in anybody that can sign, and kick them out as they misbehave (which has to happen anyway), or having a long messy queue of nodes waiting to be on the VIP list…it just feels to me like membership is proving to be harder than it should to implement, at least at the early stages of the network.
Economy, storage, etc, that’s a different kettle of fish, dealt with aferwards…


I wish I had the chops to weigh in. I understand the innovation at least but nothing beyond unfortunately but I will say, it sounds like SafeLabs will hit the ground running!

The out of the box thinking and experimentation is so refreshing.


I’m not clear on how your proposal removes the need to agree on membership. Elders still either have to choose (and agree) or if not, then all waiting nodes are added as soon as they apply, which creates the possibility of the network being overwhelmed by bad nodes in a disruptive Sybil attack. Search ‘Google attack’ on the forum.


Part 2 of this deep dive will go into the business logic of membership and then how it all ties together.

I am writing and rewriting it to get clarity. I suspect I will attack it in a few steps.

  1. A stable size single-section network (i.e. a section that tries to maintain it’s size under churn)
  2. A growing network (introduce splits)
  3. A growing network that limits growth to when it’s needed and how.

These all introduce interesting elements.

  1. Attacks the join queue and selection of nodes
  2. Attacks the splitting logic (and tries to make it natural)
  3. Attacks the limiting logic.

Each introduces a new event type to consider deeply.


So glad you’re doing this :+1:


I suppose what I’m taking about is a clear distinction between joining and membership.

Looks like David might clarify that in Part 2 ( :pray: ), and maybe that’s already the way it is, but I have the impression the joining process could be simpler. Does it have to involve BFT? Really just need to be ordered (CRDT) If all I’m saying is ‘I want to joyn’, why should I be vetted on that? I’m not holding data nor voting (not and Adult nor an Elder yet), can’t do much damage.

You might then say, what’s the point in joining then, and getting the network to do work assigning me age, address etc? (how much work is it, and can I do that myself?)
One is more philosophical I suppose, as in “I’m assuming you’ll be good, now prove me wrong” rather than “Prove me you’re not bad, and I’ll reward you”…but don’t really want to go there.
The other one is more practical (and maybe egoistic); would this speed up kickstarting a network?


I will do :wink:


This may be the issue unless it’s changed recently (quite possible!) because I believe it is incorrect. Joining means you become an Adult and start doing useful work (storing chunks and responding to messages). Joining isn’t joining a queue, everyone waiting to join is effectively in an unordered queue.


The dichotomy between crash and malice has been mentioned in numerous papers. No one bothers going into further detail between the two because it’s not meaningful to do so. Consensus / byzantine agreement algorithms solve the problem of unpredictable behavior by participants in a decentralized network, where unpredictable behavior ranges from simple crashes all the way to sophisticated malice. Put differently, the 1/3 threshold is already inclusive of simple crash considerations.

This says nothing about your group size (elder size for safe), which of course should be experimented with through tests.

But you’re saying that the group size influences the byzantine threshold. It appears so given the examples you’ve chosen. But in actuality, it doesn’t from a mathematical standpoint. That’s why no one provides for a different byzantine threshold depending on the network size.



I was wondering about that distinction too. Maybe a ‘crash’ can be detected as a specific subset of behaviours such as dropping offline, but exclude running the wrong software which is definitely malice. Crashing nodes are then treated differently than byzantine nodes, which once detected are immediately booted.


Would you like to link some of them? Even one, so that we can see the details?

I have a hunch that unpredictable behaviour can be responded to differently if it is unpredictable activity or unpredictable passivity.

I think it’s kind of a mistake to even speak about “crash” or “malice”. Because to get to get it right we have to stay in the level of another node and not resort to a viewpoint that sees anything from above.

Identity, meaning “bad” or “benevolent” is after the fact construction, that is actually not relevant at all. All we care is if the action is appropriate or not. And non-action is slightly, but meaningfully different than action.


I am specifically saying the byzantine threshold is constant and the byzantine node count is constant from 3f+1 to 4f+1.

The CFT count is not constant. it goes from 0->f as you go from 3f+1 to 4f+1 as a population count.

We don’t :wink:

Perhaps this makes it clearer.

Consensus algorithms exist to prevent forks. That’s it. They look like they provide order, and they do, but their purpose is to prevent forks. (doublespend)

If you can agree to this then byzantine nodes are clearly nodes running code in a manner that will create forks.

Crashed nodes are by definition, unable to create forks.

Therein lies the difference between CFT and BFT


I tried to make it pretty at one point…


Here’s the seminal paper that provided proof of the 1/3 byzantine fault threshold:

You should read the whole thing. But you can find the definition of byzantine fault on the second page (874), e.g.,

Byzantine faults. Having briefly discussed our communication model, [we] must now mention our fault model. Processors become faulty when they deviate from their prescribed programs. "Crashing (i.e., ceasing all activities) is a benign way for a processor in a network to become faulty. The faulty behavior considered in this paper is instead much more adversarial: faulty processors may deviate from their prescribed programs in an arbitrary fashion, perform an arbitrary amount of computation, and Even coordinate their disrupting actions. Modeling software and hardware faults as malicious behavior may not be unreasonable when dealing with the interaction of very complex programs, but it is actually very meaningful, and even mandatory, if there are people behind their computers.

I wish I had more time to say more. I’m sure the right thing will get implemented eventually. My goal for saying something is to hopefully help minimize wasting time on obviously wrong paths. Don’t listen only to the positive-sounding but ultimately wrong voices.

The point in using that expression is to convey the meaning that for some matters, it is better to rely on well established proofs instead of rolling one’s own. And yea, no need to bring up innovation. One still needs to work within the confines of certain established axioms when innovating. Breaking an axiom requires a great level of proof that hasn’t been provided here.

I know :slight_smile:

This is wrong. See above. Crashing their node(s) would be one of the simpler ways for a malicious participant to disrupt a network. Crashes without malice have the same effect so should be treated the same.


The final point, how can crashing nodes cause a fork?


I appreciate that.

Is your problem calling faults either BFT or CFT? There is an argument there as we see above (crashed nodes cannot commit faults via bad messages). Also a node not sending a message for any reason is only an issue when your 2 CFT nodes don’t and then you have another crash (or miss messages) only when N=3f+1. Even in this case the fault is limited to a stall if the number of non participating nodes means we have less than 2f+1 participants left.

There is confusion also in imagining all honest nodes get agreement as it assumes comms work, which we know they don’t always. This is where Ae kicks in for us.

So lets call it ft (fault tolerance) to cover cft plus bft

I know that paper and it shows many angles and additions like common coins(ft<1/4 or even =1/3 and so on), but most focus on ft <1/3 of N.

So N == 7 ft==2
N == 8 ft==3
N ==9 ft==4

As <1/3 of 7 8 or 9 is 2 we cover bft as per papers, we still have 2f+1 (5) as agreement.

Can you point out in any paper where the above is not correct? This is why I say it has been missed with the focus on supermajority meaning >2/3 regardless of N. That’s wrong, it’s clearly wrong from the simple math above. The focus should always be on the bft figure (<1/3) and 2f+1 as deciding number.


Please hurry! There’s an emergency on the forum with all this politics etc. directing energies of good and helpful people in meaningless debates. They, we, need something else to do. You know how youngster are, they should not be left alone, loitering and causing trouble. :yum: