Update 17 November, 2022

Byzantine fault tolerance, based on the Byzantine Generals problem, is a must for decentralised networks where nodes need to come to agreement even if some are malicious. @dirvine goes through what BFT is, and crucially what it misses.

General progress

@davidrusu Has been working on decoupling AE from the join messages and has made some good progress there, the join code is quite a bit simpler now.

@joshuef is looking at joining nodes too, and has found some anomalies. When more than a third of responses from elders have been received, the joining node asks again. With five elders, having received two responses, the node retries and gets two more replies. But at that stage it receives a response for the first try, which causes a stall/ghost node situation.

He has also successfully removed more locks, meaning that adults can now write directly to the filesystem without any intermediate stages. An important simplification.

And in another bit of streamlining, @bzee has refactored part of qp2p into a more Quinn-like API where there is no separate data structure needed for incoming connections or streams.

Byzantine Fault Tolerance

Byzantine generals problem in real-world terms

In the Byzantine Generals Problem, an army has the city of Byzantium under siege. The generals commanding these forces need to decide the time to attack. If all the generals lead their forces to attack the city at the same time, victory is assured, but if they attack at different times, they lose. The generals have no direct communication with one another and the message carriers may be enemy spies, may be killed by the defending forces or the messages themselves might have been changed. Plus, of course, some of the generals themselves might be traitors. How do the generals ensure their forces all attack at the same time?

This is the problem that decentralised networks must overcome.

Applying the analogy to decentralised systems we get:

  • Loyal generals ==> honest nodes
  • Traitorous generals ==> faulty nodes
  • Enemy Army we are sending messages through ==> unreliable asynchronous network which may drop and delay messages

This story is a digestible cover for some fairly heavy maths in which it turns out that so long as no more than a third of the generals are treacherous, the honest generals can coordinate their plans successfully – Byzantine Fault tolerance (BFT).

We have multiple nodes, some of which can be dishonest and some can suffer comms failures between them.

The classical Byzantine generals problem is one of local state synchronisation in spite of traitorous generals attempting to diverge states. To broaden it out a bit, both ATTACK and RETREAT would be correct (valid) answers… as long as all loyal forces do the same.

For our purposes, the classical Byzantine generals problem is a constrained view of BFT. It’s focused purely on consensus and so it needs to deal with more types of faults.

There are other notions of BFT we can look at instead to handle problems with membership.

Here the issue is that we have a fork. When adults and elders leave or join the nodes can have different views of the section. Though this fork will eventually be resolved, the decisions made during the split view, e.g. a DBC spend, are hard to handle. Thus we are trying to avoid this split-view through consensus.

One option Mostafa and @Davidrusu are working through is a broadcast protocol called Verifiable Consistent Broadcast (VCBC).

BFT Broadcast Protocols

A broadcast protocol tries to disseminate a message from a node to the network. The BFT version tries to ensure that all nodes agree on the message this node had sent:

On the left we have an honest node 1 broadcasting msg to 2,3,4,5. On the right we have a faulty node 1 broadcasting a to 2,3 and b to 4,5.

image

A BFT broadcast protocol protects against the fork in the second case by ensuring all honest nodes either agree to the same message from node 1 or they all refuse to accept any message from node 1.

The way this is done in VCBC is through a 3 phase protocol:

  1. Node 1 broadcasts the message it wants everyone to see
  2. All nodes check if this is the first time Node 1 has sent a message, if so responds with a signature share over the message: sigma_j(m)
  3. Node 1 waits until they receive >2/3rd of the signature shares to aggregate into a full signature, then broadcasts the full signature to the network convincing everyone that they all have the same message.

Let’s look at how this algorithm breaks down.

VCBC crucially relies on more than 2/3rd’s of the network following the protocol faithfully. We can ask ourselves: “What damage can be done if 1/3rd or more of the nodes are faulty?”

If more than 1/3rd but fewer than 2/3rds + 1 are faulty, then the worst they can do is stall the network by refusing to contribute signature shares.

If more than 2/3rds are faulty, then the faulty nodes can make signatures over any message they like.

However, this fact only holds true if the BFT nodes collude. i.e. if they are not the same attacker then they don’t matter as much. They will not create a fork.

So a critical part of BFT avoidance is the word COLLUDING.

A great mind experiment is solving the BFT problem by having MANY attackers. In this way, the single colluding attacker is diluted by other attackers and honest nodes.

What does this mean in real life?

If we consider the mind of an attacker. What would he prefer?

  1. The network is still running
  2. His enemy has taken control of the network

Clearly, the answer is the network is still running honestly. So unless an attacker has majority control (2f+1, where f = the number of bad, colluding nodes) then he cannot take over the network.

So if we have N==9 and attack nodes==9 but colluding nodes==0, then we have no issues?

So 100% of nodes being attackers is OK.

It is only when we have colluding bad nodes we have issues. There are two ways/thresholds in play here. If we consider a colluding attacker then the thresholds are:

  1. The attacker controls >=1/3 nodes
  2. The attacker controls >=2f+1 nodes.

1 Is a vandalism attack where he can stall decisions and the network.
2 Is a takeover of the section at least.

So perspective matters here. The more attackers we have, the more secure our network! We can reduce an attacker’s influence by having many attackers and/or many honest nodes.

So what is the BFT attack on a broadcast protocol

When we discuss a BFT attack, we mean a colluding attacker. Individual attackers are fine.

We can tolerate fewer than 1/3rd colluding bad nodes. Whereas there may be many bad nodes, we are protected as long as no colluding group exceeds 1/3 of decision makers (to stall) or 2f+1 to create a fork (takeover).


Useful Links

Feel free to reply below with links to translations of this dev update and moderators will add them here:

:russia: Russian ; :germany: German ; :spain: Spanish ; :france: French; :bulgaria: Bulgarian

As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!

55 Likes

First. Woooooo!!! :partying_face:

Now to read!

18 Likes

You say you’re first… but can we trust you?..:thinking:

18 Likes

I don’t see that the conclusion follows the premises.

Some attackers are doing so just to disrupt, destroy, vandalize. An attacker with these motivations would be perfectly happy if anyone takes over the network, or if they could simply cause a stall.

edit: though I’m just quibbling with phrasing, not saying anything new. :wink:

17 Likes

The difference there is exploit of the resources on network or the network itself… different cases but it’s corruption of data that is likely more interesting wrt consensus… once network is robust in size the other falls away… resources being notionally a great attractor after a time.

11 Likes

Thanks so much to the entire Maidsafe team for all of your hard work! :racehorse:

12 Likes

Niiice :slight_smile: Still in the weeds, but solid progress being made. All else comes from the basic mechanics being correct.

10 Likes

The network is back! Got a live network up and running today! :boom:

Much hard work gone in from team Maidsafe since that was last possible, super exciting!

minions-yay

25 Likes

Whens the next testnet @josh :partying_face:

14 Likes

I am as eager as anyone to jump in.
I don’t think that the necessary changes have made it to main yet, I used a branch.
Then there is the difficulty of participation without a release, but lets see what tomorrow may bring after a bit more poking. …soon?

17 Likes

13 Likes

Is this already being integrated or work-in-progress?

8 Likes

Looks like finishing touches to me.

11 Likes

Image by Stable Diffusion

Great work team! Seems many thread-masters are working the tapestry at once. While I’m reading along in the other thread on BFT, the theoretical stuff is over my head - I’d need months of effort to absorb that … so will simply nod my head accordingly as I trust the experts at Maidsafe.

Thanks to all contributors.

Cheers

12 Likes

Thx 4 the update Maidsafe devs

Is there a way to make it financially costly for an attacker to stall or fork?

Keep hacking super ants

9 Likes

Yes, node age.

13 Likes

What happend to AT2?

6 Likes

We still use that, but refer to is as AT2 or BRB (Byzantine Reliable Broadcast). It’s how DBCs are implemented. It’s a very useful and safe pattern.

12 Likes

4 posts were merged into an existing topic: Consensus, async, Ae : What does it all mean and where does it matter?