Consensus 28/32 or 8/8?

No

Only takes 2 misbehaving nodes to upset things. Also not all nodes respond quick enough.

There is a balance between opposing forces when choosing the size of the group. One end is security and failure resilience and the other is too many increases processing/bandwidth/response time.

failure resilience has to do with one node not responding due to a momentary hicup in comms. one going off line during consensus. Smaller groups can only allow one and there is more chance that 1+1 in eight will suffer momentary comms problem where 4 in 32 would be rare to have 5 having comms failure at the same time. There are also other issues.

On the other extreme of say 10,000 in a group it would be falling over itself in processing and comms load, taking too long to come to consensus. Plus a lot of other issues.

So the trick is to find a balance that maximises speed, security and reliability. It seems the dev team has chosen the 28/32 as providing the maximum.

1 Like

I agree that the 100% consensus requirement would open a DOS vector, but there is another point about the lesser consensus that I havenā€™t seen mentioned, and I think is another reason the numbers are chosen (only assuming, though):

The network is always subject to reconfiguring due to churn. Requiring a 100% consensus would be extremely difficult to achieve, as a reconfig mid process would require a restart, over and over potentially. With a lesser figure, those 4 out of 32 may not even be voting against the consensus. Itā€™s just that the group can consider the answer confirmed as soon as 28 agreements are polled. ā€œOops, three nodes didnā€™t respond (turned off? bandwidth issues?) and one said something stupid. Oh, well. Weā€™ve got 28 confirms; good enough. Done.ā€

If not for the margin, the continually changing network configuration would be a regular stuck point in consensus achievement. Seems to me, anyway.

5 Likes

Also the address_space_tool in our archive repos (common) was written to analyse group and quorum size from a security perspective as well as churn. This was run over several months and used the robotic scientist math package to analyse the results as well as us doing a bunch of math (there is a probability equation I posted that worked, but was unproven).

The quorum chosen provided an attack of over 80% only possibly under very unlikely circumstances (so very high security). Anything over 50% means an attacker can be the only person adding nodes and there cannot be more than 1 attacker. So it becomes self defeating if there are several attackers, or several good folk adding nodes.

The difference between quorum and close group allows network reconfiguration errors (there always will be during churn) as well as fixing a very important issue of asymmetric closeness where a node ā€œthinksā€ it is close to another, but the other does not think so. This is where the binary tree is not fully balanced, which is more true the younger the network is and we want the network to restart from zero nodes in case of worldwide power outages even, so we keep this in place even for large balanced networks.

We still have to solve data republishing in full network outage, but itā€™s not a worry right now and will be lined with archive nodes when they come into play,

There are a few more reasons, but itā€™s not an arbitrary choice for sure :slight_smile:

15 Likes

I think I have been able to find some formulas to compute the bad nodes percentage needed to control the network.

I started from these:
Total groups = C(Nodes, Size)
Bad groups = āˆ‘ C(BadNodes, p+Quorum-1) * C(Nodes-BadNodes, MinGoodSize-p)
where:

  • C is the binomial coefficient
  • Nodes = Total number of nodes
  • BadNodes = Number of nodes controlled by the attacker
  • Quorum/Size is the consensus
  • MinGoodSize = Size-Quorum+1
  • p = index from 1 to MinGoodSize

For example with a 28/32 consensus:
MinGoodSize = 5
Index = p = {1;2;3;4;5}
Attackerā€™s nodes = p+Quorum-1 = {28;29;30;31;32} choosen from BadNodes nodes
Honest nodes = MinGoodSize-p = {4;3;2;1;0} choosen from (Nodes - BadNodes) nodes

I get the following result:

It seems that to control the majority of groups, the attacker needs to controls 85% of the nodes which is compatible with your simulations. I am not sure of my formulas, but maybe this is an indication that they are not entirely wrong.

Here are the correponding data:

The bad group percentage in red indicates that the attacker controls at least one group. It appears that to take control of one single group the attacker needs to control 65% of nodes. The formula I used is:
BadGroups/Groups*Nodes/Size>=1, meaning the probability of a bad group multiplied by the number of groups (the network nodes partitionned in groups of Size nodes). I am not sure of that one too.

Some other results:

(though the ratio is the same, 7/8 is really less secure than 28/32)

(and a first bad group and 20% of nodes, it is clear that GROUP_SIZE and QUORUM constants in routing/src/types.rs will have to corrected before final delivery).

Here the link to the source Excel file so that anyone can test other scenarios. Input cells to modify are on a orange background.
Consensus simulation

8 Likes

Another simulation I forgot (8/8 is in the title of OP):

In a way it is better than 28/32 because the attacker needs 92% of nodes to control half of network, but in another way it is worse because the first group can be controlled with only 42% of nodes.

As conclusion, here are the percentages of nodes needed to control one group and then half of the network for both consensus cited by OP:

____________________________________________________________________
| Consensus  | Control of 1 group | Control of half of the network |
____________________________________________________________________
| 8/8        | 42%                | 92%                            |
____________________________________________________________________
| 28/32      | 65%                | 85%                            |
____________________________________________________________________

I dont know the damages that can be created by only one bad group. In another thread it was stated that they were minimal, but I am not convinced by this.

3 Likes

The problem with the simulations as I see it is that the network is trying to reach consensus as fast as possible. If you get a 7/8 then it does not reach consensus without doing some form of re-gerrymandering of the groups.

So you donā€™t need to control the group to cause network misbehavior, you only need to prevent it from doing itā€™s job - that means 5/32 or 1/8 depending on which model you are testing.

2 Likes

You are right. My simulations are intended to work for an attacker wanting to control the network to get some advantages from it but without disrupting it. This implies that bad nodes behave correctly when they donā€™t control a group in order to remain unnoticed by other nodes.

I will try designing other simulations where an attacker wants to destroy the network.

1 Like

In fact another simulation program is not needed. The current one can be used with a 5/32 consensus and this is probably what you were suggesting:

At first the result may seem frightening: the attacker needs only 15% of nodes to disrupt half of the network.

But at this level the bad nodes are a minority and they will be expelled from their respective groups as soon as they behave incorrectly and things will go back to normal rapidly.

Well, I hope so.

The real question ought to be how normal should dissent be? Because if 0% dissent is normal, kicking out odd voters will work ā€“ I suspect that there are many legitimate reasons why a node might be out of sync with the rest of the network, and if that is the case, kicking them out might not be the right answerā€¦

does the network know they are behaving incorrectly since there is no 28/32 majority ā€¦?

I do think there are times when non-consensus ought to be correct. If I simultaneously transmit valid signatures for ā€œGive this SafeCoin X to Bobā€ and ā€œGive this SafeCoin X to Aliceā€ The authenticating nodes may recieve one command or the other first due to latency, # hops etcā€¦ Dissent isnā€™t always wrong. Sometimes the network ought to be confused because somebody is trying to confuse the network.

1 Like

hmmm - i would have expected kicking someone out is only possible by a 28/32 majority ā€¦ cause if it isnā€™t you effectively donā€™t need a 28/32 majority to do what you want inside the network ā€¦ you just kick everybody out as long as it takes till you have the 28/32 nodes you need after you reached 17/32 ā€¦

oh or are groups completely re-structured after a kick ā€¦?

1 Like

This is a critical issue, to cause a vandalism attack, nodes need to stay OK and behave then simultaneously all do the wrong thing, as they do the group may stall a message as they expel the bad nodes. The time taken and costs to perform such an attack do increase over time. So this is a good thing. These attacks cannot damage data but can cause a blip, itā€™s a lot of effort for such, but folks will try it for sure. So it will be interesting to see it.

We have had a few doctorate students also doing simulation type work like this and there is a very interesting effect. When you run a computational simulation these figures rapidly alter when switched from linear to XOR addressing. It catches everyone by surprise. So there is a very noticeable difference.

As you see if nodes misbehave they get removed, but even getting close to be able to disrupt a group is very difficult. i.e. If a group does not do as it should, all the reporting node needs do is send the same message to a wider group and immediately it can collapse inwards to kill the bad nodes. We have not implemented that logic though, but for sure itā€™s a facet the network can achieve easily.

You can think of it like this, send an action to group (A) then if result is not in consensus (routing knows) then the same message can go to each member of A but sent to the group of that name. This allows the network to increase a group to X groups where X is number of nodes in a group (32). This provides the X groups with the ability to kick out each nodes that is in error. To aid this each node that sends a message back signs it, this signed message is crypto proof of error and that is enough to deliver to the nodes Manager group to disconnect them.

It does get a bit more complex, but this is the way the network works in XOR space. It allows nodes to be quickly identified and removed. Hope that makes sense, I will write more soon.

8 Likes

Yes immediately on any such event the network reconfigures in that area.

1 Like

Iā€™m not replying too much, itā€™s because I need time to figure out all those formulas and so on. And today I was busy but I willing to work on the formula with loto mathematic which is almost the same how the network work and itā€™s what @tfa do.

The only problem I have is this so far.

For other calcul I need to get my head on it. I need to do the formula myself to confirm if itā€™s true or not before speculating more. Tomorrow Iā€™ll try to sort it out and give my own result.

BTW @tfa great job, Iā€™ll give you a heart even if I still not aprove your formula.

2 Likes

My formulas need effectively some more explanations. Let me try to do so:

C(Nodes, Size) is the number of ways to choose Size elements from a set of Nodes elements. As there are Size nodes in a group and Nodes nodes in the network, this is the total number of all combinations of nodes to constitute a group. So this is total number of all possible groups. Letā€™s call it Groups.

This was the easy part. The hard part is to compute the total number of all possible bad groups. Weā€™ll take a 28/32 consensus as an example.

Letā€™s begin with a group of 28 bad nodes. There are C(BadNodes,28) ways to choose 28 bad nodes from a set of BadNodes nodes. For each of these partial groups there are C(Nodes-BadNodes,4) ways to choose the remaining (honest) nodes in the group. Thus there are C(BadNodes,28)*C(Nodes-BadNodes,4) possible groups with exactly 28 bad nodes.

Similarly there are C(BadNodes,29)*C(Nodes-BadNodes,3) possible groups with exactly 29 bad nodes ā€¦ and C(BadNodes,32)*C(Nodes-BadNodes,0) (= C(BadNodes,32)) possible groups with exactly 32 bad nodes.

When we sum all of them we get all the possible groups with 28, 29 ā€¦ and 32 bad nodes. That is the total numbers of groups having a quorum of bad nodes. Letā€™s call it BadGroups. Formally It is equal to āˆ‘C(BadNodes,27+p)*C(Nodes-BadNodes,5-p)
where p = 1 ā€¦ 5 (because 27+p = 28 ā€¦ 32 and 5-p = 4 ā€¦ 0).

More generally:

  • 5 is equal to Size-Quorum+1 (in my previous post I called it MinGoodSize but now I think that BlockingMinority is a better name)
  • 27 is equal to Quorum ā€“ 1

Which leads to the general formula:
*BadGroups=āˆ‘C(BadNodes,Quorum-1+p)C(Nodes-BadNodes,BlockingMinority-p) where BlockingMinority=Size-Quorum+1 and p = 1 ā€¦ BlockingMinority

The ratio BadGroups/Groups is the probability of a group to be a bad one. This is then the portion of the network that the attacker controls.

One last thing: The number of groups in the network is roughly Nodes/Size. If we multiply it by the ratio of bad groups we get the number of bad groups. So the attacker will gain control of his first group when BadGroups/Groups*Nodes/Size >=1

This is a purely static analysis where the bad nodes doesnā€™t try to disrupt the network (as I said earlier) and that doesnā€™t take into account dynamic reactions of the network. But I think it gives a rough idea of the order of magnitude of the number of nodes needed to control the network.

2 Likes

Then we have a problem.

An attacker can program nodes that conform to the group consensus when they are below the blocking minority in a group (less than 5 nodes). But as soon as they reach this threshold they can do whatever they want without risking being kicked out. For example they can contradict every decisions of the group. That will literally freeze the group.

Here are the raw data for the 5/32 simulations shown earlier.

This means for example:

  • 15% of nodes are enough to freeze half the network
  • 25% of nodes are enough to freeze 93% of the network.

I donā€™t understand. As indicated by @dirvine the groups are reconfigured. Does it mean the problem is solved? I donā€™t see how?

1 Like

that is absolutely true - that would be a problem ā€¦ but someone being able to freeze points of the network temporarily is at least better than someone being able to print safecoins ^^

and since groups change often not needing 28/32 to kick someone out is reasonable and prevents freezing of the network =)
(interesting then would be how likely at least 17/32 would be)

very cool here would be some kind of octave/matlab-simulation where nodes appear and log-off, groups get reconfigured ā€¦ sadly i have to study very intense right now and i donā€™t know the algorithms (XOR-groups-stuff) ā€¦ if not something like that is already uploaded or someone else does it this would have to wait till my exams have finished ā€¦

also extremely cool would be if there was simply no way to determine which nodes are in your group (without paying SCs) ā€¦ so this kind of attack would cost ā€¦

1 Like

This is a facet of XOR as soon as a node is not a node it is replaced immediately. This is the poke your finger in the sea approach I mentioned.

If a group can provide a majority decision but not a consensus decision, then the group can be automatically wider, by sending messages to each nodes close group with a signed authority stating they did the wrong thing.

This way a majority is enough to report, but not enough to make a decision and you have a gap of majority but not consensus, this is an area of reporting, but not acting on the original message, just removing bad nodes.

This is exactly the kind of rule the security sprint will put in place after feature complete :wink: There is a lot more though. We also have message direction and checking a node is allowed mathematically to have forwarded a message. This adds a lot more to the math (direction is more triangular than linear, see kademlia paper for more on this but pretty hidden like old nodes poisoning the network (which we donā€™t suffer from))), but is very powerful. I promise I will write some of this up, but at the moment I am with the guys working on NAT traversal and connection management in routing. So I will get back up there soon, then one day on vault logic and computation, then it gets very powerful.,

Anyway I am being a noise in this thread (sorry about that), I should let it continue as is, but getting a theoretical model and then running simulations against the xor network with the rules we have in place will be amazing.

4 Likes

Perhaps a ā€œpartial blind voteā€ can work here. If the bad node is waiting for a threshold of 27/32 before it votes, then lets remove the ā€œ/32ā€ part. In other words, the group is only counting how many said ā€œyesā€ and not counting who said ā€œnoā€ or doesnā€™t respond.

Without the ability to know if 4 nodes already said ā€œnoā€, the bad node runs the risk of saying ā€œno,ā€ exposing itself to the group as a dissenter/vandal, and may still fail at disrupting the Network.

What if the bad node delays, waiting for 27 vote opportunity to occur?
I believe their non-reply/delay would be seen as being offline and they would be replaced.

I have no idea if this would workā€¦