SAFE Network Dev Update - April 4, 2019

Nice concise update!

The PARSEC 2.0 paper sounds really interesting, I’m sure I won’t understand much of it, but looking forward to it’s release down the track.

Also a quick glance at the new node aging design was a mind-bender for me - Still nice to see the effort being put into it by the team.

It’s frustrating that quinn is giving so many problems - the cost of being bleeding edge. It’s good though that the quinn devs are responsive and hopefully this will all be in the rear view mirror soon.

Thanks for the update :wink:

11 Likes

So I decided to do a simple spreadsheet on this for the worst case of N/3 bad nodes.

You will have problems if the section sizes are small and there are N/3 nodes in each section acting badly (comms, bad actor, etc). For instance if all the section sizes are 42 and 10 hops then you can expect on average one in 500K messages to fail. Much worse if section sizes are smaller. (size=24 nodes then 1 in 656 messages will fail on average)

On the other hand once all the section sizes reach 63 then its one in a billion (10^9) messages on average will fail.

From the spreadsheet below you can see that maxing out the number of copies of the message is viable and desirable. If you max out the number to 21 then expect worse case of 1 in a billion average. Maxing out at 24 it is 1 in 28 billion worse case. By limiting the maximum number of copies of the message will prevent a very large section from creating unnecessary work for the network as a whole

Here is the spreadsheet to give you an idea.

Obviously we do not expect to have N/3 bad nodes for every section since if even one section had N/3 then the probability of a section having > N/3 bad nodes in the near future is super high.

20 Likes

Thought I’d give a comparison to give an idea on how things improve if less bad nodes average per section.

I changed the bad nodes to N/4. 25% bad instead of the above 33.33%.

Note

  • for 42 nodes it is 1 in 26 million up from half a million. ie 52x improvement.
  • for 24 nodes it is 1 in 6554 up from 656 ie 10x improvement.
  • for 63 nodes it is 1 in 440 billion up from 1 billion ie 400x improvement.
  • limiting copies of the message sent to 17 would be 1 in a billion
  • limiting copies of the message sent to 21 would be 1 in 440 billion

Thus limiting the copies of the message sent to 21 in this case should be OK

More importantly there should be a lower limit to the number of messages sent so that even if the section is small we will not get an unacceptable worse case scenario. In other words have a minimum to maximum number of copies sent as determined by a mathematically determined upper/lower bounds based on acceptable message loss for the network and the section can choose the number of copies sent based on size within those bounds

15 Likes

Think of it as a periodic heartbeat reaching consensus. That’s my understanding at least, but I’m sure @Jean-Philippe can correct me if I’m wrong :slight_smile:

5 Likes

Thank you for your review.
I’ll try to answer some of your questions.

As @jonhaggblad mentionned, this is indeed a regular heartbit.
Because Adults do not really do much work in Fleming, the most straight forward unit of work is how long they have been in the section. This will be replaced when there is actual work being done.

The timeout, i.e whether a certain amount of time as passed, is what triggers a vote for ParsecPurgeCandidate. (This is a different timeout than the one used to simulate work unit).
So if the candidate has not successfully completed its resource proof in a specific amount of time, each elder will vote for ParsecPurgeCandidate, and if it reach consensus before ParsecOnline, then the candidate will be removed as we considered it failed resource proof.

We currently have a resource proof implementation in our code. We do not expect to change this for Fleming. It is testing computation and bandwidth.

I hope it clarify things a bit.
As mention we are working on clarifying it.

13 Likes

Great update Maidsafe devs as always.

I noticed a snowflake in your Tweetstorm 12/ cryptocurrency…
Ideally you want to call SAFEcoin money instead of cryptocurrency, because it’s literally a store of value (family pictures, company secret, censorship resistant data etc) and maintain it’s purchasing power in it’s economy. Money gives it that different tone, compare to over 2100 cryptocurrencies screaming that they’re sound money. :sweat_smile:

@mav :vulcan_salute: :clap::clap::clap:

:stuck_out_tongue:

4 Likes
7 Likes

Hmhmm - how does one comment to this request for comment? (on github through issues or in the dev forum?) @bochaco because especially the human eye friendly but not standard encoding is something I would try to avoid tbh (especially since we found out that implementation in python and JS seem to differ)… Yes of course it works when using the libs for all eternity and for everything but if someone wants to have the faster native implementation (without callbacks and many process interaction) for such a simple task (or someone goes for a native implementation of the safe protocol at some point) using this might turn out to be not that handy… Or you need to specify exactly what encoding (binary to characters) you use there… So anybody will be able to reproduce it…

5 Likes

Please feel free to comment in either of those places @riddim, in the forum you may find some more audience from my experience (not that I’m agreeing or in favour of this, just what it seems to happen usually).
As per the issue we found, it’s like any other spec, there may be bugs in any implementation out there, but what should drive anyone through the dev is the spec. Thus it’ll depend what else we can figure out through out the development we could add more details to our own RFC to prevent as many of this situations as possible.

7 Likes

The summary states “This validation though, is a spot check and also best effort. It is not guaranteed to be accurate over time and this consideration must be clear to users of the crate.”

What happens if a malicious node joins the network, passes the initial test and then slowly diminishes it’s resources to slow the network down?

5 Likes

Well its going to get relocated(node-ageing), and potentially fail the res-proof done by the new section then and get booted out. Also its an adult not an elder on joining and all the node-ageing particulars.

10 Likes

Thanks for the quick reply!

4 Likes

Nice work with the spreadsheet @neo, though I noticed that some of your numbers are a bit off - we actually did calculate very similar things :wink:

Let’s denote the number of faulty nodes per section by f, the number of all nodes in a section by N, and the number of relaying nodes as m (equal to N/3 rounded up). Also note that we expect f to be strictly less than N/3, so strictly less than m, but we can also extend the calculations to it being equal to m or greater.

First thing is that you assume the probability of failure in a single hop (i.e. the probability that all relaying nodes will be faulty) to be (f/N)^m. That’s incorrect. Note that the probability should be 0 if m > f (impossible that all relaying nodes will be faulty if there are more relaying nodes than faulty ones), but your equation gives a different result.

This is because you need to take into account that you aren’t randomly “drawing” nodes out of the section independently - every next draw is dependent on the previous ones. The correct expression turns out to be (f!(N-m)!)/(N!(f-m)!). If m > f, the denominator breaks down (you get a factorial of a negative number), but the result should be 0, as I said above.

The second thing is the probability of message failure with 10 hops. You calculate it using the expected number of sections to fail, but there is a more accurate way of calculating that.

Note that message relay fails if at least one relay section fails. The probability of this is 1 - [prob. of all sections succeeding]. Thus, it is: p = 1 - (1 - (f!(N-m)!)/(N!(f-m)!))^10 (prob. of success to the 10th power is the prob of all hops succeeding). Then you can calculate the expected number of retries until failure as 1/p.

Try inputting that in your spreadsheet :wink: You will note that if m > f (so if N/3 > f), then the prob. of failure is 0, and all messages are certain to succeed. Only for f ≥ N/3 you get a nonzero probability of failure.

This is why we think this will allow us to drop acks and timeouts - as a failure due to a node being faulty is impossible if f < N/3 (other failures might be possible, like incorrect handling of relaying during a merge or split, but that is a separate topic).

17 Likes

Got me there. Should have had my thinking cap on. And a lot of years since I did probability theory at Uni. Thanks for the correction. I realise now I was also using a single route.

But now I think about it more I have a question. Do the copies of the message follow the same route? That is the the same hop sections? Or do they follow different routes with differing sections?


If the same route

Then if the originating section has 66 nodes then you are sending 22 copies of the message out. and any of the following sections have significantly more nodes (eg 120) then its possible that section has more than 22 bad nodes and at that time the messages stop.


if different routes then all copies hit (mostly) different sections and then its the chance that each route is bad.


Now how does the section send the messages? Does it tell N/3 nodes of the section to send one copy each out?


Its late here and I will digest your maths tomorrow and update my mickey mouse spreadsheet.

11 Likes

Yes, that is the idea. At every hop, every relaying node chooses a neighbour closest to the destination (which should be common to the relaying nodes), and out of that neighbour, it chooses N/3 nodes closest to the destination.

Firstly, the idea is to use the target hop’s size as N. So in this case, each of the 22 nodes from one section would send 40 messages to 40 nodes from the other section.
Second, when node ageing lands, we will only count the elders towards N, and the number of elders is going to be constant. So every section will always have 10 or so elders.

It’s N/3 to N/3, so every relaying node from one section sends a message to every relaying node from the other section. This means a quadratic complexity in the number of messages, but with N on the order of 10, it shouldn’t be a problem.

If it was just a single message sent by each of them, it would be almost like routes from before, but in parallel, and they had a much larger probability of failure.

11 Likes

It’s so awesome that @AndreasF is back at it! Everyone @maidsafe is going down in history and are heroes amongst this forum.

14 Likes

Is the message relay beeing implemented simultaneously with node aging, or will it be done one after another?

According to David message relay is already done.

2 Likes

they are changing it- see the update above- section related to routing

1 Like

In my quoted post I asked for finished / unfinished parts in regard to Fleming. But yeah, maybe they changed the plan inbetween.

2 Likes