Update 29 September, 2022

We have simplified the code and eliminated many glitches by removing all unnecessary asynchronous multi-threaded processes in the node code. At the same time, some issues with slow communications remain, which we think maybe caused by Quic itself so we’re digging into that.

Outside of the bug bashing, the new synchronous sn_sdkg code is now ready for integration. @anselme walks us through that this week.

General progress

Qi_ma is looking at how elders check a node’s node_age when they are deciding which to relocate to a new section, as we are seeing some anomalous behaviour there, including ‘zombie’ nodes being able to join the network.

@ChrisO is experimenting with Quinn, which is a Rust implementation of the Quic protocol for establishing and maintaining connections between peers. Unfortunately, Quic is something of a black box, and we think some of the connectivity issues we are experiencing may be down to the way it works, specifically that communications are often slow, which causes problems when processes time out. Chris has a Quinn sandbox set up and is seeing what happens when we fire different types of messages at it. At the same time, @bzee is looking at the structure of the qp2p communications to confirm we are using Quinn as efficiently as possible. The key issue is receiving concurrent streams asynchronously while allowing waits on responses (returning a watcher, to watch for response messages on the same channel).

@roland is working on fuzz tests for the new sn_sdkg crate, which @anselme describes in detail below.

sn_sdkg Integration

Distributed Key Generation (aka DKG) is the way section elders generate the section key in a secure way that keeps the section key secret. At the end of DKG, each elder knows only their own secret key share, that way nobody ever sees the entire section secret key. This is a mechanism to mitigate the action range of potentially bad elders: it’s how Safe Network can ensure that as long as we have less than 5/7 bad elders, they can’t sign anything with section authority. Section authority is required to change data, promote or denote nodes and to mint tokens, so this is very important.

Recently, we’ve been working on a new DKG that is more resilient to packet loss and that doesn’t use timers, so it can’t fail because of slow network traffic and timeouts. This post describes how this new DKG works. For this implementation, we use the synchronous distributed key generation sn_sdkg crate, which is based on poanetwork’s Synchronous Key Generation algorithm in their hbbft crate.

How DKG works

DKG is triggered by the elders when they notice that the oldest members are not the elders, or when a section splits and they need to chose elder candidates. As they notice this, the current elders ask the candidates to start a new DKG session with a DkgStart message, so they can generate the next section key.

The first step in our DKG is generating temporary BLS keys, which are used for encryption in the DKG process. Every node on the Safe Network has an ed25519 key, but although those keys are great for signatures, we can’t safely do encryption with them. We need another way.

Since our nodes don’t have BLS keys (elders have a BLS keyshare but not a simple BLS key), we generate a one-time key just for this DKG session and discard it after use. However, we need the other nodes to trust this BLS key because it’s brand new, so before anything else happens, each candidate broadcasts its newly generated BLS public key in a message that contains their signature (made with their trusted ed25519 key) over the new one-time BLS public key that they will use for this DKG session.

Once the candidates have all the BLS public keys for this DKG session, they can start voting. Voting has 3 stages:

  • Parts: every node submits a Part that will be used for the final key generation, it contains encrypted data that will be used for generating their key share.
  • Acks: nodes check the Parts and submit their Acks (acknowledgements) over the Parts. These Acks will also be used for the key generation.
  • AllAcks: everyone makes sure that they all have the same set of Acks and Parts by sending their version of the sets. This last part is there to make sure that the candidates end up generating the same section key!

Once voting is finished, candidates can generate their secret key shares along with the new section public key from the Parts and Acks.

Gossip and eventual termination

On a network, messages can be lost and that can lead to situations where some candidates are missing votes and some are waiting for a response to votes that never arrived. To counter this problem, we have gossip! Every now and then if a node hasn’t received any new DKG messages when it is expecting some, it will send out all its votes to the others. This has two purposes:

  • one is to inform the others of the votes, and get them up to speed with votes they might have missed
  • the other is to show the other participants that our node is missing votes, so others can respond in turn with their votes and help us catch up with them

Indeed, if a node receives a gossip message that is missing information, it will respond with its knowledge. This happens even after termination (completion of the voting round), because sometimes, when a node terminates (and thus stops gossiping because it is not expecting any more votes), it will still receive gossip from other nodes that didn’t make it there yet. In that case, the knowledgeable node will respond to this gossip with their knowledge so the other nodes can also reach termination. Eventually, through this process, every candidate reaches termination.

Concurrent DKGs

In this implementation, we embrace concurrent DKGs. Sometimes, right after DKG is triggered, a new node joins the section and appears to be a better elder candidate because its node age is very high. In this case, the current set of best elder candidates changes, and the current elders issue another DkgStart message to the new candidates.

The previous DKG session is not stopped, instead, now it’s a race between the two! We want elders to be very reliable nodes. In a way, the intensive DKG process is a test to check that these candidates are indeed fit to be elders. If multiple DKGs terminate at the same time, it’s fine, Handover Consensus will make sure the current elders pick just one winner. DKG sessions that didn’t win the race might or might not terminate, but it doesn’t really matter, they will eventually be stripped out as the nodes realise that they lost.

Conclusion

In short, the new DKG focuses on being very resilient to messages loss, removing the need for timers and making sure everyone reaches termination eventually without possible timeouts. It also makes concurrent DKGs a feature to select the best candidates in a race to termination between DKGs


Useful Links

Feel free to reply below with links to translations of this dev update and moderators will add them here:

:russia: Russian ; :germany: German ; :spain: Spanish ; :france: French; :bulgaria: Bulgarian

As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!

48 Likes

First think that’s a hat trick!!

Thanks to all the team now to read

22 Likes

It’s safe to be second now

@Neik would have been utterly disconsolate if he didnt make his hat trick

Gave me time to read it as well
Thanks to all for the hard work and yet another fine explainer.

If all works as planned, then we have a very robust piece of code that accomplishes what was previously thought impossible.

21 Likes

first is the worst, second is the best, third is one with the hairy chest.

EDIT: Now to read :slight_smile:

18 Likes

is somewhere written on how many bugs already you works on?

6 Likes

Yes look at MaidSafe · GitHub
Check each of the repositories and look under the “Issues” tab to get started

11 Likes

Thanks team! its getting closer!

11 Likes

Thanks so much to the entire Maidsafe team for all of your hard work! :racehorse:

7 Likes

Has all this bug squashing and optimisation this year led to any improvements in the stability of test nets?

Community test nets dont seem any more stable to me (feel free to correct me, i never participated so thats just an outsiders impression), but what about internal ones?

Also, I’m curious if there are any specific metrics measuring network stability?

6 Likes

Nice update Maid team, thanks.

Is the team using a fork of Quic/Quinn? Just wondering if updates to it’s codebase are being accounted/controlled.

Cheers

6 Likes

Yes, it’s a large complex machine which works when all the bits work. Fixing broken bits always helps. It’s never bad or wrong to fix any bug you find. We need them all gone.

No, we use vanilla quinn, that is why we think qp2p has introduced an error in how quinn/quic works. I hope so anyway, as we need to process thousands of messages per second and currently we are way behind that. My feeling is this is a major issue we should crack very soon.

26 Likes

It sounds like this messaging bottleneck has been an issue for a long time and we’re getting close to finding the cause. :pray: :slight_smile:

11 Likes

Thx 4 the update and all your hardwork Maidsafe devs

Love these readups, keep em coming

Although it will take some time to digest this info

Keep hacking super ants

10 Likes

The more I look at coding development the more I realize math is so crucial. Good job @maidsafe

10 Likes

Thank you for the heavy work team MaidSafe! I add the translations in the first post :dragon:


Privacy. Security. Freedom

8 Likes