Update 20 October, 2022

image

This week @joshuef brings some news on progress with comms (communications between nodes). As patient readers will know, this has been a blocker for a while now, but we see a definite light at the end of this particular tunnel.

General update

A quick round up of what the team are up to.

@anselme has been investigating a rogue key attack which affects some other implementations of BLS and has found ours is vulnerable too. He has submitted a fix to the blsttc crate. He’s also looking into a bug that’s preventing clients from joining the network in tests.

@chriso is working on chunk and register signing, which is where data (or its Xorname in the case of chunk) is signed by the elders to make it valid on the network.

Related to this, @bochaco is making changes to the client API related to commands and chunks, and debugging errors related to them on CI.

Mostafa is busy on test cases for our consensus mechanism, and @bzee is continuing to poke at qp2p, where we believe there’s a deep-seated flaw that’s affecting connectivity.

Comms progress

So one thing we’re working on is removing our connection reduction code, and just relying on the underlying quinn code to drop connections after a smaller timeout. This way we’re assuming less, and no longer second guessing what’s open and when.

In our testing this has had a positive impact on client tests, removing the likelihood that our connections are closed part way through.

Bi streams

In parallel, we’re also moving to streamline client/node comms using bi-directional streams. This means that we remove some state management complexity and will just wait on a response from elders. Previously (a long time ago in node-land), we’d used these kind of streams to communicate with clients, but managing the response stream was a complicated nightmare. But now the node is much simpler (due to the work done over the last year and a half or so), and this is much more manageable.

Reduced retries

We’ve also been looking to remove anything that has been covering up these issues (such as our comms abstractions as mentioned above), and also client layer retries. We now have ACK (acknowledgement) messages (they were introduced a few months back), to the client. These help tell us when a command has been seen by elders. But we were still just querying until a chunk was returned. @bochaco has been looking to be more strict here… not allowing retries and just saying “we’ve seen the ACK… so why are we not seeing success first time?”. This has exposed some errors in file read/write. (It looks like deserialising the storage commands can take longer than queries… so even if they are sent afterwards, we’re processing them first).

Reduced node processing loads

In an effort to more properly regulate node command handling, we’ve previously added code which should have been organising incoming messages, and ordering them for processing. We’ve actually seen that this was not having the impact we were looking for at all (in fact it just triggered all messages to be processed one after the other regardless of priority).

Removing this code has allowed a lot of node simplification. Most significantly, we’ve been able to move node process handling off-thread in a lock-free way (after our single-threaded push removed the vast majority of locking code).

Impact

Previously, we could track messages coming in, being queued for processing, being handled and then queued for sending. This process under load could take considerable time. Sometimes it was relatively fast, sub second. Sometimes, with all the command processing and messaging IO… it could take 20 seconds.

Now we’re seeing messages routinely coming in to nodes, being processed immediately and messages going out sub-millisecond.

This is much closer to what we’d have expected from our comms previously, and feels very much like the right direction.

Next steps

We’re looking to fully incorporate the bi-directional streams into clients, removing a lot of state management there. And we’ll also be aiming to have this same flow in elder->adult comms where responses are required, which should further simplify things there too. It should actually allow our ACK messages to be reflective of adult storage, too, rather than just elders saying they’ve seen a message, which should also help with failed verifications.


Useful Links

Feel free to reply below with links to translations of this dev update and moderators will add them here:

:russia: Russian ; :germany: German ; :spain: Spanish ; :france: French; :bulgaria: Bulgarian

As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!

51 Likes

I can’t possibly be first… again!?

20 Likes

There he is! Get him lads!

17 Likes

So many changes…
How they impact stability?
(average amount of hours network stays alive)

14 Likes

Itching to see too, so I built yesterday but only had 4 nodes join out of 15.
May try some branches if impatience gets the better of me.

12 Likes

Hopefully get a scottefc86 friendly testnet/comnet soon :joy:

12 Likes

wow! sounds like code reductions are continuing …

Ten years ago 1,000,000+ lines of code; but with diligent non-stop engineering, two years from now, the complete and awesome SN is an amazing …

BASH one-liner!

:rofl: :smile:

Seriously though, keep hacking ants! Thanks to all for the great work :wink:

13 Likes

Anyone who knows how to lookup the current number of lines of code?

8 Likes

Full repo, inc client etc.

 --------------------------------------------------------------------------------
 Language             Files        Lines        Blank      Comment         Code
--------------------------------------------------------------------------------
 Rust                   234        66804         7928         9453        49423
 Markdown                33        27301         7550            0        19751
 Bourne Shell            16         1686          233           90         1363
 Toml                    12          634           59            5          570
 Makefile                 1          196           23           11          162
 PowerShell               1           80            4            0           76
 Plain Text               7           68            1            0           67
 Python                   1           70            6            5           59
--------------------------------------------------------------------------------
 Total                  305        96839        15804         9564        71471
--------------------------------------------------------------------------------
21 Likes

Node only

--------------------------------------------------------------------------------
 Language             Files        Lines        Blank      Comment         Code
--------------------------------------------------------------------------------
 Rust                    69        20067         2412         2258        15397
 Markdown                 7         7964         2147            0         5817
 Toml                     1          121           10            1          110
 Plain Text               2           14            0            0           14
--------------------------------------------------------------------------------
 Total                   79        28166         4569         2259        21338
--------------------------------------------------------------------------------
18 Likes

Sir, Hat off to you.

12 Likes

gogogo team! its really getting close!

12 Likes

Thanks for the update, devs one and all.
It would appear at a quick glance that the tactic of swatting bugs as and when they appear seems to be working.
Those wishing to gaze at multi-coloured shapes on a chart may be less enthused. Also showing that simplifying code has both short and long term gains - Bi streams in this instance for the longer term gain. We see a benefit in reduced node processing loads as well.

Is this where OTLP was useful? Whatever, sounds like a major speed-up

Looks like there have been several major gains recently that will have benefits in the next steps outlined.

I am really unsure as to what more information could be provided with or without JIRA. Thanks to all involved.
And also thanks to the folk who were NOT directly involved in the code, Jim Collinson and Heather Burns(and the finance guy, whose name I have forgotten, sorry) are beavering away, making sure that Maidsafe/SAFE can best make use of these coding advances.

Thank you to every one of you.

EDIT: the title pic – is that looking south down Loch Etive onto the back of Cruachan?

12 Likes

I don’t respond a lot, as with many things i don’t fully understand the impact when considering coding and if it is good or bad. But the thing i really appreciate is the dedication this team puts into it! And for many years already!! Not only, working on the code but also getting everything in place legaly so nothing can stop the launch.

Keep up the good work!

18 Likes

How is this comparable to some well known projects? How many lines would you expect in an average app or program?

3 Likes

Which, considering it has been like that for a while now, raises the question:

Is the code being tested in (semi) real scenarios, such has multiple DO droplets, or only on local machines? And do these numbers

refer to the former or latter scenario?

In any case, great work as always @maidsafe team!

9 Likes

Sounds like qp2p has been the source of many hassles. Amazing to see the code refactoring paying dividends though. As always, great effort team!

10 Likes

Thanks so much to the entire Maidsafe team for all of your hard work! :racehorse:

6 Likes

Looks like Heather is no longer part of the team, finished in Sept 2022.

Her long-awaited new book on privacy is out just a couple of days ago:

10 Likes

Thanks @mav, I’d missed that.

5 Likes