Update 7 July, 2022


Credit to Spatium for the 4 title images you will see used over the coming weeks :heart_eyes:

Some promising news on the bug-hunting front as the team have tracked down a nasty little critter that was causing deadlocks during data replication at churn. For the technically-minded, the bug in question was a reference to a read-locked node that persisted even after it had been dropped, and tracking it down was made possible by the ongoing work to strip out multithreading and simplify the code. As such, we thought it would be a good time to bring folks up to speed on the work we’re doing in sn_node in this and other regards. @joshuef takes the controls this week.

General progress

@yogesh is up and away with refactoring the codebase to get rid of SledDB. As mentioned in the previous updates, we were on course to replace SledDB (for storing registers) as it had issues of write limitations and was also not being actively maintained. Therefore, we benchmarked a few alternatives in the previous weeks, where each one had its pros and cons. As an outcome of the analysis, the team decided to go for the in-house disk storage implementation that we already employ for storing chunks. This is a keep-it-simple implementation that has no bells and whistles and performs on par with the other alternatives, whilst this also frees us from having another external dependency that might not be super maintained. Currently, it only supports storing chunks and therefore needs to be revamped to support storing registers, for which work is well underway.

@qi_ma has been looking at the churn tests which have been failing and slow, in part - we think - because of the bug described in the introduction (and below). There have also been spurious messages sent to clients, which could well be part of the same issue.

Also this week, @Heather_Burns made a triumphant post-plague return to the UK Parliament, where she represented MaidSafe at a roundtable of small tech businesses and start-ups who stand to be collateral damage in the government’s determination to regulate the internet around Facebook. Heather reports that the MPs she met with are some of the few who actually ‘get this’ and want to prevent that from happening, so hopefully our message was heard loud and clear. Inadvertently, she may have also destroyed the government. Whoops.

State of the Node

Many folk may well be wondering what the state of the node is now, and why we’ve been taking on certain tasks.

Here’s a small rundown of the state of nodes:

Less Locks

In the last weeks, we’ve moved the node to a single-threaded setup. This ostensibly hasn’t had much impact on performance (although it did improve things a wee bit), but the aim here was about simplifying the code base. With only one thread to worry about (by default… we may still spawn others as needed), we don’t need to have as many checks and balances throughout the code to enable it to operate in a concurrent fashion.

The clearest example of this is the ability to remove RwLocks (read-write locks) from the codebase. These wee structures allow code to wait until a given piece of data is not being modified on another thread, before we edit it. Neat!. Yes. But also dangerous. Many of the recent bugs and troubles we’ve sorted in sn_node have come about due to these waits going on indefinitely (a situation we refer to as a deadlock).

It’s here the move to single-threaded sn_node really shines, as we can remove the vast majority of these from the codebase, and with them, we drop an entire class of bugs (and really painful to debug bugs, too!). So not only is the code cleaner, clearer and more sane, but it should be less error-prone to boot!

We’re ~80% of the way through this now. Having removed many locks over internal node structures, and replaced them with one higher level lock, it’s much easier to reason about (although the transition did turn up another deadlock!). A good example of the improvements in simplicity translating to speed can be seen in some of our Benchmarks

This is a big win.

Testnets

Another area where we’ve been working to improve things, and will continue to do so, is testnets and debugging. Some testnet errors have been the result of failed infrastructure, for example, say we started a DigitalOcean droplet, but the node did not start properly there for some reason. Recent changes have made node restarts more stable, as well as cleaning up background looping code, which should all help in situations such as this.

We’re also looking to improve the logging situation, with some clearer Cmd logs to be output separately from the more verbose run-logs. These should be easier to parse for folk and, hopefully, we’ll be able to hook them into the ElasticSearch instance we have for internal testnets.

Membership

Membership continues to improve, although it has been held up with the single threaded switch. We’re looking to close the gap between what nodes understand and vote upon, and what is shared within a SectionAuthorityProvider (SAP). This should also improve stability.

Dysfunction

Another class of bugs we’re now tackling is a somewhat more ‘meta’ class of bugs - nodes voting other nodes as dysfunctional (and so offline). Doing this well is tricky (when is a node dysfunctional vs having a temporary bad time?). That pain is acutely felt on continuous integration (CI), where the machines are less powerful, and so we can sometimes see nodes being voted offline, which can break a test cycle…

To that end, we’ve recently expanded the test suites in sn_dysfunction and are looking to keep expanding and improving this, to get towards reproducible situations where we can know that offline votes are occurring because of bad nodes.

Memory and CPU

With the single-thread changes, the recent reworking of data republishing, and with some simplifications in the node codebase sn_node generally runs really well, averaging ~130MB memory usage for an elder (on a Mac; local testnet), ~70MB for adults.

Right now, any large spikes are clear red flags for us, and that’s a really nice place to be as it makes finding the cause of such issues much easier to spot.

Data

The recent discovery of sled bugs has put a dampener on the feelings of stability in data, but we’re working on removing that right now. This change should hopefully simplify and unify the data storage in sn_node, so behaviour should be more consistent across any data type.

And SooOOoooo…

While it may not always feel like things are progressing for the community given that not everyone sees what’s being worked on and improved day to day, things are definitely going in the right direction. Every testnet we spin up is more stable (in general), but if for some reason there is a bug, with all these recent changes, plus with the ElasticSearch server tracking droplets’ stats, it’s becoming much easier to see where issues lie, and will hopefully only be getting easier too!


Useful Links

Feel free to reply below with links to translations of this dev update and moderators will add them here:

:russia: Russian ; :germany: German ; :spain: Spanish ; :france: French; :bulgaria: Bulgarian

As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!

62 Likes

One two three first.!!

Now to read really looking forward to this one :slight_smile:
Well done to all the team for all the hard work.

16 Likes

Second :thinking: and a whole 8 minutes late!

15 Likes

Thanks so much to the entire Maidsafe team for all of your hard work! :racehorse:

There’s been some more buying of eMAID on the DEX. Here is the link to the latest activity MaidSafeCoin (eMAID) Token Tracker | Etherscan. :racehorse:

14 Likes

Yeeehaaaa finally

9 Likes

Great work again, thank you!

As a non-techie I wonder…

… how can you even have different nodes, when you’re spinning them up in DO? Is it just because of this, or something else too:

Is there a way to eliminate that class of problems altogether?

4 Likes

From your post it may seem that single-threaded code is better.
No. It is generally worse. But it is easier to make.
Thinking about multithreading when network can’t survive for several hours may be interpreted as premature optimization.
But when network will gain more stability, it will be the time to think more close about multithreading again.
(I hope that no one will come with multiprocess solution - I hate it)

9 Likes

DO nodes ar mostly the same, but some can fail for any reason, but the nodes from homes are all different. Also some nodes fill faster than others etc.

We are working on this too.

13 Likes

lol :rofl:

Coupled with bug hunting, that sounds like a good weeks work all round. :+1:

13 Likes

Probably the least of your worries but do you think the issues of being able to join a node will be resolved for the next public testnet?

7 Likes

It should be that you can join if the network needs a node, but only if :wink:

14 Likes

Thanks for this amazing update! :clap:
You guys are getting real close now. Keep it up!

12 Likes

Fantastic update Maid team - one of the best yet IMO, thanks.

Looking like testnets in a month or two might reach full stability! Looking forward to that day when it goes up and stays up with no problems.

Cheers

10 Likes

Progress is very apparent after a 3 week absence! Lord help me… I’m drowning. :man_swimming:

15 Likes

I’m loving the switch to single threaded. I’ve written some complex threaded applications before, but when the node is designed to be horizontally scalable and independent, why not use it? The tests so far use a minimal amount of CPU, possibly a bunch of storage, and potentially a ton of bandwidth, which always was going to be the concern alongside latency. So many mind bending bugs no longer there… A NAS with fast storage and connection can take a few nodes, and a raspberry pi or the like can do one or two just fine alongside a simpler code base. Easy to deploy nodes in droplets. Win win as I see it.

18 Likes

I agree, great work! Oh man, we are getting closer and closer! Cheers

6 Likes

Thx 4 the upload Maidsafe devs

:clap: :clap: :clap: great job team

An elegant solution to the cutoff of SledDBC

We’re so close to SAFENET

Keep hacking super ants

10 Likes

Excellent update. Great job team Maidsafe :+1::+1:

7 Likes

Thank you for the heavy work team MaidSafe! I add the translations in the first post :dragon:


Privacy. Security. Freedom

14 Likes