Update 10 February, 2022

As most here will doubtless know, adults are nodes that store data and give it upon demand. But what if they start acting childish, refusing to store or give up data, or at least doing so slower than expected? For the sake of the network we need to demote or eject such wayward nodes, but before we do so we must redistribute the data they’re holding. We also need to provide meaningful error messages to clients and other nodes trying to store data when that fails. That’s what we’re delving into this week.

General progress

@bochaco has been working on the safe shell. If you type safe into the console (once safe_network is installed of course), you enter the shell, meaning you don’t need to type safe every time thereafter. With all the recent CLI updates this aspect has been a bit left behind, so he’s been putting that right. As well as attending to some tidying and refactoring in the node code in preparation for the upcoming membership changes.

In the DBC labs, @danda is working on integrating Ring CT into the network, including making DBCs more user-friendly for use with the two kinds of keys: a long-lived base owner key for interacting with third parties, such as for donations, and a derived one-time-use key for interacting with mints and spentbook. He’s also working on test features that can be turned on or off for debugging and optimisation, and has reduced the number of calls required to iterate over the spentbook.

On data replication duties, @yogesh has made progress on a pull model where adults will be told what data they should be holding and will start to pull data from the network automatically to ensure the right number of copies are held for redundancy. More on that below.

And @joshuef and @Qi_ma have been looking at client connection issues thrown up by the playground and the comnet. We may have squashed one CPU intensive bug (we at least can no longer repro it at the moment), so we’ll be looking to verify that in an upcoming playground.

Preemptive data replication and adult errors

Properly functioning adults are the backbone of the network, and it’s imperative that should an adult start to misbehave it is replaced and the data it holds smoothly relocated. This is called preemptive data replication and is detailed in PR #976.

Liveness checks

Elders need to ensure that adults are performing properly. They perform regular liveness checks in which the performance of a node is compared with its 3 nearest neighbours. If the pending operations count at a node is 5 times higher than at its neighbours, it will be demoted and its data redistributed. To prepare for this eventuality, once the pending ops count of a node is 2.5 times higher than its neighbours (these parameters will be optimised during testing), pre-emptive replication starts, with elders currently initiating this replication.

When there is churn in a section (nodes leaving and joining) we need to make sure that data is replicated and distributed to the newly promoted nodes. When an adult is full, it also needs to tell the elders to store the chunk at another adult.

All of this requires some self-awareness by the adult node as to how full it is. Checking space is quite resource-intensive, so we only do it in steps of approximately 10% of the available space.

Adult errors

We need to generate errors to advise clients - and the system as a whole - when data is not being stored as it should be. This can happen for a variety of reasons. These errors will be made part of the network protocol with which all nodes must comply if they are to stay in the network.

Below is a list of errors that can arise at an adult node during PUT/GET operations (not counting AE and DKG errors) and the responses we are working on.

CouldNotStoreData - the adult errored during storage, due to the adult’s storage mechanism. This is the adult’s fault. Possible causes are a failure to create directories, problems with the file system or the database used to store registers, corrupted registers, or wrong filepaths.

DataError - the node did not store due to a data error. This is the client’s fault or possibly because the message has been corrupted. Either way (we cannot know) this should be returned to the client.

NodeFull - the node is full! An error message is returned to the elder requesting the storage. We could possibly penalise adults that have not informed us beforehand that their storage levels are getting low.

Error spam

As well as informing clients, we can also make use of these faults as signalling something has gone wrong. At the same time, we need to avoid overwhelming the elders with too much messaging to-and-fro.

In handling these errors we need to ensure we are not opening up new attack vectors, allowing malicious users to knowingly perform illegal operations to DDoS the network by generating masses of error messages. As a future measure, it is possible that we could blacklist clients observed to be behaving in this way.


Useful Links

Feel free to reply below with links to translations of this dev update and moderators will add them here:

:russia: Russian ; :germany: German ; :spain: Spanish ; :france: French; :bulgaria: Bulgarian

As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!

70 Likes

First to read!

20 Likes

Second!

Must be 10 chars

18 Likes

Third is lucky :slight_smile:

Thanks for all the effort that has gone into this, looking forward to testing the various aspects as soon as feasible.

15 Likes

First to ask, when playground?

8 Likes

Could this mean that someone running a simple RasPi at home could be booted off because it’s compared to a bunch of professional server rigs? Or is that limit of 5 times higher sufficient enough to account for a discrepancy in computing/network performance?

13 Likes

Awesome!! Best news of the day. :beers:

23 Likes

Yes this way mine big concern upon last change from original idea of reward the fastest one to deliver.
It is almost worthless to store data when speed is going to be too slow.

9 Likes

It could do iff many many folk used server rigs, I doubt that will be the case though. It will be interesting as there has to be synchronicity i.e. we cannot have slow to respond is OK no matter how slow, if you see what I mean? In pure async a node could reply at infinity - 1 second and be OK but that is nuts as to what’s to stop all nodes doing that and the network does not move at all. So there has to be limits and instead of time-based, we think related to neighbours based is more natural.

23 Likes

IIRC there is work ongoing to determine what other useful work a lower spec adult can do for the network other than storing and serving chunks.

“From each according to their means, to each according to their needs” Socialism, baby :slight_smile:

8 Likes

that does make a lot of sense @dirvine, thanks. In a dev update some time ago it was already mentioned how difficult it is to work with timeouts/limits in such an asynchronous world.
What would happen if very few of the weaker nodes refuse to upgrade but the network requires new adults because it’s growing? Will the rewards increase and act as an incentive to upgrade/for other nodes to join?

9 Likes

oh nice, that sounds like a great idea.

5 Likes

We are still working on it really. The first step was to get some mechanism of relative usefulness measures in place. It’s hard to say the impact on users though.

18 Likes

Thanks so much to the entire Maidsafe team for all of your hard work! :racehorse:

11 Likes

Thank you for your enthusiasm, @maidsafe team … Very thx.

15 Likes

Great work, I feel :+1:

7 Likes

Maybe we can learn from what other projects are doing and the similar issues they have run into in running a decentralized network; like Helium.com, the people’s wireless network. In less than two years, they have reached 566K “hotspots”, basically all RasPi based miners. They have worked into the protocol now, than when more than 2 hotspots are in one area… the rewards ratio is lessened for the work performed. https://explorer.helium.com/hotspots

6 Likes

Thx 4 the update Maidsafe devs.

An upcoming playground is always good news, can’t wait. Great that Comnet is also contributing to finding bugs.

Keep hacking super ants

9 Likes

re: “Error Spam’ , the method I’ve seen in the past, used commonly by many systems for regulating ‘Error Spam’ as you call it, published by ‘systems’ re-sending the same info/alert/alarm msgs a method commonly found in APM Application Performance Monitoring (Dynatrace, App Dynamics, New Relic, some F/W systems/services et al) is generally referred to as “False Positive” handling, that is, visa vi some ‘filtering mechanism’ which is linked to some ‘regulating’ rule set, or these days some AI/ML logic ‘somewhere’ in the network , that ‘filtering/discard/log local function’ is triggered by the “look at/observe” ’ Manager of Managers support system’ which publishes/provides advice to the CRUD mechanism (proc doing the filtering), likely the node in this case, so the client doesn’t get overwhelmed with 'error spam” or false positives, usually the frequency of which can be adjusted by the client … hopefully this helps… great work , keep going!

2 Likes