Update 28 October , 2021

Bugs can be hard to find, harder to eliminate and sometimes even harder to explain. In these updates we try to lay out the latest news on the progress we’re making and our plans for next steps, but in some ways that’s the easy bit. Like saying we’re making steady progress up a certain creek without saying how far away our destination is, how many paddles we have at our disposal, and ignoring the crocodiles, rapids, and other unpleasantness that lies in the way. The hard bit is explaining those bugs without getting lost in the weeds. It’s a dirty job, but in the interest of providing much-needed context, someone has to sift through the logs. @joshuef drew the short straw.

General progress

The API and CLI code has now been merged into the main Safe Network repo, though there is no new release just yet as there are some failing CLI tests. The release process also needs adjusted to take into account the additions to this repo. @Chriso is on the case.

Also tantalisingly close is the removal of Connection Pool from qp2p, with that functionality taken into Safe Network where we can fine tune it. The Connection Pool kept client connections open, but in a way that was hard to refine and configure as we want it. Removing it simplifies qp2p and removes a lot of edge cases - and almost certainly a lot of bugs.

Meanwhile @Joshuef flushed away a huge blocker this week managing to reduce message load in some circumstances (between good nodes), from ~65,000 down to ~500, all being well.

@bochaco and @yogesh have been digging into how sections keep a record of each other, how this process can be made more efficient, and where and in what format this information is stored.

And @Lionel.faber has been looking at prioritising message types. Some messages are more important than others. BLS DKG messages, which handle authorisation, should be given top priority. Nothing important should happen without agreement by the elders. Freeing the channels for these messages will speed everything up. At the other end of the spectrum, queries, data commands and error messages can happily wait their turn without affecting performance.

Bugs

I don’t think anyone ever claimed Safe to be simple. It’s not. But it’s not not either. We have the parts laid out as folk have seen in various testnets. And since the last one (which, we know feels a while ago), we’ve been hammering away trying to make everything more stable.

The bugs behind the instability are often touched on in updates, but in quite a techy fashion. So here we wanted to give a bit more of a general overview. Something a bit more accessible to folk who don’t like diving about in a text editor for hours at a time.

You have your classic bugs

2+2=5

Or dropped messages between nodes (your post doesn’t arrive).

Or a connection issue, where most of what you want arrives. But the screw you need did not get through. (And now you need to try and make that happen again, so you can see why that screw doesn’t arrive.)

Race conditions, which is where an issue might only arise if some code or program completes faster than another part of the system. (So perhaps you only see it if your horse LuckyProblems comes in just before OtherwiseWeWork and just after ThisAlreadyHappened; but any other combo goes along fine).

Loops. Things keep happening because they trigger things at the end. Possibly forever. They’ll often cause everything to hang or straight up crash because they keep taking up the program’s resources.

Hangs. Also known as deadlocks. These bugs are the Catch 22 of the bug world. You can continue only if you have number=5, but you can only set number if you have number=5. This is obviously a symptom of a classic bug, but also often walks hand in hand with something race, so you don’t notice this until it’s too late (and now you aren’t really sure why this would be happening… :thinking:.)

Then you have some more Safe specifics

Which are often just symptoms of the above…

Message amplification. This is when we might expect to get 5 messages through to our storing nodes, but instead get 500. Which in turn cause another 15000 to come back. There’s normally a bug in there (2+2=5) when we see this, or it can be that the system isn’t doing what we thought it would so we need to rethink the design. (We recently had AE-retries naively sent to all elders. To compound this, the next set of retries would therefore be sent from all elders…to all elders. :chart_with_upwards_trend:)

Sometimes we get a lack of throughput. Messages aren’t dropped. But things are slow. Why!? Sometimes a combination of all of the above.

At the moment, after some refactoring we have too much throughput. Now this isn’t an issue by itself, but it can often expose various other issues… (take your pick from any of the bug types mentioned in this post!)

Forks! Forks in the path of our section knowledge (who came before us… who beget who)… if nodes don’t agree for some reason (a buggy reason), well then we can perhaps have two sets of valid knowledge, but don’t know which is actually relevant to our current situation.

Data not found. Is an obvious one… but why? Well, any of the above could lead to the data not actually being PUT in the first place. So good luck finding that which does not exist!

No split! We need splits to keep the network healthy (to split up the workload more easily and maintain resistance to hacks, for example). Not splitting might be a bug in the DKG algorithm (Distributed Key Generation… Or how we give our elders their authority).

Choosing the wrong target. Sometimes the messaging system seems to work and the parcel is delivered. But we’ve actually sent it to the wrong person (or sent it to a whole neighbourhood/section!?).

Being too excited! Sometimes we do something just as soon as we can. But the network, in its necessary route to eventual consistency, isn’t actually ready yet. (Imagine you PUT a chunk, but hasn’t all been stored yet, but you already try to GET.) It can seem like there’s a bug. But actually, if you try again in a few seconds, maybe it’s all there and fine. You thought you had a bug, but you were just too keen.

SooOOooo

So. That’s a wee rough rundown of various things we can see and come across in the system. That can be per-node, per client, or per section… And only sometimes, or only on a Tuesday on an obscure Linux build. And when you see the problem, it may be hiding beyond 3 or 4 different bug types, before you get to the root of the issue.

All of which we’re looking at in a system of 45 nodes and multiple clients (on average at the moment during internal testing).

Safe isn’t so complex when you think about it, conceptually at least (share data across computers). But it also isn’t as simple as it can be, yet, which is why we’re still chipping away at issues, refactoring things (making them simpler) as well as implementing new features (and sometimes they are aimed squarely at helping to debug).

Removing unnecessary code and complexity helps to get us to something simpler, which, alongside solving your classic bugs in the system, is often one of the most important ways to solve bugs. Less code, less problems. :bulb:

We are getting there! It doesn’t always feel fast, but it always feels like we are pushing forwards (even when we sometimes need to go backwards a little bit).


Useful Links

Feel free to reply below with links to translations of this dev update and moderators will add them here:

:russia: Russian ; :germany: German ; :spain: Spanish ; :france: French; :bulgaria: Bulgarian

As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!

68 Likes

first in a long time!

21 Likes

damn that was fast!

15 Likes

Not first. Close, but interest is obviously peaking!

Really nice explanation @joshuef of bugs and I think some help understanding why they can be hard to track down and then even fix.

I think simplicity is even more important in Safe Network than other systems because it’s a multi node network, where any node can break or disappear, try to be malevolent, or just be fast one minute, slow the next etc. Even simple nodes create complexity in operation when they are joined and working in combination, and then you have users/clients injecting requests at random intervals, valid or invalid.

Imagine a football ground full of people all talking to each other, passing around messages on bits of paper which must get to the right person somewhere in the ground, who then does some operation which ends in them sending another piece of paper somewhere else and so on.

Something isn’t working! How do you debug that football ground?!

22 Likes

2nd 4th in a row

13 Likes

So many people want to know what bugs are and how many etc. so we pulled together an example of life in the Debugging the impossible network :smiley: So many times these expose, oh crap why did we not change this when we did Ae/BRB etc. and others it’s just plain errors. In a network like this, it has to be complex behaviours and as @joshuef says from the simplest looking of code. It is just like ants and the sophistication from apparent complexity. Imagine debugging an ant colony and your head will be in the right place. Mostly we have ants that act more like spiders or something and we need to adjust them to be ants again :slight_smile:

A huge benefit though is knowledge sharing as bugs are discussed and people can see where their understanding was incorrect. Those parts mean there is code that is behaving wrong and we need to alter it slightly. Even today there was a find that was a surprise to us in how we had nodes joining, they were spiders and not ants, but tomorrow they will be ants again :slight_smile: @yogesh and @qi_ma are alterin that part right now.

34 Likes

Talk of parcels has me wonder if logging could be actioned only alongside a certain action, if there is an id but that’s extra work that might glue up everything… still if it moves like treacle but in the same way?..

btw does that above suggest all the code is now there within GitHub - maidsafe/safe_network: The Safe Network Core. API message definitions, routing and nodes, client core api., just in the wrong order?.. if its the sum of all things, that would be interesting to look at.

:hammer: :+1:

8 Likes

This is an excellent update which should help allay some of the very understandable frustration. And written in a non-techy style - or at least as non-techy as this could be. Certainly opened my eyes to some of the issues and I consider myself reasonably clued on this.
As always thanks to all the team for what you are doing, the world will appreciate you for it very soon now. Keep at it, folks :slight_smile:

20 Likes

Ensure everyone is playing the same game… the same as .gov shoring up the boundaries, is more impt than their worrying the detail of what people are doing.

Consistency and logic will fix most problems… keeping it simple is key, over managing processes adds complexity.

Would be a simpler world if we could just trust everyone to do it right! :smiley:

12 Likes

If that were the case, we wouldn’t need Safe in the first place. :wink:

12 Likes

You’d still want privacy to defend against competition; and security of authority; and freedom… the uses of Safe go beyond extremes and cover off so much of what is human activity.

I’ve been thinking a lot on this recently… the normal internet gives a bit of freedom but privacy and security are broke; the mix of those three interests brings alsorts of different applications interests.

It’s going to be a lot of fun once the devs have found that on switch.

8 Likes

Not telling anyone how to do their job but I’ve heard this is an effective way to eliminate bugs

Looking forward to the next streamline testnet. Keep at it team :clap:t2:

16 Likes

Striving for eventual consistency

EDIT Just noticed there is a joke in here about my current recovery from gastro-enteritis, but it’s teatime…

5 Likes

I thought they were using a scalpel to remove two legs from each spider?

3 Likes

Impossible network with mysterious Bugs! :crazy_face:

12 Likes

Scalpels are for wimps :joy:

5 Likes

So you are saying you’re up shit creek but don’t worry, you have multiple paddles, right?

3 Likes

Thanks so much to the entire Maidsafe team for all of your hard work! :racehorse:

11 Likes

I have a suggestion.

Can the team model the network after the human brain? Which is often broken, beyond repair, it’s root issues being hidden deep behind many layers of delusion; and when unable to get to the root layer, years of pain surface, causing manic episodes, childish greed, untruthfulness, and narcissism; and yet can still be high functioning, and really quite successful.

These kinds of brains can also run large networks of corporations, or become presidents.

If only network programming could be as broken, and still work so splendidly. :wink:

Great bug examples! Thanks for all of your bug crushing and humble sanity team!

8 Likes