In this week’s update we’re delving into everyone’s least favourite topic – bugs. Bugs are not an easy topic to pin down because in a complex, multifaceted, cutting-edge system like Safe it can be hard to know where one bug starts and another ends. Is this bug in our code or a third-party library, or perhaps in how they fit together? Is that really a bug fix or just an optimisation waiting to happen?
That said, it is something we get asked about a fair bit, and since a major point of the testnets is to enable us track down and exterminate the wee blighters, we should at least have a go at describing the situation. So this week we’ll have a look at one of the things really has to be 100% in place before Fleming – eliminating data loss – and the fixes we’re putting in place to get there.
@Chriso has been working away on ARM builds for aarch64, not helped by the fact that his new Raspberry Pi arrived with a duff microSD card. A replacement arrived on Tuesday so shouldn’t be too long to wait now. Thanks to @folaht, @stout77 and others for trying out the existing builds.
We’ve been making progress on anonymous transactions using Pedersen commitments and blinded values to hide amounts. More on those at a later date!
@JimCollinson has been looking at rejigging the UX flows based on the evolution of the network to include DBCs, prepaid uploads, multisig transactions and CRDT-enabled online/offline capabilities.
Speaking of no-longer-the-new-boy Anselme, here’s a bit from him:
Coming from a systems programming specialisation (UNIX), I’ve hopped from field to field going from blockchain to back-end programming, data engineering and machine learning. I’m passionate about computer viruses, systems programming, software security, decentralisation, machine learning and programming languages.
I like to think code as poetry, despite that, I want to code for a greater goal, for things that matter. That’s why I’m here at MaidSafe.
It goes without saying that data loss is an absolute no-no for Safe (data vanishing from the permaweb would be a particularly bad look). Data chunks stored on Safe nodes all around the world must be able to survive churn events, local outages and Byzantine actors. Those of you who have helped out with testing () will know we’re not there yet with eliminating data loss.
So what are the bugs that lead data to disappear over time? As we hinted at above, it is likely many things overlapping.
One of the bugs – fixed this week – happened on section split, where Adults trying to replicate data would sometimes communicate with what they thought was another Adult but which on split had been promoted to an Elder - and so ignored the message. Testing suggests that this one may have gone to meet its maker. Good riddance. This type of bug should be much rarer when Anti-Entropy is fully implemented. And rarer still now we have a reliable test for this.
Other fixes to tackle data loss issues have included changing the routing flow between different Adults in a section and between Adults and Elders to ensure that chunks are actually where we think they are and have not been lost due to faulty logic.
But we’ve also seen that the out of memory issue was also likely a cause of data loss. We think we’re on top of that one now, but a few questions remain. There may be more.
As you can hopefully see, these bugs are a tricky thing to nail down. They can exist in multiple places, overlap, and one can trigger another. They can be hard to reproduce, requiring specific network conditions or messages crossing over at specific times.
We’re getting there though, and wouldn’t have a chance of tracking them down without all the people who have stepped up to help with the testing. Even when they are short lived, the information the testnets provide the team is invaluable, so thanks once again for all the debugging so far!