Credit to Spatium for the 4 title images you will see used over the coming weeks
Some promising news on the bug-hunting front as the team have tracked down a nasty little critter that was causing deadlocks during data replication at churn. For the technically-minded, the bug in question was a reference to a read-locked node that persisted even after it had been dropped, and tracking it down was made possible by the ongoing work to strip out multithreading and simplify the code. As such, we thought it would be a good time to bring folks up to speed on the work we’re doing in
sn_node in this and other regards. @joshuef takes the controls this week.
@yogesh is up and away with refactoring the codebase to get rid of SledDB. As mentioned in the previous updates, we were on course to replace SledDB (for storing registers) as it had issues of write limitations and was also not being actively maintained. Therefore, we benchmarked a few alternatives in the previous weeks, where each one had its pros and cons. As an outcome of the analysis, the team decided to go for the in-house disk storage implementation that we already employ for storing chunks. This is a keep-it-simple implementation that has no bells and whistles and performs on par with the other alternatives, whilst this also frees us from having another external dependency that might not be super maintained. Currently, it only supports storing chunks and therefore needs to be revamped to support storing registers, for which work is well underway.
@qi_ma has been looking at the churn tests which have been failing and slow, in part - we think - because of the bug described in the introduction (and below). There have also been spurious messages sent to clients, which could well be part of the same issue.
Also this week, @Heather_Burns made a triumphant post-plague return to the UK Parliament, where she represented MaidSafe at a roundtable of small tech businesses and start-ups who stand to be collateral damage in the government’s determination to regulate the internet around Facebook. Heather reports that the MPs she met with are some of the few who actually ‘get this’ and want to prevent that from happening, so hopefully our message was heard loud and clear. Inadvertently, she may have also destroyed the government. Whoops.
Many folk may well be wondering what the state of the node is now, and why we’ve been taking on certain tasks.
Here’s a small rundown of the state of nodes:
In the last weeks, we’ve moved the node to a single-threaded setup. This ostensibly hasn’t had much impact on performance (although it did improve things a wee bit), but the aim here was about simplifying the code base. With only one thread to worry about (by default… we may still spawn others as needed), we don’t need to have as many checks and balances throughout the code to enable it to operate in a concurrent fashion.
The clearest example of this is the ability to remove
RwLocks (read-write locks) from the codebase. These wee structures allow code to wait until a given piece of data is not being modified on another thread, before we edit it. Neat!. Yes. But also dangerous. Many of the recent bugs and troubles we’ve sorted in
sn_node have come about due to these waits going on indefinitely (a situation we refer to as a deadlock).
It’s here the move to single-threaded
sn_node really shines, as we can remove the vast majority of these from the codebase, and with them, we drop an entire class of bugs (and really painful to debug bugs, too!). So not only is the code cleaner, clearer and more sane, but it should be less error-prone to boot!
We’re ~80% of the way through this now. Having removed many locks over internal
node structures, and replaced them with one higher level lock, it’s much easier to reason about (although the transition did turn up another deadlock!). A good example of the improvements in simplicity translating to speed can be seen in some of our Benchmarks
This is a big win.
Another area where we’ve been working to improve things, and will continue to do so, is testnets and debugging. Some testnet errors have been the result of failed infrastructure, for example, say we started a DigitalOcean droplet, but the node did not start properly there for some reason. Recent changes have made node restarts more stable, as well as cleaning up background looping code, which should all help in situations such as this.
We’re also looking to improve the logging situation, with some clearer
Cmd logs to be output separately from the more verbose run-logs. These should be easier to parse for folk and, hopefully, we’ll be able to hook them into the ElasticSearch instance we have for internal testnets.
Membership continues to improve, although it has been held up with the single threaded switch. We’re looking to close the gap between what nodes understand and vote upon, and what is shared within a
SectionAuthorityProvider (SAP). This should also improve stability.
Another class of bugs we’re now tackling is a somewhat more ‘meta’ class of bugs - nodes voting other nodes as dysfunctional (and so offline). Doing this well is tricky (when is a node dysfunctional vs having a temporary bad time?). That pain is acutely felt on continuous integration (CI), where the machines are less powerful, and so we can sometimes see nodes being voted offline, which can break a test cycle…
To that end, we’ve recently expanded the test suites in
sn_dysfunction and are looking to keep expanding and improving this, to get towards reproducible situations where we can know that offline votes are occurring because of bad nodes.
With the single-thread changes, the recent reworking of data republishing, and with some simplifications in the node codebase
sn_node generally runs really well, averaging ~130MB memory usage for an elder (on a Mac; local testnet), ~70MB for adults.
Right now, any large spikes are clear red flags for us, and that’s a really nice place to be as it makes finding the cause of such issues much easier to spot.
The recent discovery of
sled bugs has put a dampener on the feelings of stability in data, but we’re working on removing that right now. This change should hopefully simplify and unify the data storage in
sn_node, so behaviour should be more consistent across any data type.
While it may not always feel like things are progressing for the community given that not everyone sees what’s being worked on and improved day to day, things are definitely going in the right direction. Every testnet we spin up is more stable (in general), but if for some reason there is a bug, with all these recent changes, plus with the ElasticSearch server tracking droplets’ stats, it’s becoming much easier to see where issues lie, and will hopefully only be getting easier too!
Feel free to reply below with links to translations of this dev update and moderators will add them here:
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!