Happy New Year everyone! We’re back in the saddle again and raring to go.
Many thanks to @josh for setting up the recent community testnet, and to everyone who took part. Some of the anomalies reported we’ve seen mirrored in our own test results, and there were a few surprises too including the max CPU spikes which we’re looking into now. Cheers guys!
Just a quick one this week to let you know what we’re working on right now. Happy to say we have already tied up a few loose ends and are very much ready to roll.
Over time, we’ve considered various different ways to calculate free space on the network. Recently, we have been traversing the few db directories and adding dir sizes to calculate used space. Led by @anselme, we have now simplified data storage by removing the advanced database to replace it with a simpler straight-to-disk process with a binary tree directory structure. This required us to replace the directory traversal process with one that counts every byte as it’s written and keeps track of the total used space, which is much quicker and more scalable now that we will have a deep tree of dirs instead of one or two. It will also greatly simplify the process of having a node with chunks rejoin the network because we don’t need to measure it every time, along with simplifying section splits.
Anselme is also looking at how the storage space freed up when private registers (mutable data) is deleted can be easily calculated.
Talking of mutable data, we’ve been considering what sort of charges should be attached to the register data type. Pay-per-change would be very clunky, we want to separate edits from the DBC process. Currently we’re thinking of charging a multiple of the price of an immutable chunk (blob) PUT for a register PUT, and allowing infinite edits by the data owner (and parties they choose to share with) thereafter.
@bochaco has continued to refine the membership process, i.e. how we maintain the correct number of elders in a section and how we add new adult nodes to the section when required. When a new node requests to join it kickstarts a process that includes AE messages and the resource proof test, and once the elders agree to accept it, a message is sent back to the joining node. This flow is now integrated with the sn_membership crate - at least for adults being promoted to elders and elders leaving. However, this
sn membership process currently assumes that all nodes involved are voting members (i.e. elders) so it excludes promoting an adult to an elder. @bochaco and @davidrusu are working on this one now. We’ll explain more in a future update.
Meanwhile @lionel.faber has been looking at speeding up the CI/CD process with self-hosted GitHub runners on AWS. Native virtual machines in GitHub Actions can be rather slow - especially for Windows - which has proved to be a bottleneck in testing. By hosting the service ourselves on AWS we can use more powerful VMs to complete our workflows more quickly. Lionel is also documenting the Distributed Key Generation (DKG) process we use for agreement.
Staying on testing for a moment, sometimes tests can be too rigorous. How so? Well tests are designed to catch every error when they happen, whereas a fault tolerant network with CRDTs may be able to work around these glitches, arriving at a guaranteed consistent state eventually. So it can be wasteful to build tests that will catch everything - but it’s a tricky balancing act!
A case in point is a missing data error, like the one you may have seen on the testnet. Is the data really missing, or has it just not arrived yet? Perhaps it will show up later, or maybe it is there but there’s been another error.
In this scenario messaging between actors is nuanced too, and this is another topic of discussion. When a chunk has been successfully PUT the client should be (optionally) told it has been stored, and thanks to CRDTs this should be 100% sure, but if it has apparently ‘failed’ this is not 100% certain because of asynchronicity and other factors, so the client needs options on how to proceed.
@Qi_ma has been looking at the logs to track down the missing data error, and @yogesh is looking into the reason for the floods of AE messages that sometimes overwhelm communications between nodes. Are they related? We should know soon. Meanwhile, @joshuef has been looking at a bug that could cause issues in nodes that are being hammered by clients, causing the node’s memory to spike and occasionally crashing them. Right now we’re just capping that at an arbitrary limit of concurrent client messages, but we’ll look to make this dynamic based upon the node’s load as we progress.
Every step is another step closer.