You set up a test network, you upload some data, something goes wrong. What happened and where exactly did it occur? Tracking down where problems arise in networks is a tricky challenge, particularly in decentralised networks where each node is an individual. This week @davidrusu walks us through
statemaps, a diagnostic tool that shows us exactly which state each node is in at any point in time. It’s a god’s eye view into the network which will undoubtedly make bug squashing a lot easier.
Thanks as always to everyone experimenting with local networks and comnets. We’re reasonably convinced that some of the issues with uploading large files are at the API layer, and we’re looking at that now.
@bochaco continues to refine the error reporting process to provide more meaningful messages to clients.
@anselme is looking at AE and gossip and how one can be a fallback for the other in case of communication failures.
On documentation, @jimcollinson is finalising the main whitepaper required by the Swiss authority FINMA. It’s an overview rather than a technical deep dive, so probably nothing new for most folks here, but ticking off those legal checkboxes ready for launch nonetheless.
@Chriso and @bochaco are tidying up what happens when a DBC is submitted for reissue, and in checking that process it’s been found that some spent proofs were signed with a section key that the section processing the reissue request is not aware of.
In a highly concurrent system, it can be very difficult to see what’s going on. Nodes move through states incredibly quickly and trying to correlate messages across nodes can feel like you’re trying to recover a shredded document.
Statemaps let us recreate a partial picture of what happened in a network after-the-fact. They’ve been a very useful tool in understanding where nodes are spending their time.
We’ve instrumented the sn_node code base to log when it enters a state and again when it leaves a state. We can then process those logs to generate a statemap like the one below:
Each row corresponds to a node, with time on the x axis. The rectangles on each row correspond to the state that node was in during that time interval.
Each state is assigned a colour:
By analysing this statemap, you can begin to understand what happened here, let’s label the various phases and talk through them.
- We can see the map starts with 6 Elders voting on membership (salmon)
- After membership completes, they immediately kick off DKG (orange). This should be a hint that there will be a change of Elders.
- Meanwhile, we see that a 7th node comes online. He receives an AntiEntropy (light blue) update letting him know that he’s been accepted in the network, and then joins in on the DKG (orange). This would suggest that this new node that just joined is being promoted to Elder status and that this is why the original 6 Elders started DKG.
- Now we see DKG has stalled, this is because DKG requires total participation to complete, the existing 6 nodes have all contributed their parts but they need the 7th node to put in their share to complete the section key.
- Eventually the 7th node catches up and DKG completes. Next step is to have the old 6 Elders verify the new section key is valid and to Handover (dark blue) control to the new 7 Elders.
- After Handover completes, we see a burst of Anti-Entropy being sent out, presumably with the new SAP, showing that the new elders have taken control of the section.
We’ve developed a bit of tooling around these statemaps, the Safe Network README has instructions for generating your own.
To ease in development, we’ve also configured CI to automatically generate and upload statemaps as well for each PR.
We hope you find these maps enlightening, happy splunking!
Feel free to reply below with links to translations of this dev update and moderators will add them here:
As an open source project, we’re always looking for feedback, comments and community contributions - so don’t be shy, join in and let’s create the Safe Network together!