Launch of a community safe network


#202

Nah that wasnt what I was trying to indicate.

I was meaning this Node 3a879f(the one whose logs we’re viewing) is realising it is not responsible for a chunk/account its currently keeping track of which is what we do not expect cos of a NodeLoss, we’d expect that via NodeAdded as in someone new joined so we realise we’re no longer responsible for this acct/chunk anymore as the new node is closer than us to the concerned location. With NodeLoss, we shouldn’t loose responsibility, but potentially gain more due to the lack of the node that used to be online.

There are a couple things wrong/confusing in that log message tho. First up both MaidManager - log and DataManager - log use a similar log message and maybe if we look at the file log, the module name indicates a difference between the two, but console logs dont show the module name and the message is vague between MM/DM.

More importantly, looking at the code, the variable we’re logging is the node_name of the peer going offline in both cases rather than the chunk/acct(data_id.name()/name) we realised we’re not responsible for anymore. That probably explains why we’re seeing “spam” of that message when it occurs cos there are probably few acct/chunks but we don’t log that info but just the node_name which is the same in that loop anyways. Also probably what confuses any1 to make that connection to this being about the lost node than about ourselves.


#203

I think it is the contrary: In logs of node A, message “Moved out of close group of B in a NodeLost event.” means that node A observes that node B is not responsible of a chunk anymore.

I have added logs in one my nodes one day ago, and if I select messages containing “Dropped” or “Moved out” strings I get the following results:

I 19-01-22 18:36:34.781338 Node(bace63..(1)) Dropped b4777e.. from the routing table.
E 19-01-22 18:36:34.804900 Moved out of close group of b4777e.. in a NodeLost event.
...
E 19-01-22 18:36:34.806924 Moved out of close group of b4777e.. in a NodeLost event.

I 19-01-22 23:43:45.005518 Node(bace63..(1)) Dropped f05457.. from the routing table.
E 19-01-22 23:43:45.018251 Moved out of close group of f05457.. in a NodeLost event.
...
E 19-01-22 23:43:45.019292 Moved out of close group of f05457.. in a NodeLost event.

I 19-01-22 23:52:57.261371 Node(bace63..(1)) Dropped c826e8.. from the routing table.
E 19-01-22 23:52:57.271921 Moved out of close group of c826e8.. in a NodeLost event.
...
E 19-01-22 23:52:57.273737 Moved out of close group of c826e8.. in a NodeLost event.

I 19-01-23 10:04:44.707602 Node(bace63..(1)) Dropped 145b96.. from the routing table.
E 19-01-23 10:04:44.715073 Moved out of close group of 145b96.. in a NodeLost event.
...
E 19-01-23 10:04:44.715969 Moved out of close group of 145b96.. in a NodeLost event.

I 19-01-23 18:10:13.946033 Node(bace63..(1)) Dropped 53c140.. from the routing table.
E 19-01-23 18:10:13.952792 Moved out of close group of 53c140.. in a NodeLost event.
...
E 19-01-23 18:10:13.953890 Moved out of close group of 53c140.. in a NodeLost event.

I 19-01-23 18:26:42.687379 Node(bace63..(1)) Dropped 147396.. from the routing table.
E 19-01-23 18:26:42.693290 Moved out of close group of 147396.. in a NodeLost event.
...
E 19-01-23 18:26:42.694316 Moved out of close group of 147396.. in a NodeLost event.

(with series of “Moved out …” messages abbreviated with ...)

My node ID is bace63 and “Moved out …” messages indicates that other nodes (like b4777e) are not responsible for some chunks anymore. It is clearly because these other nodes have been dropped because these messages come just after a “Dropped … from the routing table” message with the same other node ID.

Edit:
This is coherent with what you wrote:

So, IMO there is nothing to worry about and everything is normal.


#204

I have added instructions to get logs from a docker vault in the README.md file.

This can be done on a running container without stopping it. This is possible thanks to the yml file format which allows a dynamic logging configuration (and the reason of my fork of maidsafe_utilities).


#205

Is this related to this issue? Repeated logging of NodeLost event


#206

Yep I’d very much think so, as you’d mentioned in that issue what we’re currently logging there isnt containing the unique info between those 44 occurrences you’ve mentioned which if it included the name/data_id.name() accordingly in MM/DM should have made those unique from what I can see in the code I linked in my previous post.

hmm k let me try a different route to explain my prev post :slight_smile:

Just posting inline the linked portion from the “MaidManager - log” from my previous post.

pub fn handle_node_lost(
        &mut self,
        routing_node: &mut RoutingNode,
        node_name: &XorName,
        routing_table: &RoutingTable<XorName>,
    ) -> Result<(), InternalError> {
        let mut refresh_list: HashMap<XorName, Vec<(XorName, Account)>> = HashMap::default();
        for (name, account) in &self.accounts {
            match routing_table.other_closest_names(name, self.group_size) {
                None => {
                    error!(
                        "Moved out of close group of {:?} in a NodeLost event.",
                        node_name
                    );
}
...

that’s the start portion of the handle_node_lost method. At this pt we’re handling a node lost so indicating that lost node no longer responsible for something in the system getting logged as a error log would make it strange(not saying we couldnt have had a bug for log intensity just doesnt look like it in this seq). node_name is the peer lost. we start iterating through the accounts we currently have(being the MM persona) and for each calculate via routing_table.other_closest_names(name, self.group_size) who the other grp_size-1 nodes are close to the given location(name) excluding us if we are in the close_grp size peers ourselves. It is expected to return None if we ourselves arent one of the grp_size nodes close to the provided location as detailed in the RT method: fn other_closest_names. The None arm then is what gets the corresponding log but its only listing the node_name at that point without name. That part is why I was indicating in my prev message it was about this node concluding it is self not responsible for something its managing in a NodeLoss event being an unexpected flow than about the lost peer.


#207

This happens with immutable data. When the network grows a node can lose the responsability of some data chunks but in the case of immutable data they are not deleted from its chunk store.

The aim is to keep them in case the network shrinks later and the node becomes responsible for them again. For mutable data this is not possible because the value can change in the meantime.

So, for me this branch can reveal an error, but only in the MD case.

I think it should include also the data type (MD/IMD) in order to check my hypothesis, or better only display an error in the MD case (and a clearer one).


#208

Oops, looks like I restarted my vault doing apt-get upgrade.

Anyway, I now have logs from my docker hosted vault :slight_smile:


#209

One Vault show this messages constantly…
Does anyone have the same problem?


#210

I don’t really agree, i think it’s impossible to guess how things will turn out in the wild yet.

If your home vault was actually paying you real money SafeCoin then I think you would work harder to make sure it stays online.

We can’t know how SAFE will turn out until it does


#211

Section 1 had only 8 nodes, meaning it was on the brink to trigger a merge. As section 0 had 11 nodes, I restarted one of my nodes (TFA--03) to move it from section 0 to 1 to balance the network.

I did it twice, because the first time it was recreated in the same section, which is strange because I thought there was a machinery to relocate new nodes to a less populated section.

Edit:

Now there are 9 vaults in each section, that’s not a lot. Few people are motivated to provide a vault and among those who are, some have not powerful enough node.

I think that resource proof tests are not hard enough because some vaults try for hours before succeeding to connect to the network. But they are not reliable enough and won’t last very long, but in the meantime, before the network takes the decision to disconnect them, I am not sure they do any good to the network.

There are VPS providers that propose cheap and very capable nodes. People should use them instead of trying to connect powerless node.


#212

Yeh that log needs to be better but the intent here wasn’t just an arbitrary check but more about detecting the new peer that potentially becomes responsible for a chunk we currently hold regardless of ID/MD due to the previous holder going offline and contacting the newly responsible peer accordingly for relocation. This log was just a conclusion of the routing table method essentially indicating self node itself not responsible for said chunk at this point.

So while its very possible there is a bug(I’d have to look at the flow closer to confirm), just at initial glance the fn our_chunks doesn’t seem to strip the unneeded ID cached chunks from the chunk_store.keys(). That potentially could cause it to include unneeded chunks in that iteration and thereby the log and so on. If that was the case, we’d need that to strip unneeded but include needed ID as well to facilitate relocation else that part would suffer.

That balance of passing a new node to another section isnt at a node count level. It’s at the Prefix length level. so just for example if there was a Prefix(00) aware of neighbour Prefix(1), then it’d pass a joining node to Pfx(1). Here they’re both of a single bit length and thus self accept.


#213

Thanks with a little help from @povilasb I have my vault running with UPnP (without port forwarding)! Though I still have the problem of not being able to connect to the alpha network with the vault in my local network…


#214

Perhaps a second tier community network should be initiated to see how well lower resource requirements fair? There would probably be a lot more participation if the upload speed was reduced to 1Mbs and multiple nodes per home network could be allowed to join.


#215

Beware this is a very complex piece of code and if I understand correctly unneeded chunks are only stripped when the vault runs out of space (in fn clean_chunk_store).

What I am thinking to do is to modify my safe_vault fork to change the error message in the None arm of handle_node_lost:

  • Change it to a debug message in the immutable data case (because it can happen in normal conditions)
  • Keep the error level in mutable data case (because this is the only case where it would represent a real error)

Just to check my theory and that there is nothing wrong here.


#216

Adding unique info to MM and DM logs of this msg with the data id in DM should give the details about the situation yep if thats all you’re intending I’d also log if that chunk was in unneeded_chunks list. just as a sample:

In DM:

for data_id in self.our_chunks() {
    match routing_table.other_closest_names(data_id.name(), self.group_size) {
        None => match data_id {
            DataId::Immutable(idata_id) if self.cache.is_in_unneeded(&idata_id) => {
                debug!("Not in close group of unneeded chunk {:?}.", data_id);
            }
            _ => {
                error!(
                    "Moved out of close group of {:?} in a NodeLost event.",
                    data_id
                );
            }
        },

In MM:

for (name, account) in &self.accounts {
    match routing_table.other_closest_names(name, self.group_size) {
        None => {
            error!(
                "Moved out of close group of account {:?} in a NodeLost event.",
                name
            );
        }

Just note when you say it can happen in normal conditions for ID, that’s not always the case as cases outside unneeded chunks of ID is similar to MD and not normal if it ever occurs. As in a chunk which isnt unneeded according to self.our_chunks which doesn’t report self node according to RT other_closest_peers triggering that behavior.

As for how I’d prolly patch it isnt via clean_chunk_store ofc, I didnt mean stripped as in get rid of from chunk_store as yeh that’d ruin the purpose of it :slight_smile: . I mean in the reported list thats iterated to check for the new node within handle_node_lost, there I’d be filtering the unneeded chunks from the current list of self.our_chunks. Thats more for a later fix that I was indicating as yeh I’d have to check if that was intentional for any relocation patches we added even if that said node wouldnt make the valid DM auth for said chunks anymore, as a side point note pruning chunks(cache/store) is based on chunk store full in a loose way as in it accounts for the MAX_FULL_PERCENT currently at 50%.

EDIT: Made some additions of code and also separate the unneeded chunk log to being debug, just as a sample.


#217

I have updated my safe_vault fork to improve error displayed during a NodeLost event ("Moved out of close group of {:?} in a NodeLost event."). And more importantly, it becomes a simple debug message in case of an immutable data stored in unneeded cache.

This aim is to get rid of the only error message that I came across in my logs.

I have also updated the docker image to generate a log file by default and I am about to launch a general upgrade of all the docker vaults with a service update. The new service needs an existing file at /var/log/safe_vault.log.

@happybeing, please issue following command to create this file:


touch /var/log/safe_vault.log

All docker vaults will be updated and restarted automatically with an interval of 12 mn between each, meaning that the whole migration process will last about 2 hours. Ofc, non-docker vaults won’t be updated.

Edit:

Migration has just started.

I have also updated instructions to start a docker vault and to get the logs from it.

@happybeing, don’t worry about your current vault. It should continue running until you create /var/log/safe_vault.log. At that time your vault will be put in the pending queue of vaults to be updated.

Edit 2:

Migration has ended a few hours ago.

Now, all messages in None arm of handle_node_lost are debug messages concerning immutable data present in unneeded chunk list. As there has been a lot of movements (9 vaults stopped and restarted) this a good indication that everything is fine here.

I will also restart manually some vaults to spread my vaults evenly in the sections.

@happybeing, I am sorry I have been too optimistic because your vault has been stopped when it was its turn to be updated. I suppose your vault will restart when you create a /var/log/safe_vault.log file.

Edit 3:

That’s all for now.

What is amazing is that the safe sites still work after all these movements (almost half of the nodes were restarted, some of them several times).


#218

The pain and suffering when they cut the internet connection for maintenance and you have to restart the vault which has been running for so many week.


#219

Yes, but in safe network restarting a vault is done without interrupting the service, without even the faintest glitch for users.

Myself, I don’t hesitate to restart my vaults when needed (of course not all at once, because they are an important part of the network, but one by one isn’t a problem).


#220

That is an good point. Worst thing was the loss of the achivement from all the work the vault had been doing, it was so nice to watch the vault handle such many requests succesfully over many weeks and watch it progress.

But a very good point to not be afraid to restart a vault, then you can update and try different settings with firewalls and such without being afraid. Compared to when I tried mining, several weeks without downtime is a huge success and the raspberry pi is so little and quiet that you totaly forget it is doing alot of work, compared to all system fans on max, when mining.


#221

Demonstration of a docker vault creation in 10 mn (I write slowly).
This suppose to have an account on Hetzner and to have defined an API key.

The demonstration is longer than 10 mn, because we watch the logs during several minutes
to observe the resource proof completion.

Note: This session created TEST--00 vault, it is the one that triggered the recent split
(but I have deleted it since then).

asciicast