What are the chances for data loss?

I made some quick calculations, but my math skills are terrible, so please correct me if I’m a complete idiot :sweat_smile: If we have a 1% chance of losing one of the minimum number of 4 copies within the time frame needed to detect and correct such losses, then we have a 1/100000000 chance to lose all 4 copies, which is really small. However, if we have 100,000 files, each 1 GB big (1024 blocks), then we would have only about a 36% chance that over any such time frames no file would get lost (i.e. all of their blocks would have at least one copy left.)

(1 - (0.01^4))^(100000*1024) = 0.3591…

The questions:

  1. Is this the correct way to calculate it?
  2. What are the actual chances for losing a single copy?
  3. How quickly would lost blocks be discovered?
1 Like

The first question I have is: Where did you get the 1%?

BTW - what you’re calling “blocks” are referred to as “chunks”. Like I tell everyone else - as you go along, don’t be hesitant to correct any nomenclature in technical subjects in order to deter misunderstandings.

I don’t know about these calculations. Quite hard to calculate what the chance is 4 chunks at once are gone. But good to think about it. Safenet uses non-persistant Vaults. So if we see a very big datacentre in the US creating 40% of all the Vault in the network (we would never know if they did!) there might be a chance all 4 chunks of a picture of your dog are stored at that place. Now they go broke, or have a big fire and the power is turned off all at once. Is your picture gone? In this case it is I think. But we would see at least 3 chunks for each file. And all of them stored 4 times (or down to 3 now?) so that would be 12. How big is the chance all chunks are stored on one place on the network??

Under seconds. Vaults check each other. They send some random bytes to a Vault that stores the chunk, as them to add it to the chunk and calculate the hash. It’s a sort of challenge where they know quite fast when a Vault says it holds a chunk but actually doesn’t.

Arbitrary guess so I can come up with a starting point to ask about; it may be much less than that.

I stand corrected :grin:

Where do the 3 chunks come from? A chunk can store 1 MB; my chihuahua takes up less space than that, so I guess there must be another reason for slicing her in three.

Are the 3 chunks independent? Because that would not add to the robustness anyway; quite the opposite: if any of the 3 gets lost, the whole thing is lost. If they are in a 2-of-3 kind of FEC setup, that’s very different.

I’m not sure we’re talking about the same thing; even if vaults keep checking on each other continuously, going through all previously stored chunks (“resilvering” I think it’s called), seconds seem an unrealistic target if a vault stores more than a handful of chunks.

The network is designed with transiency in mind: vaults are expected to go away, come back, disappear forever, born from nothing, etc. When a vault is gone, a bunch of chunks go with it. Some time goes by, and then the vaults that stored common chunks with it start to discover the loss, so they send copies to their data somewhere else to restore the 4-fold redundancy.

My question is about that period between losing a chunk and discovering it was lost. That period means that there are always a bunch of chunks with only 3 copies. A lot less with only 2 copies. A very few unlucky ones with just 1 copy. If we wait long enough, some will get lost altogether. With a 1% rate for such limbo chunks and about a billion chunks we have a 74% chance to lose all 4 copies of one of them during the period between when a vault goes offline and when it’s discovered. Or so I think – hence the three questions in the original post.

To get the actual percentage would be an exercise in the actuarial sciences - having to figure in human error such as fires or power loss and the like - similar to the way risk is calculated in insurance. So it’s a number that is hard to calculate as well as ever changing, however, I do think that we will find that it’s well below 1%.

Something that you may not be aware of is “Sacrificial Data”. While the Network is tasked with maintaining a minimum of 4(?) copies of any given chunk, there very well may be many more than that stored on the Network.

Sacrificial Data was created in order to measure the total available space of the Network.

This should drive that 1% number down dramatically. So let’s see what we have in your hypothetical so far:

  • 100,000 files, each 1GB big = 102,400,000 Chunks [1]
  • 102,400,000 Chunks, each stored twice for Primary Chunks = 204,800,000 Primary Chunks
  • 102,400,000 Chunks, each stored twice for Secondary Chunks = 204,800,000 Secondary Chunks
  • Primary Chunks + Secondary Chunks = 409,600,000 Non-Sacrificial Data Chunks

But we’re missing something.

The amount of excess space in the Network will determine the amount of Sacrificial Chunks that are out there. Calculating a probability without considering the change in this value inherent to the system would be foolish. So we’ll have to calculate everything three times - for the two extremes, as well as the mean.

  1. Sacrificial Data = Chunks * 2 (optimal - Network is balanced)
  • Sacrificial Data = Chunks * 1 (mean/average - Network needs more space, but has enough to add more data)
  • Sacrificial Data = Chunks * 0 (disastrous - Network is full)

Would you please provide me the totals of all Chunks in existence in each of those three scenarios while I start working on the next bit?

[1] Keep in mind that since the files are all 1GB big, we don’t have to worry about splitting a 1MB file up into 3 x 1MB chunks - achieved by padding. We can investigate this later if we wish.

P.S. Nice link to WolframAlpha. That’s a great way to show computations.

2 Likes

Thanks, and that’s a very informative thread as well.

Yes, I expect that number would be less than 1% (please see my prev response about what that 1% is really about), but I also expect we will have vastly more than just a billion chunks. Common sense can be misleading when working with numbers so outside of everyday scales, that’s why I started this thread.

Yes, I was a bit behind on how things will work; if there will be 6 copies even for unpopular data (when the network is healthy, that is) then things look pretty bright.

Not quite:

The triggers from the Chunk Info Holder [DataManger] to the Chunk Holder [Vault] are time-based and will initially start at 2 minutes doubling every time to 20 hours. Any failure will reset the schedule.
Autonomous Network - David Irvine et al.

So here we have another spectrum that we have to take into account. However, the average doesn’t mean much to us this time. That is because the point at which the data is lost does not change. So we - for the time interval - can specify a minimum and a maximum to use in our equations:

  • Minimum = 2 minutes
  • Maximum = 20 hours

Malicious actors will tend towards the maximum, while legitimate - but temporary - vaults will tend towards the minimum.

EDIT: tagging you @Tim87 so you can see this

EDIT2: Just realized that in the time between, it’s kind of a “Schrödinger’s Vault” where we have to assume it’s both Up and Down. That’s gunna make it hard.

2 Likes

Now the next step that we need to consider is what is really being checked when a “Validity Check” is being performed:

  1. The integrity of the randomly chosen chunk
  2. The availability of the Vault

Since we’re not concerned with the chunk’s integrity with this line of inquiry, let’s focus on reason #2.

Considering that the Network has been designed with Churn in mind, it’s prudent to account for the average number of vaults that are handling our data in the first place, and how big they are on average. This is far and away the trickiest variable that we have yet encountered.

The magic number for security and stability of the Network has been put at 10,000 farmers running whatever number of vaults is pertinent for them. (If anyone could find that quote I’d be much oblidged) There of course could be more or less, so we’ve got another spectrum variable here - let’s look at the range:

  • 100 farmers (half the number of users on these forums)
  • 10,000 farmers (critical mass)
  • 1 million farmers (mass adoption)

Let’s assume that they all run one vault for the sake of simplicity. What must the average size of their donated space be in terms of num_of_farmers?

EDIT: Answer is total_chunks / num_of_farmers

Doing further research rendered this question unsolvable due to the existence of “Schroedinger’s Vault” - where at any given time inbetween checks, it could be either up or down.

However, that doesn’t mean that we can’t estimate.

First of all, since the chunks are spread evenly, we can say that:

chunk_section = num_of_farmers / (4 + {0,1,2}) Depending on how full the Network is.

Take one vault out of there that has your piece: 1 / chunk_section

Now that’s randomly selected. 1 / chunk_section is the probability that you will find your chunk in a random guess - or that a random node will go offline with your chunk.

Now in order to lose your chunk, that has to happen for all of the sections - and it has to be the one with the same chunk. So:

P(lose_chunk) = (1/chunk_section) ^ (number_of_sections)

Where the number of sections is at least four (2x primary chunk, 2x secondary chunk) and at most 6 (2x sacrificial chunk).

Now that’s at any given moment in time. Let’s use real numbers

So I think it’s SAFE to say that it’s quite smaller than 1%.


However this does not, nor cannot take into account time without further study and behavior analysis.

But the effects can be mitigated by placing the chunks strategically. Secondary chunks on vaults that have lower checking times, and primary chunks on vaults that have higher checking times.

That way you always have one vault that contains your piece that is checked closer to once every two minutes, than one that is checked once ever 20 hours.

Just my $.02

1 Like

Your linked article is 6 years old and after that it has been stated multiple times that is about 20 milliseconds :slightly_smiling: .

From: http://techcrunch.com/2014/07/23/maidsafe/

“Our network knows within 20 milliseconds if the status of a piece of data or a node has changed. It has to happen that fast because if you turn your computer off the network has to recreate that chunk on another node on the network to maintain four copies at all time."

20ms is also known as “Network speed”.

Are you telling me that the DataManagers are in constant contact with the storage_nodes? Have we agreed that the traffic generated by that constant contact is worth the instant response of DataManagers to a churn event?

Does the same go for data integrity checks? Now that seems ridiculous…

So is churn pre-emptive, although going offline isn’t punished in terms of the vault’s reputation?
That would be strange.

I’ve read somewhere that 3 chunks is the minimum. Can’t really find it again.

Number of chunks is a setting and depends on implementation, you may
wish a max number of chunks, or maximum chunk size, this decision and
code is left to the reader

Looking for the video. I remember David saying this in his video presentation. He compared it to BitTorrent and it’s DHT where this proces is very slow and said that Safenet will create more overhead but is very fast at recognizing nodes/data being offline.

EDIT: Start at min. 27. (although the whole video is great for all these details and the concept of ownership etc.) David talks about milliseconds to seconds. Depending if it was a graceful log-off or a churn-event like a crashing PC etc.

1 Like

for every vault there are at least 3 other vaults holding the data online and 16 dead copies are kept offline was it a year ago if nothing changed…

1 Like

If we get an ungraceful logout […] we’ll find out in seconds

HOW?!

I see! If the time to discover a failed node by the interested parties is in the range of seconds, then there’s simply no time for another storage node to accidentally fail during that short period of time.

Is there a way to list the storage nodes for a given chunk? Let’s say there’s a zero-day exploit (e.g. a kernel bug) that could be used to destroy data on those servers.

I looked into David’s video. I believe those nodes are in constant communication with each other. No response: something’s wrong. If small groups of nodes keep tabs on each other, they can relay to the wider network if any of them are down.

1 Like

That would indeed seem to be the case.