What are the chances for data loss?

Tim87 · January 26, 2016, 2:03am

Oops, I guess I wasn’t paying attention

I exchanged a couple of emails with David years ago, just after I first heard about Lifestuff. I was thinking that an N-of-M kind of setup could give much much more resilience for the same storage space. For example, an 8 out of 12 strategy (splitting a chunk into 8 slices, then generating 4 more parity slices) would require only 50% extra space, but it would allow for any 4 of the 12 slices to get lost, and the chunk would still be available. What it means is that we get the same resilience that we would if we stored 5 whole copies, but for the cost of just a half.

I can’t exactly remember what the problem was at the time, or if that same problem still applies today (almost everything has changed since.) All the slicing-dicing would be done with already encrypted, opaque blobs, so it would be just a matter of organizing how they are stored; I can’t think of anything security related that could go wrong.

One truly serious complication (unless/until somebody comes up with more) is that the validity of the slices could no longer be checked as simply as now (sending a nonce, getting a checksum, done: we could no longer do that; there may exist some fun math to do something similar, though.) I wonder if the enormous savings on storage may still justify giving it a shot, eventually.

Otherwise, it’s not exactly super complicated, though it would make things a little slower (I doubt it would be noticeable, to be honest.)

We would need to store a two-way mapping between the chunks:

When we need a chunk, we need to know which slices to look for, and then we need to put it together from (some of) them.
If a slice is lost, we need to know which chunk to get and put together, so we can re-generate the lost slice.

feelz · January 26, 2016, 2:20am

I cant speak for the mathematical probability of safe but i use to use freenet back in the day before it became a pedohaven and files would constantly become unreachable, unrecoverable. Just like maidsafe it was design to break up files into chunks and spread redudant copies all over the place but it seemed to fail at that unless a file was really popular which i guess is why it only served the pedos. Uploads had a shelflife of 2-3 months. Literally all the network had to do was lose 1 chunk of a 3428349823 chunk file and the whole thing was unrecoverable.

I dont know if maidsafe stores stuff permanently or only keeps it based on request frequency but hopefully they manage to keep stuff alive.

neo · January 26, 2016, 2:21am

It certainly would, but then also so does the number of copies.

The problem I have found with the N-of-M is that as the file grows so does the overhead multiply

For instance, lets examine the relative times for reconstruction of in chunk. (this is my experience having used such systems for over a decade on reconstructing files stored on a medium that only reliable stores 95-99.9% of the time.

Say for a 1MB file the time to reconstruct one lost block is 1 unit of time
for a 10MB file this time is around 15 units
for a 100MB file this time is around 300 units
for a 1GB file this time is around 7500 units of time
for a 10GB file this time can be enormous depending on the PC’s memory size.

VERY importantly N-of-M requires the whole file worth’s of chunks to be download to construct EVERY file because of the way N-of-M builds the M chunks(blocks)

SAFE can retrieve individual chunks without needing to read the whole file. 1 random chunk requires only the 3 chunks to be d/l to decrypt it. Streaming a portion of a file only requires 3 for the first then one chunk at a time as needed.

Imagine a static database file (say a novel with vids embedded etc 100GB in size) that has index files (chapter/page) and you want to start reading from chapter whatever, then you need to download the 100GB + the index files to do that. Plus have the spare local disk space for reconstruction (about 8 hours on my 100Mbit/s dowload)

On safe that is 1 second (6 chunks 3 for DB 3 for index to start) a second or two. And can use ram buffers to hold the chunks.

N-of-M also requires disk space to store the file during reconstruction, which means you leave a potential security flaw for the next person using that common PC to reconstruct personal info from the deleted files. SAFE doesn’t even need to store the file on disk temp or not.

Tim87 · January 26, 2016, 2:25am

And that’s where it stays. I’m talking about redundancy on the chunk level: individual chunks, once encrypted etc are sliced up, added parity slices, stored away. This process wouldn’t know anything about what happens above the chunk level.

neo · January 26, 2016, 2:27am

Then you lose all copies of the chunk you lost the rebuilding info as well. N-of-M loses its appeal when only done per chunk

SAFE has error checking on each chunk stored, so if error then it retrieves another copy of the chunk

Tim87 · January 26, 2016, 2:36am

I’m not sure what you mean, and I’m kinda sure you misunderstood what I meant, so let me try again.

storage:

we have a random 1 MB chunk (we don’t know and we don’t care if, or what, it is a part of)
we make 8 x 128KB slices out of it
we generate 4 x 128KB extra slices according to some popular FEC algorithm
we put the chunk and its slices on our global, two-way directory that maps them to each other
we store the slices based on their hash (just as chunks are stored with the current system)
done

retrieval:

we have a map for a file; we look up its first chunk
we look up the slices that belong to that chunk
we pick 8 of them by random, and get them
we put them back together into the original chunk
we go get the remaining chunks (one by one, independently) the same way

restoring a lost slice:

a slice is reported lost
we look up which chunk it belonged to (from the directory mentioned above)
we retrieve it and we put it back together, as described above
we generate the missing chunk using our FEC magic
we’re happy because the balance of the universe is restored

verifying if a server indeed has the slice it says it has:

no idea whatsoever; may be a deal-breaker

EDIT: The appeal would be in the storage savings that could be achieved without compromising on redundancy. I didn’t talk about trying to restore a document from an incomplete number of chunks; I believe that kinda thing belongs more to the application layer. I was sticking to what’s already being done: storing chunks in a safe and redundant way.

EDIT 2: Think of it as RAID, but the drives are spread across the globe. The chunks are logical sectors. The slices are the physical sectors on the different disks.

neo · January 26, 2016, 3:02am

So are you saying each stored element is now 128KB instead of 1MB and there is now 12 stored element (using element for want of a better word)?

I can see that is more workable than N-of-M over the whole file. And I agree that is is a better way of securing data on the storage level.

So if we example this over the current SAFE model

SAFE model                  N-of-M on a 1MB element of file
4 - 6 1MB chunks stored     12 x 128KB elements stored
1 MB d/l per MB             1MB d/l per MB or more if error/delay
1 access / MB               8 accesses / MB
1 No additional processing  additional processing per MB
    (0 units of time)            ( 8 - 12 units of time)
Works on v.small device     Requires device with more porcessing

I think the 2 most important issues from the above is the additional processing (& time) and the number of accesses required.

NOTE: IoT devices/pads/phones will number in the 100’s of billions very soon, while the number of PCs is growing slowly. Limiting IoT devices could mean 10% are capable rather than 33%

Additional processing (& time) is extremely important because it limits the number of devices capable of running SAFE. At this time is should be possible for many modern IoT devices to run SAFE (even if only running SAFE and serving other IoT device in group). This means that it is both capable of running the S/W and also doing it in a reasonable time frame.

But N-of-M requires significant processing and increases the s/w footprint and time required. So less devices will even be capable, and then additionally less devices will run it fast enough.

Number of accesses is extremely important because of the network traffic that is produced on each access. If you were just requesting data from a server then its 1-1 access metrics. But SAFE requires a coordinated effort with many nodes to do each transfer (be it a 4k chunk or 1MB chunk). N-of-M requires at least 8 times the that. So when people try to determine if the current internet can even handle the level of node (access transactions) traffic N-of-M would almost surely push it over the edge. As network traffic increases so does the delays increase for each piece of traffic. (packets do need retransmission at times and so on)

janitor · January 26, 2016, 3:15am

Unrelated to data loss, one interesting consequence of this would be that it’d require SAFE to be divisible so that every vault can get a fraction of SAFE after serving a slice.

neo · January 26, 2016, 3:21am

Not needed. The farming rate would be reduced by 8 to account for this.

you have a one in (1/ farming_rate) chance to get a coin for each get. In other words on average you get one coin for every (1/farming_rate) gets

Tim87 · January 26, 2016, 10:53am

Thanks for doing my job with that comparison

Yes, it’s about the cost of access v.s. the cost of storage.
If nothing is lost, you can just pick the 8 initial slices, and “processing” is reduced to “put one after the other.” You can think of the 4 parity slices as “sacrificial data.”
Important problem: vault verification is not solved (I’m not saying it can’t be, but I have no idea how to do it.)

It may be worth to consider that ridiculously high redundancies could be achieved relatively cheaply (e.g. let’s do a 4-of-12, and it’s still just 3 MB, and we can lose up to 8 slices for the same chunk), which means we’d need to do fewer checks. That’s fewer accesses (though I’m not sure if fewer enough to offset the higher cost of getting a full chunk.)

Also, we could do 1 MB slices of 8 MB chunks, and we’re at the same number of accesses. Granted, not everything is big enough for that, but you get the point. Raising the maximum chunk size to 8 MB would result in slices of max 1 MB.

To be honest, I brought up the N-of-M thing only because of @Seneca’s mentioning of his parity chunk idea. My initial reservations were about the resilience a mere 4 copies could provide. However, as the vaults storing those 4 chunks are very closely monitored, and action is taken within seconds, I have no reservations about that anymore.

No, I’m taking the probability that a particular chunk is NOT lost. To compute the probability of 2 independent events’ both occurring I need to multiply their probabilities. For 10^12 independent events, I have to multiply those probabilities that many times.

I don’t care about losing one particular chunk (almost zero chance even with a lousy setup); I care about robustness on the network level. I’m trying to compute the probability for “there is a chunk somewhere on the network that had all its copies lost as a result of this x % data loss.” I can’t readily explain why we need to go about it in such a roundabout way, but I believe it’s correct.

EDIT: Oops, I responded to the first sentence before reading the whole thing. Yes, that’s what I mean: the probability for somebody’s (anybody’s) data getting lost as the result of the 1% loss. For that, I first compute the probability that none of the 10^12 chunks had all their 5 copies lost.

neo · January 27, 2016, 12:23am

You use power if you want probability of ALL.
You use multiply if you want probability of ANY

Thus you use power to work out probability of all copies are in the 1% for one chunk (0.01^5)
But you incorrectly use ^1E12 which works out if all are not in the 1% at the same time.

But of course that is the wrong question. The question is what the probability that ANY chunk will have all its 5 copies in the 1%

Thus

probability that a particular chunk will have all 5 copies in the 1% 
    = 0.01 ^ 5
Now have 1E12 particular chunks, so probability that ANY of the
chunks will have its 5 copies in the 1%
    = (0.01^5) * 1E12

Analysing your equation 1 - (1 - (0.01^5))^1E12

0.01 = prob of a particular copy in the 1%
(0.01^5) = prob of a particular chunk having all its 5 copies in the 1%
(1-(0.01^5)) = prob of a particular chunk NOT having all its 5 copies in the 1%
(1-(0.01^5))^1E12 = prob of having ALL chunks NOT having all their 5 copies in the 1%
1 - (1-(0.01^5))^1E12 = prob of NOT having ALL chunks NOT having all there 5 copies in the 1%.
EDIT: removed incorrect statement

Tim87 · January 27, 2016, 4:06am

Raising to power is just multiplication over and over again (as many times as many independent events we have, which is 10^12 in our case.)

You can never multiply probabilities with the number of events; that’s always incorrect. (It’s easy to see: if you multiply probabilities with numbers greater than 1, you may end up with probabilities greater than 1, and that is inconsistent with the definition of probability.)

That’s the probability of ALL chunks having ALL of the copies in the 1% though.

You can’t just drop the two NOTs and expect to keep the meaning; you also need to change the ALL predicate to EXISTS / ANY: It’s the probability of having ANY chunk having all their 5 copies in the 1%. Which is what I meant to compute. (We’d need parentheses for these things to make sense tbh, and also to see what happens with the NOTs and all.)

neo · January 27, 2016, 5:56am

You’re right. Edited post for correctness

Topic		Replies	Views
Storage proceeding Beginners	49	4783	October 5, 2015
Calculating the probability of data persistence Development	146	1076	February 8, 2024
What happens to pieces of data if a device is shut down or it dies? Features	9	839	October 31, 2019
What about a catastrophic event that wipes out millions of nodes Features	98	5812	February 15, 2018
Can someone clarify please...? Features	7	872	February 29, 2016

What are the chances for data loss?

Related Topics