What are the chances for data loss?

“Optimally loaded” is NOT “fully loaded.”

A single storage rack in Amazon’s data centers stores multiple petabytes. Let’s say they have only 1% free space. And that’s where I stop.

It’s called optimal because having it fully loaded isn’t desirable even if you can make some money from it.
There’s no point in arguing about this, we’ll see when SAFE gets launched. If you want to place a bet, let me know.

Individual, unstructured blocks with a maximum size of 1 MB, that can be evicted at a second’s notice don’t have to add to the “fullness” of a drive, if it was organized with that in mind.

One could even modify a file system so it would lend its free blocks as opportunistic storage space for SAFE. Which is a pretty decent idea, if somebody is willing to pick it up :heart_eyes_cat:

If we needed some space (e.g. for a new file), the FS would just do its job, but it would also compile a list of the chunk copies that got deleted in the process, and then report them to the storage manager nodes (or whatever they are called) at the first opportune moment.

The Chunks are obfuscated before they’re send to a Vault. So the person owning the Vault doesn’t know what’s inside. He will only see random Chunks.

Chunks are spread evenly. That’s because the Hash of a Chunk brings it to the datamanager with an address close to them. And Hashes are quite random. So with 1 million Chunks in the network, they’re spread quite evenly over the datamanagers. And a Chunk with 2 copies lost at the same time isn’t really a problem. The network will notice and creates 2 extra copies as fast as it can.

Another thing to notice is caching. Popular files might see 20 or 30 copies around the network in cache. So to create a real problem with storage you need something like this:

A file that hasn’t been touched for some days. Where 3 of the same Chunks are lost all at once under a couple of seconds. That would be a serious problem. This could only happen IMO if we have centralized datacenters that make up x% of the total network (say 1% and higher) and their power/connection is gone in an instant. At that same time all 3 copies of a Chunk need to be stored at that datacenter.

1 Like

Let’s say I upload Star Wars XI to the network because I’m a jerk and I don’t care about all the sweat and suffering that took making it. The “link” for this file gets public and everybody starts downloading it (now let’s pretend there’s no caching, so we only have our 4 copies to destroy.) Basically: the chunk ids are public, correct? Could the malwarized hounds of the RIAA (or whatelse) still not target them?

In other words, the obfuscating you’re talking about: is that another layer on top of the already opaque chunk ids?

I think I wasn’t clear about what I was talking about. I was considering the potential loss caused by single events.

Chunks are spread evenly between vaults doesn’t also imply that vaults are spread evenly by location. For example, you think you’re cool because you run a shiny little vault on your laptop, but the guy next door is a total freak and he filled up his bedroom with a small server farm and he’s running a thousand vaults. If your laptop falls in your soup that’s less of a problem for the network than if your neighbor’s house burns down.

2 Likes

No, they can’t target them. Sorry that I can’t provide a link yet :smiley:. I’ll try to find. Nobody is able to target any of your Jedi Chunks™. The only thing one could do is request them (GET). If more people do this, than more is cached, so the file becomes faster. What I’ve understand from David is that the datamanagers will obfuscate the Chunks. So the Vaults have no clue. And the Vaults use their own encryption as well. The key’s are in RAM-memory, so gone when you turn of your computer. Now your the RIAA, you want to find these damned Jedi Chunks™. Where do you start? How do you know which Vault holds what? Maybe you’re a datamanager and see 1 Chunk come by. But as a datamanager you don’t know the IP-address of the Vault as far as I know. So now you’re still in the dark.

Yes, this is the datacenter problem as I said, indeed, not random in physical space. That’s right. There seems to be a weak spot in that. Don’t know what the Dev’s think about that. But I guess they’re busy. Not gonna poke them.

Edit:

1 Like

Wow, that’s clever; the RIAA (though I’m confident they are the music business vultures, but anyway) would have to install their malware on both the vaults and the data managers, and not only that, but they would have to be in touch! My Jedy Chunks™ are safer than I expected :smirk_cat:

I don’t think it’s very serious as long as there aren’t just a handful of really big players commanding the majority of the vaults. This is why I started thinking about Amazon, Rackspace, and other cloud services starting to utilize their free capacity for this purpose. Because then what if one day they just dump all vaults because why not :scream_cat:

The degree of physical centralisation affects the probability of an event occuring where a significant percentage of network data abruptly goes offline, but it does not directly affect the probability of data loss when such an event occurs. So I think they should be threated as two separate questions, though of course they both matter in the end.


Let’s rephrase the question in the OP: What are the odds of losing all copies of a chunk if suddenly 1% (in terms of capacity) of the network goes offline?

Obviously, any copy has a chance of 1% to be in that lost 1%. The network tries to keep 6 copies of every chunk, and it considers 4 the bare minimum. For simplicity’s sake lets settle on an average health of the network, where 5 copies of a chunk are kept. The risk of losing all copies of any particular chunk then is:

0.01^5 = a chance of 1 in 10 billion (or 10 milliard for non-English Europeans and others)

Since most chunks will be 1 MB, this means that if that lost 1% consists of a capacity of 10 petabytes (i.e. total network capacity is 1 exabyte, a million terabytes), the probably of losing 1 chunk is 1.

Now if we lose 10% capacity at once, then the picture becomes very different… We’d likely lose about 100K chunks if total network capacity is 1 exabyte. This doesn’t take into account caching or a hypothetical data recovery protocol when most of the lost vaults come back online, but it’s still pretty scary in my opinion. One major natural disaster at the “wrong” location could result in data loss, and in my view even merely one lost cat picture would significantly tarnish the perceived reliability of the network.

I’ve proposed it before, but I think it would be a great help if we’d add a parity chunk to every file’s datamap by default. For the cost of merely one extra chunk of data per file (so between 4 KB and 1 MB), it becomes possible to recover any lost chunk of that file, as long as no more than one chunk was lost. It’d exponentially reduce the probability of data loss (on a file level). In other words, it becomes acceptable for any file to lose one of it’s chunks.

1 Like

Noooooo; probabilities don’t work like that. For one, you need to lose just over 75% of your data (for 4 copies / chunk) to get an actual literal 1 for the probability of losing a chunk. On the other hand, you would have a much higher probability for losing one much quicker than you’d expect.

The problem with your calculation (0.01^5) is that you’re computing the chances for one particular chunk, but what we care about is whether there is a chunk (any chunk) that gets lost. And that requires a different formula, for which we also need the number of chunks on the network.

Let’s stick with that exabyte, and let’s say we have 10^12 chunks, 5 copies each, then compute “the chance that we have at least one chuck that had all its copies lost” for different percent of sudden losses:

  • 1% of storage lost: 1 - (1 - (0.01^5))^(10^12) = 99.9999999999999999999999999999999999999999962799240425795439…% – certain on a post-galactic scale if there’s such a thing
  • 0.5% lost: 95.61% probability
  • 0.4%: 64.08%
  • 0.36997%: 50.00% – equal probability between losing at least one chunk, or not losing any
  • 0.3%: 21.57%
  • 0.2%: 3.15%
  • 0.1%: 0.09995% – I say we’re pretty safe

EDIT: Actually, this formula is not correct because it doesn’t take into account that losing 80%+any amount of the data means there is at least 1 chunk that had all 5 of its copies lost; but I think it’s on the right track.

I did that, but I didn’t make that step explicit so I guess it was easy to miss. One chunk copy has a chance of 1 in 10 billion to be lost in my example, so if 10 billion chunks go offline (that’s the 1 exabyte being the 1% lost), then the probability is 1 to lose one. 1/10000000000 * 10000000000 = 1

That’s what I’m saying though: it doesn’t work like that. We have a (1 - 1/100) probability for not losing a chunk during a 1% loss. We can’t just multiply probabilities, because we’re not supposed to be able to end up with greater than 1, but we would. Instead, we need to raise this number to the power of how many chunks we have, and thus we get the probability of not losing any of those chunks. Then we can just do a “1 - this” to get the probability of losing at least a chunk. At least that’s how I understand these kinda things work like.

All this calculations do not consider that, though not at first, archive nodes also will exist.

@Tim87 you will find that he was saying that if you took one chunk then the probability that all the copies were in the 1% loss is 0.01^5. Then if you work this out for every chunk the probability is 1 (100%).

I like the concept, but one issue is that as the file grows larger the time required to recover one chunk increases at a v.much faster rate than chunks * the_time_to_recover_a_1_chunk file

I still would argue that this is done at the APP level (even if in launcher/client/NFS) so that the user decides which of their files requires this extra protection. After all there are some files that I would not. That library of my “funny pictures” gathered from my travels over the web are not important enough that I store an extra chunk per file for the 100,000 photos. But my family photos I would be willing to store an extra chunk for the 1000 photos.

That does not seem to make sense. You are taking the probability that a particular chunk is in the 1% loss then raising that to the power of 1E12 (number of chunks in network)

What does that even mean??

It would seem you are taking the chance of NOT having all 5 chunks of a particular chunk in the 1% (1-(.01^5)) and then saying that 1E12 of those exist

Of course to have 1% of the network storage in one facility when the network is say 1E12 chunks (1E10 chunks or 1E16 bytes) is in my opinion very unlikely.

Basic economics and keeping the shareholders happy would exclude any commercial datacentre from attempting this. The ROI for commercial farming is simply not going to enough, especially when it is more profitable to hire the resources out to customers.

That only leaves government or a malicious “rich guy” who is not after profit but disruption. That is a hell a lot of money to make me lose a “cat video” or two. And who is to say that I did not store two copies of my vid. In other words when the network is 1E12 chunks in size the network can be considered useful and thus the ROI on that disruption is not enough to reasonably attempt.

Maybe we need an app that can do par blocks across files (they exist already) and mount my SAFE drive and simply create a set of par blocks for each group of files. Yes it may cost me 1% more in PUT costs over my total PUT cost to store all those par files, but then even your 1% unlikely situation will not worry me. Except the cost to reconstruct. Hell I could even store the par files encrypted on a local drive. That means my personal info & family photos are safe from prying eyes. The par files cannot reconstruct the files without the SAFE drive mounted, so they are useless to prying eyes encrypted or not.

TL;DR
I am more worried by war disrupting the network than any commercial entity. But then without the SAFE network I may have nothing. So something is better.

Oops, I guess I wasn’t paying attention :scream_cat:

I exchanged a couple of emails with David years ago, just after I first heard about Lifestuff. I was thinking that an N-of-M kind of setup could give much much more resilience for the same storage space. For example, an 8 out of 12 strategy (splitting a chunk into 8 slices, then generating 4 more parity slices) would require only 50% extra space, but it would allow for any 4 of the 12 slices to get lost, and the chunk would still be available. What it means is that we get the same resilience that we would if we stored 5 whole copies, but for the cost of just a half.

I can’t exactly remember what the problem was at the time, or if that same problem still applies today (almost everything has changed since.) All the slicing-dicing would be done with already encrypted, opaque blobs, so it would be just a matter of organizing how they are stored; I can’t think of anything security related that could go wrong.

One truly serious complication (unless/until somebody comes up with more) is that the validity of the slices could no longer be checked as simply as now (sending a nonce, getting a checksum, done: we could no longer do that; there may exist some fun math to do something similar, though.) I wonder if the enormous savings on storage may still justify giving it a shot, eventually.

Otherwise, it’s not exactly super complicated, though it would make things a little slower (I doubt it would be noticeable, to be honest.)

We would need to store a two-way mapping between the chunks:

  • When we need a chunk, we need to know which slices to look for, and then we need to put it together from (some of) them.
  • If a slice is lost, we need to know which chunk to get and put together, so we can re-generate the lost slice.

I cant speak for the mathematical probability of safe but i use to use freenet back in the day before it became a pedohaven and files would constantly become unreachable, unrecoverable. Just like maidsafe it was design to break up files into chunks and spread redudant copies all over the place but it seemed to fail at that unless a file was really popular which i guess is why it only served the pedos. Uploads had a shelflife of 2-3 months. Literally all the network had to do was lose 1 chunk of a 3428349823 chunk file and the whole thing was unrecoverable.

I dont know if maidsafe stores stuff permanently or only keeps it based on request frequency but hopefully they manage to keep stuff alive.

It certainly would, but then also so does the number of copies.

The problem I have found with the N-of-M is that as the file grows so does the overhead multiply

For instance, lets examine the relative times for reconstruction of in chunk. (this is my experience having used such systems for over a decade on reconstructing files stored on a medium that only reliable stores 95-99.9% of the time.

Say for a 1MB file the time to reconstruct one lost block is 1 unit of time
for a 10MB file this time is around 15 units
for a 100MB file this time is around 300 units
for a 1GB file this time is around 7500 units of time
for a 10GB file this time can be enormous depending on the PC’s memory size.

VERY importantly N-of-M requires the whole file worth’s of chunks to be download to construct EVERY file because of the way N-of-M builds the M chunks(blocks)

SAFE can retrieve individual chunks without needing to read the whole file. 1 random chunk requires only the 3 chunks to be d/l to decrypt it. Streaming a portion of a file only requires 3 for the first then one chunk at a time as needed.

Imagine a static database file (say a novel with vids embedded etc 100GB in size) that has index files (chapter/page) and you want to start reading from chapter whatever, then you need to download the 100GB + the index files to do that. Plus have the spare local disk space for reconstruction (about 8 hours on my 100Mbit/s dowload)

On safe that is 1 second (6 chunks 3 for DB 3 for index to start) a second or two. And can use ram buffers to hold the chunks.

N-of-M also requires disk space to store the file during reconstruction, which means you leave a potential security flaw for the next person using that common PC to reconstruct personal info from the deleted files. SAFE doesn’t even need to store the file on disk temp or not.

And that’s where it stays. I’m talking about redundancy on the chunk level: individual chunks, once encrypted etc are sliced up, added parity slices, stored away. This process wouldn’t know anything about what happens above the chunk level.

Then you lose all copies of the chunk you lost the rebuilding info as well. N-of-M loses its appeal when only done per chunk

SAFE has error checking on each chunk stored, so if error then it retrieves another copy of the chunk