Storage proceeding

Well it’s a public forum and that discussion that I linked has been there since 2014.
I am on mobile so I can’t easily search but IIRC David himself has said that once the network gets larger it will be difficult to create that kind of attack and I agree. It’s easy to envision a laughable scenario with 500 nodes, but with 10,000 nodes that also have to earn basic reputation by behaving well, that wouldn’t be easy.

At the same time, like I said on that linked comment, I don’t care if the attack is merely “possible”. If you need a backup, make a backup. Or post 2 or 3 copies.
Nobody designs storage (or anything else) to be unbreakable. There are costs to consider. Nobody should be required to pay for 28 copies of all chunks, just so that people with large files can get six nines.

Half-baked idea: vaults are filled with real data and those are checked, that’s how it works by default. Redundancy is not faked or demonstrated by vault - it has only 1 copy of those chunks that were entrusted to it. It doesn’t know jack about the rest and whether those other nodes/vaults are lying. That is performed by other node roles in SAFE. Each vault takes care of its own business.

3 Likes

Or better yet, use an error correction algorithm. Since chunks are self-validating (with their hash), these can be very efficient. It’s possible to create a parity file for an uploaded file with a size no larger than the largest chunk of the file (so max 1 MB) and use it to reconstruct any missing chunk, as long as no more than 1 chunk was lost.

Simplified example of a file consisting of 5 chunks of 8 bits in size each:

chunk1: 10110011
chunk2: 01011001
chunk3: 10001011
chunk4: 00111100
chunk5: 11110010

Now we will look at each column of bits and write down 1 if there’s an uneven number of 1’s in that column, else we write down 0. The first column is:

1
0
1
0
1

There are three 1’s, an uneven number, so we write down 1 for this column. If we also do this for the other columns, we end up with this:

chunk1: 10110011
chunk2: 01011001
chunk3: 10001011
chunk4: 00111100
chunk5: 11110010
parity: 10101111

Now let’s imagine that chunk2 is lost due to an attack on the network:

chunk1: 10110011
chunk2: 
chunk3: 10001011
chunk4: 00111100
chunk5: 11110010
parity: 10101111

We can reconstruct chunk2’s data by looking again at the number of 1’s per column. If there’s an even number of 1’s, we enter a 0, else a 1. If we do this (humor me and don’t look back at the answer), we end up with 01011001, which is indeed chunk2’s data if we look back.

This means that in order to corrupt a file, an attack on the network has to delete a minimum of two chunks of that file. If there’s one in a million chance that one chunk of your file would be deleted by the attack described previously, the chance of deleting two chunks of your file is one in a billion (trillion for Americans). So by merely uploading at most 1 MB extra data per file, you exponentially reduce the chance of being the victim of non-targeted data corruption.

4 Likes

“.par2” files have been used in USENET binaries for a long time, and you can have many par files and you can recover as many chunks as you made 1MB par chunks.

3 Likes

Yes, I was sure that is possible but didn’t know how exactly so I couldn’t explain. The algorithm used for that is more complicated.

Perhaps we should contemplate integrating this functionality into the base client, or perhaps even into the core network. It’d be an extremely cost effective additional protective layer against data chunk loss. Such parity chunks could be appended to the datamap, and they’d only be downloaded by the client if any chunks cannot be retreived.

2 Likes

There has already been a thread along these lines.

It was using SIA network method of only storing par2 type chunks and reconstructing the file. It was to store more chunks (n) of the error correcting files than the size of the original file (m) and the file is recreated from any (m) chunks.

It creates a large processing overhead for perhaps little benefit. The larger the file the multiplied time to process.

The idea of one or two error correcting “par” files would not have the same massive overhead for large files, but still be significant for creation. And it would mean that all files stored would cost more. The PUTs for the 1 or 2 “par” chunks

I suggest that an APP is used (native to computer OR SAFE APP) to create the “par” files and the user remains in control of which files to have the error correction file. It could be made to be seamless for the user with the SAFE APP doing the work of creation and rebuild if needed. The technology for error correction files changes often and this allows the core network to remain simple and the user chooses which method he wants for it.

2 Likes

Yes, putting parity data creation or validation network-side is probably not a good idea, it should be client-side work only. I think I’d prefer to see it in the base SAFE client code though, so none will be unaware of the possibility.

What does that gain you in the first place?

Nobody does that. If you want to fix something, ask SAFE to enable --paraniod mode or something which makes 2x or 3x number of chunk replicas vs. the default.

Why??
The network does integrity checking for you (if it doesn’t, a bug report should be submitted).
Imagine people writing shell scripts checking the validity of their own disks in a LVM volume.
That would be totally ludicrous.

One (may) want to check if the file is there, not whether it’s corrupt. Even that is silly, though. If you don’t think your data can survive, you should use a different system, rather than spend all day verifying checksums from random chunks (GETs which cost SC, by the way).

1 Like

If it’d be done network-side it would be a network-wide feature by default, which would make the network more secure but also heavier to run vaults.

The idea is not that people will do this manually, but that it’s a simple button/slider/setting integrated in the client software, with a basic explanation on a mouse-over for example. Also, additional copies of the entire file are horribly inefficient compared to additional parity chunks.

1 Like

It can’t be done on the network-side because the network don’t know the location of the data map.

Do you mean in the client? Network side is to create a new storage mechanism that remembers the datamap as part of the network.

I like @janitor’s idea of " --paraniod" option and have that as a client option. For me personally I’d prefer that if a file by file basis and only use it for certain files. That image I put up for people to laugh at, I’d not be concerned about.

Seeing as SAFE is going to be very close to never losing data, then I’d say two extra chunks would make data loss for those files extremely rare

1 Like

I don’t see the point in a paranoid option.

The network is designed to not lose data. If the design needs tweaking to ensure this is robust, that will happen.

Obviously no system can ever be 100% though, even with a paranoid option. We have “the thing we never thought of”, and “the thing we thought would never happen” to thank for that.

So as is standard practice, if you want to dramatically improve your data security, you need backups that are independent, held on separate systems, in separate locations, regularly tested etc.

@janitor has also said this, and I think it is the only sensible alternative really. Bumping up chunk redundancy is a very poor substitute IMO.

Maybe David will one day convince us that decentralisation is just as good, but it still has single points of failure at this point (protocols, code, compiler etc). One day that may change, but until then, I think I’ll be keeping backups of data I don’t ever want to lose.

2 Likes

That way it would cause tremendous punishment on all responsible farmers. Total socialization of risk (for no payback to those who would be paying for it).

At least in the less terrible version data owner would have to GET the chunks (and pay for that).

What is the connection between the question (loss of all chunk replicas) and your “solution” (prevention of non-targeted data corruption)?

Spoken like a true uploader! Farmers already look likely to get screwed and you’re proposing yet another non-paying workload for them.

Well, those who read this post below may be already convinced that the level of protection is good enough
Hilariously, that topic was started by Seneca who was a capitalist back then (quote: “I can imagine rich/corporate users would be willing to pay for more redundancy” - whereas now he’s arguing the cost should of added protection for “the rich” should be paid by the farmers :slight_smile: ) and David already addressed those questions by posting multiple comments.

What? No, it’s just a few extra data chunks that farmers get paid for hosting. And farmers only look likely to get screwed in your own head due to the assumptions you make.

Yes, the parity files don’t have a big overhead, but verification has very high cost in terms of compute and network resources.

Can you explain, since the network already checks status of all chunks, what is the benefit of having this extra layer?

Is it “if all 4 replicas of the same chunk get deleted or corrupt at the same time, they can be recreated”? What is the likelihood of that happening in a 10,000 vault network?

It is not uncommon for large geographic areas to have internet outages that go on for hours…

I think a better solution would be to create persistant vaults that serve as backups for the rest of the network…

Farmers will not “get screwed”

The market will do what it does. I suspect that farmers are going to abundant, and thus they will get paid poorly. This doesn’t mean they are “Getting screwed” The value is in the network, not the $$ return… If the network isn’t enough payment for you, then don’t farm. If enough farmers do this the abundance will go away, and prices will increase. It will nearly always be a break even endeavor however – It’s a commodity market with little barrier to entry…

But that is irrelevant to issue at hand.

You well know that replicas are randomly scattered around the world, so not only 4 “regions” (continents) need to go down, but also at about the same time, and also not come back.
I don’t know why it’s so often necessary to re-explain things in one and the same topic. Just 2 comments ago I asked Seneca to tell me his estimate of the likelihood of such catastrophic loss. You probably saw that, but you still made a comment about isolated regional downtime (which doesn’t even involve data loss of a single chunk’s replica (let alone four), but is rather temp downtime of one chunk which is almost impossible to be detected by the user).

It’s not market behavior when irresponsible farmers cause non-paying workload for responsible farmers.
That is the same like Obamacare: you don’t want to participate in mandatory insurance, but you have to, and the more irresponsible you are, the more system resources are allocated to fix that.
In an interconnected system such as this one some “socialization” must be present as it’s not always possible to precisely calculate and allocate costs on the fly without causing outages, but why add more unnecessary overhead without rational justification (according to David’s comment I linked above))?

What would you tell to people (may they be the farmers, uploaders or downloaders) who “exploit” the network?
Is there any use of the network that you would considered economically unethical (say, some users taking advantage of the bad economics at the expense of other users)?

Interesting reference. So I guess what you´re saying is that Obamacare is flawed and you assume that the system will not work out because it gets naturally exploited. As a citizen of a state where insurrance has been obligatory for long time I can tell you that the system works way more efficient than the handling in the US prior to Obamacare, precisely because people see some value in not being ill. That´s not only a good metaphor for what I suppose @jreighley is saying here: the value is in the system. Let´s say it costs you 10 EUR to store 100MB on SAFE. That doesn´t sound like a good deal to you, I guess? Yes, this is not competetive to Googles free plan for 15GB, but we all know that this is NOT free and if you host your own server then that comes with some drawbacks as well. At some point people may have interest in buying this (potentially!) expensive storage of which they know that its securely encrypted and that people cannot take it down (easily).

There is a huge difference in that I don’t have to farm…

I get fined if I don’t buy insurance.

The users will pay for PUTs and the network will pay the farmers whatever it needs to pay the farmers to have enough farmers to do whatever needs to be done. If you don’t like it, don’t farm. If you do, do. There is no coercion in the deal, thus you are not “getting screwed” if you don’t get paid what you think you ought to get a paid.

Those who “exploit” the network will have to pay the price that the network decides to charge them. If they pay what they are asked to pay, then it isn’t “exploiting”

Folks will use SAFE because they don’t want their data to be hacked. That is plenty of incentive to hand over as much hard disk space and bandwidth as is required to sustain the amount of data that you care to store… You don’t need to be paid in FIAT of any flavor for this to be a worthwhile transaction because you would be having hard drives and bandwidth that your are paying for either way. Using SAFE just gets you the benefit of being rather hackproof, and redundantly backed up…

Okay, that explanation makes sense - a matter of positioning (not necessarily for the lowest cost of service).
David claimed that 4 replicas is plenty enough (maybe that could be even lowered to 3, he said on that link) and I agree with that (because the likelihood of 4 copies of the same chunk being destroyed within 1 hour or so is ridiculously small). Note that in this extremely unlikely scenario there would be a loss of 1 file (not some user’s entire SAFE data, etc.).
Related to Obamacare, adding more insurance would be like bundling mandatory insurance from some rare jungle disease to all plans, I suppose.

Yes, but you’re effectively transferring cost of premium service to users who don’t need that service (who think 4x is fine).
Or maybe they do. I personally would be happy with 3x. Not many enterprises have more than 3 copies (not willingly anyway) of their data.
At some point (4 replicas, 8 replicas, etc.) the cost starts impacting adoption, and you might start wondering what is the priority of the project: Iron Mountain-like service for enterprise customers, or secure, safe, low-cost anti-censorship content distribution network for (nearly) everyone.

I discuss this above:

X is the fraction of vaults which disappear permanently and simultaneously. N is the file size in megabytes. Possible scenarios where X is large:

  • A country doing deep packet inspection suddenly cuts off access to SAFE.
  • A group of farmers, feeling they are being “screwed”, collectively agree to run a shutdown -h <time> command.
  • An investor/speculator wants to drive down the price of Safecoin compared to fiat, so they invest in a huge amount of vault capacity, wait years, then shut it off. Buy up Safecoin on the cheap. Then submit a PR to github which addresses the problems discussed in this thread. Safecoin price jumps once confidence is restored. Speculator gets rich.

edit: the formula above has the number 4 as a constant, which is too pessimistic, on average