Storage proceeding

Actually MIN_CHUNK_SIZE=1024 so:

Very small files (less than 3072 bytes, 3 * MIN_CHUNK_SIZE) are not split into chunks and are stored entirety.

1 Like

I thought David Irvine said it was combined into the datamap (3kb metadata) or something like that, if it’s smaller than chunk size. But IDK it was a long time ago and things change

1 Like

While we’re on the topic about the files, and how it potentially could lose data if all 4 went down, I want to extend on one certain thing that I haven’t yet find on this forum, corruption files.

Suppose vault 1 has file corruption, vault 2 goes down. Another vault takes it’s place. Vault 2 get the data from vault 1, which is corrupted. Now you got 2 corrupted files, and 2 non-corrupted files. If the non-corrupted file goes down, and takes a copy from 1 or 2. Once 3 vaults has corrupted, the 4th vault will be corrupted.

I was thinking what if we use btrfs “scrub” which takes the metadata, use it to compare to each file and ensure the integrity has not been corrupted. If it is, it will replenish by replacing the corrupted file to the clean non-corrupted file.

1 Like

They have changed, and even the chunk size issue. Not even sure if the 4K min is valid anymore. I (seem to) remember that there was talk of fixing the chunk size to 1MB to remove some complexity, but I sure cannot remember where. Maybe it is referred as 1MB for simplicity of discussion but I am sure that things have changed more recently. Guess when testing is done the chunk size (min/max) will be determined and fixed for the initial version of the network.

Won’t happen because the chunk will not validate and thus vault 1’s chunk will be declared invalid and itself be replaced or removed.

Of course, everyone is aware of this and it was discussed on several occasions before.
The idea is once the network gets big it would take a lot of resources to get big enough a share.

If you have a RAID that loses 35% of disks, what happens? You lose data. So you may need a backup.

The file name of a chunk, and their position in the XOR space too, is based in the SHA512 own contents. So a corrupt chunk is detected immediately and eliminated.

About the chunk size this is the actual code:

/// Holds the information that is required to recover the content of the encrypted file. Depending
/// on the file size, such info can be held as a vector of ChunkDetails, or as raw data.
[derive(RustcEncodable, RustcDecodable, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub enum DataMap {
/// If the file is large enough (larger than 3072 bytes, 3 * MIN_CHUNK_SIZE), this algorithm
/// holds the list of the files chunks and corresponding hashes.
Chunks(Vec),
/// Very small files (less than 3072 bytes, 3 * MIN_CHUNK_SIZE) are not split into chunks and
/// are put in here in their entirety.
Content(Vec),
/// empty datamap
None,
}

People really knew that? @Artiscience asked “does that mean that large files are exponentially more dependent on a stable network?” and I think the answer is yes. Also there’s old discussion about controlling 75% of the network as being some sort of magic cutoff but that’s not quite accurate.

Tangentially related… there was discussion in an old thread about proof of redundancy (e.g., many vaults and one shared disk). I gather it’s not quite mathematically solved but may be impractical given the ranking system. However, I have this half-baked idea that maybe filling up a vault’s free space with some sort of junk data and occasionally asking them to hash something against that could be validation of their claimed unique storage. There would be a lot more to flesh out there to actually make it work… Or alternatively, always fill up a vault immediately upon joining the network (so increase redundancy to fill 100% of network storage). Then anyone trying to fake their redundancy will be found out eventually and penalized.

Well it’s a public forum and that discussion that I linked has been there since 2014.
I am on mobile so I can’t easily search but IIRC David himself has said that once the network gets larger it will be difficult to create that kind of attack and I agree. It’s easy to envision a laughable scenario with 500 nodes, but with 10,000 nodes that also have to earn basic reputation by behaving well, that wouldn’t be easy.

At the same time, like I said on that linked comment, I don’t care if the attack is merely “possible”. If you need a backup, make a backup. Or post 2 or 3 copies.
Nobody designs storage (or anything else) to be unbreakable. There are costs to consider. Nobody should be required to pay for 28 copies of all chunks, just so that people with large files can get six nines.

Half-baked idea: vaults are filled with real data and those are checked, that’s how it works by default. Redundancy is not faked or demonstrated by vault - it has only 1 copy of those chunks that were entrusted to it. It doesn’t know jack about the rest and whether those other nodes/vaults are lying. That is performed by other node roles in SAFE. Each vault takes care of its own business.

3 Likes

Or better yet, use an error correction algorithm. Since chunks are self-validating (with their hash), these can be very efficient. It’s possible to create a parity file for an uploaded file with a size no larger than the largest chunk of the file (so max 1 MB) and use it to reconstruct any missing chunk, as long as no more than 1 chunk was lost.

Simplified example of a file consisting of 5 chunks of 8 bits in size each:

chunk1: 10110011
chunk2: 01011001
chunk3: 10001011
chunk4: 00111100
chunk5: 11110010

Now we will look at each column of bits and write down 1 if there’s an uneven number of 1’s in that column, else we write down 0. The first column is:

1
0
1
0
1

There are three 1’s, an uneven number, so we write down 1 for this column. If we also do this for the other columns, we end up with this:

chunk1: 10110011
chunk2: 01011001
chunk3: 10001011
chunk4: 00111100
chunk5: 11110010
parity: 10101111

Now let’s imagine that chunk2 is lost due to an attack on the network:

chunk1: 10110011
chunk2: 
chunk3: 10001011
chunk4: 00111100
chunk5: 11110010
parity: 10101111

We can reconstruct chunk2’s data by looking again at the number of 1’s per column. If there’s an even number of 1’s, we enter a 0, else a 1. If we do this (humor me and don’t look back at the answer), we end up with 01011001, which is indeed chunk2’s data if we look back.

This means that in order to corrupt a file, an attack on the network has to delete a minimum of two chunks of that file. If there’s one in a million chance that one chunk of your file would be deleted by the attack described previously, the chance of deleting two chunks of your file is one in a billion (trillion for Americans). So by merely uploading at most 1 MB extra data per file, you exponentially reduce the chance of being the victim of non-targeted data corruption.

4 Likes

“.par2” files have been used in USENET binaries for a long time, and you can have many par files and you can recover as many chunks as you made 1MB par chunks.

3 Likes

Yes, I was sure that is possible but didn’t know how exactly so I couldn’t explain. The algorithm used for that is more complicated.

Perhaps we should contemplate integrating this functionality into the base client, or perhaps even into the core network. It’d be an extremely cost effective additional protective layer against data chunk loss. Such parity chunks could be appended to the datamap, and they’d only be downloaded by the client if any chunks cannot be retreived.

2 Likes

There has already been a thread along these lines.

It was using SIA network method of only storing par2 type chunks and reconstructing the file. It was to store more chunks (n) of the error correcting files than the size of the original file (m) and the file is recreated from any (m) chunks.

It creates a large processing overhead for perhaps little benefit. The larger the file the multiplied time to process.

The idea of one or two error correcting “par” files would not have the same massive overhead for large files, but still be significant for creation. And it would mean that all files stored would cost more. The PUTs for the 1 or 2 “par” chunks

I suggest that an APP is used (native to computer OR SAFE APP) to create the “par” files and the user remains in control of which files to have the error correction file. It could be made to be seamless for the user with the SAFE APP doing the work of creation and rebuild if needed. The technology for error correction files changes often and this allows the core network to remain simple and the user chooses which method he wants for it.

2 Likes

Yes, putting parity data creation or validation network-side is probably not a good idea, it should be client-side work only. I think I’d prefer to see it in the base SAFE client code though, so none will be unaware of the possibility.

What does that gain you in the first place?

Nobody does that. If you want to fix something, ask SAFE to enable --paraniod mode or something which makes 2x or 3x number of chunk replicas vs. the default.

Why??
The network does integrity checking for you (if it doesn’t, a bug report should be submitted).
Imagine people writing shell scripts checking the validity of their own disks in a LVM volume.
That would be totally ludicrous.

One (may) want to check if the file is there, not whether it’s corrupt. Even that is silly, though. If you don’t think your data can survive, you should use a different system, rather than spend all day verifying checksums from random chunks (GETs which cost SC, by the way).

1 Like

If it’d be done network-side it would be a network-wide feature by default, which would make the network more secure but also heavier to run vaults.

The idea is not that people will do this manually, but that it’s a simple button/slider/setting integrated in the client software, with a basic explanation on a mouse-over for example. Also, additional copies of the entire file are horribly inefficient compared to additional parity chunks.

1 Like

It can’t be done on the network-side because the network don’t know the location of the data map.

Do you mean in the client? Network side is to create a new storage mechanism that remembers the datamap as part of the network.

I like @janitor’s idea of " --paraniod" option and have that as a client option. For me personally I’d prefer that if a file by file basis and only use it for certain files. That image I put up for people to laugh at, I’d not be concerned about.

Seeing as SAFE is going to be very close to never losing data, then I’d say two extra chunks would make data loss for those files extremely rare

1 Like

I don’t see the point in a paranoid option.

The network is designed to not lose data. If the design needs tweaking to ensure this is robust, that will happen.

Obviously no system can ever be 100% though, even with a paranoid option. We have “the thing we never thought of”, and “the thing we thought would never happen” to thank for that.

So as is standard practice, if you want to dramatically improve your data security, you need backups that are independent, held on separate systems, in separate locations, regularly tested etc.

@janitor has also said this, and I think it is the only sensible alternative really. Bumping up chunk redundancy is a very poor substitute IMO.

Maybe David will one day convince us that decentralisation is just as good, but it still has single points of failure at this point (protocols, code, compiler etc). One day that may change, but until then, I think I’ll be keeping backups of data I don’t ever want to lose.

2 Likes

That way it would cause tremendous punishment on all responsible farmers. Total socialization of risk (for no payback to those who would be paying for it).

At least in the less terrible version data owner would have to GET the chunks (and pay for that).

What is the connection between the question (loss of all chunk replicas) and your “solution” (prevention of non-targeted data corruption)?

Spoken like a true uploader! Farmers already look likely to get screwed and you’re proposing yet another non-paying workload for them.

Well, those who read this post below may be already convinced that the level of protection is good enough
Hilariously, that topic was started by Seneca who was a capitalist back then (quote: “I can imagine rich/corporate users would be willing to pay for more redundancy” - whereas now he’s arguing the cost should of added protection for “the rich” should be paid by the farmers :slight_smile: ) and David already addressed those questions by posting multiple comments.