Preventing data loss using parity checks

Self encryption breaks a file up into chunks before being uploaded. The entire file becomes corrupt if one of those chunks goes missing. I’m interested in techniques to recover from this sort of issue.

Let’s put aside the existing high security of chunks via redundancy and data chains, and consider how users themselves may further reduce the chance of corruption or loss of their files on the network.

The most promising technique I found is Parchive which creates .par files used to recover missing data (originally designed for usenet). From the par2 man page:

If test.mpg is an 800 MB file, then par2 will create a total of 8 PAR2 files.

test.mpg.par2 - This is an index file for verification only
test.mpg.vol00+01.par2 - Recovery file with 1 recovery
test.mpg.vol01+02.par2 - Recovery file with 2 recovery blocks
test.mpg.vol03+04.par2 - Recovery file with 4 recovery
test.mpg.vol07+08.par2 - Recovery file with 8 recovery blocks
test.mpg.vol15+16.par2 - Recovery file with 16 recovery blocks
test.mpg.vol31+32.par2 - Recovery file with 32 recovery blocks
test.mpg.vol63+37.par2 - Recovery file with 37 recovery blocks

The test.mpg.par2 file is 39 KB in size and the other files vary in size from 443 KB to 15 MB.

These par2 files will enable the recovery of up to 100 errors totalling 40 MB of lost or damaged data from the original test.mpg file

This seems quite efficient, and the degree of recovery can be tailored (eg instead of 40 MB recoverable it could be 80 MB, and instead of 100 errors it could be more). If you only need a little bit of recovery you only need to down load a small par file. It’s really quite elegant.

It could easily be included by users as a second upload beside the original file, since it simply generates .par files which can also be uploaded to the safe network like any other file.

There are some other options which are quite interesting.

Shamir Secret Sharing Scheme which requires M parts of N total parts to be able to recreate the file. But this is much less efficient than most parity-style systems.

RAID also implements parity checks, but is at the filesystem level rather than the file level.

Reed-Solomon codes as used on CDs which is very efficient (1 parity bit for 3 data bits)

There’s a wealth of information on Error Correction and Forward Error Correction on wikipedia

In summary, I think there’s no need to add compulsory parity checks at any layer of the network, but it’s definitely something of interest as a second layer of security for files that users may opt into if it suits their needs.

Do you have any other thoughts on users adding extra layers of security to their data?

15 Likes

Its been a while since we discussed this. Must be over a year now.

This would be an easy way to ensure all parts of the files exist for critical files, even if large outages are current.

The biggest issue I see is that we don’t want this at the core level and as an optional feature to be used on selected files. Also the memory usage problem, big files require a lot of CPU/time (increases exponentially as file size increases) to check let alone correct errors. This makes it not suitable for media files since it might require the whole file worth of chunks downloaded before starting to play. Useless for movie watching on most phones.

One solution was suggested that the pars is on the chunk, so the chunk is split into 8 sections and have 6 data parts and 2 par parts. This way the checking is on each chunk and no need for downloading enough chunks to recreate the whole file (maybe GBs) Doing it on a per chunk basis also means that if data transmission error occurs then the chunk doesn’t necessarily need to be downloaded again.

And it would be suitable for media files since its done on a chunk at a time. PAR checking of 1MB is very quick.

So each part is 128KB and there is 6x128KB of data stored in the chunk and 2x128KB par parts

Again even on a chunk basis I think it needs to be optional and really a layer above the SAFE code.

For extra reading on this
EDIT: one discussion that progresses into PAR "files"

Another one

Still cannot find the topic specifically discussing this.

EDIT: thanks to @mav who found the topic

8 Likes

Is it this one?

1 Like

Nope, it was one where the opening post was about another project which was implementing M + N storing. They use the (semi) official name for the technique. There was quite a lot of discussion in it.

2 Likes

I think this may be it, about the sia network

the chunks of the Mojette transform have an interesting property, they are correlated and you only need n out m chunks of the Mojette Tranform file the reconstruct the original file.

2 Likes

@mav, yes that is the one. Thank you

What do you think of the idea to par each chunk (6 parts data + 2 parts par)?

While it doesn’t help with say a large outage that makes all the copies of some chunks unavailable till the outage is over, it does help with the more common issue of packets in error. So then the need to retransmit packets or even chunks will be reduced greatly.

2 Likes

I think par the full original file rather than each chunk. This gives the most flexibility in the size of the par files generated (and thus how much it costs the user to save the redundancy to the network).

I reckon the most sensible implementation is to let the user decide via the interface of the app.

Have a checkbox to ‘Enable additional redundancy using parity checks’. If checked, the user can use a slider to select how many MB of par files to create, and is shown both the additional cost of uploading the par files, and the degree of improvement to the durability of the file.

The user can then decide if the value of the redundancy is worth the cost of it.

Setting sensible defaults will go a long way; I’d say 10 MB would probably be enough. I can’t imagine more than 10 chunks of a file going missing at the same time.

It will be interesting to see whether chunks ever go missing on the live network, since that’s the only thing parity checks protect against. I’m guessing parity checks will probably turn out to be completely unnecessary.

1 Like

The issue I see with this is that its the one who uses the file that has no say. You don’t have just one user but all those who want to view the file. In effect one user decides for all who use the file, be they mobile phone or power users.

This is likely true but if you have a 1000 chunk file and allow 10, this is 1 in a hundred. So if you have 10000 then the chances are 10 times that it may not be enough. I still think that the number of pars should relate to the size of the file, and you only download as many pars as necessary. So if one chunk is lost you are downloading one par. Maybe 2 minimum and 1 par chunk per 10-50 chunks

It most certainly will be interesting.


The big problem I have with file wide pars is the time/resources needed for large files.

The increase in processing time to repair a 100MB file compared to a 10MB file is not 10 times longer but more like 25 times (or much longer). And the memory usage is also greater. But for 100MB to 1000MB it is more like 50-75 times, and not possible on smaller devices.

If you want to recover lost chunks then instead of file wide pars. I would do multiple chunk pars.

That is you take a number of chunks (generating user selectable) but relatively small like 3-6 chunks and generate 1 or 2 par chunks. The advantage is that you can still recover lost chunks, but the compromise is that sequential lost chunks would still fail the recovery. But lost chunks would be random so less a problem for SAFE than for a faulty disk drive or hiccup in uploading a file to NNTP (usenet), both of which is more likely sequential errors.

Remember that pars would only be need to be downloaded if required, so generating more pars only costs “PUTS” for the uploader and time. For the downloader it only costs time if a par is needed.

Obviously this is at the APP level

4 Likes