Are Erasure Codes (Storj) better than Replication for the SAFE network?


#1

Storj Launches Version 3 of Its Decentralized Cloud Storage Platform

Of interest: Erasure codes

“Files are divided into segments, which are then divided into stripes. After, stripes are organized into erasure shares and uploaded.” Storj claims that erasure shares enable video streaming and buffering functionality similar to the YouTube experience, even if using 4K settings.

At a high level, Erasure codes allow the receiver to recover all data from any portion of the data. For example, erasure codes are used by satellites when they transmit data because they assume some data will not reach its final destination.

Storj’s decision to utilize erasure codes to deliver file resiliency is unlike most decentralized cloud storage providers, which choose replication to deliver reliability in case storage nodes fail.

“Purely using erasure codes for resiliency is much more efficient in terms of the required storage capacity and bandwidth used to meet service level agreements. We found that our new architecture is able to achieve AWS-level resiliency with an expansion factor of 2-3, meaning for every gigabyte of data stored, we use 2-3 gigabytes of storage capacity on the network. Systems that use replication to achieve the same resiliency would require 10-16 gigabytes of storage capacity per gigabyte of data stored. In our preliminary tests, compared to our previous network (which predominantly used replication), we have greatly improved file durability while cutting the expansion factor in half.”


What’s up today?
What’s up today?
#2

That’s a bit of a stretch. It’s basically n of p where you set the ratio you wish. So p can be 10 and n be 20, then any 10 parts will give the file. It’s a debate between replication and (error coding or) erasure codes, striping etc. Functionally they are similar, the debates are on efficiency really. So the number of replicants verses the n/p ratio is the argument of the past. I have always favored replication as it is simpler and faster. n/p ratios had a bigger argument when there were disc space efficiency issues, as with raid 5 and so on. I feel this is confused in distributed computing though where the number of stripes can be as many as we wish (replicants). As I say though it can be argued for a very long time.

Talk of AWS resiliency is not about erasure codes really, neither is data transmission losses (this is what reliable protocols are for). These are kinda misleading I think, it is about the number of copies you need to get (however they are encoded) to retrieve a file plus taking into account the probabilities the machines they are stored on will be able to give those chunks. It really is that simple. AWS resilience will be matched by many projects on launch when their software is proven over time to be resilient, then the underlying algorithms can be considered resilient, but not before.

10-16 GB would mean they set replication at somewhere from 10-16? So which was it?
2-3 is unclear if it means 2 or 3? with erasure coding that number should be known up front, otherwise n and p have not been selected.

So seems strange to me to have ranges of numbers like that? However, if replication was 16 then it meant that the network needs 16 copies as some may vanish (up to 15) and still have the data (not at AWS resilience, beyond that to 100%). If 15 machines could disappear and you have an erasure code where you need 2 or 3 parts, then those could have existed in those 15 lost machines. So this all makes little sense to me? if you only need 2 copies of 3 say, you could store the whole file on 2 machines with the exact same resilience. This is a confusing way to frame erasure codes in a decentralised network of nodes.

So erasure codes can be used, I agree (both these and replication do work), but these figures do not make sense.


#3

Hey David! This got cross-posted to our subreddit. I replied there: https://www.reddit.com/r/storj/comments/9snqix/storj_v3_whitepaper_has_just_been_published_here/e8tfo0o/


#4

Excellent.

I cannot understand the erasure code play here to be honest. I will try and elaborate.

So we have say 3 chunks, A B & C. Consider then erasure code where 2 of 3 is OK. Now forgetting the size issues on getting the data (replicated chunks will be the most efficient) or the size of total data stored for X resilience (where X is % you can lose) where erasure codes look to be more efficient (but I will explain below, why they may not be).

So in a decentralised network, you need to know that when the network loses a chunk (of any kind) then it has Y time to create a copy of that chunk. If replicated then that is easy as there are more than 1 copies of each chunk (we have that set to 8 right now, but 4 is probably way beyond what is required.

However, if you have an erasure code system A B & C are all different chunks. So if you lose any you would need to make a copy. So if you lose A for example then the network would require to make a copy of A, to do that nodes would need to get copies of B & C to recreate A (thanks @draw ) . This is a lot of work and if B or C also goes down then we are screwed. So erasure codes used like this means that instead of having 4 exact copies then we may only have 3 different copies, where 2 can recreate the file. To increase resilience and make it easier to recover a single chunk without getting p chunks of the n total chunk then it would make sense to replicate each of the n parts. Then it becomes how much to replicate each part, or should the network know when a part goes missing to create the swarm of messages to all the holders of the n parts to restore the chunk.

This part could be very costly IMO. So the balance of using the replicant number of chunks versus erasure codes where space efficiency is definitely improved is upset by the networks ability to maintain the chunks it should. I hope I am making sense, but it is an interesting aspect to consider IMO.

It has been debated a few times, but interesting to see why this seems better than replicant copies and you guys seem to have put in some time. so interested to hear differing views.


#5

To make sure I understand, I assume you mean recreate A instead of C in the quote above?
I’m not a RAID specialist, but that is something like RAID 5.
I see the problems you describe, certainly if n and p are rather big numbers.


#6

I agree it sounds bad with few blocks/chunk, but what if a chunk is stored in a way that 4 out of 12 (or 16) blocks is necessary?

Each block is 1/4th the size of the original chunk, and the total storage cost is 3x (4x) the size of the chunk. If a block is lost, any 4 of the remaining 11 (or 15) will be enough to rebuild it. It’s still the same amount of data that’s being transferred, just in 4 x 1/4 chunk sized blocks. It does need some more organizing and it takes more computation, but it gives unbelievably higher redundancy than simple replication (anybody want to calculate it?)


#7

Ah yes, I will edit that. Yip that is what raid does, but all the bits are local to the machine so it works really well there


#8

Only if you know what chunks to ask for, which means a more complex data map. If it were pure erasure code then you can ask for missing chunks (do you wait? which is sync) or have to ask for all of them, which means more bandwidth and message complexity.

There is also a tough call as well, if a hacker sends you a scrambled chunk then rebuilding is poly hard to figure there is a bad chunk (can be resolved using name == hash of chunk)

It does not really, it’s a matter of selecting number of replicants to match the n and p parameters. [edit to be clear it’s higher space efficiency I think you mean and that is not a debate I would have with you :slight_smile: as you would win, but it’s purely space efficiency not redundancy. At least in a p2p network]

This is what needs to be calculated, space is cheap and getting cheaper. This computation and message complexity of restoring a missing chunk will not scale downwards in the way storage costs do.


#9

So we have say 3 chunks, A B & C. Consider then erasure code where 2 of 3 is OK. Now forgetting the size issues on getting the data (replicated chunks will be the most efficient) or the size of total data stored for X resilience (where X is % you can lose) where erasure codes look to be more efficient (but I will explain below, why they may not be).

Okay, just to make sure I understand what you’re saying, let’s say you have some chunk you’d like to store, X. You have two redundancy schemes, replication and erasure codes. In the replication scenario, R_r(X), R_r(X) = A = B = C. In the erasure code scenario, R_ec(X) = (A, B, C) where all A, B, C are unique. In the replication scenario, you only need one of A, B, or C (X, X, or X) to recover X, and in the erasure code scenario, you need two of the three of A, B, or C.

I’m puzzled that you say replicated chunks are the most efficient for getting the data. In the replication scenario, A, B, C, and X are all the same size. In the erasure code scenario, A, B, and C are each 1/2 the size of X. When doing a retrieval, you only need 2 of the erasure code copies, which is still overall the same number of bytes of data transfer as replication, but you get the benefit of being able to recover it from two sources in parallel.

In terms of resilience, or durability, in this 1/3 or 2/3 scenario, yes, replication is more efficient, but there’s a tradeoff. The total amount of data in the replication scenario is 3 times, or what we’ve been calling an expansion factor of 3. In the erasure code scenario, the total amount of data is 1.5 times, or half as much. Erasure coding is twice as efficient in terms of bandwidth and disk space for put operations. If we were going to use the same amount of bandwidth and disk space, erasure coding once again would win in terms of resiliency.

So in a decentralised network, you need to know that when the network loses a chunk (of any kind) then it has Y time to create a copy of that chunk. If replicated then that is easy as there are more than 1 copies of each chunk (we have that set to 8 right now, but 4 is probably way beyond what is required.

Sure, in an erasure code scenario, let’s say you have 50 total pieces, and you need 20 of them. You don’t really need to repair until you drop to a certain threshold. Maybe 30? When the redundancy drops to 30, you can download the chunk once (1x the size of X (20 * X/20)), then store the missing 20 pieces (in this case, each piece is 1/20th the size of X, so again, 1x the total bandwidth transfer). We’ve recovered significant redundancy with very little bandwidth transfer, because 20 of the repairs were batched.

However, if you have an erasure code system A B & C are all different chunks. So if you lose any you would need to make a copy. So if you lose A for example then the network would require to make a copy of A, to do that nodes would need to get copies of B & C to recreate A (thanks @draw ) .

Not right away. This is especially true if you’re using a 30/80 erasure code.

This is a lot of work and if B or C also goes down then we are screwed. So erasure codes used like this means that instead of having 4 exact copies then we may only have 3 different copies, where 2 can recreate the file.

Yeah, erasure codes where you do 2 out of 3 are pitifully low numbers. This 2/3 example would never work, I agree.

This part could be very costly IMO. So the balance of using the replicant number of chunks versus erasure codes where space efficiency is definitely improved is upset by the networks ability to maintain the chunks it should. I hope I am making sense, but it is an interesting aspect to consider IMO.

The primary reason for us to use erasure codes is not disk space efficiency, but overall bandwidth efficiency. Erasure codes, where you repair immediately if any piece is lost, would be bad, obviously, so we don’t do that. We only repair when we get to a repair threshold, where we start to consider that hey you know, maybe the data is at risk. We still can tolerate 10 losses or something, but that’s too close to comfort, so we’ll repair the missing 30.

Overall, this means that we spend significantly less bandwidth. With replication, you do have to spend 1x the overall size of the chunk for every time you lose a piece.


#10

You may be interested in this paper: http://www.eecs.harvard.edu/~margo/cs261/papers-a1/blake.pdf specifically, figure 3. Bandwidth is by far and away the limiting factor in systems such as ours. This is especially true with respect to bandwidth caps. Comcast limits bandwidth to 1TB/month, which really means you can’t use more than 385KB/s all month long. There’s not a lot of available bandwidth, and with high network churn, replication will kill you.

It’s worth writing a simulation or toy model. Batched repair with erasure coding is really the only way this works.


#11

Each block is 1/4th the size of the original chunk, and the total storage cost is 3x (4x) the size of the chunk. If a block is lost, any 4 of the remaining 11 (or 15) will be enough to rebuild it. It’s still the same amount of data that’s being transferred, just in 4 x 1/4 chunk sized blocks. It does need some more organizing and it takes more computation, but it gives unbelievably higher redundancy than simple replication (anybody want to calculate it?)

Yes! Section 3.4 in our paper has that exact calculation: https://storj.io/storjv3.pdf


#12

Nature favors the SAFE use of replication in addition to error codes. Dna repair mechanisms offer a pretty good example of using parity bits to correct corrupted chunks at the nanoscale. And this operates within every replicated cell. So maybe it’s not a question of replication vs error correction, but the simplest way to get both?

However, those users that care to have both can just upload their Parchive files to SAFE… Quite a hassle though, I won’t. Much easier to upload a file twice to get 16 copies and sleep easy :wink:

*To be fair, maybe someone will eventually write an app that combines Parchive and SAFE-drive client side in order to make error codes easy to use.


#13

@jtolds Are you sure about that? Lose a redundant chunk, make another copy. Who cares how big the file is?


#14

As long as those par blobs are for a single chunk. Ever try to fix a 20GB file that had par sections. The time taken to repair data is not linearly proportional to the size of the original data section but some sort of exponentially increasing function.

There was some discussions in the past about self encryption having par ?slices? within the chunk in order to repair the chunk if some transmission error occurred. But then transmissions errors are actually quite rare since faulty packets are detected and resent.

replication is perhaps overkill but does solve some other issues that in a global system starts to overwhelm the issues. I am sure both storj’s method and SAFE’s method will work. But the question is overall how well will each compare. In SAFE’s design, storj’s method would require some redesign and make delivery of chunks quite a bit more complex for what savings.

Storj’s design if I am not mistaken relies on the storage nodes to be continuously active with small downtimes. Maybe up for weeks at a time. But SAFE does not have this requirement, and could actually work with nodes with much higher downtime. EG some nodes up for days at a time and some a day or less and its done so that running a node includes a wider cross-section of the world’s population


#15

Thats a obvious mistake. Only the chunk needs the extra copy.

There are a number of mistakes (on both sides) and hopefully they don’t make the discussion to murky.


#16

Okay, my mistake, I’m unclear with SAFE terminology in terms of chunks or files.

I’ve edited my post to clear up when I mean files and chunks.


#17

Storj’s design if I am not mistaken relies on the storage nodes to be continuously active with small downtimes. Maybe up for weeks at a time.

Months.

But SAFE does not have this requirement, and could actually work with nodes with much higher downtime.

There simply isn’t the bandwidth available for this much churn. There’s a tight relationship between repair bandwidth, churn, redundancy strategy, and data access. See http://www.eecs.harvard.edu/~margo/cs261/papers-a1/blake.pdf


#18

Is that based on American bandwidth allocations by ISPs? Many parts of the world don’t have such restrictions. Australia for one has seen some 1000 times allowed (acceptable) quota in a decade for the quota unlimited plans. And Australia is restrictive compared to other parts of the world too.


#19

Yeah. Situations like Australia’s certainly help a lot, but there’s a very non-linear relationship between churn and repair bandwidth anyway. The difference in repair bandwidth usage between nodes that are online 95% of the time to 96% of the time is much more than the difference between 96% to 97%. If nodes are offline a bunch, the amount of bandwidth required for maintenance will be astronomical. Over a wide-area network, there’s simply only so much you can do.


#20

Yes bandwidth is much higher and it’ll be interesting to see how your Storj will go with this method you are using.

For SAFE I expect that other issues need to be considered than just the argument between n of m and replication and so they need to be considered in the mix of things.

Your mention of non-linear comparison is not making replication worse but as the sizes of your data sections being protected increase so does the overheads. Whereas SAFE is very linear. One for one if you will.

Also churn is not necessarily happening constantly since there will be cases where chunks have more copies than required and so a node going off line will not need all its chunks copied to new nodes. For instance a node that goes off line for a period, when it returns in many cases will have its store validated and its chunks will increase the total number of those chunks. This actually can make it non-linear but in being below the linear line.

Also with your nodes staying up for months means that you are targeting your nodes are a decisively different level of audience.