What is the probability that a file uploaded could be lost?

Just heard about this recently very exciting stuff. One of the things I’m most excited about is the potential to use the network for long-term data storage, a problem currently in search of a solution.

So, as I understand it, this system splits the data into “chunks” and stores them across the network. It tries to maintain at least four active copies of each chunk, and makes new ones to replace those stored on machines that go offline.

So, this means that a chunk will be lost if all four machines happen to go offline before the system can make more copies. This is probably unlikely, but not impossible. And, correct me if I’m wrong, if one chunk is lost, the entire file is lost?

Have there been any calculations or simulations to determine the probability of this happening?

4 Likes

I think you might find this topic of particular interest and seems to have been made for you

And now the network attempts to maintain 6 copies now

2 Likes

The post @neo points to is pretty in depth.

Calculations would require you first need to know how many vaults were running and then the density of vaults running in a geographic region. It would require a catastrophic internet outage in a region that happened to have your four copies of a chunk. Those four copies being in the same region is unlikely if the distributions of vaults is worldwide. The likelihood of a catastrophic event in one region, unlikely. A global catastrophic event, unlikely and if that happens we probably don’t care about data after that. My .02 anyway.

Thanks for the answers. It seems like a really good thing. We’ve pretty much accepted that much of the digital work of our era will be lost to history, but this could be a chance to preserve it.

I wasn’t picturing a catastrophic event, though. Just 4 (or 6) people who happen to turn off their computers at the same time. I don’t know how close together that would have to happen to lose the data, though.

But since we’re talking about catastrophes, what if the available storage space goes down? I know we probably expect it to keep growing, with advances in computer technology and all, but it seems like if it doesn’t, it would be pretty catastrophic.

As the available space drops the rewards for farming rise encouraging more people to farm and to increase their vault sizes. Considering this is worldwide it reasonable to expect this to happen and within a reasonable time frame. There can be a 50% drop in space after the price rises before the network can be considered nominally full. At that point farmers are being paid the big MAIDs.

But even at that point the space can drop further. I doubt it will reach that point because farmers then would be getting a SAFEcoin every chunk they retrieve for the network. Very high incentives to farm more.

Then there are what is called archive nodes in the planning that will hold rarely retrieved chunks and likely hold a copy of other chunks. The details are not set yet and the dev/planning team have not revealed the details yet

So data loss mitigation is of great importance to the development team and it seems that they have plans to help prevent it

Yeah, I realize it’s not likely, but I was thinking something seriously disruptive, like a world war, that damages the internet infrastructure over a large area. I’m thinking serious long term, not within our lifetimes.

Could you clarify this statement? I thought the total supply of SafeCoins was fixed.

Thats a real big subject and if one is used to bitcoin then its like the 4th dimension of crypto. SAFEcoin is used to pay for storage. The coin is recycled and available for reissuance

While the total possible coins is 2^32 (approx 4 billion) the amount of coin that will be issued over the years will be many times that 4 billion because of recycling when people pay to upload data.

Farmers are rewarded according to farming rate (FR).

FR is calculated in real time according to the amount of available space in the network

As the available space drop FR rises which results in farmers being paid more often

So while plenty of available space farmers may get rewarded a coin at the rate of 1 coin per 10 thousand chunks retrieved (GETs).

And when the space drops they might get like 1 coin per 5000 GETs

Then as the avail space (compared to used space) drops further the rate increases.

The algorithm current will reward farmers at the rate of 1 coin per GET when all the available space is used.

After that the average number of copies will drop, but the network is still safe and data secure. And the farmers are rewarded one coin per GET. I doubt many capable users will resist those payment rates. That is why I doubt it would actually get to that point and I am confident that people will increase farming/vaults to take advantage of the great rates.

It is expected that farming rate will settle to a point where people are happy. If not enough then some will stop and the rate rises to a point where enough are farming to keep it about there.

As to world wide events killing off SAFE storage, then I doubt data store will be our biggest problem. Also I imagine that SAFE will not be the only store for critical data, which is only good backup/storage management.

Also there are a number of advancements coming out that allow massive backup storage in small devices (eg optical storage) and if the archive nodes used them we could likely have a archive store of every chunk in them. If that was done then even a world wide event would not kill the data but just require the network to come back up

4 Likes

I’ve thought about this and concluded that the SAFE network is mainly useful for making data available and secondarily for backup, and that in either case it represents redundancy of off-line data.

In other words, I plan to keep off-line backups, suitably encrypted.

if there is a dramatic reduction in storage capacity, entities which PUT data on to the network need to be able to set and allocate funds for a Max payment. at that point as supply is being reduced, data is cut out which no longer has funds or is unwilling to fund its existence in the “new reality”

I think one important question is (and I think the developers have addressed it already somewhere) how the non-permanent nature of vaults is related to data loss. When vaults were still permanent, it was not a big problem that a vault could go offline for a while and then go online again (power outage, reboot of the computer, reboot of the dsl-modem, etc.). But since vaults have to start from zero (at least with regard to reputation) once they have been offline (for how long?), I think the question was if the chunks they still have stored can be used to restore data that would otherwise have been lost. I think I remember that the answer was yes, maybe somebody can point us to the thread where this was discussed (I can’t find it…)

The ability to republish data is the key issue to solve for worldwide network outage etc. This is an interesting issue, but we are 100% focussed on launch right now. So this is “archive nodes” and you will find conversations about this on the forum.

A node with data for instance is not trusted. There is no immediate way to republish this info. There are easy ways around it, but there is very likely more efficient ways. A simple way si record network “difficulty” which is distance between nodes. If this is reduced then quorum reduces to match it, all the way to zero. It’s not currently through rfc so we won’t answer it before beta (I think). This is similar to bitcoin difficulty and perhaps what would happen in a worldwide outage with similar issues to resolve, like which chains do we believe, the longest may be first on network etc.

So archive nodes and outages will share at least parts of this mechanism.

6 Likes

I have to admit that I am not familiar with the exact workings of the network, but there is something I don’t understand: I imagine that there have to be the hashes of all chunks stored somewhere on the network, otherwise, how can the proof-of-resources work? And if the hashes are known to e.g. the close group (or whomever), why can’t the chunks stored by a “dead” vault, who lost his reputation, be used to restore otherwise lost data? I mean just in emergency? I understand that vaults have to be punished when hey don’t behave as expected, but I don’t understand why their data can not be trusted when the hashes of the chunks they deliver have the correct value… Sorry for my limited understanding of the network, again :wink:

Here’s my understanding: hashes aren’t stored centrally, but in the datamap which the requester uses to access her files. The hash is also the address where the file is stored, so to get the file (or rather each chunk of the file) request the chunks according to their hash/address.

Only devices that can supply a chunk that matches the hash can get the reward for serving the chunk, which of course proves they have the chunk.

If all vaults holding the chunk go offline, the chunk is not necessarily lost, if any of those vaults come back at some point. But to restore those chunks would I think need some special recovery logic, and it isn’t clear if this is planned or will indeed be worthwhile. It depends on what the actual likelihood of this happening will be.

2 Likes

I think we agree on this, this recovery logic would be a nice-to-have, and experience will show if it is really necessary. I guess that a lot of redundant copies (6 nowadays?) will do a good job, but the SAFEr the better (pun intended :wink: ) The bigger the amount of data, the higher the chances that a loss happens in singular cases, and it would be nice for the reputation of the network if we could say that it “has never lost a single byte”, right?

1 Like

A lot of this can be figured out once the network vaults hit end users (very soon based on last update) and we can try and measure how fast churn happens across the network. With 6 copies spread out throughout the world, the chances of 6 copies being being all in one country (catastrophe chances) or even one continent are pretty slim.

I am not even concerned with catastrophic events, even with everyday churn. I guess it is difficult to predict… I just imagine all the people with slow upload, and the network being very busy delivering regular GET requests, and I could imagine that in very very rare cases (that we would like to avoid anyways) the network might fall behind maintaining enough redundant copies, until the number drops to zero. But of course I hope soon we will be proven that the network is more stable than that!

I know that they’ve done some pretty intense churn tests on the testnet that they have up and running currently. Its not a closed network, its got little VPNs all over the world. I’m not sure how much they’ve tried to stress it, but I know they had constant churn for days at a time with no loss. Can’t comment on speed.

The reason I am thinking about this is because I would prefer if there was no “discrimination against unpopular data”. The “popcorn-time” data of the latest tv-shows will all be in cache and maybe have even more than 6 regular copies (and I think this is a very nice use case), but I am also very much looking forward to the “dropbox” use case, and when you back up some “unpopular” files, that are indeed very important to you, it would be a bad feeling to know that they are less safe than the latest tv-show/porn movie… :stuck_out_tongue:

EDIT: of course you could make your files artificially popular by regularly requesting them, it would just cost you some of your bandwidth

I’ll look for the post, but this has come up. Eventually there will be vaults that hold data so long and so reliably that they will be “promoted” to archive nodes. What all that entails I don’t remember, but as far as vault ranks go, that will be way up there. Again, not sure what all vault ranks will be used for in the future either.

Found the post:

It’s a big one, but has some great info in it.

1 Like

i think storj have solution
storj store all the files in the farmers hdd when farmer went offline his data will be on the hdd and when he come online network will check if the file was available by using the unique hash for the file it store in hdd if it dont match the hash it will download from the other farmer.

and also your data is stored in different geo locations like Asia , Africa, Europe and other time zones so that it will available at all the times.

if the number of online peers droop then new farmer will host your data and you can set number of per your need to store. if you want to store data on 20 farmers then your files will be safe since all 20 farmers will not go offline.

i think maid safe also have similar architecture