I have recently found out about this project and I find it very interesting.
What will happen to the data I put into the safenet if I decide to never delete it, if I stop using the service or if I die? Will the copies of my data be stored in the SAFE network forever or till the whole network is taken down? Or will it get deleted when I stop paying for storage space or when I stop contributing with storage space for the safenet? What will happen if my storage solution does not work for some time, or my safecoins runs out? Do I need to make back ups of all my data first?
The plan is you pay when you store date (“PUT”) and that’s it, so if you don’t delete the data remains indefinitely. If this model proved unsustainable, the network might need to adapt in some way, but for now that’s the plan.
Most PC harddrives are half full at best anyway. SAFE would only use those parts of your hard drive not already filled. As a result, orphaned data - rather than “forgotten” - would hang around until it gets pushed out by more recent orphaned data. The more newly orphaned data comes in, the less long existing orphaned data hangs around.
BTW your PC already works like this. When you delete a file, it’s still there really. Over time it’ll get pushed out and replaced with new stuff.
As a result, orphaned data - rather than “forgotten” - would hang around until it gets pushed out by more recent orphaned data.
Is this actually how MaidSafe is currently designed? Can you elaborate more?
Since farmers only get paid for PUT commands on their machine, it makes sense that they would like to get PUT requests more frequently and get rid of old chunks on their servers, as old data make them lose money. Won’t there be an incentive to clear their drive every month or so to make way for new incoming data?
David will probably reply to say I am totally wrong, but …
Remember you have absolutely no idea what the data stored by SAFE is on your machine - you cannot tell which is the most valuable nor which is orphaned (you CAN tell which is the most popular, but not what it actually is). You therefore have no way of discerning between different types of chunk - it’s all the same to you.
I should imagine that SAFE will hang onto orphaned data if it has nothing else to do with that storage just the same way as your filing system already does. The difference is that if orphaned data suddenly becomes deorphaned, it can be immediately marked as in use again without having to redownload that chunk. This is the big advantage of content addressible storage designs.
Equally, if you drag the slider for how much of your free space SAFE can use down to minimum, SAFE will destroy all orphaned data. It may also penalise the storage of your data on other people’s machines, because with SAFE you are only guaranteed to get from others what you yourself give to others.
Gaming the System
Farmers get paid on GET requests not PUT. I think this is what you meant. Having more chunks PUT in their vault increases their chances of more Safecoin GET requests.
If a farmer can measure chunk request frequency, they would want to delete chunks that are not being requested as often. To do this, they make a daily/weekly/monthly log of individual chunk replies. Example below.
Vault Drive E (30days)
chunk abc1 - requested 0 times
chunk abc2 - requested 1 times
chunk bce4 - requested 5 times
Naturally, this farmer would want to delete chunk “abc1”. So how do we discourage that behavior? Here’s a few ideas.
The MaidManagers can do a daily/weekly/monthly audit. This means a “fake” GET request for all chunks that are supposed to be stored in the vault. Failure to reply back with 100% will result in a rank downgrade. This is resource intensive, and not the most efficient way to handle it.
If the deleted chunk is ever requested, and the vault fails to reply… then the vault will be downgraded. I believe we plan to have this ranking system in place but it does not go far enough.
We could also make vaults with higher ranks have a higher payout. If a vault is hosting many unpopular chunks, they will still be compensated because they are: online more often and more reliable with less popular data. Exact details of how high the rank goes and what is required to obtain each level is still being formulated.
So now the farmer is faced with a hard decision. Keep hosting unpopular data and keep their rank high, thus maintaining their income. Or take a chance at deleting data, getting down ranked, and lose income. Also the vault will get removed from the Network if it continues to misbehave. This means they have to start over.
Q: Can a vault start over with popular chunks already stored?
A: I don’t think so. The way the XOR Network operates is based on the hash ID which initially determines where a chunk is stored. So the Network will not be looking for those chunks if your node(vault) address has changed. I could be wrong though. This would be something for MaidSafe to answer.
Q: How much more beneficial is it to have a higher rank in terms of collecting chunks?
A: If the Network has to choose between 2 vaults, it should prioritize the higher ranked vault over the lower one. This means higher ranked vaults get more chunks stored compared to lower ranked vaults.
Thanks for the detailed answer. I like option 3. I’m not sure how likely it is, but if data does, in fact, live forever on the network, it would be possible that the majority of data on your hard drive would never get requested. The higher payouts on a small amount of data that that does get requested may not be enough.
Number 1 doesn’t make sense, since calculations can be done differently. A farmer may notice that a good income involves chunks that are requested a hundred times a day, so anything less than that would be deleted. We don’t really know what they’ll set their baseline at. They could reset their hard drives until they reach their desied 100 get requests a day. Hopefully the penalties in place would prevent this from every happening.
Don’t forget, the rewards are adjusted to balance provision with demand. And yes, resetting to try and capture fresh frequently accessed data will be penalised, and the network needs to adjust itself to minimise this.
If a farmer regularly deletes unpopular data, their node will be asked to redownload a copy of that data frequently, increasing their bandwidth costs. If the node refuses any request to download a copy of data, or pretends to download and not actually bother, the storage the network believes it to share is reduced. Your allocation on other people’s machines is therefore also reduced, and your costs for keeping your data on other people’s machines rise and therefore your surplus drops.
The handy thing about SHA512 content addressable data and an XOR chunk map is that chunks really are allocated as close to randomly as is possible. So long as you supply a reasonable amount of storage to the network, your income will tend very closely to match the network average. And that network average is the optimum amount, it’s as good as you can get for any sustained period if you supply a reasonable amount of storage.
MaidSafe spent a ton of time thinking of every conceivable way one could game the system and made sure you lose out from it. Sans bugs, of course.
If its owner deletes the data, and no other copies exist, and no one uploads identical data using the same owner keys, it’ll eventually get garbage collected subject to free space pressure.
The handy part of this is if you delete everything from SAFE, and then reupload it again, the network will spot that the chunks are already uploaded and you don’t actually need to upload anything.
If you’re more worried about the privacy/security thing that deleting data doesn’t really delete data, that’s already the case for any filing system, and any cloud storage. Nobody actually deletes data, they simply flag the data as deleted and they may or may not garbage collect some day.
The big difference with SAFE is that the connection between data and its owner is one way. There is no mathematically feasible way of figuring out who owns some chunk. If one were the NSA and could watch the entire internet, one could watch network traffic patterns in SAFE to infer connections between chunks and owners, but even then you can only narrow an owner down to their local group at best, so that’s a 1 out of 32 chance, and this is assuming the attacker has a 100% opacity on all global internet traffic flows.
And let me turn it the other way round. If you have some data which you need to delete in a hurry because someone is coming for you in the next five minutes, the very last thing SAFE should do is actually delete anything because that identifies the chunks belonging to you. Far better simply to disconnect you mathematically from your data using an entirely local operation of destroying your encryption keys.
That’s not totally true. When I empty the trash on my mac, I actually “secure empty trash” so that the data is overwritten with random ones and zeroes, which is pretty close to actual deletion of the files.
I’m not sure I agree that keeping all old data around forever is a good idea. Couldn’t there be some way to require more safecoin to be paid to keep the data for e.g. 100 years vs always keeping it around for years even when the account which it belongs to is not active for years for example?
Good point… but it seems like it would be nice if there were some way to decrease the amount of cruft. For instance, maybe people could earn coin when they delete their data. Not enough to make people decide to delete data they should probably keep, but maybe enough to make it worth it for people to not be complete packrats and leave data laying around decades later.
I recently had a brief email discussion with Fraser about this. The mechanism for removing chunks from the network hasn’t been determined yet, and the API is initially not going to notify the network when a deletes a chunk. Hopefully this will change by release. Determining when no entity is pointing to a de-duplicated chunk is difficult without exposing metadata about the entities pointing to it (at least I think its difficult). I know @dirvine and @Fraser had some thoughts on this topic, but I’m a bit lazy to look it up right now (sorry, its on the forum somewhere). Currently the only thing (at the data storge layer) tagged with owner information are the SDVs, which store the networkID (SHA512 hash) of the encrypted data map of the containers (directories).
Or are you referring to people naturally aging off data that isn’t accessed? I know there has been discussion on the ways of probing the vaults to ensure this doesn’t happen, because someones data could be unexpectedly purged. The pay once scheme seems to provide some interesting storage issues.
This isn’t always true currently. If the API isn’t aware a chunk is previously on the network, it will issue a PutChunk request (name changed slightly). The PutChunk request will always upload the entire chunk in the request. The PutChunk request could ask the network first, which would require another RTT but will reduce bandwidth in some situtations. Do you think its worth it? That portion will be re-written/updated to use routing_v2, so its going to be worked on shortly anyway.
FWIW, Drive (and by extension the Posix/REST API) has some special cases where it knows the chunk is on the network a priori, and in those cases 0 network events should occur. I recently got GoogleMock setup between both the (test only) disk layer and network layer so the Posix/REST API should be well tested for exact number of network requests (and failure injection!).
This doesn’t seem to be as effective with SSDs. Apparently OSX will notify you of this, so sounds like you have a traditional HDD.
That should be accurate, but theres lots to keep track of on the data storage front…
I think a mechanism for each group to regularly ping the chunks held by that group in other groups is wise. A bit like RAM row refresh, it keeps those chunks alive.
Anything unpinged falls to the bottom of a LRU cache, and those chunks get scavenged as free disc space pressure demands.
I had thought this was per user only, so dedup only happens per user, possibly per group. For obvious reasons dedup past that leaks information.
To be honest, we get a free pass next few years anyway as hard drive capacities are still experiencing (slowing) exponential growth. We will have to do something eventually though like that ping idea.
I personally want everything to be synchronous, so when the user hits Ctrl-S to save he waits the ten or fifteen seconds until enough copies land, and fsync completed, on the end destination nodes to warrant claiming the save completed. I appreciate that isn’t possible, and it introduces unpleasant complications to close() semantics, but that’s what I’d personally prefer. In a fully synchronous design, RTT is always an optimisation
Just because you rewrite a file’s content doesn’t mean the hard drive physically overwrites the original. Indeed with a SSD it almost certainly would not.
The only way of really deleting data is ATA Secure Erase, and the only way of deleting data and being sure you have done so is a degauss round on magnetic disks.