Data Deduplication

Too complicated - would inevitably involve “magic numbers” that @dirvine dislikes.

But I’m willing to be convinced otherwise?

I suspect we can estimate cost of checking for existing data vs saving the data, then run with it. The main point is that it should cost something no matter how small. People don’t like spending money to do DoS generally.

3 Likes

Totally agree.
I tend toward the view that if you want to PUT something, it should cost, whether or not it actually results in a new PUT.
“Profit” from this to go to general funds split between farmers, PtP whatever.

Anything, just anything that results in fewer pictures of kittens on teh intrawebs…

But absolutely no freebies for something that could get used as a DoS vector.

4 Likes

Thanks for that.

I agree with this. If you commit to something then its no loss to you that the system has already stored it for someone else.

That too.

But still (if I wasn’t in a hurry) I would use an APP that checks if the chunks exist already. So it would do something like take my 20GByte file and self encrypt the first 10 chunks and then try and retrieve them from the safenetwork. If they don’t exist then unlikely the rest do and so upload the file in the normal manner. If the first 10 exist then good chance the rest do so it encrypts more and checks say the last 10 and so on. Thus I don’t even ask the network to store the data and cannot be asked to pay and the network hasn’t done 20GB worth of attempted store.

3 Likes

Yes, but that’s only for Structured Data and Appendable Data.

If a chunk of ImmutableData already exists, the Vaults send a PUT success (no refund):

https://github.com/maidsafe/safe_vault/blob/310c363ea379cb4d608d065d9007fad52ed97efa/src/personas/data_manager.rs#L482-L488

Immutable Data offers no refund because those chunks aren’t owner specific, so the user gets a PUT success. This isn’t the case for SD/AD and thereby the refunds.

5 Likes

High five to @neo - right again man! :slight_smile:

1 Like

Actually I was repeating what David said so it should be high five to @dirvine :wink:

And thanks to @frabrunelle for tracking it down

4 Likes

Too modest for an Aussie! :wink:

David qualities are contagious :slight_smile:

3 Likes

I’m trying to learn about SAFE Network and I’m stuck on deduplication (which is pretty early in the FAQ). I understand that the pre-encryption hash allows a client to detect if a duplicate chunk is already stored on the network, but isn’t the stored chuck encrypted by a client-specific key? If so, then client B could not decrypt the encrypted chunk stored by client A. Also, does data deduplication create a problem when a client wants to delete a file/chunk? Is there a use-count for each chunk? I’m a programmer so feel free to get technical or point me to source code files. Thanks!

No, the self encryption algo uses the data itself to form the encryption. This allows anyone who has the datamap of the chunks to decode the file. Each chunk on its own is “impossible” to decrypt, but once you have all the chunks in sequence (datamap) then its possible to decrypt the chunks.

Sorry I don’t have the link to the self encryption encryption with me, but I’m sure a search on “self encryption” in the forum will bring a wealth of information.

EDIT: follow this link and the post has a video in it that explains it (hopefully since I didn’t watch the vid to check)

5 Likes

Is this deduplication process implemented yet? I ask because if I repeatedly rename and re-upload a largish file the process takes pretty much exactly the same time to complete on each occasion, whereas I’d expect it to be quicker after the first time since the network will realise an identical chunk already exist and thus not bother to store it again. Perhaps the upload has to happen in full anyway before the network can make that decision? Or, alternatively, is it the encryption process rather than the uploading process that is the time-consuming bit?

3 Likes

The concept of dedep is to save on storage (its inherent to SAFE) BUT not let the user know that the chunk already exists on the network.

Inherent to SAFE
This means that the network stores the chunk according to its contents, meaning that the fn(hash) of the contents is the address for storage. So 2 chunks with the same data will have the same storage address.

Not let you know it already exists
This is immutable data we are talking of. You will be charged the “PUT” cost. If done properly it will still upload the chunk, but no need to actually store it, so the client cannot deduce that the chunk already existed. Yes the charging for the chunk store has been confirmed.

Now the flaw is that an APP could be written to check prior to uploading if the chunks already exist. Simply do the process of self encryption on your PC and then request the chunk at that address to see if it exists. The problems with this is the time consumed and bandwidth used, so much so that people might prefer to pay the miniscule amount just to upload the file and be done with it.

6 Likes

Why hide that a chunk exists though? Does it really reveal anything? Isn’t it just a waste to upload data again if it already exists, instead of just hashing it on the client and only uploading if it doesn’t already exists? I think it would be nicer if I didn’t have to pay put costs for data that already exists on the network.

Each time you reveal something you potentially reduce anonymity/security, even if very small.

Lets say you gave a confidential document to just one person. Then that document appears on SAFE, and found out by de-dup reporting it. I am sure a ton of examples could be given as to how revealing this can lead to information leak.

Also in order to pay for the resources the network used to attempt to store the chunk the charge for the PUT continues and you need to not reveal the chunk exists. Otherwise people get a little snotty about paying for something they feel they should not have had to.

Also it would be nice not to have to pay for anything too :slight_smile:

But dedup becomes a significant source of “income” for the network since over time more and more data will be duplicates. The network still has to perform functions when you try to upload a duplicate file, even the checking costs bandwidth. Also having “pay once” storage the costs of retrieving popular files is offset somewhat by dedup. I think its reasonable to assume that a lot of popular files will also be uploaded by more than one person.

So if you don’t want to pay then use an APP that wastes your time to see if the chunks exist prior to attempting to upload. Since reads are free then, it won’t cost you anything (except time/bandwidth) to check. But as I said before many will most likely just upload the file to save the bother. And yes it uses SAFE resources to do this but thats the tradeoff to give a free to read network.

9 Likes

I have to say it makes sense to charge even if the file is already on the network. The idea being if person A (original uploader) deletes there file, the file will stay as person B (second uploader) has paid for it to remain on the network. Extra work has to be done to verify it stays on the network, and that is what Bs payment goes towards.

1 Like

Given that the average Maidsafe host is only going to have modest bandwidth at their disposal, there are going to be use-cases where higher availability is desired. Imagine hundreds of thousands of people trying to access the same video that has just gone viral… Wouldn’t it make sense for there to be (many) multiple copies of that high-demand data available?

In that case there will be a huge number supplying those chunks that make up the viral video.

Also remember that the network has 6-8 copies of each chunk stored in vaults.

When a file (vid) becomes popular then the built in network caching will be often supplying the chunks rather than getting the chunk from vaults. The more popular then the more caches that hold a copy

4 Likes

How can the system find duplicates in case if chunks are encrypted? Duplicates of chunks are possible, of course, but it is only a result of random data collisions: after processing the same chunk with different passwords - different data blocks would be retrieved. As a result - different data chunks might be a part of one file, and vise versa - same chunks might be a part of one file.

Good block cyphers are designed to produce near random output, so the collision expectation should by very small, if data chunks size is relatively big, so this feature might be not as good, as it might be in traditional cloud services.

If they are exact duplicates then the hashes are identical, so the network can identify them by the hashes without knowing the content.

3 Likes

The encryption of chunks follow an encryption method that encrypts the data using the data itself. Search out self-encryption. This way you only need the data map to decrypt a file, no other keys required so when sharing a private file you do not need to give any of your keys to the other person. You only need to give them the datamap. This is how all files are encrypted when stored on immutable storage. This way public files can be stored encrypted but anyone can read the whole file. The datamap is shared (made public) and so then anyone can read the file. But a vault cannot read the chunk since its encrypted.

As @Jabba says the address of the chunk is derived from its hash. Since self-encryption is used then 2 different people encrypting the same file will end up with exactly the same chunks and thus the same chunk addresses and this is how the network knows its a duplicate.

8 Likes