Data Deduplication


#1

An associate browsed the MaidSafe.net features page and sees the deduplication as a safety concern. I think he is working under some misconceptions/assumptions about how the system works. Here’s the specific concern…

The system having the ability to identify identical data segments negates the inherent security of those data segments. In the statement, it makes it seem as if no one knows where the segments are when in fact, they are closely tracked and managed by the system. Any file you store that another person also stores will be identified as such, allowing authorities to identify users who store the same file simply by monitoring the deduplication process.

My guess would be that there is no “deduplication process” to monitor because deduplication is a simple consequence of how chunks are routed to locations in XOR space based on their hash codes, which are calculated before they even enter the network. Also, I believe there is no way for authorities to monitor a user’s activity, apart from the user’s computer already being compromised, since each time a user logs on they are assigned a different XOR address.

Are my statements correct, and do they fully address his concern?


Are hosts compensated for providing service or penalised for not providing it?
#2

You are correct, he’s making assumptions on how deduplication is achieved that are not valid.

Files are ‘chunked’ and then encrypted before they leave your machine, and only you have a map that can be used to relate those chunks together.

Once on the network those chunks end up at random locations and only someone who has a data map for the particular file can know they correspond to that particular file, and nobody can tie a chunk on the network to a user (or file) without first having that same data map (for the particular file).


#3

Saying what @happybeing said differently

You store a file

  • it is split and self encrypted in chunks
  • each chunk is sent to the network to be stored
  • for each chunk you pay the network to do the store
  • you have the datamap (list of chunk addresses obtained from the self encryption process IIRC)
  • the network does the storing
  • if the chunk already exists (duplicate chunk) then no store happens
  • you are returned stored_OK type of response. There is no indication that it already existed and no refunds.

In other words there is no response back to you that the chunk existed or not and your chunk store appears exactly the same whether the chunk previously existed or not.

EDIT: for the no refunds being confirmed see Data Deduplication


Really really basic stuff
#4

Thanks @neo a good clear more detailed description.

Are you sure about charging when already stored? My understanding was maidsafe still plan to charge, or at least that this is not yet decided.


#5

That was declared by David as Yes it will.

Some of the reason being that if you don’t charge then

  • it tells you that the file (chunks) exist and reduces security.
  • the complexity to “refund” the amount make the network more complex.
    • something to do will having to pass up the line to refund the amount.
    • the network has done the work for processing the chunk except the very last step of actually storing it
  • also by charging each time it doesn’t favor one person over another. Everyone treated the same in storage costs.
  • Improves the network economics

Of course this doesn’t consider a technique completely separate to de-duplication is to write an APP that does the self encryption without trying to store the file and then requesting the chunks. This tells you if the chunks exist without needed to attempt to store the chunks.


#6

Master branch of safe_vault already implements the refund when put data already exists in the network.

Edit : As indicated below by @frabunelle, this is only true for SD and MD and not for immutable data.


#7

Hmm, so isn’t it a DoS attack vector to keep putting the same data? It won’t take up more network space, but would take processing time (and bandwidth?). And you could do it for free.


#8

I didn’t think it was up to the vault component to issue the refund back to the “PUT” balance.


#9

Yes, it is. Account balances are managed by MaidManager persona of vaults, including refund when a put is unsuccessful because the data already exists.

Edit : As indicated below by @frabunelle, this is only true for SD and MD and not for immutable data.


#10

Perhaps a partial refund would be better.


#11

Too complicated - would inevitably involve “magic numbers” that @dirvine dislikes.

But I’m willing to be convinced otherwise?


#12

I suspect we can estimate cost of checking for existing data vs saving the data, then run with it. The main point is that it should cost something no matter how small. People don’t like spending money to do DoS generally.


#13

Totally agree.
I tend toward the view that if you want to PUT something, it should cost, whether or not it actually results in a new PUT.
“Profit” from this to go to general funds split between farmers, PtP whatever.

Anything, just anything that results in fewer pictures of kittens on teh intrawebs…

But absolutely no freebies for something that could get used as a DoS vector.


#14

Thanks for that.

I agree with this. If you commit to something then its no loss to you that the system has already stored it for someone else.

That too.

But still (if I wasn’t in a hurry) I would use an APP that checks if the chunks exist already. So it would do something like take my 20GByte file and self encrypt the first 10 chunks and then try and retrieve them from the safenetwork. If they don’t exist then unlikely the rest do and so upload the file in the normal manner. If the first 10 exist then good chance the rest do so it encrypts more and checks say the last 10 and so on. Thus I don’t even ask the network to store the data and cannot be asked to pay and the network hasn’t done 20GB worth of attempted store.


#15

Yes, but that’s only for Structured Data and Appendable Data.

If a chunk of ImmutableData already exists, the Vaults send a PUT success (no refund):

Immutable Data offers no refund because those chunks aren’t owner specific, so the user gets a PUT success. This isn’t the case for SD/AD and thereby the refunds.


Really really basic stuff
#16

High five to @neo - right again man! :slight_smile:


#17

Actually I was repeating what David said so it should be high five to @dirvine :wink:

And thanks to @frabrunelle for tracking it down


#18

Too modest for an Aussie! :wink:

David qualities are contagious :slight_smile:


#19

I’m trying to learn about SAFE Network and I’m stuck on deduplication (which is pretty early in the FAQ). I understand that the pre-encryption hash allows a client to detect if a duplicate chunk is already stored on the network, but isn’t the stored chuck encrypted by a client-specific key? If so, then client B could not decrypt the encrypted chunk stored by client A. Also, does data deduplication create a problem when a client wants to delete a file/chunk? Is there a use-count for each chunk? I’m a programmer so feel free to get technical or point me to source code files. Thanks!


#20

No, the self encryption algo uses the data itself to form the encryption. This allows anyone who has the datamap of the chunks to decode the file. Each chunk on its own is “impossible” to decrypt, but once you have all the chunks in sequence (datamap) then its possible to decrypt the chunks.

Sorry I don’t have the link to the self encryption encryption with me, but I’m sure a search on “self encryption” in the forum will bring a wealth of information.

EDIT: follow this link and the post has a video in it that explains it (hopefully since I didn’t watch the vid to check)