De-duplication Fidelity


#1

This is my first post on these forums, although I’ve done a fair amount of reading on them recently. For now, I’ll start with a question, which I haven’t seen come up yet (in this aspect, at least). If it has already been discussed in detail, though, I apologize in advance for being redundant.

That being said, I am curious about how MaidSafe’s de-duplication process will actually work. If my understanding is correct, data will be divided into ‘chunks’, and identical chunks will be shared across the SAFE Network, rather than having many copies of each chunk. Assuming that’s right, how identical is “identical”? Does that mean data will only be de-duplicated if it is 100% bit-for-bit identical?

In a blog post, Nick Lambert gave this example: “So, if 10,000 copies of The Beatles ‘A Day in the Life’ are uploaded, the network will identify that they are identical and update the subscriber count to 10,000, but keep only 4 copies of each file.” [Source: http://blog.maidsafe.net/2014/05/14/the-economics-of-safecoin/ ]

So, following this example of a song, one might naturally wonder about the many varieties that may exist of a given song. An obvious example would be different release forms (CDs, remasterings, collections, various geographical regions, etc.), all of which might have slight variations that one person may prefer over another. Or, perhaps someone even wants to have every possible version of this particular song. Additionally, if the source is not the issue, one person may simply have ripped a track with greater accuracy than another person. Similarly, what if people have the same song, from the same release (perhaps even originating from the same shared digital source), but have customized its metadata differently, or normalized the audio levels compared to other tracks in a playlist?

How would such differences be handled? Could data be “de-duplicated” even if it is not an exact match?

Thanks in advance to anyone who might be able to help me understand this process better. And, again, if I’ve repeated an already-discussed issue, perhaps my post can be de-duplicated too.


#2

[I’m not sure if my own computer is just having problems, but it is displaying my post very strangely. After my first paragraph, the next five are all set in a separate scrolling box. I’m not sure what may have happened there, or if it’s visible to others. Please let me know if anyone else has any difficulty in viewing it.]


#3

I was able to read your post just fine. Two different fonts is all


#4

Okay, thanks. Sorry for that. I think I just figured it out. It looks like the forum software doesn’t like me indenting paragraphs.


#5

This thread relates to your post


#6

Thanks. I found that topic earlier too, but I wasn’t sure if it was answering the same issue. I know there was discussion of how compression would affect files being de-duplicated, but it didn’t seem conclusive. Anyway, if I’ve misunderstood something, I’ll be happy to be corrected.


#7

@SilasB welcome to posting!

You’ve understood this as I do - only exact copies are deduplicated, and I don’t forsee the network doing any fuzzy matching for this purpose.

A fuzzy search engine app could though attempt to collate multiple versions of the same data for those who want to find different mixes or do clever “like this” searches. I’m not aware of anyone planning this though.


#8

No, it cannot be.
How could the network change one’s file to point to a different file that belongs to another user?
That would be completely unacceptable.
While there are ways to make this work better, I don’t think any of them can be applied to the SAFE network anytime soon.

Imagine you use some popular Excel invoice template to charge folks for your services.
Imagine I use the same template.
Now imagine that MaidSafe figures out our Excel docs are 99% similar (the only difference perhaps being in bank account info) and replaces your document with mine. Then you can invoice your customers and have them pay to my bank account.

From the free market perspective, I don’t care that deduplication can’t handle all these fancy scenarios.
Majority of garbage that we have is not unique and will be deduplicated. The rest won’t, but we have to pay for it, and the more careless we are, the more expensive it’ll be. So ultimately we’ll want manage our data better.
Also there’s economic motivation to share data and make money from it (which offsets one’s costs), although as discussed in other topics, can de-incentivize authors from creating original content (which is a different topic).


#9

Your example does a great job of illustrating the point. I actually feel rather foolish now for overlooking something so simple. Regardless, thanks; your response really helped.

I guess now I understand why the issue of encryption came up when this issue was discussed before. Perhaps even a tiny change in a very large file, which was later compressed, could result in none of its data chunks being identical to other compressed copies of the file.

On the issue of chunk size: If I’ve read things correctly, it seems that the size of chunks for de-duplication will be 1 MB. Is there any reason for setting the limit this large? In the case of popular image files, for example, many of these will surely be under 1 MB, and there may be many thousands of copies of them. Is there some technical reason for not applying de-duplication to them as well? Or have I just misunderstood something?


#10

Fortunate, many thanks to the universe that I had a chance to meet and speak with the founder of http://pexe.so/
and also to see these algorithms working behind the scenes.

but the point is:

video within another video could even be detected:

Essentially, they’ve been able to take a short clip length of any video and find its matches across the internet. In any medium, raw unaltered, down to complex like this^^.


#11

I hadn’t heard of Pexeso before, but it’s interesting to see how far along they are with it. It reminds me of some of the image-matching algorithms used by some of the search engines now—although Pexeso seems to do a better job of finding partial matches than the search engines currently do for still images (which is perhaps also an intentional result of different design motivations).

But, I guess if human brains can recognize patterns of similarity, it shouldn’t be too surprising that computer algorithms could also have similar capabilities (especially as things continue to develop in these fields). Anyway, interesting stuff.


#12

Also, I hope it won’t be considered rude for me to reiterate my question from my second-to-last-post here… But, does anyone know about the “chunk size” issue I asked about? I’ve done a bit more searching, but still haven’t come across an answer, and it’s something I’m curious about.

My apologies, though, if I’ve overlooked a similar answer elsewhere.


#13

It’s a balance. there will always be minimum three chunks (so they could be a few bytes each). Each chunk has an overhead on the network (maintenance) so small chunks dedupe better but cost more in maintenance. This is an area that will evolve though, essentially a single byte chunk would be best, but what size the pointer to it etc. ITs a rather large area to consider though.


#14

De-duplication of all data? Is All of the data de-duplicated with success?

The reason for inquiry is:
Does the system deduplicate data with precision, if so,
does that mean that there is a universal solution to finding duplicate 3D movies for example

At the end of the article the engineers behind pexeso indicated that they are still working on being able to detect a 3D movie duplicate. Does MaidSafe already do this by default the way it is built for deduplication?


#15

Okay. That makes sense. I didn’t realize it was a trade-off like that. Thanks for clarifying.


#16

If I understand your question, I think @happybeing and @janitor may have answered it when responding to my earlier question here.

That is, only exact copies of data are de-duplicated—but since they’re only chunks of data, the larger “file” the data is a part of is not particularly important. In a conceptual sense, then, it’s actually quite simple. It’s not looking for partial patterns (as it would be if trying to match similar, but inexact, copies of something, like an entire song or movie); rather, it’s simply finding exact copies of data chunks, and organizing them efficiently.

This is my very elemental understanding, though, so perhaps someone else can correct my explanation if I’m mistaken.


#17

The precision level is “chunk”, a 1MB “unit of allocation” on the SAFE network.
It doesn’t matter what’s inside, what file type it is, etc.
A checksum is calculated and compared. Different checksums for 2 chunks means the chunks are different and can’t be deduplicated and referenced by the same “pointer”.

What you’re asking about is something specific to applications, but that’s done on a different level and can be done, but it would be ridiculously “expensive” in terms of resources. Not that this cost can’t be afforded - it can, that’s why the 3D dudes are developing such app - but rather that the SAFE network isn’t built (primarily) for that. Maybe later it’ll become feasible and interesting, but at the moment it’s not a priority for the project (I guess - I’m not a MaidSafe employee).