Deduplication

I’m curious about the predictions and calculations of usable storage space from pooling spare disk space and uploading of user files.

If you know of any useful references on deduplication rates relevant to this application, or to MaidSafe reasoning on this topic, please post them here.

I did some quick googling, there is not much now thats not behind a paywall. We used to have a ton of these in the office. I am sure with a few hours you can find some stats. The figures we had were

1 - 4 (low)
1 - 20 (median)
1 - 80 (high)

These represent the dedupe rates per organisation. The low end is 25% the median is 95%. These numbers are the size of tape required per data size, so the low end is a 1 Mb tape can back up 4 Mb of data if dedupe is used. We use the median number but expect much higher as we are talking global dedupe and not a single organisation. The network will be able to tell us in real time what the rates are though. It is not so important now as the farmers should be continually farming whatever amount of data. It was much more important when we were matching resource to users and could reduce resource required per file.

The dedupe industry has got very competitive and it seems they are keeping numbers close to their chest a bit, but some digging will turn something up. Even a few years back it was hell getting the numbers.

3 Likes

In the current system, is there a way, through deduplication, to determine who the prime mover of a file was? Assuming I make something or write something, if I bring the content into existence, is there a way, on the lowest level, to (in a fully anonymous way) know who the creator was? That could be handy.

Not in a way to determine who uploaded something, but for the person who uploaded to claim they are the file’s creator.

EDIT: Thinking about it more, maybe this would be left up to the content creator. If I want to tag or sign my file, I can. If I don’t then I don’t. And if I want to build a locked-down video streaming service and access to the media through a special player, that’s my prerogative. The underpinning of the system doesn’t need to accommodate that.

2 Likes

Here’s Nick with a reference and some stats about cloud storage research: file access, deduplication etc:

Deduplication savings are dependent on the ratio of unique data blocks to redundant data blocks in a data set.

Without deduplication,
Storage Space Consumed = Total Number of Blocks * Block Size

With deduplication,
Storage Space Consumed = (Number of Unique Blocks * Block Size) + (Number of Redundant Blocks * Dedup Overhead)

To maximize deduplication savings you need two things: one, very few unique blocks; two, the dedup overhead per redundant block must be significantly smaller than block size.

Block size and the probability of finding a redundant block are inversely proportional: As block sizes get smaller, the probability of a block being redundant get higher. However, as block sizes get smaller the storage savings per redundant block also get smaller. Block size must be carefully considered as it plays a large role in determining storage efficiency (and transfer time) - if a block size is too large or too small compared to the average size of a file, storage efficiency will go down.

In general, multimedia content is likely to observe minimal storage savings from deduplication while traditional content (text and documents) is likely to observe large savings.

1 Like

I agree with the thrust of your post, certainly that blocksize matters (though not only because of deduplication, it also affects communications, and other network effects), but I’m not sure how you reach such a definitive conclusion - quoted above.

1 Like

I offered an explanation yesterday regarding deduplication and without thinking inferred that deduplication works on both private and public shares.

Does deduplication work on all ingress to the network or just public shares?

I also wonder if deduplication will be exposed by an API, so apps can offer feedback on what is unique. Could almost make a competitive app for youngsters, whereby they hunt for files that are yet to be uploaded…rounding up the data…with the knowledge they might get some pocket money. :smile:

2 Likes

It works for all ImmutableData private or public.

Nice idea, thinking now …

3 Likes

Now that we’re at Test 9, is it too early to ask what the deduplication rate is for the test data being stored, and how comparable is this rate expected to be in comparison to the live SAFE Network? It seems to me that deduplication will be the make or break figure of merit for this SAFE Network idea.
Thanks.

1 Like