Grouping vaults by owner, to strengthen redundancy?

@dirvine “A bit confusing” whoow, that’s an understatment :wink: Probably another safety level. Is this extra level of encryption done by the vault itself? Or are other vaults or managers responsible for that? Can I do a read-out of my vault and see which chunks are inside?

1 Like

What do you mean by “which”?

If the vault on my computer is storing 20.000 chunks. Can I see them, scan them? See their names? I know they will look like random data, but when there’s a very populair musicclip on SAFE Net, and someone does self-encryption on that video and shares the chunks, can I spot one of these populair chunks in my vault if it’s in there?

How is the metadata (what chunks there are, decryption keys…) for a publicly shared file stored?
This metadata is also distributed around the network somehow, correct?
Then it follows that if you had access to that metadata, identifying how to fetch and decrypt a file, and you had copies of all this data on your own drives, you wouldn’t need external help from the network to read a file stored in your vaults.

Of course this is all highly unlikely if you have 500GB of capacity in a multi-PB network.
But I’ve seen at least one person on here who plans to add about 100TB of space over 30 vaults.
Especially in the early days, this would certainly be a significant percentage of the entire SAFE network, and those are the cases I’m concerned about.

Chunks are up to 1MB in size, correct?
Then for small files, it doesn’t seem too unlikely that a person with 100TB of space may store all the chunks of an entire file, and for large files, several - or even all - copies of the same chunk.

I did search, and didn’t find it. Would you mind linking to it?

Yes :smile:

No you cannot tell what you hold neither name nor content :smile:

3 Likes

The vault holds data that is obfuscated based on the original name, when the network does a get on your vault with the original name it goes through a decryption process to establish if you have the original content.

If you are talking about public data only then you could go to some lengths and try and work out if you have a chunk, not easy but possible, not for private data though. If it is public data then there is much less to worry about I would imagine. Its private data I am much more concerned about staying private. Public data by default would be identifiable at some stage, so should not be a surprise, the issue is do you know you are storing it, well no you don’t unless you go to lengths beyond what a normal user would do.

privacy of public data is a whole other subject :wink:

It is very unlikely unless the network is very very small or very imbalanced our tests so far show you need circa 3X network size to achieve a group (one chunk replica). Of course if you have infinite resource it will always work eventually, but it gets a bit mad at that stage.

Yeah, I am only talking about public data. Private data (i.e. data that requires information not stored on the network to access) is of course inaccessible to people who don’t have that information, I’m aware of that :wink:

I also didn’t mean to imply that it would be easy/obvious, but, while I am not a lawyer, I could imagine governments attempting to force large farmers to go to any extent necessary to identify what they are storing, and that’s what worries me.

Thank you, that is relieving to hear. Running 75% of the network does indeed seem a little unlikely :slight_smile:
That said, I am planning to contribute many terabytes of space on several hundred Mbit/s myself once the network is advanced enough for realistic usage (depending on whether I have the funding for it and predict farming covering my costs), and I’d be happy to do real-world tests then.

2 Likes

https://forum.autonomi.community/t/potential-way-to-weed-out-illegal-content/568/53

If you plan to have 100TB what makes you think you’ll be the only one?
Also, it takes a handful of losers like myself to donate a small fraction of their new HDD’s (100 folks with 30% of their new 4TB HDD) too fifer more capacity than you’ll have.
I predict you will never have more than 5% of the network capacity, even if you launch on Day 1.
As I mentioned above you’d need 70+ % in various locations around the world.

Yes if it’s public data (not encrypted) you could compare those against chunks of the same file you generated yourself and realize they both belong to the same file, but in order to do that it would take a huge amount of compute power and be relatively useless knowledge since the files are apparently public. Furthermore, one could append random garbage to files to make this kind of search and comparison much less likely to succeed in a timely manner.
All this, by the way, was discussed on the forum several months ago.

And when one of the systems loses all it’s data due to a failed HDD your entire ranking gets damaged.
Surely that would be a hugely popular feature! :slight_smile:

I was never talking about tagging data. Just vaults.

I’d say if you’re offering 1.2TB of your hard drive, you’re more dedicated than the average user already :wink:
Anyway, let’s do the maths.
Under the assumption that distribution is initially random (in practice, it would probably shift towards large-scale commercial farmers pretty quickly, as they have 24/7 uptime and much larger bandwidth reserves than the average user), the total storage space on SAFE is 1 PB on day 1 (I mean, I’d love that too, but I think that’s quite an optimistic estimate), and you provide 100TB of space yourself:

When a chunk replica is uploaded, the probability of you not getting it is 100% minus your share of total storage space (as long as distribution is indeed random), so 90%. As 4 replicas are uploaded, the probability of you not getting a single one of them is 90%^4, so 65.61%.

So for every chunk uploaded, you have a 35% chance of getting at least one copy.

For a small file of 3 chunks, your probability of getting a full copy of it is then 35%^3, so roughly 4.2%, and the inverse, the probability of you not getting a full copy, is 95.8%.

Let’s say someone uploads 100 photos which all fit into 3 chunks each. The probability of you not getting at least one full file then is 95.8%^100, which is… about 1.37%.

@dirvine has stated that in practical tests this isn’t a problem, so I’m probably wrong somewhere, and I’d appreciate it if someone were to point out as to where. It’s not like I want this problem to be real :wink:

Getting slightly off topic here, but if you run a server farm, you don’t let a hard drive failure destroy your data :slight_smile:

Vault tagging: okay, I misread that.
Couldn’t a colluding group tag a bunch of vaults with the same tag thereby rendering the network unable to store enough copies? Also as a big farmer I would never tag my vaults because that would mean less business for me.
So I doubt the idea is feasible.

You’d have a full copy with only one chunk. You mean all 4 replicas. Okay.

For private data ou can’t tell which vaults (if any) hold those chunks. You’d have to destroy 100% of your investment (assuming you’re renting the h/w) in vaults and you would still

  • Not be 100% that you could cause data loss
  • Impacted power users would lose just a fraction of their data
  • it would require enormous concentration (10%) across multiple geographically distributed locations

That doesn’t sound too concerning to me, but then again I am a guy who’s been consistently stating on this forum that I would only store copies of my data on the network. If the disappearance of a file bothered me, I’d upload it again.

Sure but you’re already disadvantaged by the fact that you’re running a professional operation (you pay for hosting, maybe tax, etc.) vs. people who run at homes to offset the cost of their HDD which they had to buy anyway.
Now on top of that ou want to sacrifice another 35% of your capacity on RAID6. I don’t think you’ll get very far with that approach… Did I mention your competitors also don’t pay anything for the bandwidth?
But if you’re so sure that’s great - rest assured that the rat race will commence as soon as it becomes clearer (say, in beta) how the financials would work out.
Anyone can spawn 100TB worth of “fat” EC2 instances at a short notice :wink:

hs1.8xlarge $3968 $2.24 per Hour $5997 $1.81 per Hour

But my biggest doubt is reserved for the idea that running hosted farm can compete with the storage from the same provider. How? If someone like Amazon charges you X and you and MaidSafe add a 10% mark up on top of that! how can that be competitive vs. the same user saving his data to S3?

I took a quick look at Google’s pricing and each GB costs 0.04 to store (per month) and 0.01 to serve (once). If you store 1TB and serve it once a month, that’s $600/year or 60K for your little farm. Why would anyone pay you 70K for service they can get from The Unevil Co. for 60K/year? Maybe my calculation is wrong but I also tried this wizard (http://calculator.s3.amazonaws.com/index.html) and 1TB of data storage with 10TB of monthly transfers to the Internet costs $1,300/month. 90% of which is bandwidth charges. Sure you could say you won’t need that much, but how much exactly will you need? If you just store data and keep it there, you won’t make any money. Any in any case to download my 1TB of garbage backups I have on S3 It’d cost me $10, but to do the same on your farm hosted there would have to cost at least $12. And you’d be getting just 1/4 of my requests.

Okay so this was a random rant but what ties it together is my belief that many posters think that not dealing with $ or other fiat gives them magical economic powers. It doesn’t.

One last OT observation: I noticed how Amazon asks for an up-front payment for those instances (see the $3968 figure in column 2 above). Not knowing whether you’ll ever get any requests for data that will be stored on you box will turn into a big gamble on multiple factors (who does what, how SAFE fares, etc).
I do not believe that farmer concentration and professional farming will be as emphasized as it is the case with Bitcoin miners and their pools.

Only if those “tags” are open to anyone. If you need to, say, present a certificate to associate your vault with others, the worst you could do is prevent the network from storing as much data as you’d like on your own vaults - you couldn’t decrease the capacity everyone else has.

Not necessarily, the business you get would just come from more different places. Subject to good implementation, of course.
Besides, you might have to do that to avoid legal action or getting your servers shut down for not attempting to identify “bad” data - who knows what the legislators and courts of the world will come up with.

For a copy of a file that is split in three chunks (which is, I believe, the minimum number of chunks a file may be split into), each of which would be replicated four times, you’d need one replica of each of the three chunks that make up the file.

I’m concerned about both situations - one person getting a full public file, enabling them to know what data is stored on their drives, and one person getting all replicas of one or more chunks, compromising redundancy.

Fair enough.

Also, not gonna argue whether large-scale commercial farming is viable/competitive in this thread, we can take that to another thread or PMs if you’d like. I will say that I’m excited for how things turn out in the beta and beyond though :wink:

Do you have any plans for splitting the HDD’s and assigning to multiple vaults?

I’m not sure yet, maybe I will. I’ve yet to read up on the advantages of such an approach.

From: Price discovery for purchase of network resources - #56 by fergish

I asked the question, as vault images will be made available as ‘docker files’ which will be an easy way to split a hdd into many vaults.

I see. I’ll do that then.
The technical side should be no problem, anyway. I assume we’ll be able to specify resource limits in a config file?

I would think so, not aware of the details though.

The argument is that several chicken eggs in a basket is safer than one ostrich egg in the same basket.

I hope people do split vaults as it adds to the number of good nodes on the network (safety in numbers does work in decentralised systems).

Doesn’t this assume that only the good guys will split their vaults? I believe it will be the opposite: splitting the vaults among multiple farm instances artificially increases the number of “good” nodes at the slight expense of performance (because of less efficient use of compute resources).
If I were the bad guys I’d use the same approach, but I’d take it to the extreme because I wouldn’t care about being in the farming business anyway.

If it makes sense for the good guys to run up to 4 farms per HDD, there’s no reason why the bad guys wouldn’t run 40 per HDD. At the very least they too would run 4 farms per HDD so overall effect on the network would in the end be none. (The little guy with ARM-based box who can barely run 1 would probably be hurt).

Read and write requests are not supposed to be concentrated on any individual instance and you should assume that if an instance is busy there’s a reason for that (like, it’s making SAFE coins for you). The main problem with this approach is a small performance hit and more labor-intensive maintenance.

Let’s wait and revisit this once the beta is out!

When the inventor of SAFE comes out and declares:

I hope people do split vaults as it adds to the number of good nodes on the network (safety in numbers does work in decentralized systems).

…and someone counters with

I believe it will be the opposite

I think I’ll stick with the inventors reccomendation.

2 Likes