Can the SAFE network replace the current internet?


#1

The idea of SAFE replacing the current internet is a seductive idea, but what does that really look like? This post is a bit of a back-of-the-envelope look at what the SAFE network would be like if it gained significant traction.

Total Data

In 2014 Facebook stored over 300 PB of data at a rate of 600 TB / day (source).

Let’s say it’s ten times more 4 years later in 2018, and then another ten times more to account for google / amazon etc.

That makes an estimate of 30 EB total and 60 PB of data per day.

Farmers

There are about 2 billion facebook users and about 1 billion wechat users. I think combining them to 3B users is a reasonable estimate for the number of people who might possibly be a farmer.

I don’t think it’s reasonable to use individuals as the measure here since companies with data centres will probably control the majority of farming resources (especially bandwidth), but anyway let’s go with the ideal ‘distributed among end users’ idea.

Initial Storage

Since chunk names are distributed very evenly, the distribution of data should be uniform for each farmer.

Assuming an average of 50 farmers per section, there are about 3B / 50 = 60M sections on the network.

30 EB / 60M sections = 500 GB per section

So every farmer would need to store around 500 GB.

This seems very achievable.

End-User Bandwidth

When storing new data, bandwidth requirements are fairly low.

60 PB per day / 60M sections = 1 GB / day per section = 0.1 Mbps

What is the ratio of read to write? Maybe 10 times more reads? That makes bandwidth 1.1 Mbps continuous (ie not accounting for diurnal peaks etc).

1.1 Mbps seems quite manageable (see Internet Speeds By Country). But this is just to service end users; there’s also the bandwidth consumption for network activity (eg churn, consensus, message routing etc).

Churn

When a vault starts, it would need to store 500 GB of data to join the section. That’s a lot of data! If this were to take one day it would need 50 Mbps continuous data flow.

This is supplied by all farmers in the section so at 50 farmer per section it should be about 1 Mbps additional data per farmer to bring a new vault into the section. Combined with end-user bandwidth of 1.1 Mbps, it’s about 2 Mbps.

This also happens when a vault is relocating.

This seems like quite a significant factor and puts some constraints around the desired frequency of relocating.

Considerations

Maybe users will run more than one vault at a time. This means bandwidth and storage per vault is less, but it doesn’t really reduce the per user requirements.

Some farmers will contribute a lot of resources and others not much. But a farmer that can’t ‘keep up’ with the section will be penalized, so it seems like there will be a tendancy for large resource providers to have an advantage in that sense.

Maybe not every vault in a section will store every chunk for that section. This would reduce the storage and bandwidth demands, but also reduce the redundancy and security of data.

Maybe joining and relocating will be a gradual process rather than get-all-the-data-right-now, which would lower the bandwidth demand. All the same, there is some need to complete the move eventually and it can’t take too long.

Summary

A network of 3B farmers would be looking at approximately 500 GB storage per farmer and 2 Mbps bandwidth consumption. That’s a surprisingly achievable target.

But do the assumptions hold up? There must be some holes or improvements to this reasoning. What do you think?


#2

As a “back of envelope” (not Rudd’s though) analysis I think its a reasonable assessment.

I think some work will be needed on this. David has suggested that they are looking at not all vaults in a section storing all chunks for that section. I think this is very important.

If they did not require to store all chunks then the requirements would be to store at least the minimum copies of a chunk. So if a section had 16 vaults then only half the of the section’s data needs to be stored on any vault.

Also this allows for a new vault to store new data being stored on the section and only data needed to make up the minimum copies of any chunk.

Oh I see that you mentioned this

I also disagree that the major portion of farmers would be data centres. But will not discuss this here since its not part of the topic. They will be a portion of vaults, but I doubt a major portion.


#3

I was a little surprised at how little needed if SAFE became a major portion of the internet.


#4

The other thing to consider is farmers may have 10 or 100 vaults to ease the burden of relocations and joins (more sections but less data per section). It will come down to the economics of safecoin in the end.


#5

Not sure of what you mean here. Is it %burden? or are you thinking some actual work is saved because of this.

I would think with the random network design that with 60 million nodes/vaults that there is little chance that a node/vault relocates to a section that one of the other 100 vaults are in.

Also have you taken into account that hopping of the data occurs?

So with “Y” nodes there is “X” hops (ave) and say “Z” GETs. Thus each vault will have pass through it an average of ( Z * X ) / Y chunks. Now if assume each node/vault retrieves an evenly distributed number of GET retrievals then we get

Retrievals == Z / Y
chunk downloads from hopping == (Z*X)/Y
total chunk uploads == (Z*(X+1))/Y

This of course alters you bandwidth down and up doesn’t it.

Well farming rewards is meant to cover all the costs of farming (incremental costs of elect, disk, bandwidth). For home this should be plenty although for data centres well ???


#6

In 2007 the total amount of stored data was estimated to be close to 300EB. This also includes data that’s not available through the internet though.

Pdf for the study
http://sci-hub.tw/http://science.sciencemag.org/content/early/2011/02/09/science.1200970


#7

Those are rookie numbers in this racket… :joy:

I won’t be surprise if this place in the Netherlands gets highly concentrated with data centers.

This will be sucha rude awakening for countries that kept their internet speed deliberately low…


#8

I honestly struggle to see a world where people/entities don’t try to amass massive farming operations. Something about humanity makes us drive towards centralization like entropy, it seems. I wonder what would happen if you have 3 or 4 operations trying to supply 50% of capacity. How would the network manage that sort of centralization? Or could the network really know? There might be a very simple answer to this on the forum already; if so, drop me a link please. :slight_smile:


#9

I agree, there will be centralised operations given those will be the better nodes with the higher upload/ratios which will be essential for things like say 4K video streaming. That’s not necessarily a bad thing given the model fully allows for a combination of centralised/decentralised and farmers can’t see the data either way


#10

Thanks in advance for your patience with this thought exercise. A quick follow up question…

Theoretically speaking, if one or two groups are that powerful (e.g., ~30% capacity), could they collude to wreak havoc kind of like Bitcoin miners have done (e.g., threaten to go offline all at once unless certain stipulations are met)? In the mid- to long-term, the intersection of supply, demand, and price would restore stasis to the network, but in the short term, how might the network react/cope?


#11

Since the data is distributed approximately randomly across all sections then this would be an impossible situation. On the other hand if one or two data centres had that capacity then look up the “google attack” topic for a lot of good discussions about if an attacker tried to do this.

Since one of Maidsafe’s goals is for home farming then I am confident that the reward structure will be such that data centre farming with them having to pay for all their inputs is unable to compete successfully on a global situation against the home vaults with their 100Mbit/s to 1Gbps to 10Gbps bandwidths. So while they could turn a profit they would still be looking to getting income from renting their services out to those still needing them as that would be more profitable. Thus only farming to use their “hot” spare capacity

Also the distribution of home vaults across all the ISPs means that bandwidth clog points are more distributed than massive data centres with 10s or 100s of thousands vaults all requiring usage of their limited number of 10 or 40Gbps links to the backbone Whereas 10s or 100s of thousands of home vaults are spread across 100’s of countries with 10’s to 100’s of ISPs each and their backbones into the internet.


#12

That was a nice set of approximations to see played out, thanks mav.

A more recent set of figures can be found in IDC’s Data Age 2025 study, sponsored by Seagate, April 2017. They estimate that the current total amount of digital data is somewhere around 25ZB and will go to 160ZB in 2025. Probably will go higher. I wonder if these figures include redundancy and data duplication? Probably no and yes. So 8x redundancy coupled with 8x reduction due to deduplication probably makes this a reasonable estimate

jlpell’s other guestimations:
Most dedicated desktop users who want to get involved will throw 500GB to 10TB at SAFE on launch. Prosumers will throw about 10TB to 100TB and mobile, timid, or just curious will be about 10GB to 100GB. Business ventures will be in PB. Depending on timelines, these numbers might be higher by a factor of 2x. I don’t think storage will be the issue, and the surplus storage will allow for extra redundancy to help spread out bandwidth load. My hypothesis is that working with ISPs and forming new ones or mesh networks in order to get low latency and stable connectivity in an evolving regulatory landscape will be more difficult as popularity rises.

Yes, from a basic user’s point of view no one wants to sit and wait for a 500GB download. I think about 1 hour is a psychological limit for most, before they need to start seeing some kind of safecoin flow their way no matter how small. Current typical broadband speeds allow for a single 10GB vault to be filled in about an hour. Multiple 10GB vaults could be run in series to fill up a 1TB drive, but there are limits to going the multi-vault route based on the number of processor cores and computational requirements for each node process.

Tiered groups ranked by performance level might alleviate this, ie. smaller groups of higher performance nodes vs. large groups of lower performance nodes. This keeps google competing with amazon, and us competing with each other. Kind of like weight classes in sumo wrestling :smile:. Computation is going to need nodes that are clustered by performance level anyway, although nothing says that computation nodes and storage nodes need to coincide on the same machine.


#13

My first thought is how much of that is backup data. I’d expect that a lot of big businesses have a number of full copies of their data and how much is 5, 10, 15 years old with the latest backup containing all the data stored on the previous data plus changes. Then of course all the incremental backups.

Does that figure include the RAID parity blocks?

Maybe then we could half the 25ZB figure when excluding all the backup media and RAID parity blocks.


#14

I do not like FaceBook, Twitter etc etc. This is why I join safenet. Thanks for the calculations, and explanations, it help a lot. Thank you.


#15

How much of that was pointless data only of value to Facebook and their advertisers?


#16

It’ll never ever replace this is what people don’t understand it’s a layer. Just because we have airplanes we don’t replace cars, same goes with this. And this is why creating conduits via browsers and especially existing browsers but also on clearnet websites and apps is critical. It’s what’s going to define our success. If we can come up with ways to do this we’ll change the world.


#17

Clear net is the unsecure net.

Its also the sponsored captured censored spied upon net.

Layers is Quantum SAFE and SAFE on Soft Radio Mesh. And SAFE on LiFi and SAFE on longer distance line of sight free space optical and SAFE on some of these new open sat nets.


#18

I see some interesting algorithm here.
When a new node joins a section, how will priority be set for data to be stored in it?

First of all, I would assume we want to ensure minimum copies of a chunk.
That would always be priority.
After that, storing new data should be important for at least a couple of reasons:

  1. Providing storage for the network as soon as possible.
  2. Enable faster feedback (rewards) for joining farner, as to increase uptake / minimise dropout.

But if a section is not requiring all nodes to keep a copy, letting a new node take part of the stream of incoming new data, will that mean some older node is missing out (given that they still have free space) - I.e. seeing a decrease in receiving rate? Could be a couple of opposing interests and needs here.
The rationale behind a priority for this is an interesting question, with lots of dynamics.


#19

The rule is simple: a node stores a chunk of data if the address of the node is near the address of the chunk, more precisely if the node belongs to the group of the 8 nearest nodes of the chunk. There isn’t any priority, and this rule is valid for any data (old or new) and any node (old or new).

8 is a parameter of the network that may need to be adapted after some simulations and tests. Also, datachain will add the notion of archive nodes. But beyond those 2 elements, I don’t expect this rule to be fundamentally modified in the future.


#20

How about redundancy? It’s 4x by default, right? 2 TB. Content you mentioned is personal content so deduplication doesn’t help. But maybe the Facebook figures were with redundancy. Then 500 GB is correct. Somebody mentioned RAID like blocks for Safe Network that give more efficient redundancy.

Correct. On to the next question:

Not at all. Most farmers will be small (run one vault), some will be bigger (several vaults), less big (many vaults), few huge (datacenter full of vaults). If we apply Pareto’s Law for back of envelope calculation recursively:

  • 80% data will be served by 20% of farmers.
  • 64% data will be served by 4% of farmers.
  • 51.2% data will be served by 0.8% of farmers.
  • 40.96% data will be served by 0.16% of farmers.
  • 32.77% data will be served by 0.032% of farmers.
  • 26.21% data will be served by 0.0064% of farmers.
  • 20.97% of the data will be served by 0.00128% of the farmers.
  • 16.78% of the data will be served by 0.000256% of the farmers.
  • 13.42% of the data will be served by 0.0000512% of the farmers.
  • 10.74% of the data will be served by 0.00001024% of the farmers.

Some of those in words:

  • Over half of the data will be served by less than 1% of the farmers.
  • Almost a third of the data will be served by less than 1/3000th of the farmers.
  • Over a fourth of the data will be served by less than 1 in 15,000 farmers.
  • Every tenth chunk will come from about 1 in a million farmers.

Averages are useless for this.