Business Uses for the SAFE network

The impact of safe network on the delivery of ‘standard’ web content such as video, images and text is obvious; user experience can and will closely match the current web.

But there’s an interesting talk from Noe Gorelick of Google Earth Engine (source video 17m-40m) that highlights a more specialised use case for the safe network.

At the 36:35m mark there’s a slide that explains from an engineering perspective why centralization of datasets is happening (ie not political / business / economic / conspiratorial reasons). Data transfer and storage is a major limiting factor to the end goal of running computation and exploring the output. End users want centralization so they can reap the benefits of it, namely speed-of-output.

Consumer end users will gladly download a 10 GB movie from safe, but industry end users will not download a 1 PB dataset. This would take 3 years at a rate of 1 TB per day. It’s not a matter of desire, it’s a matter of impossibility.

Scientific data is currently being centralized in a few massive data centres because it simply cannot be delivered, even between warehouses. This means computation must happen close to the data.

I admire the work of google earth engine and their goals (the response to “what’s the business model” is “we have to live on this planet too” ie their business is to improve the health of the planet). However I think they’re wrong about the inevitibility of data centralization.

A SAFE app which coordinates map-reduce operations (like Hadoop or BOINC) will be a natural fit for the safe network and hopefully increase the utility of massive scientific datasets. Eventually it will be nice to have compute as a native part of the network, but in the short term I can see enormous value in a grid computing application.

There’s also an interesting slide at 36:03 about the cost of data processing

Processing 30 years of global satelite imagery costs around 30K euros. 25K of that relates to data (ie bandwidth) costs, the other 5K is cpu. This gives an indication of what the current market price is for data and processing, and some insight into what cost / value / price industrial users can afford to pay for the safe network.

A lot of scientific and industrial data work isn’t especially sexy, but I think that industry opens the door to a valuable opportunity for the safe network.

17 Likes

Joyent’s Triton approach would be awesome to bring to safe network (in a far distant future).
Triton is an object store like S3, but with a difference: you can send compute jobs to an object, which will be executed by the node holding the data.

In a normal setting, you would have a vm which will have to retrieve the object to do any dataprocessing on it.
In Triton, you can leave the object where it is, and just send your awk script or whatever to the object to process.

3 Likes

That’s actually very cheap.

Just saying and isn’t solving the problem, but perspective

This represents approx a 100Mbit/sec business link (ie able to max it out)

Business using datacentres can get at least 10 times bandwidth than that, so 100 days. And if they can afford the 10Gbits/sec network then its 10 days. Maybe if they can work with the data centre get one of their 40-100Gbits/sec links they can do it in 1-2.5 days. Assumes the source could deliver that rate which is the less likely situation.

Now SAFE will allow them to use multiple links and get as much data as fast as they can process it since its utilising the world’s internet structure to supply their multiple links.

4 Likes

If a vault could do computation on the data it stored that would open lots of applications. There would be some specific requirements for the data though. First the data would have to be split up into chunks where each chunk was a piece of data that could be independently processed as part of a larger computation, second the vault would have to be able to decrypt the data, unless something like fully homomorphic encryption was used.

At least for public data it shouldn’t be an issue. If a public dataset was stored on SAFE and each chunk represented the data for a specific entity in for example JSON format, then a user could send a request to get the data with some specific Id together with a script to be run on the data by the vault before returning the data to the user and lots of these could be run in parallel.

1 Like

I agree in principle but I think in practice this is hard.

Consider a raspberry pi with 100 TB attached on a 100 MB internet connection - great as a vault for storage, but hopeless as a vault for compute.

This does segue nicely into the idea of capabilities. I think there may be an evolution later in the network where vaults will provide services at a fee depending on their specific assurances of performance (be it storage longevity or bandwidth provision or computation performance).

3 Likes

This would be really great, especially if you could add some metadata when you PUT, to describe the capabilities you want, so you could pay a bit more to store the data at nodes that for example have 50% more processing power than network average, have a fast GPU, have an FPGA chip or have some specific minimum bandwidth etc

I don’t think this would be desirable as part of the underlying infrastructure (cf. loss of net neutrality), but could perhaps be part of an application layer on top. Otherwise you end up with a balkanised network, which has problems with respect to accessibility, and also could make your “premium” data vulnerable to attack (because it would live on a subset of the network).

Although for computation I can see potential for having the node’s computing capability taken into account. No good trying to get quick turnaround if your number crunching routine is given to a phone node. Or your computation requires certain processor capabilities in order for the task to complete and not take months/years)

Data on the other hand is handled by 8 or more vaults for each chunk so the quickest wins the chance for a reward and so the requirement for “premium” is unnecessary. Very unlikely to get 8 slow nodes for each chunk of your file.

Maybe capabilities could be used to select nodes for some caching but not the basic storage. Data that needs to be processed by GPUs could be cached on vaults with a good GPU score for example.

I see the same principle operating for computation as storage. The problem is given out to the close group, all are required to provide a result or they will get kicked, voting ensures the correct result is returned, fastest qualifies for the reward.

So yes, a phone node can be part of the compute network but will be kicked out of the group unless it passes a minimum performance threshold. It has a similar effect, but keeps the access side democratic rather than making it pay to play.

1 Like

Just some musings of mine

I was thinking that the user specifies the minimum compute necessary to complete the task. This allows the phone to only be given appropriate compute tasks to perform and not be kicked out because it cannot perform most of the compute tasks.

To me this makes it more inclusive, so that lower process capable devices to not be refused any compute.

I do understand though that this may end up selective. But you could look at it as not compute, but

  • COMPUTE - something that all devices providing any sort of compute will be competing to do. Small tasks
  • HiCompute - for large compute tasks that are expected to take time and need hi performance devices
  • GPUCompute - for GPU computational tasks

The device specifies what it can do and then the use specifies the type of compute and the appropiate devices will receive the job.

This is unlike the “premium” storage concept which is an illusion because of the 8 copies and most devices have similar performance and usually be bandwidth and physical location affected. Also it breaks the security model of the storage.

With storage the difference in speeds is small compare to difference in speeds of compute. Compute speeds can vary by 1000’s or 10000’s and I suppose is why I am willing to consider separating compute in to at least 3 different & separate components. Devices within each component are treated the same, and I didn’t think compute was going to be computed of all the nodes of a closed group. I was under the impression that its something like 3 nodes will execute the compute task and those 3 may not be in the same group. Disjoint groups change this anyhow.

1 Like

Yes, without going through in detail I think the supply side can be fine in various ways.

My concern is mainly demand side, with differential pricing leading to:

  • identifiable sub networks, lowering the barrier to an attack
  • offering the same service at different rates, creating a two tier service that leads to centralisation & partitioning, of capability/security on one and limitations/risk on the other
1 Like

The quickest wins? Is this correct? I hoped this was the case but read somewhere that the node with the name closest to the data chunk in question is the one that actually supplies the data.

https://blog.maidsafe.net/category/technical/vault/

Within the Get section it says:

“The Client sends a message to the ID for the data they are looking for which is picked up by the closest ManagedNode among the DataManager group and responds with the data itself. If there is a problem obtaining the data from this Vault, a short timeout will trigger the second closest ManagedNode to instead respond with the data and so on.”

Yes that is different to my understanding that all the vaults that hold the chunk respond and the quickest is rewarded.

It is 15 months old and not written by David, so there maybe some minor points different to what is the case.

Maybe David @dirvine can tell us if https://blog.maidsafe.net/category/technical/vault/ is correct or if it is as I understand it

The Client sends a message to the ID for the data they are looking for which is picked up by the closest ManagedNode among the DataManager group and responds with the data itself. If there is a problem obtaining the data from this Vault, a short timeout will trigger the second closest ManagedNode to instead respond with the data and so on

2 Likes

This is the older (current for now) message routing that is in the code. The fastest nodes will be the one that delivers the data. It’s very likely they send a tiny message we call a Vote and the recipient gets the data from the first respondent. So it will work as you think @neo It is all the chat right now in routing design with data chains, there could be small changes.

7 Likes

Great, thanks for your answers @neo and @dirvine. I’m very happy this is the case.

Given that, I can’t see a scenario where farming isn’t dominated by commercial interests (a desirable outcome in my opinion).

I suppose the whole ‘spare resources of hobbyist farmers’ thing is great in terms of supporting a positive narrative but likely not how this plays out in the long run.

This doesn’t follow. Just because one has a fast connection to the backbone does not mean they can be first to respond. The lag time from your vault to the group can be 1/4 the way around the world, so then the home user in the same country as the majority of nodes in the group will beat your commercial interest. Then taking into account the overheads of commercial installations then the extra rewards you gain by faster networks may not even break even on the extra costs.

Its something that will have to be gauged in the tests and commercialism has the drawbacks of overheads in cost. Running spare resources in a data centre still costs a lot

4 Likes

I get the feeling he’s right for maybe the first 5 years and it’s a big plus for the health of the network in the short term

The way I see it, maybe we’ll get mesh networks in the future, but at present we have star type structures with peering points/ co-location being a way for the little guys to compete… what do you think?