Columnar Data Storage on the Safe Network

Would the Safe Network improve the Big Data space? Contrary to many of the unstructured data proponents structured data has still maintained its place at the top when considering Big Data Analytics. The reason for this is simple, SQL is still the best and easiest way to interact with data for analytics requests and SQL by design requires data to be stored in a somewhat normal form structure to be useable.

The basics with efficient data storage for query consumption is to choose partitions and sorts so as to allow multiple nodes/amps/servers to distribute work in parallel as evenly and efficiently as possible.

Couldn’t the Safe Network revolutionize the storing and pulling of information in this way? In theory if the data is disseminated intelligently then running SQL queries against the Safe Network could blow away services like Oracle, Google Big Query or Amazon Redshift.

The logical evolution here would be that companies would consider developing on the Safe Network first given their data would be fastest (and in theory cheapest) on the Safe Network. It could very well be the spark for mass adoption.

Is there any type of structured data storage feature on the roadmap?

3 Likes

There are certainly challenges in doing this but I do expect SAFEnetwork to be used in this way, and expect we can work with and learn from other projects facing similar challenges (searching across distributed data sets). It’s not my area, but I’ve come across some work on this in Tim Berners-Lee’s Project Solid, where they have developed techniques for queries that pull data from multiple servers without needing to know anything about those servers or what they hold, in advance (see SPARQL ‘federated queries’).

6 Likes

I’m not very technical but couldn’t big data be drawn directly from the user? So upon installing an app the user is asked if they mind sharing analytics, bug reports, etc in an anonymous fashion using perfect forward secrecy? That way the data of a public ID that represents a open and public person or even a public ID as a private alias, remains decoupled and anonymous. I believe Apple uses perfect forward secrecy to gather analytics.

I think storing isnt as much the issue as is retrieving here. Data is typically gathered transactionally with many complimentary attributes included. Knowing precisely how data will be consumed helps dictate how the information should be stored to most efficiently retrieve it. The science here starts converging as creating smaller partitions for parallel processing tends to be the fastest way to retrieve. This is where the Safe Network could greatly disrupt given how access to data could be greatly disseminated in an intelligent way allowing extremely efficient data retrieval

2 Likes