Where do decentralized applications store their data?

Introduction

Now there is a boom in blockchain projects. Some blockchains are so powerful that they claim to be a platform for building applications on top of them. Applications automatically turn out to be decentralized, resistant to censorship and blocking. But is everything so good and simple? In this article, we will try to look at the blockchain as a platform for applications, taking off the rose-colored glasses.

So what the blockchain is?

Blockchain (chain of blocks) is an immutable data structure consisting of a list of blocks where each next block contains a hash of the previous block. As a result of this hashing, the chain of blocks becomes unchanged: you can not change or delete a block from the middle of the chain without rebuilding all the blocks above, because the slightest change will require a rebuild (recalculating hashes) of all blocks above the changed one.

If the calculation of the hash of each block is computationally or economically complex operation, then the data change in the middle of the chain becomes practically impossible at all. The combination of the new block hash calculation complexity, as well as the ease of checking the correctness of the hash, provides blockchain with a serious resistance to unsanctioned changes. This is what holds the safety of bitcoin and other blockchains.

Thanks to this, blockchain projects can be publicly decentralized. That is, anyone can put the working node of the blockchain and generate new blocks. In most implementations of the blockchain for the generation of a block a reward is given - this process is called mining. And since it is difficult to mine, and your results can be easily verified, it is profitable to act only honestly. Otherwise, you will spend resources on mining, and other miners will not take your block - all the work is a nuisance. Thus, with complete decentralization and independence of individual nodes, the network of blockchain nodes operates as a single whole.

But okay, say, one dishonest miner is easy to determine and ignore. But what if there are a lot of them and they have conspired? Imagine that all the people around you think the red light is green. :slight_smile: And they look at you as if you were abnormal hence you think otherwise. Social experiments show that most people in this situation begin to doubt and join the majority opinion. But in fact, the rule of the majority works in blockchain as well!

Similar problem of finding out the truth under conditions when your peers can shamelessly lie, was called by Leslie Lamport “The Byzantine Generals Problem”, and was solved two years earlier in 1980 by him together with other authors. It has been shown that with n spies that can lie and distort information, a consensus among the participants can be achieved with a total of 3 n + 1 participants. And if you guarantee that spies can not distort the messages transmitted through them, then 2 n + 1 is enough. In a blockchain due to electronic signature malicious nodes can not distort information, so if there are less than half of malicious nodes in the block, the network is stable.

Tolerance of the network to malicious nodes is called Byzantine Fault Tolerance (BFT). BFT is very important for public network systems, in which arbitrary nodes can be freely added. Such systems are the majority of projects on the blockchain.

The use of blockchain is not limited to creation of cryptocurrencies. You can put anything inside the block. Bitcoin puts there a list of new transactions and it is used to exchange cryptocurrency between its owners. In NameCoin blocks store arbitrary key-value pairs, which can be used to create decentralized DNS. In other implementations of the blockchain some more features are used. But Ethereum went a way further. It allows you to store in blockchain not only transactions, but also full-fledged Turing-complete programs, called smart contracts, which allow very fine tuning of the blockchain for an applied task. For example, NameCoin can be implemented in just 5 lines of Ethereum code.

Ethereum was designed as a universal platform for creating decentralized projects based on blockchain. Why to implement the entire blockchain system again, deploy individual infrastructure, if it is possible to realize what you need on Ethereum with just a couple of smart contracts, such as the analogue of NameCoin? Therefore, currently Ethereum is experiencing rapid growth. Since March 2017 ETH (cryptocurrency Ethereum) has grown 5-fold in just two months, and growth continues. There are already hundreds of applications on Ethereum, for example, the social network AKASHA, the freelance exchange Ethlance, the game of words, and a lots more!

Blockchain with smart contracts provides application with entire infrastructure. Applications have a code executed on blockchains smart contracts. Applications can store in blockchain any information, passing it to their smart contracts as data. Applications can read this information from the blockchain, because the state of Ethereum blockchain is, in fact, a key-value database.

Some may wonder, what else is needed? Applications are really decentralized, unaffected by censorship and prohibition. In general, blockchain - it’s pure advantages! However if everything was indeed so good… When creating really powerful applications, one immediately finds caveats.

Immutability. Immutability is, of course, a virtue. It is immutability that gives the blockchain openness and BFT. However, there is a downside to the coin. All the data that applications write to the blockchain remain there forever. You have played words game - the blockchain remembered this. You placed information in the social network - it is permanently stored in blockchain, even if you later deleted your profile. The explosive growth in the number of applications on the blockchain leads to a strong inflation of the blockchain in size. The size of Ethereum full node has now exceeded 130GB, although it has been running for less than 2 years. Bitcoin takes less space, even with its’ respectable age over 7 years.

Of course, some implementations of Ethereum include State Tree Pruning technology, which allows you to store only the most current state of blockchain, with a limited history for about a day, which allows you to reduce the stored information by 20 times. For example, go-ethereum full node requires 130 GB for storage, and Paritywith support for this technology - only 6 GB. However, given that the growth in the number of applications has just begun, and each Ethereum node has to store all the data of all applications, even if it looks necessary, it is just a delay of the problem. With the increase in size of blockchain, it will cease to be placed on mass-produced hard disks, its maintenance will be affordable only to large organizations, which leads to dangerous centralization - the concentration of control over more than 50% of the network in one organization. This can disrupt the BFT.

Slowness of transactions. For their sustainability blockchain has paid with its’ speed of transactions. Bitcoin has 7 transactions per second, Ethereum has 15. And this is for the entire network, because each node completely replicates other nodes. Adding a new node increases the stability of the system, but in no way increases the speed of its operation or the maximum amount of data storage. That is, changing the data (and each data change in the blockchain is a transaction) is a bottleneck. Popular apps will face this limitation at once.

Primitive data storage. While the state of blockchain is already a key-value database, it is rather primitive. Search is possible only on the primary key, the amount of stored data is very limited. For serious applications, this is clearly not enough.

Thus, when developing applications on blockchain, for example, Ethereum, the problem of data storage is very actual. So far there is no satisfactory way to solve it.

But after all, existing applications, for example, AKASHA, somehow got away … In the next part, we will consider the existing approaches to solving this problem.

Overview of Decentralized Databases

First of all let us formally enumerate the requirements to the ideal decentralized database that would be suitable for the decentralized applications needs.

  1. Distribution and decentralization - as the applications are decentralized the database should be either.
  2. Publicity - blockchain allows anyone to contribute to the network. So the database should do.
  3. Byzantine Fault Tolerance and tolerance for other types of attacks in the public network - we can not do without BFT in public network
  4. Sharding support - if we expect that an application will be popular and is going to store large amount of data then it would be great to make use of the total network power to increase maximum storage capacity. Full data replication on each node, of course, diminishes risk of data loss in the case of problems with a particular node. However in a large network of hundreds of thousands of nodes full data replication on all nodes is too redundant, and the level of replication can be decreased in favor of increasing total storage capacity. That is if we have N nodes then each record should be replicated to m nodes, where m < N. It allows linearly scale total storage capacity with adding new nodes.
  5. High speed - popular applications may require hundreds of thousands to million database transactions per second.
  6. Ability to store structured data - the storage should be able to understand internal structure of the stored data to allow applications to link records to each other.
  7. Ability to delete data - the database should be able to delete records that applications do not need anymore to conserve space.
  8. Secondary indexes, full-text search - applications should be able to perform fast searches of records taking into account their internal structure.
    Let us see whether existing technologies meet these requirements.

IPFS

IPFS (InterPlanetary File System) is a distributed file system technology based on DHT (Distributed Hash Table) and BitTorrent protocol. It allows you to combine file systems on different devices into one, using content addressing.

Advantages:

  1. Each device stores only those files that it needs, plus meta-information about the location of files on other devices. Therefore, no additional motivation for storing files is required.
  2. There is no need to trust peers, because the addressing of files is done by content.
  3. Resistance to flooding (uploading unnecessary files to the network), because the files are placed only on their originating device.
  4. High bandwidth (thanks to BitTorrent)

Disadvantages:

  1. Storing only files (unstructured information).
  2. After the file is placed, you can not leave the network until the file will be downloaded by anyone else.
  3. Data storage by other devices is not guaranteed, to guarantee the provision of your file to others you have to stay online
  4. Files are static (unchangeable)
  5. Deleting a file is not supported.
  6. Social network AKASHA (Ethereum + IPFS) and the OpenBazaar trading platform are built with the use of IPFS. They fully inherit the aforementioned disadvantages of IPFS, the main one of which is that you can not post information on the network and go offline until it has spread to peers.

Distributed file storage

Such storages allow you to merge separate devices into a common cloud storage. As a result, users can store their files there just as they could in a classic centralized storage, for example, Dropbox, but cheaper. Owners of devices (“farmers”), providing a place for storing other people’s files, get money from users according to their contribution. To measure the contribution, ensure the reliability of storage and to stop abuse, various checks are used, for example, proof of storage, proof of retrievability, which are based on cryptography. For successful verification passing, the user pays, and the farmer receives a certain amount in cryptocurrency.

Such projects are built mainly using DHT technology and content addressing. Some additionally use blockchain and smart contracts.

There are a lot of such projects on the market at the moment, for example, Sia, Storj, Ethereum Swarm, MadeSAFE. They are all built according to similar principles. And beside other things Ethereum Swarm was to provide a convenient file storage for dApps.

Advantages:

  1. Files are stored in the cloud and are available regardless of the availability of their owner.
  2. High throughput.
  3. Due to financial motivation, the reliability of storing and retrieving files is ensured.
  4. Deleting unnecessary files is possible

Disadvantages:

  1. Storing only files (rather than structured information)
  2. Files are static
  3. Storage is not free
  4. To store files, distributed file repositories look attractive. However, for storing structured dynamic information, as social network data, storing data in static files is a serious problem. Meaning that file repositories do not know anything about the contents of the file, and in our case it is required to look for information not only by the file’s identifier (or name), but also by its content. For example, find all users with the keyword “blockchain”. Or located in a certain city. Or even carry out a complete full text search of publications. Therefore it is necessary to search for the better implementation.

Distributed databases

Unfortunately, by virtue of the CAP theorem, it is impossible to create a fully distributed database, which simultaneously ensures consistency, availability and partition tolerance.

In our case, we need a distributed database, and it should be, of course, partition tolerant and accessible - we need to quickly get a response from it, although, perhaps, not the latest. This restricts our selection to a number of NoSQL databases, because ACID SQL DBMS primarily provides consistency.

The number of distributed NoSQL databases implementation is great. For example, MongoDB, Cassandra, RethinkDB. All of them are able to work with a large number of replicas that are clustered together. The client works with one of the replicas, and the data is automatically synchronized with the others. For load balancing, sharding can be used when part of the data is stored only on a part of the replicas. Adding a new replica to the cluster practically scales the cluster linearly, and some implementations (for example, Cassandra) allow the replica to automatically take part of the cluster load.

NoSQL databases provide “eventual consistency”, that is, the data is matched after a while, when individual replicas are synchronized. In this way they are similar to blockchain - the confirmation of a transaction is more likely the more time has passed.

NoSQL databases can store both a simple key-value, and maintain the internal structure of the value, as well as additional indexes. The most advanced ones also have basic transaction support and an SQL-like query language (for example, Cassandra).

In all of the above, this type of databases may seem ideal for use in a blockchain. Nevertheless, imagine that someone in the well-established cluster of such databases added a malicious replica, which begins to inform other replicas in the cluster that all data must be deleted! All other replicas obediently delete all data, and the database will be hopelessly corrupted. That is, such a coordinated work of replicas is possible now only in a trusted environment (a cluster of such databases is not Byzantine fault tolerant). If a maliciously working replica is placed in a cluster, it can cause the destruction of the entire cluster data.

Advantages

  1. High speed
  2. Linear scaling of the speed and storage capacity
  3. Resistance to the inaccessibility of individual replicas
  4. Mature realizations

Disadvantages

  1. The need to trust the peers - byzantine fault intolerance

BigChainDB

There is a blockchain implementation called BigChainDB or, as it is called, IPDB (InterPlanetary DataBase). The authors claim a very high transaction speed (1 million / sec), and a huge storage capacity (due to distributed storage with partial replication). BigChainDB gets these benefits through a simplified consensus when building blocks, and by storing all blocks and transactions in an existing noSql database implementation - RethinkDB or MongoDB.

Unfortunately, this architecture has a significant flaw - each node has full rights to write to the common data store, which means that the system as a whole is not Byzantine fault tolerant. The authors of this project know about it, promising to think about this later. However, fixing the fundamental flaws in the underlying architecture after the release of the product is very laborious and often impossible, as this can lead to a significantly different product with a different architecture. Such an easy approach to the fundamental problem causes criticism of the project community, since the high speed and mass characteristics of BigChainDB, demonstrated in the absence of BFT (Byzantine fault tolerance), are in fact those demonstrated by the RethinkDB and MongoDB databases used for data storage. But since you still need complete trust between the nodes, why not use the specified databases directly?

Thus, the actual use of BigChainDB is limited to private networks. In order not to mislead people, BigChainDB should be called BigPrivateBlockChain, then no issues would arise. For public networks, it does not fit.

Advantages:

  1. Speed and storage, comparable to distributed noSql databases

Disadvantages:

  1. In fact, this is just a usual noSql database, supplemented by the shortcomings of the blockchain
  2. Immutability (data can not be deleted legally, but it can be deleted maliciously)
  3. Byzantine fault intolerance, therefore, the inability to be used in a public network
  4. Thus, BigChainDB is completely unsuitable for use in public networks.

Concept of a new database

Thus, the current implementations of storages do not meet one or the other of the requirements for a database, which would suit a wide range of applications on the blockchain.

While it is affordable for a database to be softly linked with the blockchain (that is, not all database transactions must pass through the blockchain), it must be resistant to the malicious behavior of other DB nodes, provide sufficient level of replication, and have mechanisms to motivate participants to support the network.

As for the principles of data storage, it is proposed to provide DB with the following properties:

  1. The database is public, the user (client) of the database is identified by its public key (the key of his blockchain account) - which is the user ID.
  2. Each user can make transactions to the database, each transaction must be signed by this user.
  3. A new record created by the user imprints his ownership of the data.
  4. Change to the record after creation can only be made by the owner (or the user for whom the trust is established through the permissions mechanism, implemented as a smart contract on the blockchain).
  5. Everybody can read all the records.
  6. To ensure that there is no conflict between the records keys of different users, all the records keys of the user are prefixed with user ID.
  7. More complex permissions can be configured using a smart contract in the blockchain (for example, trust between specific users, rights to create / delete tables), etc.
  8. All permissions must be checked both for transactions and for replication.

The mandatory cryptographic signature of each record ensures that its modification or deletion by a malicious node without knowing the record owner private key is impossible. That is, thus constructed database is Byzantine fault tolerant even without a consensus protocol.

This gives hope that the operation speed of such a scheme will differ little from the speed of existing implementations of noSql databases.

However, an attacker can conduct a Sybil attack, generating key pairs and creating garbage records in the database. This problem is solved by introducing incentives mechanism and will be discussed below.

Motivation

The public network assumes that participants can freely join it, providing equipment that enhances computing power, storage capacity and network distribution. To stimulate such behavior, equipment owners should receive a reward motivating them to perform honest work.

Similarly to Ethereum Swarm the following rewards are supposed:

  • Incentive for retrieving data
  • Incentive for data storage

Rewards are paid by the user making inquiries. Since payments through the blockchain are very slow, you can use two methods for fast payments: off-chain transactions and checkbooks. In the case of off-chain transactions, the user will need to create an off-chain channel with each node of the database (or use intermediate channels between nodes). Given that each such channel requires the reservation of funds on it, it can be quite costly. Therefore, we will adopt the “checkbook” approach as the main approach. Before accessing the database, the user must reserve part of funds on a special smart contract “checkbook.” Next, the address of this contract is used by the node to receive a reward - the “checkbook” contract stores the money of its owner and allows third parties to cash the signed checks by simply sending the transaction with the check as data to the contract method “cash”.

  • The contract tracks the total amount issued to each recipient at the time of the connection.
  • The owner of a check must necessarily remember the total amount sent.

A check is cashed if

  • The address of the contract corresponds to the address on the check.
  • Check is signed by the owner (user ID - public key)
  • The total amount on the check is greater than in the previous redeemed check for this recipient.

Then if it is necessary to reward the node the user just sends a check. The recipient node can save only the last received check from each user and periodically cash it out by sending it to a “checkbook” contract, which allows you to conserve some of the blockchain transactions with some confidence.

Retrieval incentive

The data on the DB nodes have a certain level of replication, that is, the data with a specific key is stored only on a part of the nodes, for example, on N. However, the user can connect to any node for the data. The node, to which the user connects, acts further as a “coordinator”.

When the user makes request, by the value of the data key, the N nodes responsible for storing these keys are calculated and the requests are routed to them. The data returned by the nodes is checked by the coordinator for compliance with electronic signatures, compared by the time stamp, and the most recent record is returned to the user.

The work of the coordinator and replicas storing the data is subjected to payment. The proportions of payment are subject to more detailed calculation, but in order to stimulate correct behavior it is necessary to fulfill the following principles:

The faster the node returned data, the larger payment share it deserves
If the node returned old data, the payment decreases
The node, that did not return data, does not receive anything
The coordinator receives a fixed small share
Together with the data, the coordinator issues an invoice with the total payment to each related node. The user issues checks for each node. The coordinator sends checks to the nodes. It also sends the updated data if the node did not return anything or returned old data.

To protect against malicious coordinators and users who will not pay, each node maintains a list of users from which it expects payment and coordinators sending requests from these users. If the debt level exceeds a certain threshold, the node may stop accepting requests from the specified users and coordinators. Upon receiving checks, the lists are corrected.

Storage incentive

The reward for retrieval indirectly incentivizes storage of data, but works only with relation to popular and often requested data. To stimulate long-term data storage, especially if data are rarely requested, a reward for storage is required.

The article on Ethereum Swarm describes the system of storage incentives. Nodes enter into a data storage contract with the information owner for a period of time. The storage can be paid at the time of data storage (update) or after a while, provided that the data is actually stored. In the event that a loss of data is detected before the contract expiration, the node may be penalized, for which each node requires an initial registration with a security deposit.

When you save data, the node returns a receipt that proves that the node has accepted a file for storage. This receipt subsequently allows you to check whether the relevant data is still stored, and if not - to initiate a legislation, i.e. a special smart contract that will allow you to penalize the offensive node.

In our case, data is not static, a record with the same key can be rewritten several times. In this case, not only the original record can correspond to the presented receipt, but also a record with the same key that is newer in accordance with the timestamp.

When the user initiates a data deletion operation, instead of physically deleting data, the data is replaced with a special “zero” record. The record can be physically deleted after expiration of its storage contract.

Full-text search, secondary indexes

In noSql databases, a quick search with a small number of nodes involved is possible only with the primary key. The DB we research differs little from that model. Meanwhile, the search for a business partner by keywords, as well as the grouping of records by some criteria is difficult to achieve without secondary indexes and full-text search. For full-text search, as well as usage of secondary indexes, we propose to use a solution similar to Elassandra. This solution is the local full-text indexes of ElasticSearch on each node of the distributed noSql Cassandra database. Full-text queries are sent by the coordinator to all nodes, then mixed and returned to the client. Since additional indexes are created locally and independently on each node, additional Byzantine fault tolerance is not required.

General conclusions

Thus, the considered DB can be attributed to the next generation of databases satisfying the abovementioned principles. Let us review these principles once more.

  1. Distribution DB supports an unlimited number of replicas, each of which can be a coordinator. That is, by referring to one of them, the user gets access to all data.

  2. Publicity DB is designed to work in a public environment. New nodes can be added to the network and take on some of the load at any time.

  3. Byzantine fault tolerance and resistance to other types of attacks in the public network Given that all data placed in DB is signed by their owner, the nodes cannot, at their discretion, change the data, nor can they corrupt data when replicating data to other nodes. Attempts to tamper with the data are immediately detected due to the electronic signature mechanism. For a tampering attempt a node-intruder may be deprived of the registration deposit and excluded from the network. To place the deposit, set access rights, the mechanisms for mutual settlements between the nodes, an external (for DB) blockchain is used, which should support Turing-complete smart contracts.

  4. Shard support (the ability to replicate only a fraction of the data on each node to increase the maximum total amount of data) Each DB node is responsible for a certain interval of primary keys that it stores. The level of replication (the number of nodes storing copies of data with the same primary key) is specified separately and can grow with the growth of the network.

  5. Speed Principles of data storage suggest that the speed of writing and reading data in DB will not be very different from the current implementations of databases as Apache Cassandra.

  6. Ability to store structured data Data in DB supports the structure. This can be a JSON document with a structure that meets the needs of a specific application.

  7. Ability to delete data Data deletion is supported in DB. You can not guarantee instant removal, but in the end, with good behavior, the data will be deleted. A malicious node can deliberately store all the data that is deleted. However, it will not be able to do it for all the data, because it only receives requests in a certain range of primary keys.

  8. The query language with the ability to search not only by the primary key Using ElasticSearch, similar to the methods of integration with Cassandra in the Elassandra project, it is possible to extend the query language with secondary keys and full-text search.

This DB can be used in different decentralized projects. It relies on a blockchain supporting Turing-complete smart contracts. In this regard, it can be used for the needs of decentralized applications built on top of a variety of the blockchains such as Ethereum, RChain and others.

The concept of such a DB was created as a part of an open source Übermensch project. There is no implementation yet but the demand for it is unquestionably high. So this concept is presented for your consideration. We hope that this new concept is interesting enough for you to discuss it and even join the development. You can also join our slack.

6 Likes

Thank you - had I found this 5 hours ago I could have saved myself 4 hours and 30 minutes :slight_smile:

4 Likes

And the original, of course: https://github.com/UbermenschProject/ubermensch-docs/wiki/Where-do-decentralized-applications-store-their-data%3F

There are plenty of hyperlinks with some interesting research behind the article we could not post because of the limitations :slight_smile:

1 Like