Appendable Data discussion

Sounds like a pretty big task. Can I ask what the reasoning behind it is from the point of view of the 21 fundamentals, network operations and UX?

17 Likes

Not that big really, simplifies backend a lot but makes front end use more crdt type practices and that complicates that. The main thing is we say data is immutable so cannot be lost/deleted, but not all data. Metadata, session info etc. is not. Also short term stuff should not be put on the network, unless some rich folk want to of course :slight_smile:

14 Likes

@dirvine If all data is immutable then that might simplify things for PtP? ahem :grin:

8 Likes

I can hide nothing :smiley: :smiley:

9 Likes

Iā€™m on to you man. I might have some theories with the AI stuff mentioned in the interview :wink: Have to admit Iā€™m a fanboy. Your shoulders and others @maidsafe will be amongst those the folks of the future will be standing on, along with all of the others that this project has had the opportunity to stand on.

20 Likes

To only have immutable data, will that have effect on dynamic websites variables, databases and the divisibility of Safecoin (the array model), if yes then how will it effect those fields and similar?

Fantastic with a road to a fully asychronous Parsec, thanks @AndreasF and the POA Network people.

16 Likes

Regarding parsec, fantastic news! :slight_smile:

For the database types that I am working on, it would have some effect. The most recent one is a sort of document db, where entries can be removed. If you wanted to remove something when the underlying structure is an append only instead of mutable data, then youā€™d need to keep meta data slots available as to be able to set flags ā€˜deleted=trueā€™ and things like that, or have a way to always refer to the end of some structure, and let a deleted entry be a tomb stone at the end.

I think that with the ever expanding database that I have deviced an implementation of, I would maybe let a single instance of a value be a database instance. It can then change as many times as you can afford, or until the network runs out of space, and you would always have a reference to the latest value (as well as the entire history of itā€™s state).
Thereā€™s some overhead there if you want to do it on the most primitive valuesā€¦ might be possible to cut down on it and keep the properties necessary. But then there are other ways to go about it as well.
Mixing non-SAFE and SAFE storage would be ok for some types of applications, but not for others. You could also have throw-away appendable instances, and just evolve the map to all instances (i.e. let the map be the ever expanding db, where the last entry points to latest version of the map). Then compose that in arbitrary number of layers.

For event stores, itā€™s basically the possibility for immutable event streams for real :slight_smile:
It makes me wonder if it wouldnā€™t be possible to include the ever expanding logic to the core libraries, and have ā€˜endlessā€™ (to your wallet, or the network limit) streams? @dirvine

10 Likes

Dropped that and ran :slightly_smiling_face:

Is this official or a one night stand?

4 Likes

AppendableData ! Again !

History repeats itself: Structured Data -> Appendable data -> Mutable Data -> Appendable Data

So, you are going to define the 4th version of this kind of data. This is typically over-engineering. In French there is an expression to describe a situation like this: the better is the enemy of the good.

At least, you can take advantage of if to put back signatures in the new structure. I have mentioned several times that this was a regression compared to initial SD, without them:

  • no multi-sig
  • no static check
12 Likes

I like this idea at first glance.

Choosing an arbitrary xorname for MD was a concern to me (eg data density attacks). Being able to use the content of AD to derive the xorname should be more secure. It also allows simpler caching of AD, whereas caching MD is very complex and challenging.

Having an initially blank field next_data_xor_name that can be updated just once (ie the append operation, which would not affect the original xorname) allows arbitrarily large appendable data sets, but comes with a cost of needing more lookups to fully download the data or reach the latest point in the chain. I think there are ways to simplify this though, eg via communications rather than storage, maybe some overlay network recording AD-LATEST-XORNAMES. I feel like thereā€™s a strong parallel between bitcoin segregated witness and the ā€˜appendableā€™ part of AD. Navigating the ā€˜next_dataā€™ direction of AD is an interesting problem.

This maybe has implications for safecoin history and privacy since the new owner must be appended rather than replaced. But with a suitable signature scheme privacy should be retained. Hard to imagine this would still permit free txs though.

AD also really challenges (and strengthens) the idea of volatile vs permanent data. Is SAFE used to ā€˜store data foreverā€™ or is it used to ā€˜securely arrange meetings with other people and then transfer data p2p in a volatile wayā€™? A bit of both I guessā€¦ I think AD progresses this concept in the right direction. A bit like how lightning network is used transfer the day-to-day data (volatile) then every so often the aggregate result is written to the ledger (permanent), where the bitcoin blockchain and lightning are acting as ā€œa highly accessible and perfectly trustworthy robotic judge and conduct most of our business outside of the court roomā€ source. MD feels like a pre-lightning-network bitcoin, AD like bitcoin + lightning-network because AD encourages volatile data transfer using efficient ā€˜off-chainā€™ ways.

Will we get to hear more details about how AD has been discussed within maidsafe?


Very cool. I know nothing of this technology, but having read the medium article posted by @dirvine it will be interesting to see if this particular drawback ends up being significant or trivial:

BLS signature verification is order of magnitude harder than ECDSA. Signature aggregation for the whole block with 1000 transactions still requires to compute 1000 pairing, so verifying one tiny signature in a block may take longer than verifying 1000 separate ECDSA signatures.

13 Likes

This is not at all decided yet though.

There are many ideas floating around, and this is just one of them (admittedly the one I am advocating for personally, for multiple reasons :slightly_smiling_face:). The primary goal here is to store the data perpetually, as per the network fundamentals. But I also think secondary goals like having the backwards API compatibility with Mutable Data are very important too.

How exactly is this data type will be called and implemented is yet to be decided though, so for now weā€™re at the discussion and pre-RFC stage.

14 Likes

EDIT: some of the information in this post is misguided due to ADs not being implemented as data objects but link objects. So an AD should be called a ALD and this changes some of the logic/assumptions my post was based on.

Not smart in my view. Remember we want to support devices that do not have a large disk capacity and/or privacy concerns so that no trace is left on the device

One notable use case is - Database operations. Some very large databases are doing thousands of mutations per minute or even per second during their peak hours. If you make all data immutable then the space required for these types of databases will balloon out and swamp the storage capacity. Its one reason these multi TerraByte databases do not keep a log of every mutation that is made on them and only do snapshots of the data. And even these snapshots are kept for a set period of time.

In my opinion you must provide the ability for fast changing data (bases) to not keep every change done on them. At least have it as an option.

Also Databases with append only data means there is not a simple field change function. You either have to have a procedure to track through all changes (may need to read many MDs == very slow access now due to lag time) in order to reconstruct the record and this is time consuming when done every record read. All that energy wasted. OR the database makes a complete copy of the record and appends it so that the procedure to reconstruct the record is easy.

To build just one object for display it may require 10s to 100s of relational database records to be read from multiple files & index records too and if the database is very active then the work to reconstruct each of those hundreds of records could require more than one MD per record. And its not a parallel situation since the index field for other files is held in the records being read.

Then in my opinion there is a use case for temporary files too. For instance editors that store a temp file and discard it once the editing session is finished. So the temp file is useless once discarded since the saved file and previous file is the actual files. Remember SAFE will be run on devices that cannot have large temp files on its disk and/or for privacy concerns

Also these application temporary files are often heavily mutated and some on group of characters change and others on larger changes. This is for recovery purposes and if someone wants privacy (whistle blowers, ordinary people) on the shared device then temp files have to be on the network If you keep all these mutations then this represents a lot of wasted space for no benefit. The changes are saved when the file is saved and session is over. Think of all those 100s of millions of word documents that office staff work on each day and you want to save all the character/lines/paragraph of changes (for no benefit). Thats many terra bytes or more a day of useless data (never accessed again) (no information gained/lost by keeping or not keeping it)

And deleting this very temporary information does not take away meaningful information since each version of those documents are still kept in immutable data as the files. Thus reinforcing the fact that its keeping data with no benefit to the people using those applications or the future world.

The world of data storage is a lot more than web sites.

So in my view keeping web site changes is good. BUT not EVERY character or word or tag that is changed during an editing session. Just keep the saved files for goodness sake.

tl;dr

  • Remember one of the early promises was that you could log in on any device and when you logged out there is no trace left. Having the requirement that temp files are stored on the device means there is a trace. SSD devices/memsticks means that wiping files using overwrite methods donā€™t work properly and files can be recovered often. EDIT: even if you encrypt the temp files, the fact they even existed (meta data) can cause problems. Remember the ex NSA chief who said we kill people on meta data.
  • Databases will require a method to reconstruct records by tracing through all the appended changes and building the record.
    • This dramatically increases the time to access data since a lot of those records will now be multi MD in size because of changes done to the record.
    • Index files now become almost useless (speed wise & size wise) due to having to reconstruct the index record. Just read up on how they optimise index records and you might get an idea of the problems of append only data will cause.
    • the multi terrabyte databases with 1000s of mutations per minute or second will result in dataspace blowout for no measurable benefit
  • The world of data is so much more than web sites and I agree that each version of the website should be kept, but not all the temporary data/files involved in the edit of each web page.
  • Privacy & Security
    • once you have append only data then you force temporary files back onto the device, if indeed the device can support it. This has serious implications for those in the world who want privacy and security of their data. Not every activist or whistle blower can have a device that can support large temp files and that also cannot be taken from them. Often they use shared devices or phones/tablets that can be taken from them. If the temp files are not on SAFE but the device then the device can betray them.
    • But if you put all temp files on SAFE then data space will balloon. For instance if I have 10 MB of documents and I edit each on average 3 times a year then I end up adding 60-300MB of appended temporary files. And 3MB to 30MB of immutable data (the various versions). Now I have 60-300MB of pure wasted space stored on SAFE. Many times than each version requires.
31 Likes

Will Fleming still use the existing model?

Fabulous! Iā€™m always impressed by the open-mindedness of the team and hence the ability to learn and adapt new tech for the project. Possibly a lot of this is simply due to not yet being in beta and perhaps this attitude will shift a bit when beta is launched, however I think there is also a sense of humility here - knowing that you canā€™t figure everything out yourselves - so you keep looking at other projects for better solutions. Maidsafe isnā€™t the only team in the cryptosphere this mature-minded for sure, but itā€™s good to see.

Can we have a ā€œpros versus consā€ thread on this? @neo has made some interesting ā€˜proā€™ (keep MD) points. I wonder about the costs and cons of MD to the network though ā€“ and importantly how they are or may be addressed without giving up MD altogether.

Great to see the new videoā€™s, events, website dev, and general marketing pressure maintaining itself week after week. It will pay off handsomely down the track, so do keep it up!

Awesome update as usual. Thanks to the whole team for the hard efforts.

Cheers

11 Likes

One cost I forgot to mention is that it will make the barrier to adoption of SAFE a lot higher for large business and systems. Even small database systems that will not touch SAFE because they will not write the code to reconstruct records.

So the cost is non-use of SAFE by some/many including those seeking privacy and security from leaving traces of the temp files on their device.

9 Likes

Is it insane to have immutable data, mutable data, AND appendable data? It seems they both have irreplaceable qualities that have been the cause of continuous debate.

8 Likes

Iā€™m only guessing, but perhaps one of the cons of MD is more network traffic? and of course more code complexity.

Given the seemingly strong proā€™s though Iā€™m scratching my head at why those conā€™s would be much of a trade off. Which is why Iā€™d like to see a clearly laid out thread on the proā€™s versus the cons. Maybe this discussion has happened in the past and itā€™s already on the forum but a quick search found nothing.

5 Likes

But of course if it means that adoption fails when it comes to real database applications and web sites then ā€¦

Yes and it seems that the problems with removing mutable data have been forgotten.

5 Likes

Would not surprise me at all. Itā€™s a long running project and people have left and new people have joined. Hopefully this can be resolved quickly and the solution cemented firmly in place - itā€™s frustrating to see this coming up again now. Iā€™m not discounting that there may be a good reason, but would like to know what it is.

2 Likes

This would be helpful. I understand that keeping data forever is one of the big ideas, but already itā€™s not an absolute rule since metadata and messaging wonā€™t be kept - presumably that includes all the machine-to-machine sensor data that will grow exponentially as the IoT comes online. So thereā€™s already a dividing line between data that will be kept and data that wonā€™t, and as @neo points out having only the immutable data option could potentially be restrictive.

From an end-userā€™s point of view this would seem to be the ideal scenario. What would be the downsides? More complexity presumably ā€¦ any others? Interested in your thoughts @nbaksalyar.

8 Likes