Appendable Data discussion

dirvine · February 10, 2019, 10:27am

Man this is a long convo, there is defo a miscommunication here.

Can I start this reply with - AD is what we currently have sans multisig. We are not losing here but gaining. The only thing we remove is the zero out of a value from a key value pair. It allows AD to be more consice and exact in it’s specification. It does not prohibit more data types down the line in any way.

Also AD and ID do not means apps cannot edit data, they can, they can remove things from presentation (“delete”) so apps will just see a network they can do all these things on from a user perspective. I think folk think we are removing ability for users to edit data here and we are certainly not.

I don’t think it is, but it is what we have now in a pseudo way. True Appendable data as we will proved makes this more elegant.

Data is immutable, due to immutable data type. Where to find that is the appendable data type. If we end up in the future with actual Mutable Data (edit/delete) then that is all OK, but it has to be in a eay that other things like perpetual data cannot be damaged.

2 Type Immutable Data (hash of content == name) this is self-checking data. Then appendable data, eay to think of this as pointers to immutable data. This allows data representation to change. i.e. now this file has this content (i.e. it points to new chunks). All the history and data is still on line though.

An immutable data chunk can also be a list to pointers, so it allows a mix of data types for instance to make up a video etc. Thik like this video == 1billion chunks so AD has pointers to 1000 but the last pointer is an ID chunk that has more pointers etc. This is how you can have a mix of data types to represent a file.

neo:

The new AD (appendable data) type which is replacing the MD (mutable data) type means that whenever a change is made to one or more bytes then a new field is created with the changes in that field. How are the changes to stored?

will the change be something along the lines of bytes xxx-yyy are changed to whatever OR

The previous field is flagged as old and the complete data is stored in the new field.

And I imagine that just adding (appending) to the data results in the new field just holding the new data to be appended.

What happens when the changes exceed the 1MB size of the AD?

What happens when there has been hundreds of changes been made. eg a web site or document?

will the system have to add an AD when a particular change cannot fit in the current AD?

how will it determine the address of the continuing AD? what if the ADs at the calculated addresses are in use? How will the linkage be done.

What happens when the web page is 900KB long and its is changed 50 or 100 times? How many ADs will be linked for this one? How long will the processing take to get the actual page to display? In other words will the browser have to retrieve the 50 or 100 ADs that this one AD has grown to?

I am aware the guys are finaliseing RFC here so don’t want to step on toes, but related to last point. AD is a list of pointers (those can be identities etc basically xor addresses) that is finite. The last item of this list will be to further history contained in an ID chunk.

AD is as MD is now, when you register a change it appends a new list item (right now it is a key value pair, where the value nullifies). So change there is not a key value, but a list, so smaller, more efficient and not scrubbing history. The existing thing scrubs history but leaves the data in place (zeroed out).

If you wish to put temp stuff on a perpetual data network then you can, but there is a network cost to doing so and the network will need to take a payment. So now all we do is say it will be permenent and this is the cost. If it were not permenent then you need to sign it to prove you can delete it, then we need to know you have not shared it then perhaps it can be deleted BUT you just exposed the inner workingsof your “stuff” to the world on a network that folk can ask for your stuff, did you really encrypt all that stuff, if you did it must have been in RAM etc. or stored on disk locally (please not plaintext). All this is much more additional work/design to allow folk to use the network as a temporary storage solution? It might be easier to ask the client managers you are connected to for some play areas etc. It is a new feature though that needs really thouroughly thought out.

I claim this is all not thought out as much as it should be and is additional work/design/code and will delay launch. If the whole community want us to do that then it can be done, but it is absolutely not required at all. Local apps can use local temp storage that is secured and wiped, that is faster and more efficient. Where they cannot then they log into a bigger machine that can, at least right now.

I agree, so say you render a 100Gb video, that rendered can create many terrabytes of temp data, it does so locally and scrubs it, but itf you wanted the network to hold it you have huge bandwidth costs etc. (SAFE and network like it are bandwidth bound).

It is doing as data is doing now, but more efficiently and not scrubbing entries.

By the app as the last item is that AD completed. So the app has to make the last item a link to another data element or the AD is closed (just as now).

Just as now really.

Of course, but this would be to do more than we curently do in Alpha2 and will need consensus on edit/delete as you can have collisions there even from a signe user loggend into 2 seperate devices and so on. So there has to be consensus on those changes, therefor it will be slower. If data is CRDT like then you can do things very fast comparted to consensus requiring data types. This is due to the latter requiring mechanism to hanndle conflicts. Even with append only there are opportunities for conflicts, but much simpler to solve.

I think the Appendable Data RFC needs to happen sooner than later for the sake of the community to show this is explanding what we have in alpha 2, so all alpha2 apps can work but more will be possible with true AD and multisig.

I feel there is a knee jerk here that folk think, we are losing something ehre, we need ot quickly design alternatives, but we are not losing stuff here, we are getting more.

Then a debate on temp files, rewriting history, deleting certain data etc. can take place without distraction of quickly repairing something that is not broken

EDIT - I burned my porridge answering that, the price we pay

neo · February 10, 2019, 10:53am

Can I point out we are losing the promise or envisioned MD structure that was to be implemented.

Yes that is understood and clear in my discussions

There are other very elegant. Remember that just elegant from one usage does not equate to elegant in general sense. Appendable data is one method that is a computer science concept that is studied but then students move onto better methods. Appendable data is great for logging systems where change is minimal and most data is appended.

As my proposal does and gives you the cake and you can eat it too. Appendable data causes the workload to be placed at the time of reading and since data is typically read many more times than written then it ends up causing greater computing effort reading and a consequential slow down as simple examples show.

Whereas version keeping MDs require little or no extra processing in the storing and no extra processing in reading. This is CS201 stuff. Also ADs increase energy usage due to extra processing when trawling through the changes to reconstruct the data (or follow index links) where as version-kept MDs do not.

Both provide perpetual data

So you are keeping the 2 types of data. Good.

Again an invalid situation. 1x10^9 * 1x10^6 == 1x10^15 bytes for a video. Hmmmm I find that highly unlikely 1,000 TB video file. About 200 thousand hours of 4K video.

In any case the video itself is still immutable data and if I have the datamap of it then even your example falls apart since I have both ADs

In any case my proposal ALSO disallows the pointers to be modified if you take the version at the time of storing the video.

My proposal makes MDs as perpetual as ADs without the additional cost to reading the data as ADs introduce.

So is the original data copied over (with changes made) to the new place where its stored or what?

This sounds like if I want the data held then I first access the AD which is a list of pointers then I have to get the actual data stored elsewhere???

But still has more overhead then simply version keeping the MDs. Here MDs means the promised MDs that was to be implemented.

But as now has not had the issues of much appending. Cannot do much with 1000 PUTs. The problem occurs when the data is actually being changed and modified. These problems will then increase over time.

The more network accesses you have to do to get the actual data slows down applications.

Sorry about that, you can always make more though.

Thanks for that, it makes a little more sense but I still disagree that its better than simply implementing MDs as proposed for months (I know not yet implemented) and keeping a copy of each version.

A lot quicker than either trawling through changes or indexes which then require further network access to get the actual data.

And if an interim measure to get to Maxwell then so be it.

dirvine · February 10, 2019, 11:24am

What I was saying, if you have video rendering especially animation type work then it will produce many more time the size of the file. These systems create layers and tons and tons of temporary files to get the finished product. This is why video rendering is usually large machines with tons of RAM etc. If you tried to do that on a phone or system with zero local storage it would be cripplingly slow and mean uploading huge amounts of temp data. So extreme example but valid if we say allow all temp stuff on the network, it just won’t work in many cases.

Is it worth a seperate thread where the OP is your “finalsed” original proposal for discussion?

No the new data is new chunks that are pointed to with the new AD entry. This means the history of an AD list is a list of the versions of the original data.

The MD proposal zero’d out entries though, is this what you mean or another MD proposal?

I just added more honey, it was still good and now stuck to my ribs

I think this will be a concrete data type for the forseable. If we then wanted in place edit/delete of network data elements we would need to add another data type so they do not get coupled. Well I would hope we have them seperate, but you never know.

neo · February 10, 2019, 11:34am

Not clear really.

If I change 5 words (in random places) in my web page what is written?

So my web page is 750KB long. Where is the data stored. It sounds like the actual data is never stored in an AD but in immutable chunks (3 chunks for 750KB) and the AD only points to those 3 chunks.

So what happens with those 5 changes.

Do you write out 3 new chunks and append the new pointers?

OR do you store the 5 words and how to change the original 3 chunks to make the new web page?

What was understood is that the original MD stored the actual data in the MD fields. And each field could be changed (no zeroing or appending for a change)

For us here its like 28-30oC by breakfast time. Chilled milk on toasted oats with fruit here (if I have any) and coffee. Got to have that coffee

oetyng · February 10, 2019, 11:39am

Hey David, this short one.
(Damn neo, you must short down your texts, I feel like we’re in a bird’s nest and only the biggest loudest chick gets any worms )

Is an AD 1k of value entries, i.e. a list?
What mechanism tells us where the last entry is at, will there be a field for current index or must we search from index zero?

dirvine · February 10, 2019, 11:40am

Either those chunks or another chunk that only holds the open (unencrypted) data map.

Yes new chunks and new pointer in this case, but it is not the only way, it is the simple way to do it.

I think this is important and wher we do need to get one part right first and then solidly look at true mutable data elements. Its a larger discussion, if we had immutable public data and folk published websites in MD types, then it is not perpetual at all. So these parts need huge debate and get them right. As the network does not know what those MD content types are then it means stuff can be hidden, or web pages put there for easy edit, but that does scrub history as well. So I agree it would be great, but I suspect it also needs to be clarified what can be done there, like not being able to address a felid and publish that field, or not etc.

dirvine · February 10, 2019, 11:42am

Yes it will be (AFAIK)

I would hope the last entry is the first one you Get if you ask for the AD item. Then for history you get the whole thing and traverse it. RFC being finalised by others though, so this should be the case IMO but lets see if we need to debate it in the RFC.

neo · February 10, 2019, 11:43am

Its is perpetual if you version-keep all versions. This could be made optional for temp files if we could be sure its not able to be abused. Or there is not an option to not do it.

dirvine · February 10, 2019, 11:44am

I agree, but we would need a mechanism to identify temp files and the edge cases around that.

oetyng · February 10, 2019, 11:52am

It seems AD is much smaller than MD then.

When you store ‘345’ to an AD, that’s the only value you have there (as long as you only GET, with param includeHistory=false).

While an MD can have ‘0’ to ‘999’, and you always get all of them.
EDIT: (And I assume now, that behind every MD value, there is ulong.MaxValue number of previous versions, but not exposed through the API).

dirvine · February 10, 2019, 11:57am

Yes, I expect this to be a known value as folk appending will need to know they are at the last entry before full. In that case it will link to another AD, I hope. So perhaps the last entry is a different call/rpc etc. where if you try and write to it then you get an error showing you need to link this, or perhaps the network just will do that in a way that it is invisible to users. However in the latter case the return of an all history call may include a few AD items.

JPL · February 10, 2019, 12:10pm

Are there any examples of appendable-only CRDT applications in the wild outside of the logfile use case? A quick search tells me that CRDTs are used by Riak and Dynamo NoSQL Key-Value databases. The detail is rather over my head, but I can’t see the word ‘appendable’ there. Another one is OrbitDB on IPFS-logs, but again the use case seems to be logfiles.

dirvine · February 10, 2019, 12:20pm

Append only is a list of pointers (for us these can be ID’s as well).

CRDT is conflict free data type, these are generally not byzantine tolerant.

Append or constant grow is CRDT capable, i.e. no delete or edit. There are some nice papers on CRDT in the wild for edit etc. https://www.researchgate.net/publication/268332406_Byzantine_Fault_Tolerance_for_Services_with_Commutative_Operations

Note that we do not currently use CRDT functionality in SAFE but use PARSEC or eventual consistency for agreeing what a data element is. However if the underlying data is CRDT capable (i.e. easily used in a CRDT network) then it means we have the ability to use CRDT functions and sets to be able to agree on these types without having to go to full consensus (ordering).

In terms of networks that do this, then I am not sure, blockchain has such a mechanism. Riak and others like paxos based network have CRDT types but importantly not in byzantine networks. Those byzantine fault tolerant networks is where I see SAFE as at the leading edge of. So public stuff like bittorrent, freenet etc. do some things to stop overwrite with different data, but not crdt. crdt is more prominent in protected networks, i.e. amazon used vector clocks in dynamo to achieve this (similar data type that only grows).

We use appendable data as a term, others will not do this. So even orswat/or sets in CRDT land are appendable (orswat is debatable as you do drop elements). If you look at an OR set though, then any removed item is in the removed item list, so it is still there, but can be seen as removed. Clients will not access the removed set and only get the added part.

bones · February 10, 2019, 2:44pm

Theres a lot of discussion and i see good and bad points as others have pointed out.

Which option gets us to a mvp without compromising network security?

Coders and engineers will always see something that could be done differently.

I just want a functional , safe, network in the wild.

happybeing · February 10, 2019, 3:46pm

I’ve read /skimmed all this and still am not clear what is being discussed. I’ll wait for the RFC

JPL · February 10, 2019, 4:05pm

Your excellent post raises some very pertinent points about why this is a complex area that needs a lot of thought. The censorship game has changed. It’s more about creating multiple alternative ‘truths’ to sow mistrust in the genuine fact, diluting rather than deleting it. There’s nothing new about this tactic but these days using trolls and bots to spread misinformation and disinformation rather is cheaper and more effective than banning outright. Even those arch-banners in the Chinese government are doing this now (although they ban plenty of stuff too). Coincidentally I just read an article about that here The new censors won’t delete your words — they’ll drown them out

dirvine · February 10, 2019, 4:22pm

In is not that published “anything” is true, the truth part is “this was published” (and perhaps by this authority), if that makes sense. So we do not in the future rewrite what was actually said and done now.

happybeing · February 10, 2019, 4:41pm

One of the goals here is to create accountability. This is one of the things that we lose at scale, but if the record is public and permanent, we all can be held to account for what we said in public the past. Through that we can all have a better chance to filter out rubbish, and tune into a more reliable version of reality.

dirvine · February 10, 2019, 5:44pm

I think this is key, if we capture everything published and make it immutable then I am sure there are/will be algorithms to find the most likely truth for anything published, hopefully, that includes studies, records and reports as well as the usual claimed truth. I would love where all these wars are happening to have records of what we do. When I was in the middle east then the best place for western folks was considered Iraq, until we made them our enemy. There is a ton of this about, so capturing it all for our descendent to evaluate will be important.

So rather than history being written by the victor we hopefully will have enough proof points for our descendants to write a better history than we have had, well we can hope so, but step 1 must be, do not allow published stuff to change.

Toivo · February 10, 2019, 5:53pm

How do you think this affects the will to publish in SAFE Network?

Topic		Replies	Views
Database @ Safe in the Published Zone Features development , appendable-data	8	504	December 13, 2022
Safe Network storage features Features storage	38	810	December 25, 2023
An Overview of the New Data Types Development	40	1998	October 21, 2020
DataStore over AppendableData design Development	25	2559	February 27, 2019
Thoughts on the dangers of undeletable data? Features	71	5110	April 20, 2016

Appendable Data discussion

Related Topics