Appendable Data discussion

appendable-data
immutable-data
mutable-data

#101

So is that on the network or locally on my computer? Must be on my computer I guess otherwise would be in some sort of data type right?


#102

Personally I don’t see anything wrong in having permanent data. In fact, I welcome it, and from a consumer point of view, keeping a data history is what initially interested me in this project. Classically History is always edited, and is written by the victor. Having data histroy means a lot to historians, beacuse they will have uncensored sources. Way back in the early 90ties, I had one of the few listed websites on the WWW, called: The Naked Truth, on Yahoo, along side the Captain Kirk Sing-a-long page, and the Evil Clown page. Today that site is gone, and so are the original files. So my point: I wish I could have preserved history.


#103

everything is on the network.

Yes in a data field of your account record that only the network can modify.

Mind you append only ADs means that this set of ADs holding your account info will keep growing and growing because every time you do something account wise the account record has data appended and next time those appendices have to be processed to recreate your account information.


#104

Yes that is fine. As long as it is not using appendable data objects.

I suggested keeping each version of the MD as it is mutated. This speeds things up by not requiring the reconstruction of the data after processing the appends. And makes a simple network access since the copy is just like an MD that cannot be modified. (ie read only).


#105

@dirvine David, I understand the keeping of files. data and web data/sites in perpetuity and fundamentally fully agree with this in principle and practice. The issue that most concerns me is the use of Appendable Data as the ONLY way to achieve this.

You mentioned an issue of perspectives and how we are potentially misreading yours. This may well be true but if its a fact that the direction is to implement perpetuity via ADs then it is a direction change from the MD that has been discussed many times in the past. Its OK to change things but I feel it is a change for the worse in a practical sense.

Can you confirm if the following questions are true or not and please provide some thoughts on the issues/questions contained in each point.

  1. immutable file chunks and AD are somehow being combined as one as per your implication further up, or is this a misunderstanding? You were talking of having only one type of data and that immutable video files are somehow also able to contain the MD (now becoming AD) data types. This is very confusing when you make such arguments. And if I am confused as to what is happening then I am sure others will be too

  2. The new AD (appendable data) type which is replacing the MD (mutable data) type means that whenever a change is made to one or more bytes then a new field is created with the changes in that field. How are the changes to stored?

    • will the change be something along the lines of bytes xxx-yyy are changed to whatever OR
    • The previous field is flagged as old and the complete data is stored in the new field.
    • And I imagine that just adding (appending) to the data results in the new field just holding the new data to be appended.
    • What happens when the changes exceed the 1MB size of the AD?
    • What happens when there has been hundreds of changes been made. eg a web site or document?
      • will the system have to add an AD when a particular change cannot fit in the current AD?
      • how will it determine the address of the continuing AD? what if the ADs at the calculated addresses are in use? How will the linkage be done.
      • What happens when the web page is 900KB long and its is changed 50 or 100 times? How many ADs will be linked for this one? How long will the processing take to get the actual page to display? In other words will the browser have to retrieve the 50 or 100 ADs that this one AD has grown to?
  3. In your talk with @fergish, you say at 45:36-> that

when we talk of data in perpetuity we don’t mean every single scrap of nonsense that you write on a scrap piece of paper or draw something in a window in your brain or thought … when we … talking about data we are talking of data you want to keep and you value that data enough that you’re prepared to make a tiny tiny tiny small payment - almost like - if you get out the good writing paper ?? want to keep and make data last forever - just not something we have as a fulffy goal that would be ?? This is something that we feel is for humanity we must do that must be able …

  • so why aren’t the temporary files that editors use (either via safe-drive or native safe app) not considered as digital scraps of paper?
    • the temp file is a jumble to anything other than that editing session. The context to the editor is lost once you start a new session or start on a new file
    • The original file is still there in immutable data (or in your new one combined ADs)
    • the new version of the file is still there in immutable data (or …
  • I put it to you that the temp file is that digital version of the scrap of paper that is used to keep the changes for undos and unexpected app close, till the new file is written. After the new file is written the temp file loses all context and its content cannot be used again even by the editor since there is no context for it.
  1. As I alluded to before - By using append only data types (AD) there has to be a method to provide the data as it now is.
  • how is that process to run? Especially once the changes (appendixes) cause the AD to overflow into one or more additional ADs
    • is is done by some manager network side?
    • is it done by the client code?
    • Is it done by the application?
  • how will the implementation prevent the access time of a particular AD increasing as more and more appendixes are added due to changes and multiple ADs need to be read to get the data for that original AD? I am especially thinking of changes adding appendixes that cannot fit on the original AD and new additional ADs are needed to store the appendixes.
    • what will be the additional time (lag time etc) to access these additional appendix ADs do to the performance? For example the 900KB web page that undergoes multiple changes that adds maybe 300KB to 900KB to each appendix and the AD now is spread across many ADs.

Also can you comment on my proposal above to keep the essentials of the MDs as outlined when MDs were being introduced. You know that an MD can be mutated/appended to/deleted

  • The proposal mainly introduces keeping the previous version of an MD as a history. The changes to an MD causes a new version of the MD which essentially is a new data object to be written to the vaults.
    • this means that history is kept - ie perpetuity of data
    • Owner is also kept which is an essential part of the data. (who wrote what is important)
    • Effectively the version # becomes the lower bits in the final adderss of the actual data object to be read. This can be done by either reducing the #bits in an MD addess or increasing the address space of these objects.
  • The one exception is to allow a “temp” MD where the “keep-a-copy-of-old-versions” flag is set to false.
    • This allows apps and web browsers to know if the data from the MD is temporary and can either flag it or the app running in the browser can reject it or not.
    • OR the api to retrieve a MD can have a “Allow-Temp” option which defaults to no. Thus allows the 90-99% of APPS that do not trust these MDs to not even see them. And the other APPs (utility or say data handling APPs) can access them if they want. So tehn editors can have their temp files if they wish.
  • even data handling APPs or database APPS can use the version keeping MDs without penality of access speed slowdowns due to expanding appendixes to trwal through.

So by using this proposal you can pervent any loss of speed due to trawling through appendixes which over time would cause continually increasing access times for any data that is changing. But keep the benefit of perputal data.

If you decide that making the keeping of copies optional (for temp files/data) is bad or can be abused then it would not affect the APPs using temp files or temp data or data base type apps to lose that ability to turn off history since they are just accessing the latest version of the MD.


#106

An additional question:

Is this correct?:
An MD always has been append only, in that there are ulong.MaxValue number of entries in each of the 1k MD entries (one for every version), but that the API has only exposed the most recent version (and calculated size only for the most recent versions? Otherwise you’d soon have no more space, even though size of visible data was not that big).


#107

Man this is a long convo, there is defo a miscommunication here.

Can I start this reply with - AD is what we currently have sans multisig. We are not losing here but gaining. The only thing we remove is the zero out of a value from a key value pair. It allows AD to be more consice and exact in it’s specification. It does not prohibit more data types down the line in any way.

Also AD and ID do not means apps cannot edit data, they can, they can remove things from presentation (“delete”) so apps will just see a network they can do all these things on from a user perspective. I think folk think we are removing ability for users to edit data here and we are certainly not.

I don’t think it is, but it is what we have now in a pseudo way. True Appendable data as we will proved makes this more elegant.

Data is immutable, due to immutable data type. Where to find that is the appendable data type. If we end up in the future with actual Mutable Data (edit/delete) then that is all OK, but it has to be in a eay that other things like perpetual data cannot be damaged.

2 Type Immutable Data (hash of content == name) this is self-checking data. Then appendable data, eay to think of this as pointers to immutable data. This allows data representation to change. i.e. now this file has this content (i.e. it points to new chunks). All the history and data is still on line though.

An immutable data chunk can also be a list to pointers, so it allows a mix of data types for instance to make up a video etc. Thik like this video == 1billion chunks so AD has pointers to 1000 but the last pointer is an ID chunk that has more pointers etc. This is how you can have a mix of data types to represent a file.

I am aware the guys are finaliseing RFC here so don’t want to step on toes, but related to last point. AD is a list of pointers (those can be identities etc basically xor addresses) that is finite. The last item of this list will be to further history contained in an ID chunk.

AD is as MD is now, when you register a change it appends a new list item (right now it is a key value pair, where the value nullifies). So change there is not a key value, but a list, so smaller, more efficient and not scrubbing history. The existing thing scrubs history but leaves the data in place (zeroed out).

If you wish to put temp stuff on a perpetual data network then you can, but there is a network cost to doing so and the network will need to take a payment. So now all we do is say it will be permenent and this is the cost. If it were not permenent then you need to sign it to prove you can delete it, then we need to know you have not shared it then perhaps it can be deleted BUT you just exposed the inner workingsof your “stuff” to the world on a network that folk can ask for your stuff, did you really encrypt all that stuff, if you did it must have been in RAM etc. or stored on disk locally (please not plaintext). All this is much more additional work/design to allow folk to use the network as a temporary storage solution? It might be easier to ask the client managers you are connected to for some play areas etc. It is a new feature though that needs really thouroughly thought out.

I claim this is all not thought out as much as it should be and is additional work/design/code and will delay launch. If the whole community want us to do that then it can be done, but it is absolutely not required at all. Local apps can use local temp storage that is secured and wiped, that is faster and more efficient. Where they cannot then they log into a bigger machine that can, at least right now.

I agree, so say you render a 100Gb video, that rendered can create many terrabytes of temp data, it does so locally and scrubs it, but itf you wanted the network to hold it you have huge bandwidth costs etc. (SAFE and network like it are bandwidth bound).

It is doing as data is doing now, but more efficiently and not scrubbing entries.

By the app as the last item is that AD completed. So the app has to make the last item a link to another data element or the AD is closed (just as now).

Just as now really.

Of course, but this would be to do more than we curently do in Alpha2 and will need consensus on edit/delete as you can have collisions there even from a signe user loggend into 2 seperate devices and so on. So there has to be consensus on those changes, therefor it will be slower. If data is CRDT like then you can do things very fast comparted to consensus requiring data types. This is due to the latter requiring mechanism to hanndle conflicts. Even with append only there are opportunities for conflicts, but much simpler to solve.

I think the Appendable Data RFC needs to happen sooner than later for the sake of the community to show this is explanding what we have in alpha 2, so all alpha2 apps can work but more will be possible with true AD and multisig.

I feel there is a knee jerk here that folk think, we are losing something ehre, we need ot quickly design alternatives, but we are not losing stuff here, we are getting more.

Then a debate on temp files, rewriting history, deleting certain data etc. can take place without distraction of quickly repairing something that is not broken :wink:

EDIT - I burned my porridge answering that, the price we pay :smiley: :smiley:


#108

Can I point out we are losing the promise or envisioned MD structure that was to be implemented.

Yes that is understood and clear in my discussions

There are other very elegant. Remember that just elegant from one usage does not equate to elegant in general sense. Appendable data is one method that is a computer science concept that is studied but then students move onto better methods. Appendable data is great for logging systems where change is minimal and most data is appended.

As my proposal does and gives you the cake and you can eat it too. Appendable data causes the workload to be placed at the time of reading and since data is typically read many more times than written then it ends up causing greater computing effort reading and a consequential slow down as simple examples show.

Whereas version keeping MDs require little or no extra processing in the storing and no extra processing in reading. This is CS201 stuff. Also ADs increase energy usage due to extra processing when trawling through the changes to reconstruct the data (or follow index links) where as version-kept MDs do not.

Both provide perpetual data

So you are keeping the 2 types of data. Good.

Again an invalid situation. 1x10^9 * 1x10^6 == 1x10^15 bytes for a video. Hmmmm I find that highly unlikely 1,000 TB video file. About 200 thousand hours of 4K video.

In any case the video itself is still immutable data and if I have the datamap of it then even your example falls apart since I have both ADs

In any case my proposal ALSO disallows the pointers to be modified if you take the version at the time of storing the video.

My proposal makes MDs as perpetual as ADs without the additional cost to reading the data as ADs introduce.

So is the original data copied over (with changes made) to the new place where its stored or what?

This sounds like if I want the data held then I first access the AD which is a list of pointers then I have to get the actual data stored elsewhere???

But still has more overhead then simply version keeping the MDs. Here MDs means the promised MDs that was to be implemented.

But as now has not had the issues of much appending. Cannot do much with 1000 PUTs. The problem occurs when the data is actually being changed and modified. These problems will then increase over time.

The more network accesses you have to do to get the actual data slows down applications.

Sorry about that, you can always make more though.


Thanks for that, it makes a little more sense but I still disagree that its better than simply implementing MDs as proposed for months (I know not yet implemented) and keeping a copy of each version.

A lot quicker than either trawling through changes or indexes which then require further network access to get the actual data.

And if an interim measure to get to Maxwell then so be it.


#109

What I was saying, if you have video rendering especially animation type work then it will produce many more time the size of the file. These systems create layers and tons and tons of temporary files to get the finished product. This is why video rendering is usually large machines with tons of RAM etc. If you tried to do that on a phone or system with zero local storage it would be cripplingly slow and mean uploading huge amounts of temp data. So extreme example but valid if we say allow all temp stuff on the network, it just won’t work in many cases.

Is it worth a seperate thread where the OP is your “finalsed” original proposal for discussion?

No the new data is new chunks that are pointed to with the new AD entry. This means the history of an AD list is a list of the versions of the original data.

The MD proposal zero’d out entries though, is this what you mean or another MD proposal?

I just added more honey, it was still good and now stuck to my ribs :wink:

I think this will be a concrete data type for the forseable. If we then wanted in place edit/delete of network data elements we would need to add another data type so they do not get coupled. Well I would hope we have them seperate, but you never know.


#110

Not clear really.

If I change 5 words (in random places) in my web page what is written?

So my web page is 750KB long. Where is the data stored. It sounds like the actual data is never stored in an AD but in immutable chunks (3 chunks for 750KB) and the AD only points to those 3 chunks.

So what happens with those 5 changes.

Do you write out 3 new chunks and append the new pointers?

OR do you store the 5 words and how to change the original 3 chunks to make the new web page?

What was understood is that the original MD stored the actual data in the MD fields. And each field could be changed (no zeroing or appending for a change)

For us here its like 28-30oC by breakfast time. Chilled milk on toasted oats with fruit here (if I have any) and coffee. Got to have that coffee


#111

Hey David, this short one.
(Damn neo, you must short down your texts, I feel like we’re in a bird’s nest and only the biggest loudest chick gets any worms :joy:)

Is an AD 1k of value entries, i.e. a list?
What mechanism tells us where the last entry is at, will there be a field for current index or must we search from index zero?


#112

Either those chunks or another chunk that only holds the open (unencrypted) data map.

Yes new chunks and new pointer in this case, but it is not the only way, it is the simple way to do it.

I think this is important and wher we do need to get one part right first and then solidly look at true mutable data elements. Its a larger discussion, if we had immutable public data and folk published websites in MD types, then it is not perpetual at all. So these parts need huge debate and get them right. As the network does not know what those MD content types are then it means stuff can be hidden, or web pages put there for easy edit, but that does scrub history as well. So I agree it would be great, but I suspect it also needs to be clarified what can be done there, like not being able to address a felid and publish that field, or not etc.


#113

Yes it will be (AFAIK)

I would hope the last entry is the first one you Get if you ask for the AD item. Then for history you get the whole thing and traverse it. RFC being finalised by others though, so this should be the case IMO but lets see if we need to debate it in the RFC.


#114

Its is perpetual if you version-keep all versions. This could be made optional for temp files if we could be sure its not able to be abused. Or there is not an option to not do it.


#115

I agree, but we would need a mechanism to identify temp files and the edge cases around that.


#116

It seems AD is much smaller than MD then.

When you store ‘345’ to an AD, that’s the only value you have there (as long as you only GET, with param includeHistory=false).

While an MD can have ‘0’ to ‘999’, and you always get all of them.
EDIT: (And I assume now, that behind every MD value, there is ulong.MaxValue number of previous versions, but not exposed through the API).


#117

Yes, I expect this to be a known value as folk appending will need to know they are at the last entry before full. In that case it will link to another AD, I hope. So perhaps the last entry is a different call/rpc etc. where if you try and write to it then you get an error showing you need to link this, or perhaps the network just will do that in a way that it is invisible to users. However in the latter case the return of an all history call may include a few AD items.


#118

Are there any examples of appendable-only CRDT applications in the wild outside of the logfile use case? A quick search tells me that CRDTs are used by Riak and Dynamo NoSQL Key-Value databases. The detail is rather over my head, but I can’t see the word ‘appendable’ there. Another one is OrbitDB on IPFS-logs, but again the use case seems to be logfiles.


#119

Append only is a list of pointers (for us these can be ID’s as well).

CRDT is conflict free data type, these are generally not byzantine tolerant.

Append or constant grow is CRDT capable, i.e. no delete or edit. There are some nice papers on CRDT in the wild for edit etc. https://www.researchgate.net/publication/268332406_Byzantine_Fault_Tolerance_for_Services_with_Commutative_Operations

Note that we do not currently use CRDT functionality in SAFE but use PARSEC or eventual consistency for agreeing what a data element is. However if the underlying data is CRDT capable (i.e. easily used in a CRDT network) then it means we have the ability to use CRDT functions and sets to be able to agree on these types without having to go to full consensus (ordering).

In terms of networks that do this, then I am not sure, blockchain has such a mechanism. Riak and others like paxos based network have CRDT types but importantly not in byzantine networks. Those byzantine fault tolerant networks is where I see SAFE as at the leading edge of. So public stuff like bittorrent, freenet etc. do some things to stop overwrite with different data, but not crdt. crdt is more prominent in protected networks, i.e. amazon used vector clocks in dynamo to achieve this (similar data type that only grows).

We use appendable data as a term, others will not do this. So even orswat/or sets in CRDT land are appendable (orswat is debatable as you do drop elements). If you look at an OR set though, then any removed item is in the removed item list, so it is still there, but can be seen as removed. Clients will not access the removed set and only get the added part.


#120

Theres a lot of discussion and i see good and bad points as others have pointed out.

Which option gets us to a mvp without compromising network security?

Coders and engineers will always see something that could be done differently.

I just want a functional , safe, network in the wild.