[RFC] Data Types Refinement

Ah, I don’t know what past topic you refer to. I might have been unclear at some point (this thread/topic here, is the only one I’ve discussed Map though I think).

To clarify:

Map replaces MutableData. MutableData feats go more or less into the PrivateMap. So what is newly introduced is PublicMap, as to be consistent with the other types (every type has a Private and Public scope).

In reality though, it is the key-val semantics of AppendOnlyData that has been extracted out into a Map type, and MutableData more or less scrapped - with exception for a few parts that were merged.

This is because AD key-val parts and MD were overlapping (two different implementations of key-val that were never consolidated before), and there was no Private|Public duality over one single type, instead it was over two different types (MD and AD), and at the same time AD was in essence two types (Map and Sequence). Although that was a very limited “Map”, meaning there was no fully capable equivalence to a “Public MutableData”. And fixing that, is the biggest feature addition with this RFC.

You can do Update and Delete on both Private and Public Map, but on the Public one, there’s always a history of values for every key - i.e. no public data can be truly deleted.

3 Likes

I suppose you could even add ( Private | Public ) as an Enum parameter. Not sure whether that would be better or not. It simplifies the data types a bit, but makes it slightly less immediately clear if something is private or public.

2 Likes

Might be eventually, with other improvements that are being planned as we speak.
I think for now, with this RFC, it’s good as is.

But there is currently a “step 2” of the data types work, which if accepted will go into a new RFC. It’s dealing with the backend handling of data, but also how the Private | Public scope applies to the data type representations and the underlying data.

So, these things are likely to be discussed there in that case.

6 Likes

Sorry, I meant a past post in this topic. More precisely:

PublicMap is equivalent to these 2 elements (previous Sequenced/Unsequenced subclassing being replaced by the new ExpectedVersion enum), simply because:

and so, logically, PublicMap shouldn’t exist.

I am not sure to understand what you intend to do about public/private scopes, but don’t forget that we need the previous AppendableData feature not exclusively in a public way, but also in a private way.

Edit: To be more precise on my doubts:

Historical values is a feature that should be orthogonal to public/private scope.

1 Like

I think you need to read the RFC again, because the things you say don’t make sense.

These few lines from the RFC should say it all:

This proposal replaces MutableData and AppendOnlyData .
It merges MD and the key-val part of AD into a single type Map , and separates a Sequence type out from AD .

Map forms a perpetual MD in private and public form, while Sequence is essentially an AD without key semantics.

[…]

PrivateMap ~ ( UnpublishedUnsequencedMutableData / UnpublishedSequencedMutableData )
PublicMap - New!

PrivateSequence ~ ( UnpublishedUnsequencedAppendOnlyData / UnpublishedSequencedAppendOnlyData )
PublicSequence ~ ( PublishedUnsequencedAppendOnlyData / PublishedSequencedAppendOnlyData )

But let me know if it’s still unclear.


It’s the same as before.


Public types are perpetual, that means nothing is deleted, data is only appended, i.e. a “history of values” builds up.

3 Likes

I’ve been reflecting on what I’ve read in the forum and on github with regard to how the data types are used/structured in addition to the ideas presented in this RFC. I wanted to offer the following diagram and set of definitions as a variation of possible object hierarchy. I won’t go in to the nuances of private vs. public and guarded vs. unguarded flavors discussed in this RFC since those are easy extensions.

A few definitions.

  • Chunk : a generic term to refer to any contiguous piece of data that is 1MB in size.
    Block : a permanent and immutable Chunk that cannot be changed once it is written. It’s XOR address is based on a hash of the data content.
    Blob : a mutable Chunk that can be updated. All of a Blob’s content is rewritten during every update, but its unique XOR address stays the same.
    Blend : an appendable Chunk that consists of 250 lines/entries, each 4kB in size. Only the first empty line is written during each update. Each entry is immutable once written, but the Chunk itself is mutated as new lines are added, hence it is a Blend of a Block and a Blob.

  • File : The highest level of organization for a set of data. “Everything is a File…” before in enters the SAFE Network and when it leaves. This is what is provided by and to the human client.

  • Sequence: an ordered set/collection of Chunks where repetitions are allowed. Could be built with, or be an extension of, a Blend.

  • Map : A File descriptor that points to none or more Chunks and/or Sequences. Could be built with, or be an extension of, a Blob. Small files could be stored entirely within a Map, without the need for a Sequence. This is analogous to the way that modern filesystems store small files within an inode, rather than take up another entire disk sector.

18 Likes

Great to see all these ideas and inputs. It really feels like great group think and this is proving it works.

15 Likes

Thanks @jlpell!
Fantastic that you are putting in time and thinking of how we can do this better.

Your object hierarchy idea is similar to what I am working on as a step 2 of data types work. I get very tempted to put that out for comments now :smile:

I will look closer at your ideas soon, and I think they are going to be absolutely great to weave into that discussion.

10 Likes

I will rephrase it differently to try to make me understood. I have 2 fears:

Firstly, when you say:

I interpret this as: if the map is private then previous versions of keys and values are not preserved and the map needs to be public to preserve their whole history. But by doing this, you lose former Unpublished(…)AppendOnlyData possibilities.

Secondly, the extraction of the notion of Map might open the door to a workaround of Perpetual Web principle

I cannot be more specific because you did not show how the elements you want to separate (map and sequence) will be recombined to implement the existing data structures.

Concretely, can you show how the following example will be translated in the new system:

Take an AD at version V0:

V0
KeyA ValueA
KeyB ValueB

Update it in a transaction with one insertion, one deletion, and one update and then it has 2 versions:

V0 V1
KeyA ValueA
KeyB ValueB KeyB New ValueB
KeyC ValueC

Current version is V1, but data at version V0 are still accessible. This was possible for both a published and an unpublished AD (this is about my first fear)

We couldn’t cheat because the AD was a single structure. With a diagram for an equivalent example in the new system, I could check that there will be no risks of cheating with the extracted map part (this is about my second fear)

2 Likes

Ok, but I go through this quite extensively. Did you actually read the RFC again @tfa?
I think this part here answers all your questions. But I will go back there now and clarify even further.

From the RFC, under Map:

Key versioning

Both Public and Private Maps use key versioning. This means that there is a history of values for each key in the Map. An Update SHALL append a new value at the specific key, and thus incrementing the key version. A Delete SHALL append a Tombstone to that key’s value array (and similarily increment the key version).
[…]
The difference between Private and Public is that the Private Map allows hard-delete and hard-update, which erases the actual value.
[…]

Private Map

The characteristics of a PrivateMap is that Delete actually deletes the value; that data is permanently removed from the network.
The PrivateMap API is extended with ‘hard_delete’ and ‘hard_update’ to allow for this.
Using delete , the actual value is still available in the history. With hard_delete , the value is deleted, but the previous history is intact. Deletion of a key along with the entire history of it is also possible (this is similar to how MD works today).
And ultimately, the entire instance can be deleted from the network.

hard_delete and hard_update

These operations replace old value with Tombstone , and then appends another Tombstone when delete , and the new value when update .

The vector is also reflecting version, which is the reason for inserting a Tombstone , as it hard-deletes the data, but also increments version.

Using hard_delete and hard_update operations, the data history would look something like this:

[value] -> update(new_value) -> [Tombstone, new_value] -> expected_version = 2
[Tombstone, new_value] -> delete -> [Tombstone, Tombstone, Tombstone] -> expected_version = 3

…while the ‘soft’ delete and update gives:

[value] -> update(new_value) -> [value, new_value] -> expected_version = 2
[value, new_value] -> delete -> [value, new_value, Tombstone] -> expected_version = 3

There will always be a version. The PrivateMap allows for erasing a value, but the version will always be bumped, so there’s no fooling anyone about that.

1 Like

Maybe the source of the misunderstanding is that after having extracted basic operations (key/value management and sequence management) I thought you would re-implement the original (Un)published(Un)sequencedAppendOnlyData multi-key feature by combining these basic operations in a certain way.

In fact you didn’t mention this, but in my mind it was obvious and an understatement that we preserve this feature. For example they are used in safe file sync command because a directory is implemented by such an object.

Maybe the problem is that you don’t intend to do this and this explains why we are going in circles.

But if you intend to re-implement original *AppendOnlyData structures you should tell more about it. For example you could explain how the use case example I mentioned in my last post would be done.

2 Likes

I’ll clarify over the coming days and we’ll see where we land :slight_smile:

3 Likes

Just wanting to mention that I am very happy to see this simplification for data types :face_with_monocle: :hugs:

6 Likes

Each key in a Map has a vec. The length of the vec is the expected version. That is our convention for denoting the version we expect the key to have next. (The current version of a key, is the last index of the vec.)
A Map instance also has data version, which increments on changes to any key in the Map.
So, any time you do an operation on a Map key, the key version is bumped, but also the Map data version is bumped.

(Read more about expected version here)

Expected version is the value passed in for optimistic concurrency control, it is the version you expect to come, not current version. Which is a convention.
You could pass in current version, but then you’d need to be able to pass in a value meaning “empty”, which is a complication. Additionally with unsigned it would not be representable.

Here is the convention explained (only, it’s called index instead of version , something that got updated later): https://github.com/maidsafe/safe-nd/pull/126

Examples

Private and Public Map

You have keys B and C.

The underlying representation looks like this:

"B": ["item_0"]
"C": ["item_0", "item_1", "item_2"]

But the current state view, would look like this

"B": "item_0"
"C": "item_2"

The expected versions are these

"B": 1
"C": 3
Map instance: 5

This is the same for Private and Public scopes.


Ops on Private and Public Map

Now say you want to send in a transaction consisting of 1 insert, 1 delete and 1 update:

tx {
    "insert" : [ { "A" : "inserted_val" } ],
    "delete" : [ "B" ],
    "update" : [ { "C" : "updated_val" } ],
}

(Here I have excluded the details of concurrency control. So you could attach expected version per key, or for the Map instance for example, and if any did not match, the entire tx would be rejected.)

The result for both a Public and a Private Map is the following:

Results of insert, delete, update:

The underlying representation looks like this:

"A": ["inserted_val"]
"B": ["item_0", "Tombstone"]
"C": ["item_0", "item_1", "item_2", "updated_val"]

But the current state view, would look like this

"A": "inserted_val"
"C": "updated_val"

The expected versions are these

"A": 1
"B": 2
"C": 4
Map instance: 8

You can see that the current state view considers the key B to be deleted.
But there is an API for accessing the key history, both the entirety, and a range, as well as accessing a single entry at a specific version of a key.

Again, this is exactly the same for Private and Public scopes.


Ops on Private Map

So, let’s look at hard_delete and hard_update. These are available only in the Private scope.

Let’s operate on the above result. And we want to send in this tx:

tx {
    "hard_delete" : [ "A" ],
    "hard_update" : [ { "C" : "hard_updated_val" } ],
}

Results of hard_delete and hard_update:

The underlying representation looks like this:

"A": ["Tombstone", "Tombstone"]
"B": ["item_0", "Tombstone"]
"C": ["item_0", "item_1", "item_2", "Tombstone", "hard_updated_val"]

But the current state view, would look like this

"C": "hard_updated_val"

The expected versions are these

"A": 2
"B": 2
"C": 5
Map instance: 10

You can see that with hard_delete, the actual value for A was deleted, which is represented with a Tombstone. Additionally, since we have to increment version, we append yet another Tombstone.
For C, the hard_update meant that its actual value was deleted, and then the new value appended.
B didn’t change since we did nothing there.

Again, these operations are only available in Private scope.


Details

Always write

Also, when implemented, you could specify ExpectedVersion.Any as a parameter, and the tx would always write, regardless of version - given that the keys to delete and update exists, and the one to insert does not exist.

Tombstone

If the current value of a key is Tombstone, that is interpreted as the key not existing when trying to operate on it. An update or delete on that key, would thus fail with KeyDoesNotExist. Similarly, for an insert to succeed, the key must never have existed, or have current value Tombstone, otherwise would fail with KeyAlreadyExists.
There is still an API for accessing the key history, both the entirety, and a range, as well as accessing a single entry at a specific version of a key.

Permissions

A Map in Private scope will have unique permissions for hard_erasure (covers both hard_delete | hard_update).
If you want to collaborate with someone smoothly on some shared private data, but not allow hard_erasure, you’d simply not give those permissions.
Even if you did give those permissions, you’d always see by the versions and the Tombstones, that an hard_erasure had happened.

3 Likes

Previous AppendOnlyData was a sequence of versions, and each version was a map of Key/Value pairs. My example was structured like the following:

V0 V1
KeyA ValueA KeyB New ValueB
KeyB ValueB KeyC ValueC

(global versions listed horizontally)

In your proposal they are structured the other way around, with a map of keys, and each key is a sequence of versioned values. The same example is structured like this:

KeyA KeyB KeyC
V0 ValueA ValueB ValueC
V1 Tombstone New ValueB

(key versions listed vertically)

If my understanding is correct this means that:

  • Beforehand each global version was immediate to retrieve, but each version of individual key was complex to retrieve (but this was feasible).

  • Now it is the reverse: each version of a key is immediate to retrieve, but a global version is complex to retrieve (and I am not even sure this is feasible).

I guess there are cases where the new structure is better adapted, but here I am talking specifically of previous AppendOnlyData features for which the previous structure seems better adapted.

An example is implementation of directories: With current safe-api we can display the listing of files of a directory at any specific version. I don’t know how you will be able to do the same with new structure.

Concrete example: display version 2, 1 and 0 of a directory:

$ ./safe cat safe://hnyynyiq7dzd63qo3a6hpzeckyzcxkiqzmrt4z5fmnuf13wm3hpj8epdkrbnc?v=2
Files of FilesContainer (version 2) at "safe://hnyynyiq7dzd63qo3a6hpzeckyzcxkiqzmrt4z5fmnuf13wm3hpj8epdkrbnc?v=2":
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| Name                    | Size | Created              | Modified             | Link                                                              |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| /dummy.txt              | 13   | 2020-01-25T21:55:13Z | 2020-01-25T21:55:54Z | safe://hbhydydu1ty15s4s1p6iyh9whg8pqy1h3ro3aeiu6j9gbpo9mmp93nbgex |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| /img/safe_logo_blue.svg | 5851 | 2020-01-25T21:51:49Z | 2020-01-25T21:51:49Z | safe://hbwynod35mhrikn8ohpdh5iiu6ohnw6up1zfiqhhkaey98txmoh79haqub |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| /index.html             | 639  | 2020-01-25T21:51:49Z | 2020-01-25T21:51:49Z | safe://hbhybynbpbbdgaee46jbdc8qcsd5dm9igyw8ku77mfukiz7a7y69w8cdmj |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
$ ./safe cat safe://hnyynyiq7dzd63qo3a6hpzeckyzcxkiqzmrt4z5fmnuf13wm3hpj8epdkrbnc?v=1
Files of FilesContainer (version 1) at "safe://hnyynyiq7dzd63qo3a6hpzeckyzcxkiqzmrt4z5fmnuf13wm3hpj8epdkrbnc?v=1":
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| Name                    | Size | Created              | Modified             | Link                                                              |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| /dummy.txt              | 6    | 2020-01-25T21:55:13Z | 2020-01-25T21:55:13Z | safe://hbhydydw9t33q47dgax19r935say134ox89zwgj1dx7bcmfyp541duqhfj |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| /img/safe_logo_blue.svg | 5851 | 2020-01-25T21:51:49Z | 2020-01-25T21:51:49Z | safe://hbwynod35mhrikn8ohpdh5iiu6ohnw6up1zfiqhhkaey98txmoh79haqub |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| /index.html             | 639  | 2020-01-25T21:51:49Z | 2020-01-25T21:51:49Z | safe://hbhybynbpbbdgaee46jbdc8qcsd5dm9igyw8ku77mfukiz7a7y69w8cdmj |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
$ ./safe cat safe://hnyynyiq7dzd63qo3a6hpzeckyzcxkiqzmrt4z5fmnuf13wm3hpj8epdkrbnc?v=0
Files of FilesContainer (version 0) at "safe://hnyynyiq7dzd63qo3a6hpzeckyzcxkiqzmrt4z5fmnuf13wm3hpj8epdkrbnc?v=0":
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| Name                    | Size | Created              | Modified             | Link                                                              |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| /img/safe_logo_blue.svg | 5851 | 2020-01-25T21:51:49Z | 2020-01-25T21:51:49Z | safe://hbwynod35mhrikn8ohpdh5iiu6ohnw6up1zfiqhhkaey98txmoh79haqub |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+
| /index.html             | 639  | 2020-01-25T21:51:49Z | 2020-01-25T21:51:49Z | safe://hbhybynbpbbdgaee46jbdc8qcsd5dm9igyw8ku77mfukiz7a7y69w8cdmj |
+-------------------------+------+----------------------+----------------------+-------------------------------------------------------------------+

And a directory is an AppendOnlyData:

$ ./safe dog safe://hnyynyiq7dzd63qo3a6hpzeckyzcxkiqzmrt4z5fmnuf13wm3hpj8epdkrbnc
Native data type: PublishedSeqAppendOnlyData
Version: 2
Type tag: 1100
XOR name: 0x5dd1dc7ecba19c7b8dba18a05d8f555d75923abecab14cb2cd179e35274346a2
XOR-URL: safe://hnyynyiq7dzd63qo3a6hpzeckyzcxkiqzmrt4z5fmnuf13wm3hpj8epdkrbnc
3 Likes

Aha, yes I see what you mean. (Edit: But your description is not quite correct, and your example will be the same with this RFC, see next post.)
There are ways to do it, from the top of my head this for example (using a different way of global version than I previously described):
The global version could be handled similarly as the key versions, as a vec, where every update to key(s) appends the keys involved and their new version.

0: { "A": 0 }
1: { "A": 1, "B":0 }
2: { "C": 0 }
3: { "B": 1, "C": 2, "D": 0 }

To get the versions of all keys as of a specific global version, I’d start at that version and move down towards 0. Every time I see a key for the first time, that version is collected.
The result will be all key versions as of a specific global version.

Very long-lived / active constructs would after a certain number (10k, 100k? Something like that I’d guess) benefit from a snapshot.

Edit: I’d add though that I’d probably rather see all of this managed in a db on disk, instead of using runtime constructs like this. Using a db would be the case I guess when moving away from storing these things into a chunk together with the data (i.e. in one file on disk). I think these things would be more efficiently managed and queried that way. We’ll see when we get there.

1 Like

This is not quite correct @tfa.

I did think there was something suspect with your representation when I saw it. Because I knew AD underlying representation is a single vec with an Entry struct with two fields key and value. Which is not what you describe.

But I was confused because I have not worked with safe-cli and safe-api, and I was on my phone and going to sleep. Now I’ve been at the computer to check this out though.

Summary

  • The functionality you describe is not changed with this RFC (upper layers simply keep doing the below described storage of FilesMap).

Because:

  • ADs are not structured the way you describe.
  • It’s a vec of key value pairs.
  • The version is the length -1 of the vec.
  • Every new global version has a single key value pair.
  • When the AD is used as a files container, the value of one of those is a serialized FilesMap (but that’s app usage, not part of the AD structure).
  • The key of that pair is not the keys we talk about in this case (it is not used here).

Details

cat cmd calls files_container_get with the supplied url (containing the version as parameter), there the content version and the xorname is extracted from the url. Then get_seq_append_only_data is called with this, and that calls get_adata_range with the version as start and version + 1 as end. That returns a single key and value, (from that vec of key-val pairs that I mentioned above).

The files map is then deserialized from this value.

You get the entire files map, as per your example, because it is serialized in its entirety into one single value under a key.

That single value is this:

pub type FilesMap = BTreeMap<String, FileItem>;

And further:

// Each FileItem contains file metadata and the link to the file's ImmutableData XOR-URL
pub type FileItem = BTreeMap<String, String>;

Conclusion

If we’re just replicating the functionality you describe, we’d not use a Map, we’d use a Sequence, and just append new FilesMaps to it. Voilá. Simple.
Because the key in the above example is not used for anything.

If you want to implement this in a Map instead, then that’s possible as well. But I think this suffices for this specific concern you had.

5 Likes

No, I don’t want that. I just want to know how Maidsafe will reimplement current features of safe api .

So, it seems to be OK for FilesContainer which was implemented with a PublishedSeqAppendOnlyData structures and can be reimplemented as a Sequence.

But the exercise should be done for every data structure used by safe api and the pros and cons of the proposed structures should be carefully assessed.

Another example to explicitly clarify is previous MutableData:

  • MutableData could be used for temporary data. This isn’t the case anymore with the new Map structure because hard_delete and hard_update commands don’t delete previous entries.

  • The new map structure isn’t adapted for previous UnsequencedMutableData where values are not versioned. Yes, it can manage this degenerated case, but it is a waste of resources for a simple standard map.

All these elements need to be consolidated in a more complete document, especially as:

  • In the past there has been numerous rewrites of the data structures.

  • Yes, current implementation uses adhoc structures, but they do their specific job well. As the saying goes: if it’s not broken don’t fix it.

Also, some issues you raised could be solved by minor modifications and not a complete refactoring with still so many unknowns.

One of these small improvements could be your idea of not forcing the caller to pass the expected entry version when mutating a sequenced structure. There is no need to break everything for that. For example, you could change mutate_entries function of SeqData to accept a general EntryActions instead of a specific SeqEntryActions (in case of UnseqEntryActions the function would compute automatically the right version of each entry).

2 Likes

It’s a bit drastic to say that that it isn’t the case anymore. Map can very much be used for temporary data. In the RFC it is listed that a key history can be deleted. Additionally, the entire structure can be deleted.

It is true that Map is not designed for allowing deletion of historical values. For a reason; we could extend the API with hard_delete_at and hard_update_at for example, but currently, that would introduce problems with versioning, and require a change to that design.

If you want to see that change, feel free to suggest a solution. If only the versioning is solved I think it would be a desired addition.

Also, it would be great if you wanted to give input on how you would like to amend and adjust things in the RFC, to support things you wish to see. As an RFC, there are many parts that are not fully covered, and you could help out by trying to figure out how the solutions in the RFC could be extended/modified. For example, deleting the key history, should it leave the last seen version? I would think yes, but it is good if everyone can pick things like that up and suggest a way to solve it.

Considering that the versioning is a per-request decision, there isn’t much else to do, than to keep track of versions on all these instances. I guess that could be called a waste of resources, but then so could practically everything else that requires resources as to achieve a goal. We want to achieve the goal of giving the user a per-request choice for concurrency control, that requires keeping versions.

Well, that view is a bit simplistic I’d say, it leaves a lot of room for interpretation of “broken”.
It is explained in the RFC what things are not good, and how we improve them. There are many levels on which things can be “broken”, that introduce different sorts of costs, to developers and to users. It can be runtime complications, it can be code quality etc. It’s not binary, that either it works or not (broken/not broken). All these things are on a continuous spectrum, and there’s always a cost and a benefit relation.

The scope of your evaluation seems to not include the things mentioned in the RFC. But these are important aspects, and there’s been a lot of enthusiasm for getting them addressed.


This is a very small portion of the RFC. But regardless; IMO it’s building more confusion upon an already confusing code base. Now suddenly the sequenced data type can also be not sequenced? This sort of mending is not something I’d do unless really forced to (due to for example not having access to the code base, or a big user base that relies on things not changing, etc.).


Again, thanks for your opinions and suggestions @tfa. I’m looking forward to hear your input on how various things in the RFC, that are yet not clear, could be more clear (and would be great if you have ideas for solutions for specific problems as well, if you are to find that).

1 Like

Changelog Data Types RFC

2020-01-30

  • Clarified Private | Public scope implications for Map.
  • Added description of possible extensions for deletion scheme in PrivateMap.
2 Likes