[RFC] Data Types Refinement

tfa · February 3, 2020, 4:38pm

For this specific feature we should, because extreme cases are the most useful ones:

I see use cases without tombstones:

Databases where providers want to retain history: for example, daily temperatures of some cities.
Databases where providers don’t want to retain history, but they are forced to because the database is public: for example, a vendor might not be happy to see his previous prices available to consumers because the discount percentage is not as high as promised, but he won’t be able to hide them because they are public.

I see use cases with only tombstones before current value: private databases where owners don’t need to retain history which is majority of simple databases (and this is why I want to retain MDs because they are efficient at that).

But I fail to see a use case where users would want sometimes to retain history and some other times not (for the same key). What would be the inconvenience to always retain history in this case ? (knowing that we are talking of private data, because public data cannot be selectively hidden).

I don’t deny this. Just also allow a map without history implemented in an efficient way.

Debating if MDs can or cannot be used as a private database is a digression. But nevertheless, I will argue:

Standard SQL databases can easily be implemented with MDs: A table is a MD, its primary key is the key and the record is the value. You can even implement foreign keys with MDs: The foreign key of the source table is the key and the vector of primary keys of the target table is the value. Simple and effective and many SQL databases are implemented like this.
NoSQL databases can also be implemented: A collection is a MD, a document is a value and its id is the key.
And more generally, any set of mutable objects stored in a MD can be viewed as a database.

Problem is not on disk but in memory. Your proposed Vec representation is very efficient:

immediate computation of its version (= length of vector)
immediate access to the ith element

But if you compact elements, then performance of these operation will be degraded. And you must keep the efficiency in all programming language:

rust for vault and api
C#, Javascript, C, Java … for clients apps.

Removing MDs is such a dramatic change that this must be specified in the RFC.

You cannot remove a feature which is already implemented in an efficient way (MDs) without indicating how you reimplement it without losing too much performance.

(but I would prefer that you just keep MDs).

oetyng · February 3, 2020, 6:47pm

I think that’s a very manageable problem to solve. But it has been specifically asked for Private instances to be able to keep history. Like a private Git repo.
So there are things to consider here.

Adding a new type? Not very appealing.
Removing the above feature? Not very appealing either.

Fortunately, in this case I can see a specific use case quite clearly:
Scenario: You are working on something that you intend to share with others, or eventually publish. You want to keep a history, but you might want to be able to not share some mistaken input.

It’s easy to imagine that this is just a tiny fraction of similar use cases.

This is a misrepresentation. Talking of “removing MD” is a very alarmist wording.
Truth is that almost everything about MD is kept - feature wise (that’s what a refactor means).

The single thing you are pointing out that is not kept, is the saving of memory for a private instance, where there’s a lot of change but no need to keep history.
Repeatedly calling that “removing MD” and “dramatic change”, is to say the least inaccurate. But it could be interpreted in worse ways as well.

And you can’t say that it is not specified in the RFC that now there is a history for all keys, and removing from private data leaves a trail of Tombstones. It is quite plain black on white.
That is the actual change. It is not dramatic, and it is not “removing MD”. Some perspective please

But let’s not get bogged down in that. I have quite clear understanding now of what it is that you wish to see.

Summary

Like I said before. The question boils down to the concern for space.

I agree that there are use cases when people do not need to keep history what so ever, at the same time they are frequently updating values, and the instance lifetime might be indefinite.

I also think that this is an optimization that is quite manageable to implement, and as such there is no reason why we shouldn’t. But it’s not very critical to focus on that now, IMO.

But to summarize;

There is a need for a Private Map which keeps history.
Arguably there is also need for one that does not keep history.

The problem to solve is how to allow for this without adding more types. It will be a nice problem to solve. In case you start before I do, feel free to just put it here.

tfa · February 3, 2020, 10:10pm

Yes, this is my concern

And now I understand your reluctance to keep MDs: you don’t want to create a second data structure for the map feature.

Then the solution to avoid this, is to implement the set of values of an entry with an enum:

either the whole set of values since the creation of the entry
or only the last value with its version

This way there is only one map structure, direct access remains optimized in the first case and space remains optimized in the second case.

Concretely something like:

/// (Defined somewhere else)
pub struct Key(Vec<u8>);
pub type Value = Vec<u8>;

/// Optional value
/// None enum value (also called Tombstone) represents:
/// - either a deleted version
/// - or a previous value that is hidden (only possible in private scope)
pub type OptValue = Option<Value>;

/// Versioned value used when only the last one is retained
pub struct VersionedValue {
    /// Optional value
    value: OptValue,
    /// Version of the optional value
    version: u64,
}

/// Values of an entry
pub enum Values {
    /// The whole set of values since the creation of the entry
    Historical(Vec<OptValue>),
    /// Only the last value is retained
    Pruned(VersionedValue),
}

/// Set of entries in a map
pub type Entries = BTreeMap<Key, Values>;

(to be adapted with your needs)

This also means that there are 3 possible commit transactions for an update:

standard commit (keep previous value)
hard commit (replace previous value with Tombstone)
prune commit (remove all previous values and change the enum value to Pruned)

The second and third commit transactions are only available in private scope.

Note: a better name is still to be found for the second one. What about amend commit? This is used in git when we want to modify the last commit message.

oetyng · February 3, 2020, 10:29pm

Great, I like that setup! I will try it out now.

Prune commit is a nice one.

I think hard / soft commit are quite optimal though, as being existing terms within this domain, together with being very natural antonyms (short, distinct and perfectly clear as opposites to people).

What you suggest is standard / amend. I think it doesn’t quite come up to the same level in terms of clarity etc. But, we can muse further upon this.

On a slight tangent, I was thinking about

git reset -- hard HEAD~3 # Go back in time, throwing away changes

This could be quite useful as well in Private scope. So, resetting back to a previous value, throwing away all recent values since.

tfa · February 4, 2020, 1:03pm

For that we could add a depth parameter to amend/hard commit which would indicate the number of entries that are erased by a tombstone.

To go back to the past the user would pass the old value he wants to put back and the right depth to erase all intermediate values (and maybe + 1 if he wants to hide the same value that was previously present)

In my example, he could erase values MyVal7, MyVal8 and MyVal9 by going from:

To:

with an amend/hard commit having value = MyVal6 and depth = 4.

Other interesting properties of depth parameter:

Depth = 1 is equivalent to current hard commit transaction
Depth = 0 is equivalent to current standard commit

So, we can merge standard commit with amend/hard commit and there remain only 2 commit transactions again:

standard one (with depth parameter)
pruned one

Topic		Replies	Views
[RFC] Data Hierarchy Refinement RFCs	22	2338	January 31, 2020
An Overview of the New Data Types Development	40	2034	October 21, 2020
RFC 54 - Published and Unpublished DataType RFCs	40	3174	July 25, 2019
RFC: Dynamic Data Support RFCs	13	1977	April 5, 2016
RFC - Remove Transaction Managers RFCs	5	2437	July 1, 2015

[RFC] Data Types Refinement

Summary

Related Topics