RFC 54 - Published and Unpublished DataType

Discussion topic for RFC 54.

16 Likes

From Unresolved Questions:

It is not clear if the special OwnerGet RPC is worthwhile enough. The data is stored publicly by the vaults so is retrievable by them anyway. The Network doing extra checks and work to make certain data get-able only by the owner(s) (or keys authorised by the owner(s)) for something that is stored in public by the vaults and can be worked around with not much effort (e.g., published on clearnet etc.) seems like an artificial constraint. Also there isnā€™t much incentive for the vaults to adhere to this behaviour. Can it in-fact, over a period of time, might become a norm for the vaults to just return data irrespective of the asker (which is the case now) for GETs and earn rewards for doing such work anyway ?

Yeah this is tricky!

I reckon the check adds a good ā€˜extra levelā€™ of security which is certainly imperfect but still probably quite robust in practice.

One thing that niggles me here is more complex ownership structures in the future may cause a lot of friction here.

eg time-based-locks or hierarchical multisig or oracle-based checks etc. may impose a lot of work on the ownership checking.

This might reduce the usefulness of the datatype or introduce attacks on the verifiers.

I donā€™t like to over-optimise for merely a ā€˜maybeā€™ future problem, but this seems worth further reflection.

My current thinking is the ā€˜protectionā€™ of OWNER-GET is not enough to justify the possible future difficulties it will bring or the additional network load caused by longer routes.

Still gotta ruminate on it for a while.

9 Likes

Iā€™m wondering if the ā€œUnpublishedā€ part of UnpublishedSequencedMutableData and UnpublishedUnsequencedMutableData isnā€™t redundant.

The distinction between ā€œUnpublishedā€ and ā€œPublishedā€ only applies to AppendOnlyData and ImmutableData, as MutableData will never be published. So ā€œUnpublished Mutable Dataā€ is clearly redundant and can confuse people into thinking there is a published version of the type ā€“ on the other hand it is a bit more explicit.

What do people think? Should the redundancy be removed?

10 Likes

Yes and as Lionel mentioned ImmutableData is the opposite, always published. I agree with you here @marcin letā€™s remove the redundancy and go explicit.

9 Likes

When youā€™ll get this to work in Maxwell, this will be an amazingly flexible and powerfully structure!!
:clap::clap::pray:

2 Likes

Yup!
Iā€™m also having a feeling there might be some simpler scheme to all of these names. Not sure but

UnpublishedAppendOnlyData
PublishedAppendOnlyData
UnpublishedImmutableData
PublishedImmutableData
SequencedMutableData
UnsequencedMutableData

is a lot of textā€¦

I am wondering if itā€™s worth the descriptive names here. I myself am very keen on descriptive explicit names. But I think these components are so fundamental that they might deserve something shorter like a unique name, where their use and capabilities are implicit knowledge.

For example: within the event sourcing ubiquitous language, a ā€˜streamā€™ is an append only sequence of events. This is perfectly well known and as normal to any ā€˜event sourcerā€™ (uhm sorcererā€¦ :joy:) as the use of ā€˜streamā€™ in disk IO context, or ā€˜arrayā€™ in data structures context.
So, the appendonly-ness of it isnā€™t needed in the name when sorcerers ( :wink: ) talk to eachother. Or in the names in code.

Just spitballing here:

OpenStream
HiddenStream / ClosedStream

OpenData
HiddenData / ClosedData

Set
SequencedSet

(I havenā€™t managed to bring it all the way home with this example, Iā€™m just exploring here. Maybe someone can pick it up.)

The SAFENetwork is itā€™s own context, and data storage within it is itā€™s own sub context. The ubiquitous language of this context, allows for this approach I would say.
We would here be saying: Open means it can be accessed, a bit more directly said (published is a bit more technical way of saying the same).
Hidden or Closed means it can not be accessed.

Just saying ā€˜Dataā€™, establishes that Data on the network is immutable. We donā€™t need to call it immutable - it is the norm.
Calling it stream is also distinguishing that this is not the actual data, it is rather a stream of pointers to data.

With this, weā€™re saying: We are creating the language here, we are setting the norm. We decide what this will mean, and since this will be widely used, everyone will follow and it will be established [within this context].

Using these long unwieldy names is short-sighted IMO.

OK, so this is not a complete idea/answer, Iā€™m trying to inch over to a perceived possible something - not yet known.

3 Likes

What does hidden mean? Hidden from whom?

Iā€™ll clarify

ClosedData / HiddenData (merely two examples of names) would replace UnpublishedImmutableData.
ā€˜Closedā€™ means it is not accessible by others, in the same way as unpublished means. ā€˜Hiddenā€™ is the same thing, just a little different word (and I might say: is closer to what it really is).

The difference here is that we are defining new names, which carries the implicit knowledge of use and capabilities, instead of technically describing what it is and does. This, I am arguing, is supported by the fact that these are so fundamental parts, and assuming wide spread use (as we surely are) it is quite natural to do so, as it is usually what happens. We load new words with meaning in our created contexts.

Unpublished Data could still be shared with others though, itā€™s just not accessible to everyone.

Hence differentiating between something that is Public/Published or not.

Hidden could also be shared with others. Not just
by default accessible by everyone.

The act of sharing it is in fact ā€˜showingā€™ the address edit: errā€¦ the data. (But less technically expressed)

Iā€™d find a list of types without names (eg just type 1, 2 etc) alongside a description of the properties and uses of each.

It might make it easier to see what naming schemes are suited because at the moment these descriptive names are obscuring rather than revealing IMO.

1 Like

I can see why sequencing would be necessary for a BTreeMap, which is the underlying structure of Mutable Data, especially if in the future it is to be published, thus transforming it into Append Only.

Why do we need sequencing for Append Only, the underlying data structure being a Vec?
My only thought is that this may have to do with ownership, which can change over time.
Something like, ā€œAt data version X, it was owned by owner(s) Y, and the state was Zā€?

Anyway, this would be good to add to the RFC.

I.e., if some user stored a PublishedSequencedAppendOnlyData object at a XOR address X and type tag Y, and another user wants to store UnpublishedUnsequencedAppendOnlyData at the same XOR address X and type tag Y, there will be a conflict and an error will be returned.

This seems problematic to me. Correct me if Iā€™m wrong, but users donā€™t normally have a choice for XOR address, right? The address is derived from the original file contents? Iā€™ve read ā€œWhat is self-encryption?ā€ on the FAQs page and watched the video but some things are still unclear to me.

For instance, say:

  1. User Y backs up a directory of files as UnpublishedUnsequencedAppendOnlyData. They are doing this for backup purposes and have no intention of publishing public data on the SAFE network. One of the files is jquery-3.4.1.min.js.
  2. User X decides to create a public website available on the SAFE network.
  3. User X wants to upload the asset files as PublishedSequencedAppendOnlyData (for example, they want to upload jquery-3.4.1.min.js).
  4. User X gets an error. Iā€™m not sure what happens next:
    (a) they arenā€™t informed why they canā€™t upload their file,
    (b) the SAFE network discloses that the data exists as someone elseā€™s private UnpublishedUnsequencedAppendOnlyData, or
    (c) they are able to use the file. But if (c) is the case, doesnā€™t User X run the risk of User Y deleting their data?

What is a ā€œtype tagā€? I can see that it is an unsigned 64 bit integer, but what is it for and what are valid values for it? Is there some documentation that I should have read prior to reading this RFC?

This is only true for Immutable Data. For published one address is derived from content, for unpublished one it is derived from content and owner. This means that publishing a chunk created as unpublished in a first step is possible.

For all other data structures (Mutable Data and Appendable Data) address is chosen by user. For them if he wants to publish a chunk he created as unpublished in a first step, he will have to either delete it or change its name.

6 Likes

Hey @bytes, @tfa perfectly answered most of your questions, so I will chime in on this one:

You can think about this as a file extension: you can store two files with the same name but with different extensions, like report.pdf and report.xls - and they may or may not represent the same thing. Same idea with type tags: you can store different MutableData / AppendOnlyData objects under the same name but with different type tags, and there will be no conflicts.

In addition to that, certain type tags can be treated in a special way on the Vaults side. For example, type tag 0 is reserved for user accounts and they are handled as such both on the Clients and Vaults side.

This RFC is largely based on RFC 47 - Mutable Data, so reading that one might help.

9 Likes

I like the implementations, but the naming is clunky to read. IMO the following renaming scheme flows easier:

Published*Data ā†’ Public*Data
Unpublished*Data ā†’ Shared*Data
Private*Data ā†’ Private*Data
*Sequenced*Data ā†’ *Versioned*Data
*Unsequenced*Data ā†’ *Data

10 Likes

In the safe-nd crate I see this idea has been implemented and that ā€œsequenceā€ has been shortened to ā€œseqā€ for even shorter names.

All this is OK and Mutable data naming conventions are logical: there are 2 structs (SeqMutableData and UnseqMutableData) and one trait (MutableData) that groups common functions in both structs.

For appendable data this is a little more complex with published/unpublished sub-cases and so there are 4 structs (SeqAppendOnlyData<PubPermissions>, UnseqAppendOnlyData<PubPermissions>, SeqAppendOnlyData<UnpubPermissions> and UnseqAppendOnlyData<UnpubPermissions>)

But for immutable the naming doesnā€™t follow these conventions with 2 structs (UnpubImmutableData and ImmutableData) and one enum (Kind) that encapsulates one of these structs.

ā€œKindā€ doesnā€™t mean anything, and the ā€œPubā€ prefix seems missing in the second enum to me. Following the same logic of the conventions I would have called:

  • the 2 structs: UnpubImmutableData and PubImmutableData
  • the common enum: ImmutableData

Wouldnā€™t this be more sensible?

5 Likes

Yes, it would be I think. Speed means we will miss some of these points so nice catch @tfa lets ping @ustulation and @nbaksalyar to comment here. Seems like sense to me though.

4 Likes

Yes, we discussed this problem internally and weā€™ll be making the naming consistent sometime soon. :slight_smile: Thanks for the suggestions!

7 Likes

The following changes have recently been made to this RFC:

  • Requester field has been removed from the Requests.

  • BLS-PublicKey has been replaced with the PublicKey enum.

  • Missing RPCs have been added for AppendOnly Data.

  • Missing index field has been added for AData owners and permissions manipulation.

  • Common response type will be used for mutations.

5 Likes