RFC 54 - Published and Unpublished DataType

I wonder what the rationale is for locking the optimistic concurrency to a data structure by dividing them into Sequenced and Unsequenced.

To me it seems like this could be much simpler and more flexible, by skipping that subdivision entirely, and just provide an ExpectedVersion parameter. This is a standard when working with event sourcing streams I would say.

You can look at my AppendOnlyDb at GitHub, where I’ve implemented it like this.

ExpectedVersion.Any => Always writes.
ExpectedVersion.Specific(u64) => Only writes if version matched.
ExpectedVersion.None => Only writes if empty.

Then we can get rid of a bit of this cognitive load with the type count explosion, and shorten the names, and not limit use cases (I might want concurrency check sometimes and sometimes not for my data structure instance, too much assumed here).

3 Likes

Instead of using Sequenced and Unsequenced as data types, I would much rather prefer to be able to set concurrency check on a case by case basis, by having it passed in as parameter.

  • That gives flexibility for the user.
  • Fewer data types.
  • Avoiding these long names.
  • Overall feels cleaner and simpler, yet more powerful.

Since implementations are started and RFC is still open, I wonder:

Is there a hinder to doing that, and what are the reasons for not doing that?

1 Like

Final Comment Period

The Published and Unpublished DataType RFC will remain open for the next 10 days to allow any final comments to be made.

Thank you for your contributions! :slightly_smiling_face:

5 Likes

Is there a reason why they can’t be called Public and Private data types? Published and Unpublished is a tongue twister and difficult naming system to use in a sentence or whimsical online banter. Written down the type system becomes unnecessarily long and unruly.

2 Likes

Or Published and Private?

1 Like

what do you call an unpublished data that you share? private?

Shared data.

Private data is owned by one person/account and not published. When describing the unpublished data type the RFC describes it as being private. So why not just call it “private data”. Much shorter and sweeter, and self explanatory.

Same thing for published data. It is described as being public data. So just label it “public data”. Again this is short and sweet and self explanatory.

Data that is unpublished but has multiple owners is shared data. This is how it is described in the RFC. So just call it “shared data”. Short, sweet and self explanatory.

1 Like

Public and Published have the same root. Public is shorter and self explanatory. Published is not self explanatory. Public is the better label/name for the class of data.

Private is more pseudoPrivate though. The vaults can see it, I mean we don’t enforce encryption there. So there is a marketing/PR issue with private, perhaps?

It cannot be successfully published on SAFE though. Public is perhaps simpler, published == public to a great extent. Private/unpublished is a bit more tricky. Not simple, but we do need to make it simple, but not misleading.

4 Likes

This might be confusing because we already use ‘private data’ for referring to encrypted data which can be published at the same time (e.g.: private/public data in Alpha 2).

2 Likes

Can the distinction between private and secret be useful?

Public
Private (i.e. pseudo-secret)
Secret

2 Likes

I suspect 4 things at play here (possibly 3).
Published
UnPublished (but some could be later, not mutatable, only append and static)

Plain/readable
Encrypted (secret)

This is the issue we can have secret data that is either published or not and likewise we can have publically readable data that is published.

The vaults don’t/can’t enforce encryption (ignoring how they store it etc. from the users perspective they don’t from the farmers perspective they do, but this is all user we are discussing), so clients are the ones who decide secret (encrypted) or not.

4 Likes

Can you elaborate? I have a spreadsheet of my finances, a diary, a file of passwords… stored in my private folder, and the vaults can see it?

If you store that plaintext (from the low level API) then it is plaintext everywhere. The vaults storing it will have it obfuscated, but in flight then the nodes getting the message will see it, unless you encrypt it yourself. This is where high level API’s should ensure all data is encrypted, but folk can bypass that in the low level API easy enough. Vault don’t care really it is only data to them and they look after it.

If we ignore data at rest where vaults will take steps to encrypt on the holders etc. as a farmer protection. Then the vaults only see bit and bytes of stuff, they cannot enforce encryption, they can enforce formats (like the reserved types).

Hope that makes sense. The network can be used by a bad app to store data without any encryption, even bypass self_encryption and so on.

3 Likes

Thanks, I forgot about the difference between low and high level APIs. :grimacing:

1 Like

I agree that ‘published’ and ‘unpublished’ don’t seem right. What about ‘restricted data’ (as in guarded from access)?

2 Likes

This data type is further sub-divided into two categories, Sequenced and Unsequenced. For Sequenced MutableData the client MUST specify the next version number of a value while modifying/deleting keys. Similarly, while modifying the Mutable Data shell (permissions, ownership, etc.), the next version number MUST be passed. For Unsequenced MutableData the client does not have to pass version numbers for keys, but it still MUST pass the next version number while modifying the Mutable Data shell.

Fundamental to this concept, is the version. I wonder, why we need to expand on the vocabulary, by introducing Sequenced and Unsequenced, when the most natural and closest related naming would be Versioned and Unversioned. This is an existing nomenclature that seems suitable.
I believe there is a long standing pattern in the code base of inventing new words for existing phenomena, or very closely related concepts. This is not a good habit IMO. It obfuscates the technology, it’s concepts and principles, to new developers and others. I think it is important with stringency, and minimalism in the nomenclature, trying to keep it small where possible (i.e., if we are talking about versions, use Versioned, not another different word to mean the same thing), and trying to keep it related and close to existing developer nomenclature.

3 Likes

Current structure and nomenclature:

Published / Unpublished

Unpublished

  • Private
    – Encrypted
    – Plain text
  • Shared
    – Encrypted
    – Plain text

Published

– Encrypted
– Plain text



Alternatives (same structure)

Private / Public

Private

  • Secret
    – Encrypted
    – Plain text
  • Shared
    – Encrypted
    – Plain text

Public

– Encrypted
– Plain text


Restricted / Public (1)

Restricted

  • Private
    – Encrypted
    – Plain text
  • Shared
    – Encrypted
    – Plain text

Public

– Encrypted
– Plain text


Restricted / Public (2)

Restricted

  • Secret
    – Encrypted
    – Plain text
  • Shared
    – Encrypted
    – Plain text

Public

– Encrypted
– Plain text

3 Likes

I am not clear on the boundary between of low level and high level. I would consider self-encryption as low level, because it is automatically invoked when files are uploaded. The only problem is that small files (< 3KiB) are not encrypted.

I think users should be warned when they are about to upload private small files. This would give them an opportunity to check that these files are not confidential. For example, the dry-run option of the CLI could indicate if a file won’t be encrypted.

2 Likes

I’m also not the biggest fan of Unpublished, however Restricted seems wrong too because Published is actually kind of more restricted, as this data can’t be deleted.

Intuitively I would say that I like the idea of splitting Private into Private and Shared and additionally having Public the most.