[Pre-RFC] Labelled Data

danda · December 3, 2019, 3:32am

A few thoughts as I skimmed over this thread:

It seems late in the game to be adding such a significant feature that would touch various layers. Maybe for after the mvp?
Manual labelling is very nice for certain apps/domains. Eg, tagging photos (digikam) or emails (gmail) or projects (github) or questions (stackexchange).
It should be opt-in, at least for files. A lot of times people just dont want to bother when storing/syncing/creating. Yet for an app like stackexchange maybe it makes sense to be required.
I suppose the simplest way to integrate with RDF is a triple like:

<resource_url> – ms:label → “tahoe”.

And then we may have other public RDF triples like:
<tahoe_url> – dc:name → “tahoe”.
<tahoe_url> – dc:description → “A lake in the sierras”.
<tahoe_url> – dc:image → <img_url>.

So from that one label string, we can work backword to <tahoe_url> and from there get an image and description of Tahoe.

Has anyone defined an RDF schema/vocabulary for Safe network usage? In other words, which properties can be applied to which types of resources? Without such, a graph can grow unwieldy quickly — wild and unruly, like the jungle.

Traktion · December 3, 2019, 7:50am

I like this too! Could such a plugin help content to become more searchable too? Given it could add meta data at upload time, it sounds like such plugins could compete to categorize data most appropriately.

krnelson · December 3, 2019, 8:23am

On this topic regarding app dev usability, disincentivise the use of data silos (aka monolithic apps) and allowing developers to make useful modular extensible applications quickly I recommend a quick read of at least chapter 2 of “The data centric revolution” by Dave McComb:

Don’t let the talk of “enterprise” and “business centric” put you off - the author elsewhere acknowledges that at it’s heart they are talking about Solid ecosystem (with an enterprise spin) where many app devs operate.

Semantic, Solid and RDF are bandied about often as if it was another high level application framework to tack on later, but taking the low level data-centric design approach required to make the magic happen to heart is a much harder step (at least for “old” devs like me :).

JimCollinson · December 3, 2019, 8:56am

I think there may be a little misunderstanding here about what has driven this proposal, and what it has implications for.

The immediate place your mind goes when you read words like Labels and Tags is the User Interface—which is understandable—and you might be thinking “oh great, so everything I do on SAFE is gonna involve me managing some big tag-cloud now? Do I really have to put labels on everything?”.

But the motivation for this is really about the User Experience as a whole. The UI which is layered on top of all that is a separate thing, and it needn’t have anything to do with labels, chips, tags etc… although it might employ those elements if it fits the task at hand.

Flattening things out like this is intended to avoid the UX being determined by some rigid underlying data structures, rather than the other way around.

JPL · December 3, 2019, 9:27am

That’s some interesting stuff. Maybe drifting slightly off topic, but this is surely relevant (from p5 of the PDF).

It is easy to come up with simple models. People do it all the time. However, usually when people oversimplify, the complexity goes somewhere else. Usually that somewhere else is in application code. Another important resting place for complexity is manual procedures. It is possible (and is actually the norm) to construct procedural workarounds to handle the cases not handled by a system. People often build rule-based systems on top of other systems to handle some of the distinctions omitted from the system.

A little later the authors suggest that an optimum level of complexity for an enterprise system is perhaps 300-500 ‘concepts’ (classes, properties, attributes) because 400 concepts are “within reach of most motivated participants” and can be learned over a weekend. Also most organisations are based around a few hundred concepts. However, many enterprise applications use tens of thousands of concepts, leading to both siloed data and lock-in to that application.

This seems to me (a non-developer) to be a good principle to bear in mind. Hopefully, by using the best of RDF and the like, we can build a platform that is simple enough to use by both devs and end users, but also able to handle the complexity of the real world. I think this will be fundamental to SAFE’s success and definitely worth getting right at an early stage so it can be properly extensible. We have the opportunity to start afresh and avoid the pitfalls of the data silo model - but I’m wondering, how much work will that be?

JimCollinson · December 3, 2019, 9:34am

The thinking behind this concept was actually borne out discussions around designing the UX for app permissions, and how we can make things secure, understandable, but yet relatively low friction. In particular we were asking questions like:

Can we allow an app to freely write the data that is needs to to function, without the user having to grant it unnecessary permissions, before knowing if it can be trusted?
How can we let an app view and discover the structure of a user’s data—so files can be found, and selected for instance—without letting it actually read the data?
How do we let a user organise their data in exactly the way they choose, regardless of the app they are using?

Can we let a user be in control of what permissions an app has, and at the same time choose individualised permissions for specific files or folders?
How do we let app devs have freedom to choose the interface, and representation of data, that suits the use case, but yet also have the user structure their data in a way that they understand? And how could we let a user flip between these things at will?
How do we allow data to be in more than one ‘place’, at the same time?
Does the user have a way of easily assimilating who, or what, has access to a give file or folder?
Do they have a way of understanding the same for an given app, or another user?

So, a starting point of apps having containers, with a set of permissions associated, starts off fairly straightforward and understandable, but you very quickly hit its limitations when considering the questions above.

We’re building an MVP at the moment, no question, but we have to raise our gaze a little higher to avoid tying the user (and app devs) up in knots down the line.

krnelson · December 3, 2019, 10:27am

A project with fair amount of resources tackling very similar questions:

Check down from section “The problem”

Also more over on this thread.

happybeing · December 3, 2019, 10:41am

Very useful questions. A couple that I’m also interested in when I think about app access to my data

can I monitor what an app is doing, maybe set threshold or warnings for certain situations?
is it possible to look at the history of what an app has been doing with particular kind of data?

This applies particularly when the ‘app’ might involve delegating access to others. But even for vanilla apps.

I’m not suggesting this be baked in, more as a way to test what could be built on top of the underlying data management capability. I think monitoring and audit are essential to managing data and access.

joshuef · December 3, 2019, 11:02am

Yeh I think this would definitely need to be customisable in the long run My thinking about an initial impl was jut a simple config file for mime-type maps to build out this idea. We could then expand upon later.

For sure. Indeed it’s pretty minefieldy. The label combo idea is a hack around what we have at the moment (named containers). But there will certainly be a better way of doing this in the long run (or maybe even from the get go… if we do go with this. )

Some of this is required for MVP (ie, some form of indexing of data PUT). So if we’re looking at that it made sense to ask these questions now, before we write APIs for ‘containers’ which may well end up changing / extending under this. I’m loathe to be switching up the API if we can avoid it, so if we do want this later, it makes sense to be seeing what we could be doing now (ie, can we get somewhere with minimal effort that means we wont need to rework APIs later).

It should be opt-out IMO. As noted above and in the OP (i think… if not I need to clarify it), that we need this for tracking data applications have PUT, so user’s can know what apps are up to and what data they actually own. Without this, any PUT that you don’t manually keep track of will be lost as soon as the xorurls are out of memory.

Nothing solid as yet (ahem). We’ve had some vague ideas bandied about, but atm w/ MVP the focus is on building the scaffold, and we can bring in RDF behind the scenes for data later on.

There’s no RDF capability baked into the network yet, but it is planned and wanted for integration.

That said, if someone were to want to spearhead an RDF vocab for the network, I would not be sad about that, nor shy about opining and looking to use it for describing data we’re creating.

Yeh, auto applying RDF to your data based upon some rules you set. Making it easier for any/all apps to get access!

joshuef · December 3, 2019, 11:10am

thanks for the reading @krnelson!

Always good to read more on wasm. Sounds very cool, and as you say in the other thread, something we could well be using down the line for increased web-app security

Diving into the data-centric reading now

happybeing · December 3, 2019, 11:20am

One way to cater for languages is to use a schema to capture the semantics of labels. If a label has a meaning defined by a suitable ontology, this meaning can then be represented (localised) using whatever the best name is in any language rather than by an arbitrary string that can have different meanings according to who translates it.

I don’t know much about ontologies, but expect that file type, data categories and their meaning are pretty well covered, and that some of the people involved with Solid could advise on suitable candidates.

Related to sharing ideas, I’ve posted on the Solid chat and forum inviting the Solid community to take a look at this discussion. They have been grappling with these same issues, so I think it would help both projects to join in on this topic.

loziniak · December 3, 2019, 1:29pm

This RFC has made me to think a lot. Lots of questions, and doubts.

What if some app starts to store some photo metadata inside objects serialized into binary MutableData and label it with “photo”? How this will be useful to other apps? Will they see this data?

Will icons, design projects, document scans, logos, comics also be automatically labelled “photos”? Will it be labelled based on file content, or extension? MIME is available only in Web / Electron, from what I’m aware. What with other apps (Java, C, FFI + whatever)?

What other labels would be automatically created? What would be a “standard” set of labels?

How we differentiate labels from folder path? If we have folder “abc” labelled “photos” and also a label “abc”, what would “photos/abc/1.jpg” refer to – a file inside folder or a labeled file?

Why multi-labels have to be in alphabetical order? They represent AND operator, and it’s commutative (a AND b == b AND a).

Is it app:safe-cli or app(safe-cli)?

Is it allowed to label two files with same name with same label? How would “safe index get photos meInJapan.jpg” work then?

How would you implement automatic mapping of similar labels (mentioned Photo = Foto)? Artificial Intelligence? Dictionaries?

How would multi-label indexes be implemented? If we want “me/photos”, we have to traverse all objects from “me” and for each of them traverse “photos” to find if it’s there, so highly inefficient. Would multi-indexes be created for all combinations of all labels? Or would apps create indexes that they need to use? If you uninstall the app, would the index be deleted? What if other app also needs it? Create another structure to track which app uses which index?

Now it’s possible for an app to pretend to be another app, you just need to connect with different id, which is just a normal string you can copy inside your code. I think that at some point, apps will access each other’s data just by pretending. If there is a popular photo app called “photo-album”, all photo apps will eventually start putting photos inside this app’s folder. This would work wit ZERO effort from MaidSafe team, it’s already implemented in Alpha2. Siloing goes away with that.

I think there are more important problems for the Team to solve, as immutable, appendable, unpublished etc. data with per-user permissions, this is the basis of the project, and still not ready to use. MaidSafe should focus on things that app devs don’t know how to create. Indexing and labeling is a common problem for app/lib developers, I would leave it to them. Maybe it’s too much to call this a feature creep, But my opinion is that you have more important things to think about.

cheers
Maciek

jlpell · December 3, 2019, 1:51pm

Zotero is an academic research tool used for organizing and collating vast numbers of publications. It offers the user a flexible means for organization through the use of labels/tags and a folder structure. The labels/tags are such a useful feature that I couldn’t imagine working without them at this point. I can see a variety of benefits for allowing all data types to have a tag/label field, many of which are listed above.
Good idea @joshuef!

I highly recommend that you take zotero for spin. Think of it as a prototype for what you are planning, just not built into the pweb itself yet.

anon34725758 · December 3, 2019, 1:55pm

Hi. I’m a little more familiar with Solid and Linked Data than Safe, but still learning a lot about both. Hope you don’t mind some possibly goofy questions.
How does SAFE use the URL’s in an RDF graph? Does SAFE use CRDT’s? Would the URL’s change when the content changes? Would they be human readable? Thanks in advance!

happybeing · December 3, 2019, 3:21pm

Thanks for joining us, all questions are welcome!

SAFE URI’s can be either human readable or content addressed. The human readable form is very much like a tradition web URI, except that even though the same URI can point to different versions of a file/data, there is an audit trail which means you can always retrieve earlier versions (destinations) of the given URI. This is because the a URI is stored in ‘Appendable Only Data’ which means all previous destinations of the URI are available, forever.

So you get both a perpetual web (like the wayback machine) for mutable public data, and for any public immutable data there is a content based address which never changes.

anon34725758 · December 3, 2019, 4:31pm

There are 55 posts in this thread, and I’ve only read the first 20 or so, and those not very carefully, so I might have missed something.
Solid uses the Linked Data Platform to organize RDF and non RDF resources, although it only uses basic containers and not direct or indirect containers. Then on top of LDP will be done some vaildation based on data shapes which as far as I know still has to be worked out. So they’ve had a while to work out the containment structures in addition to what was done before by others for LDP, as well as what they have yet to do with data shapes. It may be useful to draw on some of that experience. Maybe a comparison of the proposal with LDP would be a good start. Just my impression.

joshuef · December 3, 2019, 4:47pm

thanks for your thoughts @loziniak!

I’d take this as a separate concern to anything to do with labels. Problems here are problems in general, labels or no. There are ideas on this front, but we have yet to focus on anything into an RFC or the like.

This is an issue without labels. If you have a ‘Container’ which is called ‘Photos’ you have the same issue. Apps will have to be aware that they do not own or control data. So they may always be exposed to data they cannot handle. That will be the reality of any system where data is the user’s and not controlled/managed/sanitized by the application necessarily. That’s not exclusive to SAFE nor labels on safe IMO.

As suggested above, auto labelling could be user controlled eventually. And in the short term performed via extensions eg. All apps would benefit from this as it’d be baked into the SAFE apis.

What we consider standard… That’s an open question. Totally open to suggestions. But i imagine a simple mapping of extensions to labels to be a starting point.

Multi labels are a hack around the idea of using code from the containers interface for this. So don’t get too hung up on it for the labels concept as a whole. Why alphabetical? Why not? @mav posits it could be tricky is one. Probably there are more. The goal is to make some string (to be used as a key/container name in this hack), and to make labels useful, some/label/i/have and label/some/have/i are not viable. Only one could/should exist. (Otherwise which label do you own etc)

That’s for the hack anyway. (I should probably more clearly label it as such in the OP) We’ve some ideas that may mean we don’t need this, and are currently exploring that. If it seems feasible, then the OP will be updated/expanded (or indeed superseded by a full RFC).

Probably app(safe-cli) as it’s more url friendly. If the former exists that’s my bad and I’ll head up and correct it.

I’d image similar to how the filesystem handles it: meInJapan.jpg meInJapan.jpg (1). Open to other suggestions. Same question goes for any index. I’m sure that’s a solved issue though.

RDF in the end should solve this as @happybeing points out above.

As suggested in the OP is one naiive option built around current APIs we could work up. But there are other options. And probably others around container style APIs too. I’m open to more suggestions.

As noted in the OP, there will be many opportunities for improving on efficiency here. This is more about the high level concept, but I welcome any suggestions on improving efficiency for this/any following RFC.

See above re: alphabetising.

If you uninstall an app, is your data deleted? (No. It’s your data, not the app’s)

As for if we should be focussing on this. Well, IMO the more we shine a light on this the more useful it becomes for a raft of things. But also, as stated in the OP, some form of indexing is crucial. So it’s definitely worth exploring before we build out one set of APIs to then chop/change that later.

joshuef · December 3, 2019, 4:48pm

thanks @jlpell! will give it a look now.

mav · December 3, 2019, 10:13pm

Well… not really. It would be ‘laughably simple’ to be able to upload files with no hierarchy, no record in your account, let all the ad-hoc data structures pop up to compensate etc; but leaving out any structure would still not prevent the MVP from working. MVP (to me) is ‘can upload and download files’. Bitcoin wallets had no encryption (!!) for a long time after release and caused serious losses, which to me would seem like an MVP-level feature but here we are bitcoin still dominates. The first iphone couldn’t copy/paste. M is for minimum.

Maybe I’m being too backend focused with this opinion. I understand we don’t want to encourage lots of incompatible hierarchies so I don’t hold this opinion very strongly and think this discussion is fruitful regardless if it ends up being implemented or not.

File signatures is sometimes an option, using the magic bytes at the start of the file.

danda · December 3, 2019, 10:58pm

Can someone point me towards where MVP requirements are clearly spelled out?

I just read through the roadmap in detail, and I find no mention of MVP there. A quick forum search also did not turn up a definitive document. Are we referencing a ghost?

Topic		Replies	Views
[RFC] Labelled Data, Indexing and Token Authorisation RFCs	61	3237	January 29, 2020
RFC - Unified Structured Data RFCs	12	3230	September 22, 2015
Pre-RFC: Linkable Data Structure Development	25	2178	January 6, 2016
Code: Indexed and searchable data in SAFENetwork Development	13	2312	February 10, 2019
Demo APP UI Revisements RFC RFCs	0	945	July 18, 2016

[Pre-RFC] Labelled Data

Related Topics