[RFC] Data Hierarchy Refinement

Nice work. There is a lot to dissect here after reading through it a few times. The only way for me to give feedback in a coherent manner is to pick through it line by line as a running commentary. Although crude it is the only way I can manage a response in the wee hours of the morning. Here goes:

Good. I would go one step further to require that the “chunk” datastructure is a base unit used in the construction of ALL other datatypes in the hierarchy following an OOP construction by assembly approach, including metadata.

Based on this RFC and past discussions in the other thread on Data Types Refinement you have convinced me that the term Blob is an absolutely horrible descriptor for what you are trying to accomplish. More on this below.

It is unclear here if you really mean data “type” vs. an instantiated data “object”.

Just chunk it. Chunk early and chunk often. :cowboy_hat_face:

Interesting insight. I thought that if the local group of 8 did not have enough storage than the nearest neighbor search radius is expanded to include more than 8 nodes?

I agree that there is an opportunity for improved deduplication here.

I like where this is going. Viewing SAFE as a big hard drive in the sky with analogous operations/functions to a common tried and true filesystem like ext4 or xfs will help speed development IMO since you already have a stable, working and well documented model of what you are trying to accomplish at a grander scale.

If not careful with the definitions this could lead to some circular dependencies since the metadata needs to be stored somewhere too. Consider as an example the EXT4 filesystem where we have data blocks and metadata blocks. Regardless of data block type (meta vs. actual) they are all stored in fixed block sizes on disk (typically 4kiB to match hardware sector size). I view the EXT4 block on disk to be analogous to a SAFE chunk. This indicates that your metadata should ultimately be stored as a “chunk” too if you want to keep the logical consistency and benefits of assembling a well defined object hierarchy.

Specific comments about terminology:

Nice. I like the differentiation here. This could also be generalized to N layers extending from core to boundary. A few synonyms that evoke different imagery for the case of N =3:

  • “Gateway Nodes”, “System Nodes”, “Kernel/Core Nodes” for a computer reference.
  • “Exterior Nodes”, “Boundary Nodes”, “Interior Nodes” for a spatial reference.
  • “Frontier Nodes”, “Border Nodes”, “Control Nodes” for a geographical/political reference.
  • “Peripheral Nodes”, “Passing Nodes”, “Principle Nodes” for a roles reference.

The term Shell is not used appropriately here and also later in the document. In computing the term ‘shell’ is synonymous with a user interface that allows access to an operating system, its programs and services. It is confusing to equate shell terminology with pure data and datatype constructs unless you are specifically building to a shell program like the SAFE CLI. I do like the simple and self-explanatory definition offered by Client Nodes

Programming wise, if chunks form the base object in an OOP hierarchy from which other types are assembled, then your metadata should also be stored as chunks. This means that all nodes would store and retrieve chunks, but nodes dedicated to dealing with metadata would store and retrieve metadata chunks, and nodes dedicated to data would store and retrieve data chunks. For this reason I would recommend using the terms Data and Meta Data. To maintain continuity with previously employed terminology I suggest using Data Vaults and Meta Vaults here.

You seem to really like this term, but it is not a good designation for what you are trying to achieve here. This is made more evident by the picture you drew below. Really what your are designating by your Blob and Sequence is an Unordered Set vs. an Ordered Set. The mathematical definition of an Ordered Set is essentially a Sequence (when duplicate entries allowed) so you’ve got that one. Blob on the other hand evokes no intuition of a Set. So why not just keep it simple and call it a Set. You could also extend the Set terminology to a Collection where duplicate entries are allowed.

Everywhere else in the computing world this is called a File. The use of Files in the network is OK. KISS. “Everything is a File.” And like I said earlier, shell is usually reserved for user interaction with programs/services. Later on when SAFE has a computation layer, I could see “ShellNodes” as being the perfect description of an interface to this layer. These future ShellNodes would handle the running of SAFE programs and processes as part of a general SafeOS.

ClientNodes, FileNodes, DataVaults, MetaVaults.

I would be happier if you replaced the term “Shell” with “File” in this large section.

Seems inconsistent to have specialized ChunkSet and chunk_map types. Wouldn’t it be preferred to build this higher level types from lower level ones? So this way a chunk_map is instead Map<Chunk>.

This is a problem. Nothing in this world should ever have the indignity of being blobified at this high level of abstraction. I suspect that it would only get too large due to the owner history and permission history. Better to change these constructs from a Vec to a Sequence so that all of the #blobification can happen under the hood.

Perfectly logical and follows standard filesystem practice. However, is there a chance for a security exploit here where the refcount could be decremented maliciously?

You forgot one of the best features that is possible with this approach. The chunks can be encrypted again by a subtype of Gateway nodes prior to being sent to a Data Vault (your ChunkNode) and decrypted when retrieved from the vault. These keys would only be known by the Gateway layer, not the Client nor the System Layers.

All nice to see. I also like your data flow diagrams. They really help make things easy to understand. A lot of possibilities here.

12 Likes

Fantastic feedback, @jlpell. Much appreciated :slight_smile:
I’ll brewing on a response .

7 Likes

I need to correct myself here. Client node is actually ambiguous since that best describes the client/user computer that is actually connecting to the SAFE Network. So I would instead propose a rename of Client node to Shell node. Here Shell node is a better nomenclature since if I understand correctly these are essentially the network interface that handles client requests and does input/output to the clients.

Another correction needed here. Given the proposed rename above. These would be more appropriately called Process Nodes or Compute Nodes.

In summary, you end up with Shell Nodes, Process Nodes, and Data Nodes/vaults. An intermediate File Node aka Meta Vault to deal with your metadata needs might also be important to complete the picture and form an analogy to a general purpose computer system with typical levels of pointer indirection. After all, a general world computer is the end goal is it not?

6 Likes

Yes, the internet cables are simply the connections on the giant mother board. :slight_smile:


Where we differ on the view of chunks, is that I see all chunks as a write once immutable object.
You can write anything between 1KiB to 1MiB, but once written it can not be changed.

Metadata, since it is changing (you edit name, you change scope, modify permissions), would not go into chunks for that reason.

It should say “an instance of a data type” :+1:

Not AFAIK.

The metadata is stored on the ShellNodes. It’s a different structure on a different type of nodes. I don’t see that circular dependency that you mention.

Well, the intention is to keep things that change - metadata - separate from things that don’t (and can’t) change - chunks. The point is to separate them, so I don’t see why they should be made chunks with that in mind.
Metadata is a thin layer on top of the bulk of the storage that is chunks. It describes a set of chunks, and who can access them and what basic structure they are part of.
This thin layer can easily be handled and stored by the ShellNodes.

Chunks can only be accessed through this thin layer. So making the thin layer a chunk, short circuits the logic, since you have to access a chunk to get the metadata that is supposed to be the layer hindering access to chunks…

I’ve always preferred to use computer references when building IT stuff. I know all (or most) of them where once references to something out of our natural world. But that was before the new world existed. Now that it does exist, I very much prefer to use that language, and invent/bring in new only when absolutely necessary (i.e. there is no real equivalent in the domain, so you’ve invented something).

“Gateway Nodes”, “System Nodes”, “Kernel/Core Nodes” for a computer reference.

So, these I consider the best.

I agree. I basically hitch hiked on an existing use of the term today in MD and AD; it returns all except the data (so the permissions and owners :slight_smile: ). It is basically the same things that we already have, but organised a bit differently.
But again, there are better names than shell for this, I completely agree.

Not if you can only access chunks through metadata, which is one of the major features in this design. How are you going to load that metadata, when it is stored in a chunk?

Sure, you could say that they are stored in meta data chunks, but that to me only confuses. Which chunk was it now? Better with clear distinction. The meta data is information about chunks. It has another structure, another purpose, another lifetime. I think that is best modeled by letting it be something else than a chunk. It gets too goopy/muddy to try fit everything in that one concept. Better have them clearly defined and separate IMO.

I’m not modelling a set here though. I’m modelling a blob. (The picture isn’t great, I can see how you would get the wrong impression, so I will definitely update it.)
Also I think you look at that picture from the wrong side. The names of the structure are for those accessing it from the left (clients), and it is a layer of abstraction over the right (the sea of chunks).
When you have a binary large object, a bunch of data in no special structure, it is stored to the network as a Blob, and the result is that there is a single pointer (no set of pointers…).
A pointer is meant to point to data. It is not meant to point to a chunk.

A pointer is either of these three:

  • An address to a Shell instance.
  • An address to a single chunk.
  • A map to chunks.

Because your data could fit in one chunk, or it could be split into many chunks, or it could be put in another structure described by a Shell.

So, the pointer is to your data regardless of those underlying details.

  • So, when you have a Blob, you will have a single pointer.
  • When you have a Map, you will have pairs of keys and pointers.
  • With a Sequence, you will have an ordered set of pointers.

And I think this can be made clearer both in the picture and in the text.

In a way I like it better than Shell. But I think it comes with its own problems. Connotations that don’t quite apply.
For example, you could store a file to the network, and you could store it “freely floating” as a Blob, or in a special structure, such as in a Map, with a key to reference it. Not sure I think that thing is best described as File.

The Shell is an abstraction or a container, depends on how we see it, over/holding our meta data.
Our meta data I would say is of three types (out of 5 described here):

  • Descriptive metadata - descriptive information about a resource. It is used for discovery and identification. It includes elements such as title, abstract, author, and keywords.
  • Structural metadata - metadata about containers of data and indicates how compound objects are put together, for example, how pages are ordered to form chapters. It describes the types, versions, relationships and other characteristics of digital materials.
  • Administrative metadata - information to help manage a resource, like resource type, permissions, and when and how it was created

I’m not sure I think this abstraction or container is best described as a File, since it will collide with other notions of File, that are definitely not the same thing (such as the things you have on your local drive, that perhaps end up in a Map).

Metadata, could be used, but I think the abstraction / container is more than that (also, it is not a very sleek name). The metadata is what you find within it. And moreover, this abstraction / container is actually the only way you can access the actual data. So, it is sort of a gateway, or a shell, to the data, holding all necessary information (i.e. the metadata) as to fulfill that role .

I wrote this in a reply to happybeing, above, maybe you missed it:

Blobifying is basically about spreading large data as chunks, evenly over the network. Given that the Shell can grow indefinitely, there should be a way to spread that out as well, in a smart way with good operational complexity. I think there could be better ways than blobifying, and have planned to iron that out.

No, the actual structure, when Map or Sequence, can also grow. For every Key:Pointer pair you insert to the Map underlying structure, it grows. For every Pointer you append to a Sequence underlying structure, it grows. A Blob, only holding a single Pointer, would only grow by permissions and owner changes.

Not sure how you mean. Vec (the way mentioned in the RFC) is the rust data type we implement things with. Sequence is a concept in SAFENetwork.

Well, not more than a security exploit on anything else in the network (accessing someone else’s data, steal coins or what ever). It all depends on the same security (Elders, node age, consensus etc.). As long as that holds, then only those with correct permissions can do these things. The chunks are only accessed through the Shells, if you don’t have the permissions in there, you cannot access the chunk. When the ShellNodes have verifyid the request, they contact the ChunkNodes and ask them to send you the chunks. You can’t request the chunks in any other way. So if in the Shell metadata, there is no public key with permissions corresponding to your request, then you are not going to get to the chunks.

This is a cool one. But what problem does it solve?


Looking forward to hear your ideas :slight_smile:

8 Likes

@oetyng, replying here rather than from the dev forum, but in relation to the conversation there about the decoupling of metadata with the Shell layer, what would happen with metadata embedded in files, thinking exif or similar here. I’m assuming deduplication wouldn’t be effective in these cases, or can the chunking process be somehow ‘tailored’ for these type of files (which seems very complicated to me, but throwing it out there…)?

2 Likes

Shells and metadata

The metadata we talk about here in the context of Shells in this RFC, is very specifically for describing how your uploaded data is organized within SAFENetwork. That is something completely different from exif for example.

Details

So the metadata we’re talking about in the RFC, is mostly orthogonal to any file metadata (or any other metadata you can conceive of out in the world). I say mostly, because you can still give it a name, and you can represent a structure that might have some correlation with what ever metadata you’re bringing in from your environment. But that is completely an app developer design decision by then.

Same goes with the metadata in a photo. Your app could just store that photo file to the network as is. The app could also read the metadata out, and store it in some structure, so that it’s easy to read only that info, through traversing the Shells in the network.

If I were to write an app handling photos on SAFENetwork, I would let it extract the exif data, then it would upload the photos into some structure of choice, and the exif data into another structure.
This means that when the app is browsing and searching the photos, it would traverse the structures created for the exif data, and when accessing the actual photos, it would then traverse the structures holding that.

Additional info

These lowest level structures of the network, can of course be exposed directly in apps, and early on might be many apps that reflect these structures to the user in some ways.
But as these things progress, I think we would just use a virtual file system for example, that looks and feels just like the file system on your computer today, and you would have no idea that there are things like Shells, or Blobs or Maps or Sequences, (and some day, perhaps not even a SAFENetwork…).

Conclusion

The metadata we talk about in the RFC is about the smallest parts of the storage structure in the network, and how apps want to organize them.
You can use that storage structure to hold any sort of data, such as photos or exif data.

Related questions

Deduplication

With regards to deduplication, that is a bonus effect which applies as far as possible when possible. So if you have a single app that uploads your photos, it would deduplicate as it would use the same algo for treating your file (assuming it only uses one). Another app might not. And naturally, unless you send your photos around, the likelihood someone uploads the exact same photo to the network is extremely low.

6 Likes

Been reading through this some more and have been struck with a few ‘eureka’ moments. Will need to share those another time. A few quick responses/comments below.

Yes true. What is really required is a third object to form a suitable basis for both.

Perhaps they should be the same thing and offer a uniform interface…

Maybe it shouldn’t grow indefinitely… just have a very large max size that you can quantify and make design decisions with.

A typical/standard Vec lives in volatile memory only. Eventually you will need to serialize this to disk for persistent storage. I figured that reuse of a SAFE datastructure could handle this for you automagically.

I recall long conversations past with @neo about reference counting and data deletion. The consensus view was that this was very cumbersome and inefficient. From experience I know that it has some serious performance penalties on local disk operations in linux when you have lots of hard links to the same file. Do the chunks really need to be deleted? I’m not so sure. The previous standard method of letting a user delete the metadata to a private chunk (aka the “datamap”) but leaving the chunk on the network as garbage is probably fine. Copy on write is as safe as it gets. I know there was some pushback from the community when dirvine asked about this. To some extent I think dirvine was too accomodating for trying to keep the “hive” all happy and that his first intuition about append only and copy on write is a preferable strategy. This was more of a concern when Safecoin was a unique data object and not a section balance. With the balance method, those concerns are likely unfounded.

It solves the “obfuscation at vaults” mentioned here.

1 Like

Nice! Looking forward to hear them :slight_smile:

But they have very different interface. The only thing that can yield is the Shell model, and it would seem to be completely contrived to force it into that shape (looking at file descriptor interface, or what did you have in mind?).
A File mapping use case is rather supposed to be built on top of it I would say.

Well, strictly it’s not the Shell that is growing indefinitely, but the lifetime of people’s usage.
And in fact, in this proposal, a max size is suggested, which is quantified and used for design decisions (such as how to spread it out and access it in the network).
The main purpose is to solve the use case, which is indefinite lifetime.
Not sure what concrete aims or issues you are having in mind.

Ah, yes. Yeah, so for growing Shell, as I said I have other plan than the blobification, similar to the ever expanding database I posted about on this forum way back. That one builds a tree out of MDs, with O(log n) access time.
So, basically it’s about reusing Map and/or Sequence, like you suggest.

It was certainly not this model discussed.
There is a u64 associated with a private chunk, and it is simply incremented on adds and decremented on deletes. There is no other difference to not ref counting.

There is nothing resembling cumbersome or inefficient there, I must say.

Completely different thing. The only contention we’d see here is that of deduplication, which might mean many requests to a few nodes for very popular chunks. But that is orthogonal to ref counting, it comes with deduplication. Additionally that’s what caching is supposed to solve.

Considering that this proposal adds a new feature that didn’t exist back then - that chunks are accessed through Shells only - deleting chunkmap is perfectly fine for hindering any rubber-hose, hacked or other illicit access.

But that wasn’t the only argument.

With the very low overhead of refcounting, and the probably quite low ratio of occurrence (where naturally the work required is proportional to number of chunks being deleted, but still a perfectly distributed computation), as well as any potential benefit in at least in theory storing less junk, it is a very low hanging fruit to enable it, if only for the ability to let people feel that they can. People are not completely rational, and it’s a lot easier to just be able to say “Yes”, when they wonder if they can delete their data. That alone might be orders of magnitude more important for the future of the network, than any technical detail.

I did not perceive that discussion to have much to do with safecoin at all actually. It was mostly about key rotation, rubber hose and people’s desire to know that they could delete, as far as I remember.

Ah, this, yeah that’s nice. It should be completely redundant for anything self encrypted, but with data less than 3kb, that currently wouldn’t be self encrypted, it would give the protection at rest, even if clients circumvented any client side encryption. (I would think that if they did, then they probably knew the consequences as well, but yep would still be the bad apps of course.)

3 Likes

Memory isn’t clear, but the issue mentioned might have been with regard to the operations and checks needed to perform the delete when your refcount gets to zero.

You are not wrong.

3 Likes

May I suggest:

  • Triplet, whose structure is a set of tuples, each consisting of 3 keys, possibly representing two entities and their relationship.

I do recognize this could be stored as a number of Map entries but it may make sense to have it as a separate data type since it would be useful for RDF data and graphs in general and it also doesn’t add much complexity since each entry has the same structure and size.

8 Likes

Awesome @JoeSmithJr
As you already saw there, that’s exactly what this is designed to support; extending with new types.

I havn’t thought specifically about RDF (it’s an area I’ve not gone into yet).
I would have to think about how well it would play out implemented this way. @joshuef, @danda, @nbaksalyar and @bochaco might have ideas.

4 Likes

yeah, basically it sounds like a great idea to me. RDF is built of triples, so having a native data store should make RDF ops faster and more space efficient, as compared to XML or other text serializations.

3 Likes

I think having a Triples data type (Shell) could be the ultimate answer to natively supporting RDF, i.e. having effectively a native triplestore, the first thing that comes up to my mind is that we avoid the serialisation and probably inefficiencies introduced by it.

Nevertheless, some more thoughts/analysis would be needed just to understand how we deal with mutations and its history, it sounds like it could be similar to a Map but having the object to be the key, and the value to be a two-dimensional array for the predicate and subject, so that way we can handle mutations history of triples, and presumably then have the private and public (and perpetual) flavours of it?..not fully sure all this makes sense but just some thoughts.

I was refreshing my memory and looking up for some old research I was also doing in relation to this, also playing around with trees and luckily I found some diagrams I had made. The first one shows how you add pointers to what we are now calling a Shell:

…and a second diagram shows how you build up the tree with the older versions of Shells when they grow, and making them imutable chunks (or blobified as it’s being called here). I just thought it’s worth sharing it as it still fits in this discussion:

Back then I was also thinking that instead of serialising the Shell to make it immutable, we can simply have ImmutableShells, in that way we still can make use of whichever optimisation available for accessing the Shell’s entries, while they can still be immutable and even stored at the location calculated on its content, although I guess you’d anyway need to serialise it for calculating the hash.

5 Likes

Agreeing with @bochaco, @danda and @JoeSmithJr here.

Native triple storage will certainly make RDF more ‘first class’ (though we still have to get APIs to deal with that and probably offer some decent serialisation options to folk). But in terms of raw data storage, this is definitely ‘nice-to-have’ :+1:


Edit: Hmmm, thinking on this further (:coffee:) . If all the RDF data is triples, we lose some efficiency in terms of dereferencing entries. Or we’re allowing nested triples… but then how to easily reference those nested triplets…

Perhaps this is the reason that a lot of triple storage is done via ‘documents’, ( ttl , or jsonld ), which allows this flexibility

6 Likes

@bochaco I think native triple storage would have significant benefits for providing an RDF API. For example, Tpyypically you can ‘select’ triples with patterns using a term to match against each of subject, predicate, object. There are standards too for RDF datatype and API in JavaScript for example such as RDF/JS (as I’m sure you know). So these could be a part of the API for a triple datatype and eliminate the need to deserialise the data in order to perform the ‘query’ (as you note). Maybe this is exactly what you were thinking but no harm spelling it out!

It may be difficult to do with a more compact form of RDF storage (I forget the name). I suspect not, but it needs to be considered.

3 Likes

I think this is what you were referring to, http://www.rdfhdt.org/technical-specification/, which I haven’t dig in myself, but it seems it also focuses on a separation between metadata and data, so perhaps there are some nice matching there in the approach and this proposal.

2 Likes

Wouldn’t they be searched by any of their three parts and also by (object, predicate) and (predicate, subject) and (object, subject)? That doesn’t necessarily fit into the original key-value model of Safe. While each triplet can be uniquely identified by the hash of the tuple, there would be a need for 6 separate lookup indexes as well.

I can see two approaches for that:

  • some sort of a document based indexing scheme implemented (and reimplemented again and again) at a higher level of the stack
  • a new type of data lookup built into the core

The first of these have the downside that the index would need to be maintained manually by the owner of the application (with all the costs involved) even if the entries themselves were added by other users. This may or may not be a good thing, I won’t argue for either, just thought to bring it up.

The second would make graphs a more native feel on Safe. On a second thought, if index/lookup isn’t handled by the core, I’m not sure the idea has any benefit over just implementing the entire thing in the application layer.

Would that be an issue when the search is distributed over so many nodes? Possibly, or maybe even more, or maybe not at all. Safe will be very different from centralized databases.

2 Likes

While we’re at it:

I’ve been planning for this.

Do you have some suggestions on particulars of that design?

2 Likes

I’m sorry but I don’t have much. I know it will be necessary for the thing to make sense (why store stuff if we can’t do anything with it) but that’s all.

1 Like

It might be beneficial to aim for supporting RDF* (RDF Star) as it is currently undergoing an accelerated standardisation process, is central to Graph data standardisation. It is a pretty big deal and also a prerequisite for graph databases living on the Safe Network. A high level overview of what it’s all about plus pointers to the standardisation efforts underway here.

In this case it would require the support of “nested triples” so extra attributes can be assigned like provenance information which will be especially required on a decentralised permissionless network like Safe.

Luckily there are quite a few research/projects claiming various levels of success at decentralising and distributing triplestore query loads. With luck one or two of these methods work very well with the strengths of Safe Network and allowing graphs to have as you say “a more native feel on Safe”. A good contender for the killer app.

On the topic of scalable cloud based RDF storage @oetyng you might find this paper, “Towards a Scalable Cloud-based RDF StorageOffering a Pub/Sub Query Service” of interest due to your previous CEP work. In this case we would throw away the “compound event” quadruple hack work around to plain old RDF limitations and use (soon to be) standardised nested RDF* nested triples instead.

6 Likes