GitHub code: Indexed and searchable data in SAFENetwork


#1

Code with basic unit tests can be found here: https://github.com/oetyng/SAFE.DataAccess
It is a proof of concept of the algorithms and structures, that uses in-memory collections.
Next up I’ll add in SafeApp connections, as to run it with MockNetwork and Alpha-2.


So, last week I was revisiting the topic of need for a design for storage and access of data in an efficient manner in SAFENetwork.

I have been talking many times about how this is a topic that I’d very much like to see more discussion around, and there are some older topics. @intrz and @neo has provided some nice input, and this time I was reading this nice post by @neo again before writing the code. @intrz posted a set of links to some very interesting papers here. (Honestly, I didn’t base any work on that now, this is at a much simpler level.)

I think it’s almost half a year ago I was coding at this the last time. Back then I crafted a partitioning scheme, which was connecting to a local network, as well as using MockNetwork. Work slowed down as issues indicated that there were still some garbage collection issues with the SafeApp library. But that project was quite investigative to it’s nature, and it has quite some flaws, both the basic design idea and the code in general (takes many iterations to get it clean and simple).

This time, I was looking to overcome the deficiencies of that solution, and now I have the basis for an indexed and searchable (document-ish) database, however very very rudimentary still.

The main differences are:

  • More or less limitless expansion.
  • Only increases as data is added.
  • Indexing on data properties, which enables search queries like SELECT * FROM table WHERE p.Name = 'name'.

Except from the obvious connection to the actual network, next I’ll be doing following:

  • Use Sprache for creating a simple query parser.
  • Continue to clean up and refine the structures (improve testability with inversion of control, interfaces etc)
  • Test performance in a local network.
  • Add documentation.

I’d be happy to hear opinions on how the indexing structures and so on can be designed better or differently (different approaches can give some nice thought-sparks).

The best way to understand the algos IMO is to just read the code, so please go ahead and read, and please feel free to ask questions when it is unclear how it works. (Please remember we are not - and do not strive to be - in the same universe as ACID yet, that is a tough nut to crack.)

A short description:

  • A DataTree is used to store data in a chain of Mds.
  • A head md is used as the top structure for the DataTree.
  • When the head is full, a new Md is inserted at top, pointing to the previous head.
  • The data entry stored in in the new head is in the form of a Pointer object.
  • When data is added to the DataTree, it is stored in the form of a Value object.
  • Value objects are stored in the leaves of the tree.
  • Pointer objects are stored in all Mds between the head and the leaves.
  • Info on what types exist in the Database, are also stored in its own DataTree.

  • A Database holds any number of DataTrees, for the storage of data objects.
  • Each type of data added to a Database, gets its own DataTree instance.
  • Adding data is done by passing in an object, and the key with which to find it.
  • Indexing on the key happens automatically.
  • Indexes can be created for any public instance property on the object.
  • The Indexer derives from Database, and stores the indices in it’s own database.

This is all very simple, the utmost MVP so to speak. An indexed and searchable database of decent quality requires huge amount of work and is really complex. And that is without being distributed, or decentralized… So, there’s plenty of work for all to go around :slight_smile:


#2

Good to see you are still finding time for this @oetyng. I haven’t thought much about it myself, but am wondering how it might relate to RDF and SPARQL.

@hunterlester has been digging into that area, and been gathering some interesting info about how to do versioned queries on RDF, and wondering how this might be applied to SAFE. One of the things I found interesting, which he dug up, is a paper about OSTRICH. It is very good because it starts with a good review of the different approaches tried so far, categorising them by the different techniques they employ, and explaining the pros and cons, including how to benchmark this. In the end their approach (OSTRICH) is based on HDT (a binary serialisation of RDF) snapshots. Well worth a read for anyone wanting to think about RDF based storage and retrieval.

Anyway, in case it is of interest, here’s that paper, and good luck with SAFE.DataAccess:

https://rdfostrich.github.io/article-jws2018-ostrich/


#3

Such good timing @oetyng, because as I’ve been reading lately I keep thinking, “I wonder what he thinks of this”, given your background.

While searching for papers on storing RDF as a graph, I found this paper about how to store a distributed RDF graph.

Then @nbaksalyar found this piece about how we might query the graph partitions.

There’s so much more to consider but here’s an initial rough brainstorm:

  • Each vault runs a partition of the network graph in-memory
  • Each vault is responsible for archiving their partition of the graph in some kind of format like HDT binary or Hexastore onto our network data types.
    • The archive is solely for use by vaults to construct a graph partition to run in-memory
    • How often would/should archives be updated?
    • How might this change farming economics. Do vaults need to also be rewarded for utilizing RAM, depending on what’s need to store the graph partition?
  • Perhaps vaults are also the best place to house a SPARQL engine, exposing query operations to other vaults and to clients.
  • Client applications could add triple statements in whatever format, where vaults would then hash separate triple statements, then route them to a graph partition in the network according to the XOR address of the triple statement, which the controlling vault will at some point archive.
    • How would an account be charged for a writing to a graph via client application?
      Clients wouldn’t directly be appending an MD or creating a new ID since that would be left up to logic running in the vault that determines when to create an archive of it’s graph partition.

#4

Oh haha, yes… But I have left leaf mounds in my yard that I should have brought away long ago, I had planned to do it this weekend (as well as previous, and so on). My neighbors will think I am very lazy :smile: (Well… I am but not in all ways :slight_smile: )

Mmm, and I have not looked into that much, so I don’t know either.
But the little I picked up from the discussions, it seems that the RDF implementation will be a layer over MDs in the same way as this code is - but at network level? So both are ways of organizing the access. Probably quite different, but I will have to read more there.
Thanks for the link.

Aha!

Very interesting things there, I’m quite exhausted right now. But plenty of information for me to dig into, and I’ll happily get into the various aspects there and come back to you with my thinking.

Just shortly: I have been thinking along the same lines there; that the indexing engines can run on the vaults, according to some partitioning scheme, and queries can be xor routed to the relevant vaults, that then execute the queries on the locally stored data.

I will need to get back to this, my brain is mush now :exploding_head::sleeping::upside_down_face::thinking:


Update on SAFE.DataAccess:

  • Added the SAFENetwork MD operations.
  • Refactored to be able to implement the IMd interface from the PoC.
  • Refactored all for async / await.

So, the only in-memory part was Md.cs (implementing IMd).
Now there is also MdOps class (in SAFE.DataAccess.Network, with reference to SafeApp.dll), implementing IMd. So that there is the actual SAFENetwork connection. All the other code is unknowing of what the underlying storage is.

Next up is to add the SAFENetwork account interactions and top layers that can be used in an example application, and then a whole swathe of testing.


#5

Great to see you continuing your work on this and sounds like you will be producing a really good system for data storage or database. And my example was just one of so many ways things were done. And it was just an outline. Its good that helped you to develop your plan.

I am a little busy at the moment to be reading code etc but hope to get to it soon enough.

It is really good to see people diving into various aspects of what the network will provide that I feel confident we will have a working infrastructure running on top of the safenetwork at launch. In effect the network will launch (land) with it feet running.


#6

I agree that’s most likely and what Maidsafe are working on atm. At the same time we can though think about whether your needs, or those of RDF/SPARQL (and therefore Solid) could be better satisfied with other fundamental storage types or features. No harm in thinking about that, although I haven’t been doing so myself. I haven’t delved into these areas enough to think about that. But maybe you have?

My own thoughts have just been around ways to handle NFS containers when they get full but I haven’t designed for that yet. I can see some simple and effective ways of dealing with MD that gets full of deleted entries when used as a filesystem but don’t have code to handle it yet.


#7

I agree with Hunter, it’s really a good timing for this :slight_smile: Thanks for your research @oetyng ! I’ll read through the code and will summarise my thoughts.

In a very direct way, I think. Linked Data is indexable and searchable by its very nature and RDF with SPARQL simplifies the task a lot.

Not necessarily but everything depends on what approach will have the best performance and require least number of changes in the network code. And there are many options for this: RDF can be stored as a graph in a key-value storage (that is Mutable Data) – and as an example there is e.g. RDFLib that has a back-end storage implementation on top of Berkeley DB, a key-value storage too. But I think tests will also show if storing RDF in a separate triple-based data type is more efficient.

There’s lots to research and consider, so any extra input helps! :slight_smile:

Yes, with federated queries I’d imagine it as the query parser / constructor / handler will live on the client side, and the client side will decide where to send the partitioned queries and how to compose the final result. The individual queries most likely need to be handled by Vaults, because otherwise we’ll have to fetch the entire data sets and it’s a very inefficient approach.

The network fundamental of “storing data in perpetuity” also simplifies this task in a lot of ways: with the assumption that the data at a given location never changes, it’s possible to cache it at different levels and make the queries very efficient.

Still, there are many unanswered questions that I’m trying to answer, like e.g. how to deal with the encrypted data. But judging just by the number of research papers on LinkedData/RDF out there, it seems like the community has already thought of everything. :slight_smile: I’ll keep on researching and will keep you posted.


#8

That was some really nice reading! My mind started expanding immediately. I see now how all this can be combined. This is so exciting :slight_smile:
I haven’t read it all yet, but as soon as I can sit down with this I’ll draw up some points.

The details on how these things were done previously are so fascinating. It always amazes me, the ingenuity that was there to solve things that today are not even noticed. Anyway, it added to my ideas for moving away from the consistent hash partitioning of the previous solution.

Ah, but this here that I wrote, creates that abstraction. You basically get an object that can handle as much data as you can fit in the network (and pay for).
So, you have this one object (or n of them…), with an id, and you can totally disregard the fact that Mds have 1k entries. This object has as much space as there is in the network. You just add. And fetch by key with O(log n).
Then for deleted entries, if you have a very delete intensive application, then you could add handling so that the entries can be reused, as to not grow the db unnecessarily fast. But it wouldn’t be needed otherwise.

(Now, the caveat is that with absurdly large data sets, the indexer db for the key values will have so many indirections that it becomes severely impractical. But considering it’s an O(log n) access, I think there won’t be disk space enough in the world for anyone to actually get there… Creating new indices could take a long time at far more normal levels though.)

I just had my eyes opened here when I started reading the links from @hunterlester. I need to consume the rest and then get my thoughts down. I’ll be back :slight_smile:


#9

I will be stealing that for SafenetworkJs at some point :smile:


#10

Example console application with the in-mem db:


#11

Mock routing storage implemented:


#12

So :slight_smile: a few initial thoughts from me here

From dev update:
“This involves storing and indexing RDF triples in Mutable Data, and most importantly, storing triples in the encrypted form.”

So, like it is said in the papers, and as you mention, it is graphs loaded into memory that would be most desirable, for performance.

  • It can sometimes be problematic to keep entire graphs in memory, so how much is loaded?
  • Persistent storage would otherwise be in the network MutableData.
  • Indexing (many different sorts, which are the most essential?)
  • Encryption, and querying over encrypted data, homomorphic encryption then? Are there other ways?

So, about long time storage and fast retrieval, that is a smaller issue IMO, than indexing and querying over encrypted data. I only just recently looked at a homomorphic encryption research example from Microsoft (apparently working, but didn’t test it myself), so I am oblivious as to how available the technique is at the moment.

There are surely many ways to do the storage part, but basically, following the idea of this DataTree structure, increasing the tree vertically as the head gets full, and letting every subsequent level expand horizontally as an MD get full, you get an infinitely expanding structure that can be scanned. That is a quite good place to start.
Now as an item is stored, we also index it with the direct address to the resulting leaf MD, so we do not need to scan this structure.
Many things can be done to tweak it. For example, the DataTree will cache the current MD, as to not traverse the entire tree on every add, and once the leaf node is full, the DataTree will again be traversed - once - as to invoke the horizontal expansion, and cache the newly added leaf for subsequent adds.

Indices can probably be stored more efficiently, for example filling every MD entry up to 1 Kb - at the cost of heavier serialization operations at each write - but also getting a much compacter index DataTree (i.e. fewer MD fetch requests when scanning it as to get the results of the query match - which with path indexing, by definition will be everything in that tree).

Didn’t fully comprehend HDT binary or get into Hexastore, to see how well it applies. But other than that. Yep.

The question of how often an archive is updated, it must be dynamic IMO. It all depends on the type of data in the graph.So, it is the writer of data that must decide that. But maybe the question was how often node calls the persistent storage for changes? Well, I guess that is essentially the same thing as first question, if these nodes are to be the gateway to querying the data, new data is only available at the rate with which the in-memory data is refreshed.

The querying will require CPU as well. While it’s true that the RAM usage is required, throughput and access to what it is holding, is mandated by how much CPU is dedicated to processing the queries. (Jumping a bit ahead here, connecting to this:)

I would definitely say so, with the obvious complication of encryption, as Nikita touches on.

Yep, maybe.
Trinity partitions by node id, i.e. subject and/or object, (as opposed to triple). Since a small percent of nodes will have hundreds of thousands of links (which would incur prohibitive network load when querying), while most have very few, they partition differently based on the neighbor list size, as to optimize for these different cases. (both graph partitioning and update is out of scope of that paper, would be nice to know more about how they solve that).

Hmm, I initially saw this differently - as the client actually doing the MD update, and then perhaps signalling refresh request to the vault responsible for that partition. What would be the reason not to do so?

Yes, so Trinity process the queries in parallel and communicate between the machines for a given query. Not that their model must be copied straight off, but if we play with that idea as a starting point, their machines map 1:1 with the Vaults.
A client could have the query planner part, or maybe there could even be some dedicated query planner role, which then directs the query to the appropriate vaults for query execution, and collects the results, perhaps doing the statistics as well. Maybe depending on type of query we would find that some simpler ones are optimally planned and managed from a client, while others are sent out in parts to the network. There’s a lot to go into there.

And it gives an extremely powerful and interesting new query dimension, if the version is available to index on, with or without timestamp.

To quote hunter again
There’s so much more to consider but here’s an initial rough brainstorm :slight_smile:


#13

Recently read through @happybeing’s reference Three Shifts To Decentralise The Web specifically the section on “Interfaces become the queries” with reference to this website: http://linkeddatafragments.org/ which “aims to grow decentralized querying to a Web scale”, reminding me of this thread. Contains interesting related talks and projects that might be a useful reference.