DataStore over AppendableData design

oetyng · February 22, 2019, 3:33pm

Background

Simple, fast and efficient data storage and retrieval is one of my favorite subjects within SAFENetwork scope. As you might know I have been doing some implementations to try and see how the current API can be used for it.

There’s currently a proposal being made for replacing MutableData with AppendableData. The discussion can be followed here: Appendable Data discussion - #282 by Anonymous2020

My previous implementation based on MD will have to adapt, and for that I have come up with the following, which is still a work in progress.

When I was asked to provide some requirements for the DataStore, I thought I’d go through what my thinking is so far with regards to AD in the context of SAFE.DataStore, and present them in it’s own post. I have mostly thought of this in terms of what can be done, with what we have, i.e. it’s been very investigative, and continues to be, and for that reason I haven’t really set out a goal for what should be accomplished.

Anyway, here goes.

EDIT:
As per request of @Southside (thanks for bringing my attention to it!) I’ll include a chart to explain the Big O notation used further down:

(Thanks @draw for looking up the chart.)

Requirements for AppendableData

Indefinitely add new version to a piece of data (this is what emulates mutability of this piece of data, when we use a GetValue() API that returns the value as of latest version.).
O (1) on the above (both an append, as well as a read of latest value).
Fairly fast lookup of previous versions. (I think O log(n) will be required here…)
Writing multiple values in a single commit to a structure (for an AD, that would be like adding a range of values in one request, where each value is appended as a new version).

AD API

ad.Append(value) // always appends, value directly stored in AD
ad.TryAppend(value, expectedVersion) // optimistic concurrency
ad.GetValue() => (value, version) i.e. current version
ad.GetValueAt(version) => value
ad.GetAll() => array of values, where index would represent version
ad.ReadForwardFrom(version) => array of values from [version] to current version
ad.ReadBackwardsFrom(version) => array of values from [version] to version 0
ad.GetRange(version_a, version_b) => order of a and b determines order of returned array // [a b], a != b, a,b > 0,

DataStore Architecture

Explanation of the components

ValueAD and StreamAD are the two different uses of an AD. For ValueAD we only care about the latest value, which we get by calling ad.GetValue(). The history is there, but it is not something we are primarily interested in, so for our purposes it is opaque (hence the 2D symbol). For StreamAD, the whole history of versions is interpreted as entries in a collection, and we have equal interest in all of them (hence the 3D symbol).

Each unit is a combination of two ADs, one acting as ValueAD and one as StreamAD, or in the leaf node level of the trees; AD and ImD. That way the unit has a fix address, but “mutable” content. The StreamAD / ImD holds the properties of the unit. Every time we need to mutate the properties of the unit, we produce a new StreamAD / ImD with the desired properties, and append a pointer for it, to the ValueAD. This way, the entire history of the unit is kept intact, while emulating mutability, since the ValueAD reference is always kept.

Detailed design

Application start

You load the address of your datastore root AD (acting as ValueAD, i.e. we care only about value at current version), from your app container.

This ValueAD contains a pointer to a StreamAD (i.e. an AD where we care about all the values/versions), in the diagram we call it DataStore Registry. Anything related to the DataStore is stored here. One such entry is the TypeStore, which is a registry of what “types” your datastore contains. So you have an “Entity_A” type, and a “Entity_B” type in this application.

In the TypeStore StreamAD, each entry is a pointer to a ValueAD. So the entry representing “Entity_A” type, points to the ValueAD for the Entity_A type. The current value of it, points to a StreamAD which has the following:

The primary index ValueAD address, this ValueAD latest version points to the current StreamAD index structure.
… n Secondary indices ValueAD addresses, these ValueADs latest versions point to their respective current StreamAD index structure.

This is loaded once at start.

Indices

Whenever a new index is created, the address to the new ValueAD is added to the Entity_A StreamAD. In case a type is removed, a new StreamAD is produced without the type, and the ValueAD for the type is appended with the pointer to this new StreamAD.

Writing data

The indices themselves are never appended to, as we want them to be sorted. To solve this we use LSM trees, Log Structured Merge Trees.
An in memory structure is kept, that is inserted to (and queried), and as a certain number of entries has been reached or time passed, it is flushed to network. This is either a simple addition of a new StreamAD (or tree thereof), or in case any smaller or same size tree structure already exist, it is merge sorted it in memory, until no more same size or smaller exist.
The sorted result is then sent to new instances of StreamADs (so, writing a set to each of them would be “the writing multiple values in a single commit”-feat), and new values are appended to the ValueADs that points to these new StreamAD trees (the new structure, containing the sorted entries of the existing trees, and the in memory tree).
This procedure is called compacting.
This will give log(n) number of trees which are holding our data for our specific index, each new half the size of the previous most recent.

As to support reliability, a transaction log in form of a StreamAD is inserted with any change, before it is applied to the in memory structure. In case there is a process failure before flushing, the transaction log can be replayed to rebuild the in memory structure.

Write performance is O log(n).

Accessing data

The indices are tree structures made up of ValueADs pointing to StreamADs, each entry in the StreamADs pointing to a new sub-tree of ValueADs and StreamADs.
(Basically, a ValueAD+StreamAD is like an MD entry pointing to another MD, looking at it with current design).

The leaf level of these trees, via a ValueAD, points to an ImD with the actual data.
The secondary indices point to the very same ValueAD instance as the primary index, so that we only need to update one single place in the entire structure, when the underlying data is modified (i.e. a new ImD produced).
(Remember that these StreamADs representing a level in an index B+Tree, is replaced in its entirety via the compacting process, as to avoid small writes.)

In case of B+Trees, the StreamAD entry also points to previous and next StreamAD at leaf level. As the entries are sorted, range queries can be made by first doing a point query, and then scanning from there on, until the end of the range has been hit.

As mentioned above, for any given query, there will be multiple indices to hit. First the in memory one, and then the most recent persisted, and then the next (which is double the size), and next, and so on. There are actually log(n) of them, each taking O log(n) to query. Using Bloom filters and using them selectively (as they also need to be kept in memory) will improve the query performance (which would otherwise go to O log(n)^2 ) and bring it back closer to O log(n).

Problems

Atomicity

When compacting, we are replacing existing trees with a new. We want this operation to be atomic.
However, the tree consists of multiple ADs, which could support atomicity for multiple values committed to each of them, but would have a lot harder to support it over all of them together.
This brings up the need for coordination of reads and writes to the trees, as to not get data races.
One way is to use locks, but that is very inefficient especially in a distributed environment.
A database called Noria - written in rust, implemented over RocksDb - has an interesting technique to avoid read-write locks. Overview starts right here in this youtube presentation: Rust at speed — building a fast concurrent database - YouTube.
Perhaps this could be utilized in some way, to avoid read-write locks on compactions.

On the other hand, perhaps it will be as simple as just appending a new value at the root level of a tree, and get an atomic switch by that.

Data buildup

LSM trees is not a new concept. It has been used in filesystems for example; Log Structured Filesystems.
The way this works without filling up the hard disks, is that old unused data is garbage collected.
Now, if this principle is to be used on SAFENetwork level, and SAFENetwork is supposed to be a globally used data storage and computing network, then it is fair to say that we will experience the same data build up at micro level (individual computers), and thus also at macro level? The one important difference would be that there will be no garbage collection of unused data in SAFENetwork.

Will the data capacity needs be different just because we amass computers of the world population? I don’t know. Sure a growing network would have good chances of swallowing the data increase. But a mature network? Why would it differ from the needs of individual computers (HDDs/SSDs)?

Southside · February 22, 2019, 3:43pm

I’ll read this in depth later, but my initial comment has to be…

Please consider giving an explanation in non O notation next to every instance above. O notation is great for data scientists but most of us are not and would appreciate a non-shorthand definition to help us better understand and appreciate the hard work that has obviously gone into this proposal.

draw · February 22, 2019, 3:51pm

http://bigocheatsheet.com

neo · February 23, 2019, 1:11am

How would this work with multiple instances of the APP running around the place?

How does the performance change when there are requirements for additional ADs to be created to account for overflowing the AD with appends? Or are you just creating new ADs for each and appending one pointer in an upper AD. EDIT: Ahhh you say " (Remember that these StreamADs representing a level in an index B+Tree , is replaced in its entirety via the compacting process, as to avoid small writes.)"

So then my question applies to the ADs holding links to these B+Tree ADs

At some stage there will be overflowing of ADs into requiring additional ADs to hold the appended links. How will the performance slow down as the network has to read each of these ADs to find the latest (== current) value for the link. Remember the ADs writing scheme is proposed to keep one last pointer so that the next AD in the appends is linked so there is no facility to pointer to the last AD in the one AD’s sequence of appends.

oetyng · February 23, 2019, 5:10am

That’s why I ask for it to be O(1) as part of the requirements.
Getting the latest value of an infinitely growing AD must be constant time for it to be long term feasible.
At most log(n) for being feasible in the above. And then it will degrade performance correspondingly of course.
If that cannot be reached then they need to be swapped out regularly. And for really giant enormous data sizes (haven’t checked how big), we would eventually need to create a new account as to get rid of the previously filled upp app container.
This is a scenario I lay out in the Appendable data discussion topic.

This is from the draft I wrote the other day, didn’t include it in this post:

So, it would work in the way that we will see eventual consistency corresponding to how often we compact, or the need for replicating the in mem table (in which case we are more likely to be using dedicated database servers which clients connect to, instead of writing in each client.).

neo · February 23, 2019, 5:35am

I know this is not in the scope of your discussions here, but to me this is one of the problems with AD concept. After a long time of operating we will be seeing increases in access times for these ADs that eventually grow into multiple ADs holding the data/pointers/links.

My thought was that if you hold the data in memory and are updating that memory and only writing it back out after its too much for memory then the other instances will not see the new data being stored for some time. I imagine a seat reservation system requiring that any updates to the memory indexes/data will require it to be written out to the network.

Then as I read what you wrote I also realised that the other issue is that if one instance is adding a record with index 000111 and a second instance is adding a record with index 000111 then there will be consequential data loss. There would need to be measures in place to deal with this and be just more complexity I guess in the write strategies.

In the API above I am thinking that if a true MD system was used then there would only need to be a couple of other apis for mutate value.

I am guessing to overcome the issue of tracing through the ADs then the sections would need to have a dual pointer store for any AD address. That is the actual AD the address corresponds to (version 0) and the address of the latest version for that desired AD.

oetyng · February 23, 2019, 5:47am

The limit is not “too much for memory” though, which would be too slow in most cases as memory is often quite big, but a set size and / or time passing so that a fairly regular flush occurs.
But it’s a trade off of course, and yet to be seen if it can be sufficiently fast.

Don’t think this leads to data loss. If you have two writers, they see only the data they themselves have written. As soon as one has compacted, the other will need to merge its data with the first one.
The optimistic concurrency API of the ADs (TryAppend(value, expectedVersion) ) makes sure there is no concurrent compaction writing over a newly written one.

That’s what I think as well.

neo · February 23, 2019, 5:57am

So for instance if you had a record with the seats available for an event, then 2 instances could not allocate specific seats and/or would have to read the record then write it back out. If upon writing the expected version is wrong then the instance that saw that would have to fail the attempt and the customer choose another seat. For most relational databases online this check for avail seats is quick and possible to keep real time updates showing seats as they are allocated and the customer experience is good. But for SAFE this has significant network lag times and your “in-memory” lag times.

oetyng · February 23, 2019, 6:04am

Seat reservations (and similar) is an interesting area, because counter to common belief how these work in real world is often not by transactionally reserving the seats in real time.
Instead there is an eventual consistency and if there is over reservations it is solved by the business. (I.e. tough shit, we’ll compensate you with this or that).

But, there will still be business areas needing real time transactions requiring low latency full consistency with multiple writers. Those will not be looking at SAFE as a first option, as SQL on SAFE is probably not among the first things we get (due to these complications).

And this is also why it will be hard to get such businesses (or those who provide a nice customer experience with booking systems relying on real time updates) to move onto SAFE.
The pain of data breach epidemic hitting them will be pushing them towards SAFE, but the latency / performance will be a repellant.

neo · February 23, 2019, 6:44am

Actually there has been changes since transaction rate of online server systems has increased. Now it more depends on industry and type of seat reservations needed. Its now very bad business here to over book seats and they typically use real time bookings. For instance a simple example is cinema bookings where you can see the actual reservations on line and if you stuff your booking then those seats you choose are frozen till they time out after 15 minutes. Our airlines are the same.

Actually I am at a loss to find any examples of the “olden” time seat reservations in our country like we had in my younger days. Not sure if we have legislation here that outlaws over booking as a method to handle bookings. Ticketmaster actually handles some events that sees the occassional large event booked out in minutes without the complaints of overbooking. And within those minutes you get your actual seat (section/row/seat) allocated and ticket with the actual seat specified in PDF form. So they must be using transactions per (set of) seat allocation

I am thinking that a system that allocates one AD per seat available would not have such speed performance issues since a seat allocation requires only one append with the expect version (ver 1 I suppose) and the value for that AD would be customer details. The very small fraction of cent per AD would not be an issue since we have large ticket processing fees of some $3 to $7 here.

oetyng · February 23, 2019, 7:10am

Yep, and if it includes multiple businesses (flight, hotel, car) theres often a Saga handling it, i.e. it goes from one to the other, and if any fails, it issues compensating actions (unbook), all of which looks like a single succeed or fail to end user.

Yes that could be a good way for such.
So when you have a set to make available for bookings, you add those ValueAD addresses to a StreamAD, and that address you insert in some index hierarchy (you want to be able to find it later, so that you have this event/flight/whatever in your db).

Then you share this StreamAD address with the public and they can race to get the index 0 of any ValueAD in it.

So, any indexing of such bookings is done afterwards if they are needed. Not all things need full indexing immediately, and those are better suited to go straight into append only streams (which themselves would have to be pre-allocated into some sort of db structure).

neo · February 23, 2019, 7:15am

Yes if you are crossing businesses. Holiday plan booking of flights+hotels+car is understandable. But here within the one business (eg airline, event booking, cinema, or …) they can do the booking immediately without over bookings.

Actually remembering my trip last year, they actually now have a “hold” system which holds the booking temporary allowing the agent to do the other books in sequence and not try all at once, but they know immediately if any is not possible. With the hold system then once they have all the bookings on hold they can then finalise the bookings without any overbooking which is the major problem.

Yes true, indexing can be done later.

But my thought is that when the seller’s APP sets up for an event it will allocate all the ADs prior to opening for bookings. Thus there is no issue to finding ADs (since someone else may have the AD at the address their algorithm uses) and primary indexing can be pre done. Only customer (etc) indexing would be delayed if needed. Remember the APP runs on the customer’s PC and so all this is done through their Machine so

AD allocations needs to be done by the ticket seller prior to opening the event for bookings
indexing is best done after by the ticket seller so that the customer has less opportunity to hack the system (hacking the booking APP)
encryption of the indexing is wise to prevent hacking/snooping

Anyhow I guess this is getting a little off the central theme you wanted to discuss, so I’ll leave it there since this is more implementation.

oetyng · February 23, 2019, 7:52am

This happens too, but it is 2PC which gives n log n, so those who require higher throughput I’ve seen more recently use Saga.

Exactly, pre allocating the ads, then sharing the collection address (and releasing the key for the entries).
But I always use random MD addresses (and would expect to do the same with AD) as to never have to care for conflicts. So that is why the primary index is important at every level.
Otherwise you can also go by a consistent hashing scheme at various levels, that was used throughout by my first SAFE datastore implementation.

I’m trying to find what problems AD will give for the data store I’m implementing, and how those can be solved, so it’s all good.

oetyng · February 25, 2019, 12:14am

Additional thoughts

Clarification

I have here been investigating how the data storage implementation I currently work on, would need to change as to work with AppendableData instead of MutableData, given a specific solution for append only data storage (LSM trees).

I.e.:

How could an LSM tree backed storage be designed
How would it work if we had AD instead of MD
and
What would I assume of AD functionality for it to hold

Mostly I thought about this because I wanted to know how I should continue the development of it even after such a change, and naturally because it is just as fun to solve this problem as it has been to try solve the problem with MD the first time.

What this is NOT (if anyone thought so), is a proposal for AD or really an opinion at all about AD. It is NOT a proposal for MaidSafe on something they (or anyone else) should implement. Its not a proposal at all. This is just me sharing my work on the data store, and possible solutions I see for it and ways to adapt my code and design to potential changes in SAFENetwork.

Implications of the design

Now of course, in doing that work, it also leads my thoughts further on what the change means in a larger scope than my own implementation.

The data store I design, is not assuming limitations in what SAFENetwork is suitable for. That means: it assumes that it can be used for anything that people use for example MongoDb or Cassandra, etc. etc. for. (LSM trees are used in data stores such as Bigtable, HBase, LevelDB, MongoDB, SQLite4 [5], Tarantool [6], RocksDB, WiredTiger [7], Apache Cassandra, InfluxDB [8] and VictoriaMetrics [9].)

One thing that came up when I was thinking about problems with this data store design, in the larger scope, is exactly this fundamental difference in the existence of garbage collection or not.
LSM trees are implemented as write only, but they are not implemented (as far as I know) on media that cannot be garbage collected.
That is quite a fundamental difference.

I am assuming here that data storage capacity growth, and need for garbage collection, operates at different levels. What makes individual hard drives working with log structured filesystems, not to fill up on day-to-day basis, is not solved by data storage capacity growth (which obviously also happens in parallel out in the world, but is not what in practice avoids the fill up). On a macro level, would the constant upgrades (replacing) of storage media by vaults in the network have this effect? My initial feeling - without having checked the numbers - is that the orders of magnitude are different, so that it wouldn’t work. (Anyone up for some back of the envelope estimations are welcome to come forth!)

The data storage with LSM trees over appendable data design, that I have described above, would give a very large production of waste data.
Now, it is of course one technical approach to emulating mutability in a 100% immutable setting.
It assumes that we for our applications do need to emulate mutability, and that it is a fundamental necessity as to produce software that we can use for everything we need.
It seems to me that it touches on the fundamental issue, that it is not a property of this specific implementation, but indeed a property of any system trying to emulate mutability in 100% immutable setting.
I would be interested in hearing other approaches, ideas and suggestions, as to either show that my suspicions are wrong, or maybe to confirm them, whatever might bring us any step closer to knowing how to best deal with it.

Dynamic, static or something else?

There’s also the possibility to view the SAFENetwork as not supposed to deal with that kind of data handling, i.e. large amounts of dynamic data, and instead more of a static data store.
Technical boundaries would lead to some limitations in use cases, and there’s nothing wrong with that.
A space rocket is very good for taking things out to space, but not very good for plowing or harvesting lands in agriculture. (Well, maybe you could do some controlled shock-wave harvesting with a SpaceX rocket and a joy-stick, piling up crops in nice heaps at the edges of the fields. But I guess there’s a big risk of just a very expensive sea of fire and chaos )
Point being, maybe this advanced rocket shouldn’t be used for everything, as a versatile multi tool, but to solve specific problems. And so, such insights would guide us better in our efforts, with what we try to do. Now, I still hope and wish that the full spectrum of possibilities will still be there, but I don’t demand of the flowers to be trees, or the sun to be the sea. Everything is what it is and the best way to use it is to know things for what they really are. I guess that way of seeing things is a big part of what makes me like exploration and problem solving.

JPL · February 25, 2019, 8:44am

Just for my understanding, is the basic issue here that SAFE (and presumably other decentralised systems) cannot be garbage collected because memory is not pooled? So all data would need to get written to disk?

neo · February 25, 2019, 9:09am

No because garbage cannot be deleted.

oetyng · February 25, 2019, 12:59pm

The problem revolves around deleting data. It becomes very complex to do it safely (meaning: not causing anyone who would expect to find it to later not find it) when data is stored in a network like this, since you need enormous amount of tracking to see who uses it. So a much simpler approach (needing less code, i.e. more robust) is to not allow any deletion.
But, then we don’t have any garbage collection either (i.e. removal of completely worthless, unused data, which would - as far as I can see - be produced when trying to emulate mutability, and which log structured filesystems and databases rely on as to not very quickly run out of space.)

Well, it’s the story of life, where ever you go there’s daemons to fight, they just come in different shapes and colors.

But, if I were to choose, I would prefer immutability as to start building on something simpler, and eventually investigate possibilities to allow mutability in some way.

I can think of much less complicated ways to build an indexed and searchable datastore than the OP example, that would perhaps have OK time complexity for reads and writes.

For example, just make everything streams. If we assume that AD will be based on fix size segments that will be allocated in different sections (like a chain of MDs), then we can optimize for that size, and make a snapshot so that it lands on the next-to-last slot (version), right before the link slot to next segment.
And if we assume that there’s a fast and simple way to get most recent value (version) of an AD when you hold a reference, then you know you just need to load all values from that segment, follow the link on first slot in the segment (pointing to previous segment), and pick the snapshot on the next-to-last-version on that segment, and then you can always restore some state with no more than n = [segment size] apply-operations of events since the snapshot - regardless of the AD version length.

So, every change to a primary/secondary index in the datastore, is appended to such an AD. Let’s say the segments will be 1k entries. At version 0 would be nothing, since there are no previous segments in this AD. At version 1-997 (inclusive upper bound) you would place the serialized data of every change (small changes, so < 100Kb size, which is still plenty for an event, which is more likely to be ~1Kb). At version 998 of the very first segment, you’d insert a snapshot. So what that means is: you’d apply all the 997 events through what ever function models your state, and get your current state. That current state you serialize and put to an ImD, whose datamap goes into version 998. Then version 999 takes the link to next segment.

Anything can be built with this (in principal, but maybe not getting it as efficient as we have become used to, that would be left to see). We wouldn’t avoid the garbage collection problem, but I think the data buildup would be at a slower rate than with OP example, and it is much simpler.

EDIT: disambiguation index / version for clarity

mav · February 25, 2019, 11:12pm

Is this kept by the consumer app (run by the consumer, eg ticket purchaser) or the owner app (run by the owner, eg the ticket seller) or by the vaults?

This is fine in non-SAFE applications because the person operating on the data (ie user or operating system or server administrator or app itself etc) has authority over what is defined as garbage. But on a distributed permanent network, what’s garbage to one person is valuable history to another - there’s no authority over garbage. ‘Unused’ needs to be the subject of a detailed debate here I feel (more broadly than just the context of LSM Trees), and ultimately will form the parameters around what the SAFE network is and isn’t meant to be for.

This is a good point, and depends on how we define mutability and the purpose of that mutation. If the purpose is to stay up to date, AD can emulate it perfectly. But if the purpose is to preserve privacy by deletion, then obviously AD cannot emulate that.

I think the main issue with AD is a performance one - clients need to compromise between staying up to date but not consuming lots of storage space or request bandwidth to achieve that.

My proposal is not very fancy - a binary tree structure containing a list of pointers to IDs (except the first entry is a config for the tree, and the last entry points to the next AD tree structure in case this tree fills up). The config entry contains a ‘predetermined’ depth depending on expected usage and lifetime. (The predetermined depth may be chosen algorithmically, it does not need to be chosen by the app creator).

If I have a very slowly updating datasource like my personal webpage then a single level tree is enough (approx 32K entries per AD). I can update the site 32K times before history spills over into a second AD (I call the spillover event a new ‘epoch’, current users can stop tracking the first epoch but new users will need to traverse from the first epoch to the latest). If I update the webpage once a day on average, this is about 90 years of updates in a single AD.

If I have a very fast updating thing like a database, the tree structure might be set at 10 levels deep. This means clients must fetch 10 AD to get to the latest entry, but it allows 32K ^ 10 updates (that’s insanely huge) before it spills over into a new epoch. The first level will update very rarely, the next level a bit faster, and the lowest levels will update very rapidly. But the user doesn’t need to traverse a linear history of all updates to get to the latest version. They just check if the top layer has changed - no? Second layer changed - no? Third layer changed - yes? Get the latest AD entry and follow the latest values down to the bottom of the tree; now the client is up to date with only 10 AD requests and have skipped potentially over billions of AD of history (which would be pointing to trillions of ID of data).

I think the binary tree structure has a good compromise for users to find the latest without too much storage or requesting. O(n) to traverse epochs, O(log(n)) to traverse within an epoch. It doesn’t even take any planning to get the depth right. If the root structure fills up too fast, the next epoch in the chain may be configured to use a more suitable depth. It can work with the AD API from the original post.

LSM Trees are fascinating but the need for duplicate data + garbage collection seems unsuited to AD.

happybeing · February 26, 2019, 9:33am

Would it be credible for the binary tree depth to be adaptive, ie grow or shrink according to some control input, or would that always effectively mean rebuild from scratch?

oetyng · February 26, 2019, 1:23pm

OK, I need to clarify things that might not have been obvious in the OP.
So, the goal is to construct a datastore with indexed and searchable data.

Part 1

Goal: Indexed and searchable data.

Have a look at this chart to start with:

Comments

This falls under the very first [ NO ] in the diagram. This is not part of the problem I want to solve.

I am saying that this is not related to LSM trees, rather related to emulating mutability over immutable storage.

When we have copied the same data 1000 000 times (and far earlier than that IMO), then it’s quite fair to consider a few of those instances to be “garbage”. That is what I mean at least.
You’ll see further down, exactly what this duplication of data comes from.

The problem

Let’s continue with describing the problem:

The problem I want to solve, is to provide the kind of data storage (and access) that has been talked about during all development of SAFENetwork:

Autonomous cars data
SAFESearch
Token Exchange
Critical infrastructure (powerplants smart electricity grid etc.).
etc.

All of these go all the way down to the last [ YES ] in the chart:

Large amount of data - Check
Access data often or fast - Check
Continuous insert of new data - Check

So, these things seem to need indexes.

The problem we are solving is very similar to that of storing to disk. It is costly (in time or PUTs) to write, and we want to be able to access the data somehow afterwards. This is not a new problem and it has been studied in depth.

First of all, let’s go through a bit about the trees:

Binary trrees are not suitable for for this since they have a very low fanout; only have 2 entries in a node. This gives a rapid increase in tree depth.
BTrees can have n entries,which are stored at every level together with keys. This gives a bit better read performance than B+Trees, but is more complex and requires more merging and splitting of nodes.
B+Trees can have n entries, which are all stored at leaf level. This is less complex to implement, utilizes the capacity of a node better and gives a larger fanout than BTrees and requires less rewrite of the tree as new data is inserted. This is the preferred solution for disks.

OK, so we have found a standard approach to build an index, which is B+Trees.

So, we need to look at why the B+Tree requires mutations.

To search for data, maybe all temperatures within a given range from a component in some powerplant for example, we need range queries.
If we are only indexing on a data id which is a monotonically increasing number, then there will be no problem, since every insert by default will result in ordered data.
But temperature measurements can come in any value. 123, 456, 234, 876, 345. They don’t happen in an ordered fashion. So every time we insert the data into a tree, if we are going to have an ordered tree, we will need to route it to it’s correct part in the tree. (This is what makes it fast to do a point query also, i.e. get where data equals an exact value.) But as new values come in, what once was the correct place for a previous value, will no longer be the correct place as to maintain a sequence. So as the new value should go there, we start re-arranging as to maintain the order.

^That there is mutation.

Part 2

Emulating mutability

The simplest way to do this, is to use MDs, and just rewrite the value in an entry. Done. Problem solved.
But, we are trying to solve the problem within the confines of immutable storage, so AD then.

A way to emulate mutation is to combine the two ways we can use ADs, i.e. as a value with opaque history, or as a collection of values. When combining them, we can store a collection at fix address, and by appending a new pointer, we can emulate that the collection is changing. What is really happening is that we are producing a new collection (new AD), and pointing to it.
Now, this doesn’t need to mean enormous data duplication, since the actual data need not be in that AD collection, only the pointers to it. The data is in ImDs, and that we don’t duplicate. The AD collection with pointers isn’t very big in comparison to the actual data. But still, over time, this will risk building up a lot of duplicated unused and meaningless data.

I’m a fan of event sourcing, so I’m not the least unfamiliar with the notion of the potential value of keeping ALL history (that’s how event sourcing people feel generally). But history is one thing, duplication is another. One copy of the history is good enough. Trillions and quadrillions and so on, of them is not needed.

Speed vs storage overhead

With or without out indexing, we stand before this problem:

For those applications that don’t need indexing, this is the problem with regards to the actual data.
For those applications that will use indexing, this is the problem with regards to the indexes as well.

When we go to the right end (append every change), we are not duplicating any data at all, and have minimal serialization costs when writing (so, high write throughput), at the cost of ever degrading access times (slower and slower queries).
When we go to the left (say, when you update email address on a Person model, you store the entire person model again, with Name, Address, Phone numbers, Birthdate, etc. etc. etc. i.e. constant duplication of data), we get fast access, since you just fetch the Person with one request (especially if you store the address as a hash of a unique value of the Person model, so you just calculate the hash, and access the data). But this comes at the cost of tremendous duplication of data, waste data. There is no need for more than one copy of the Phone or Address data, but we can’t avoid it. Also, writing is slower, since we need to load, deserialize, mutate, serialize and then insert large amounts of unrelated data with every little change.

There is no other dimension to the above, that is the physical reality we are stuck in when we go the immutable path. That is what I say seems to be the case at least, and ask of others to show that it is wrong.
We could cheat of course, and not store the data in the network, but we are assuming here that 100% data storage in the data network is the requirement.

LSM Trees

This is a way to increase write throughput, and is not exactly the same problem as this post covers. It’s a subset of it. If we have large amounts of data to store, then we will need higher throughput.
But LSM Trees are also best served with B+Trees for indexing. The OP described an idea to implement this combination, over AD - and during that work, the more fundamental problem emerged; the three data storage requirements in the flow chart; mutability over immutability, and it’s implications.

Summary

So, if we want to be able to do these things:

Autonomous cars data
SAFESearch
Token Exchange
Critical infrastructure (powerplants, smart electricity grid etc.).

… it seems we have to solve this problem. I.e. efficiently emulating mutability. Or not go immutable. (Or what ever?)

We could also say that No, those things are not what SAFENetwork is supposed to do, i.e. we only want to do things like:

Backup Internet Archive as to ensure it lives on.
Individual websites that we modify now and then.
etc.

But that doesn’t seem to be what most people around here believe or wishes that SAFENetwork will do. It sure isn’t the only thing I wish for it to do.

Decisions, decisions…

I think if the decision is being made to go over to AD, then at least we must assess what that means for what SAFENetwork can do, and try to investigate if and how we can solve the problem that arises with regards to what we have imagined to implement on the network.

Solutions

Now, if anyone out there, has a solution to the problem, I would be very interested in hearing it, so that I can go ahead and implement it.
Because that’s what I want to do!

Topic		Replies	Views
Code: AppendableData as infinitely expanding DataTree, over doubly linked MDs Development	5	1064	March 5, 2019
Database @ Safe in the Published Zone Features development , appendable-data	8	509	December 13, 2022
Appendable Data discussion Features appendable-data , immutable-data , mutable-data	285	8265	July 8, 2019
An Overview of the New Data Types Development	40	2005	October 21, 2020
Code: Indexed and searchable data in SAFENetwork Development	13	2329	February 10, 2019