Code: FileSystem - Unlimited Size (working on Mock Network)

This something to think about?

https://github.com/fuchsia-mirror/docs/blob/master/the-book/dotdot.md

5 Likes

Imo Rsync is about as easy as it gets for managing a local folder with a mounted safe drive.

1 Like

Yes rsync is a very useful tool. But inotify has another function, like make it possible to rsync automatically and immediately after a file changed and not periodically.

5 Likes

@oetyng I’ve been revisiting your old threads and am realizing just how forward thinking you’ve been.
The debate around AppendableData brought me back to this thread in particular.
Why can’t we implement your idea here for an infinitely expanding data structure?

What are you thoughts around file versioning? How would you handle it in the context of this thread?

11 Likes

Hi @hunterlester, nice to hear :slight_smile:

The principle could probably be used.

I am currently investigating various things related to this.

One thing is that I am looking at how to implement SSTables and LSM trees without MutableData, i.e. with something working as the AppendableData does - to the extent that I know of how that would work.

Another thing is that I am working on a BTree as well as a B+Tree for the existing infinitely expanding data structure (IED), as a replacement for the current indexing. (The current implementation is not so write-efficient, and does not allow range queries as is.)

So, there are various things to reconcile:

What I have here basically forming the IED, is a heap based on MD.
To make this a searchable and practical storage, I need to combine it with indices.
It can then become basis for a database, or a filesystem.
Now, the IED doesn’t need to be a heap, and if MD is being replaced with AD, then it seems more complicated to build a heap (don’t know yet if it is practically doable).

Enter LSM trees.

However, LSM trees also use indexing, for which B+Trees are quite apt. And B+Trees in the same way as these heaps, as far as I can see right now, seems to become difficult to implement efficiently without MD (or something like it).
(Another complication is that the SSTables use timestamps as to get them reliably merged. If we’re assuming writers from more than one machine, the timestamps will not be reliable. Maybe that will have to be a problem for the application.)

An AD could well be an SSTable, if we disregard the timestamp issue.
And so an AD in the role of current state, could store a map of current SSTables (ADs). After every merge of the tables producing ever bigger tables, we append a new map to that AD with current tables.
So far so good. But there’s still the question of how to construct an index efficiently with AD and ImD as building blocks.

But I am still investigating, only started.

As to my thoughts so far about AD, I was thinking about how I would want it to work as to be most useful within the ramifications of how I understand it to have been suggested.
I am looking at it as both a simple value with an opaque history, as well as a stream. The API I propose to be useful would be something like:
ad.AppendInSitu(value) // always appends, value directly stored in AD
ad.AppendToImd(value) // always appends, pointer to value stored in AD, value in ImD
ad.TryAppendInSitu(value, expectedVersion) // optimistic concurrency
ad.TryAppendImd(value, expectedVersion) // optimistic concurrency
ad.GetValue() => (value, version) i.e. current version
ad.GetValueAt(version) => value
ad.GetAll() => array of values, where index would represent version
ad.ReadForwardFrom(version) => array of values from [version] to current version
ad.ReadBackwardsFrom(version) => array of values from [version] to version 0
ad.GetRange(version_a, version_b) => order of a and b determines order of returned array // [a b], a != b, a,b > 0,
Possibly even a subscribe functionality, so that a new version is pushed. Polling would be needing a “hardpoll” as to not hit a cache.

About how the AD could be built internally:
We would be talking about the principle for the IED, as a basis for AD, so talking of the Rust layer, and that might give some more wiggle room, if the vaults can be allowed to have mutable storage, just for the internal use, and implement B+Trees for indexing.
MD right now is handled by a close group. But an AD, would it also?
Would be more scalable if an AD was internally built of fixed length segments, that are linked. If we reuse the MDs, remove the mutation feature, and just let the last entry be a link to a new MD, then we have something that could be an AD. It wouldn’t efficiently support the API I suggested above though.
The way I make the IED be infinitely expandable, is by building a tree of the MDs, which can have a massive fanout with 999 slots pointing to an MD each, in turn pointing to 999 each, and so on.
There’s a little bit of a chicken and egg problem if we’re looking at current IED as basis for AD, as the IED is not very practical without indexing, and without MD maybe no indexing? And then no [manageable] IED for AD? Well, that is the nut I want to crack :slight_smile:

The file versioning. If we look at how the storage solution is implemented here currently, it just stores data directly to an MD entry. I have already planned for this to change so that data is stored in ImD, and only pointers in the MDs.
This way, a file will always be an ImD. My first idea of file versioning would be to, in a new ImD (i.e. new file version) always include a pointer (well, that would be the datamap) to the ImD with the previous file version.

I have a lot more thinking to do here. I want to be able to either say: “Allright, scrap the MD, I will be able to build these datastore things with an AD this and this way.” or, alternatively: “I will not be able to build a datastore with the AD and ImD only, please do not scrap the MDs”.
The VFS stuff I have designed here is based entirely on this datastore functionality, so I would want to “secure” a way to implement the datastore before saying too much about the VFS.

10 Likes

Hey @oetyng, I second @hunterlester here, really cool stuff. Would you be able to summarise what you woudl say are your tech requirements for the datastore?
I can guess some reqs ofc, like being able to add new versions to a certain piece of data indefinitely, and be able to do some fearly fast lookup of any version of it, what other things in summary would describe your reqs?

5 Likes

Hey @bochaco thanks. Absolutely, let me get back to it, I’m just about to turn off the lights.

(On the AD subject, we could emulate the MD, by keeping a root AD with a map of ADs forming the entries. It won’t have the transactionality that current MD has, which was nice for example when creating a new index over existing data as we could commit large number of entries in a single commit. But it might have benefits with regards to making it handle range queries I think, since every new version of the map in the root AD, could easily be sorted after a key deletion, as it needs to be updated anyway, which was not the case with MD entries when keeping (key, value) as mdvalue and collection index as mdkey).

4 Likes

Did you ever try reactive programming? It would be nice to have this implemented as an observable.

1 Like

I have mostly thought of this in terms of what can be done, with what we have. It’s been very investigative, and continues to be. For that reason I haven’t really set out a goal for what should be accomplished.

Now with MD going away it was the first time I was wondering if I could even get a basic datastore together. But I’m quite sure that ADs can be used to build MDs (like I wrote in previous post), however with slightly different properties, for better and worse. I’m just beginning to explore what implications it might have, and how to adapt to it.

As to build a very basic datastore, I currently don’t think there’s anything lacking. Depending on what you would mean with basic.

Things that would be valuable to have might add to the requirements, if we were to expect them.

Things you can expect from a regular datastore today is:

  • Concurrent writes from multiple clients.
  • Single writes to be atomic.
  • Thousands of reads and writes per second (some even millions).

I think this can be achieved, with various techniques. But that is basically what I am trying to find out.

A database called Noria (written in rust, implemented over RocksDb) has an interesting technique to avoid read-write locks.
(overview starts right here in this youtube presentation: https://youtu.be/s19G6n0UjsM?t=2000).

LSM trees write to in-memory tables before merging, and we could perhaps at some point use the same technique to avoid locks on the root ADs and B+Tree structures holding the indices, at the cost of duplicated data for the pointer maps. Otherwise we would perhaps need some distributed locks (and I have been thinking about those as well).

All of these things might introduce some detail to the requirements, as to work well, details we may not yet be aware of, that will emerge while developing.

One thing that isn’t trivially solved is latency. So write latency for a single writer can be somewhat circumvented with the LSM trees, but as soon as we introduce more clients, those writes will be as delayed as the merging of that writer is (i.e. persisting the in-memory tables to the network), alternatively as delayed as a consensus can be reached in a cluster, to which all clients connect (in case we set out to replicate the in-mem table, which maybe or maybe not could be a way to build such a cluster).

But I think it is the right direction to focus on write efficiency when choosing algorithms, as the reads can probably be served better with caching.

I feel like I’m in the very beginning of a long duration effort of both learning and experimenting here, so I’m unfortunately a bit vague with regards to requirements.


I’ll try get a summary though:

  • Indefinitely add new version to a piece of data (this is what emulates mutability of this piece of data, when we use a GetValue() API that returns the value as of latest version.).
  • O(1) on the above (both an append, as well as a read of latest value).
  • Fairly fast lookup of previous versions.
  • Writing multiple values in a single commit to a structure (for an AD, that would be like adding a range of values in one request, where each value is appended as a new version).

That way we can do something like this:

You load address of your datastore root AD (acting as ValueAD, i.e. we care only about value at current version), from your app container.

This AD contains pointer to ImD, which has a map of AD addresses (each acting as a StreamAD, i.e. an AD where we care about all the values/versions). Such a StreamAD can for example be a registry of what “types” your datastore contains. So you have an “Appointment” type, or “TemperatureMeasurement” type in your application, of which each instance of it you translate into a row, which is stored as an ImD, that has its datamap stored in an entry in (possibly) various StreamADs, thereby forming indices.

An entry in this example Registry table (i.e. StreamAD), is a pointer to a ValueAD. So the entry representing “Appointment” type, points to the ValueAD for the Appointment type. The current value of it, points to an ImD which has the following:

  1. The primary index ValueAD address, this ValueAD latest version points to the current StreamAD index structure.
  2. … n Secondary indices ValueAD addresses, these ValueADs latest versions point to their respective current StreamAD index structure.

This is loaded once at start.

Whenever a new index is created, the address to the new ValueAD is added, and this is serialized and put to a new ImD instance, and the ValueAD for the Appointment type is appended with the pointer to this new ImD.

You only append to the map of the StreamAD index structures (a StreamAD too), not the indices themselves as you want them to be sorted. So what you do is you keep an in memory table that you insert to (and query), and as a certain number of entries has been reached or time passed, you merge sort this in memory table, with a persisted table (a StreamAD) of same size, and keep merging while any smaller or same size persisted ones exist. Then send the sorted result to a new instance of an AD (so, this is the writing multiple values in a single commit feat), and append a new value to the ValueAD that points to this new StreamAD (the new table, containing the sorted entries of the previous table, and the in memory table), or add a new to the map if there was no merge (so flush only). This will give log(n) number of index StreamADs.

The indices are tree structures made up of ValueADs pointing to StreamADs, each entry in the StreamADs pointing to a new sub-tree of ValueADs and StreamADs.
Basically, a ValueAD+StreamAD is like an MD entry pointing to another MD.
The leaf level of these trees, point to an ImD with the actual data (in case of primary index, otherwise to the ValueAD of the primary index, as to only update one place when modifying underlying data) as well as (in case of B+Trees) to the previous and next ValueAD at leaf level (remember the entries are sorted), so that range queries can be made by first doing a point query, and then scanning from there on, until the end of the range has been hit. But there will be multiple indices to hit, actually log(n) of them, each taking log(n) to query. Using bloomfilter will improve the query performance which would otherwise go to (log(n))^2.

That’s the first idea so far, of how to build a datastore using B+Trees and LSM trees based on AD. Needs a lot more work obviously :slight_smile: So, I’m currently not sure what more would be needed to achieve the above. I’ll post this for now as a start.

7 Likes

Hey, I haven’t used observables much. It is more about bringing changes in underlying data to the UI, and I haven’t worked much with UI. But I’ve written a lot of event sourced code, which would be the backend part, the foundation for these streams, over which observables could be implemented.

3 Likes

Now after having discerned more here above, I can see that file versioning is very simple:
The file name (the full path) will have its fixed ValueAD in the primary index, and each version of it points to an ImD. You have all the versions in that AD.
So, that would be my second idea :slight_smile:

There’s no real distinction between ValueAD and StreamAD, it’s just what they represent to us, i.e. how we use them. It’s just AD all the same.

2 Likes

Update 2019-02-23

The filesystem code was separated from SAFE.DataAccess (which also had the code that is now in SAFE.DataStore repo) and the filesystem code can now be found in a new repository: https://github.com/oetyng/SAFE.Filesystem.
Some bugfixes have been made, and some more tests passing as well as some refactoring and cleanup of code.

I have installed Dokan and implemented the IDokanOperations interface (from Dokannet) as SAFEDrive class.

This means that when you run the application, a drive is mounted and can be found in Windows File Explorer.

Right now it supports creating dictionaries, open them and create new dictionaries within.

And we’ll see a notify icon in the tray:
notifyicon

To use this you need to install Dokan, instructions can be followed here:

But, the SAFE.Filesystem code is not really mature for testing at that level yet. So, just putting it there for your information.

How Dokan works

Dokan library contains a user mode DLL (dokan1.dll) and a kernel mode file system driver (dokan1.sys). Once the Dokan file system driver is installed, you can create file systems which can be seen as normal file systems in Windows. The application that creates file systems using Dokan library is called File system application.

File operation requests from user programs (e.g., CreateFile, ReadFile, WriteFile, …) will be sent to the Windows I/O subsystem (runs in kernel mode) which will subsequently forward the requests to the Dokan file system driver (dokan1.sys). By using functions provided by the Dokan user mode library (dokan1.dll), file system applications are able to register callback functions to the file system driver. The file system driver will invoke these callback routines in order to respond to the requests it received. The results of the callback routines will be sent back to the user program.

For example, when Windows Explorer requests to open a directory, the CreateFile with Direction option request will be sent to Dokan file system driver and the driver will invoke the CreateFile callback provided by the file system application. The results of this routine are sent back to Windows Explorer as the response to the CreateFile request. Therefore, the Dokan file system driver acts as a proxy between user programs and file system applications. The advantage of this approach is that it allows programmers to develop file systems in user mode which is safe and easy to debug.

I can say that starting to use Dokan is dead simple. You just run the installer, then you implement the IDokanOperations interface, set it up with a few lines of code in the entry class, and then it’s running! Or as they write themselves:

To make a file system, an application needs to implement IDokanOperations interface. Once implemented, you can invoke Mount function on your driver instance to mount a drive. The function blocks until the file system is unmounted.

The devil’s in the details though.

It’s been some struggle to understand how exactly the methods on the IDokanOperations interface are supposed to work. It’s not running very smoothly, and it’s a lot of calls back and forth to the various methods, you have no chance of knowing what is going on and why, or what you are supposed to do.

For example CreateFile being called 8 times when it starts, and then 10 times when you enter the drive, and then 27 (!!) times for creating one folder?? It just doesn’t make sense. What is the purpose of all that you think.

So the struggle continues :joy: I really hope that I have just not understood it properly, and that it’s good stuff at the core. I’ll have to trawl the community forum and start asking some questions about this.

We have come this far at least! :slight_smile:

For this to work on something else than Windows I will need to take another approach. But I’ve just wanted to test and see how well the SAFE.DataStore would work as basis for a filesystem.

20 Likes

Astounding update! This is going to be great!

2 Likes

Is there anything that the technically challenged can do to help at this stage @oetyng or is this still a conversation between you and the MaidSafe devs?

3 Likes

Nothing yet unfortunately, thanks for asking! Right now, there’s not anything for MaidSafe devs to do here either. It’s a battle between me and Dokan now :slight_smile: I need to look at more examples, and dig deeper. I am going to implement the interface again based on another example, and see if I can get it straight.

As soon as it is possible to create files and write to them, I’ll holler and provide some instructions for how this can be tested :slight_smile:

Thanks! I was very happy myself when I saw the drive in the explorer :slight_smile:

9 Likes

This is some impressive work here, I’ve been following your ideas from your first “infinite file structure” thread and am definitely excited to see where you can take it.

7 Likes

I think it’ll be very helpful for the users to store and retrieve the data on the SAFE Network. Also, I see you are using .NET Core 3.0 so basically a cross-platform CLI app can be built and used on different desktop platforms. Maybe different code can be used for mouting the drive based on OS.

4 Likes

Yes, I think this can be very good.
As far as I can see, Dokan will only work on windows, and for mac and Linux I would need to use FUSE. Now, Dokan has a wrapper for FUSE. What this means in practice I’m not 100% sure of yet. But I’ve assumed that it means that if you have code that is already built on FUSE, you can use the Dokan wrapper to also have it running on Windows. So, that’s what I meant when I said I will need another approach to run on mac and Linux, since I have code based on Dokan, and there’s no FUSE wrapper of Dokan that I know of.

So, the best thing would have been to focus on FUSE from start. But I want to be productive and try things out, so I’m focusing on Windows now (and evolving a general idea for the non OS dependent design), and leaving the rest for later :slight_smile:

I have found a much better source to use btw, and my issues from previous post I can now confirm are because of not getting the previous implementation right. So I’ve got a very good foundation now for next step.

I’ve started to explore options for performance and reliability. This means some new interesting approaches to data storage. I will try make some more progress before I go into more detail here though.

3 Likes

I thought I’d share some feedback which might help, although this thread is a bit inactive lately. Nice proof of concept, by the way.

The approach of s3backer is sort of similar, but more narrow in scope (and filesystem-agnostic). It uses fuse to create a virtual file of any size you want. The file contents are actually stored in (configurably-sized) chunks in an Amazon S3 bucket. That’s all it does / needs to do. I’m sure a “safebacker” tool could work the same way.

Linux (and other operating systems) can use a file as a block device, formatting any filesystem on top. xfs, reiserfs, ext3, ntfs, fat, etc. This approach could be worth looking into. I think it would be a lot less work (not necessary to “reinvent the wheel”/filesystem). There’s a lot of mature filesystems out there with experienced and active development teams - work that can be leveraged. I’m not as familiar with Windows, but I’m sure it’s possible to do this - for example, VeraCrypt lets you create/mount a virtual encrypted disk from a file.

One thing handy about having flexibility of which file system is in use is that it satisfies the requirements of users of all sorts of platforms. For example, Windows users who want to back up their files (including the ACL file permissions) can do so to their virtual NTFS drive. Similarly, Linux users who have file names and permissions that NTFS doesn’t support, aren’t stuck having to work around quirks and lose metadata.

4 Likes

Hi @bytes, thanks from the feedback.

This POC was just a first small reconnaissance.
I then began new one with a better starting point.
So this topic is not the current head.

The current project is here: Release: SAFE.NetworkDrive on Windows v.0.1.0-alpha.1

It is now totally different. As it is event sourced, with the actual file system in memory, the data being stored to network is becoming agnostic of platform as well (that is what is being worked on now).

So the filesystems are only projections, and this means you can build on the same stream (same drive) from Windows and Linux. If and what compromises that will lead to in the mapping is to be seen.

This is a distinct approach from the one you suggest. Check it out.

I started with Windows since that is what I am fluent with and I have thus made fast progress with the concepts. Linux and OSX will take more time.

10 Likes