Syncer: a caching FUSE based filesystem in Rust

happybeing · June 27, 2020, 11:46am

This could mean terabytes of SAFE files available on your local PC as fast as if they were stored on your local drive.

Syncer is a FUSE based caching filesystem built in Rust using ssh/rsync on a storage server. Designed to provide locally cached access to terabytes of data stored remotely, where you only have a few GB locally, fast as a local disk but more CPU intensive.

Syncer looks MAIDforSAFE. Could be a better SAFE Drive, get in touch if you fancy collaborating or taking this on. I’m not up to speed on Rust but could help out if anyone would like to pick this up.

dirvine · June 27, 2020, 1:04pm

This is as we imagined way back in FUSE days. It could be very very good for us.

happybeing · June 27, 2020, 1:29pm

I’ve reached out to the author. No changes for 18 months except for updating the github build scripts in March this year, so I’m hopeful.

It builds and my immediate impression is good. I note that his brief design is clear - uses hashed chunks which are deduplicated and pushed out to read-only slaves . Have a quick read of DESIGN.md.

The build generates a syncer.d daemon which contains:

/home/mrh/src/fuse/syncer/target/debug/syncer: /home/mrh/src/fuse/syncer/src/backingstore/blobstorage.rs /home/mrh/src/fuse/syncer/src/backingstore/metadatadb.rs /home/mrh/src/fuse/syncer/src/backingstore/mod.rs /home/mrh/src/fuse/syncer/src/backingstore/rsync.rs /home/mrh/src/fuse/syncer/src/config.rs /home/mrh/src/fuse/syncer/src/filesystem/entry.rs /home/mrh/src/fuse/syncer/src/filesystem/mod.rs /home/mrh/src/fuse/syncer/src/filesystem/vclock.rs /home/mrh/src/fuse/syncer/src/lib.rs /home/mrh/src/fuse/syncer/src/main.rs /home/mrh/src/fuse/syncer/src/rwhashes.rs /home/mrh/src/fuse/syncer/src/settings.rs

So it may be interesting to try and include something in ./backingstore for SAFE.

Getting a bit out of my depth now, but that’s where it looks interesting so far.

danda · June 27, 2020, 4:54pm

Thanks for the link. I will check it out.

I have also been brainstorming about integrating with OS filesystem calls, and looked at a couple other rust fuse projects already.

One issue at present is that the FileContainer code/api is oriented around a flat list of files, not a tree structure the way that a filesystem is. This results in some inefficiences, and extra filtering steps. And we don’t provide an API like a normal filesystem, eg open_dir, read_dir, mkdir, etc, etc.

A couple weeks back I made a quick prototype (in an interpreted languge) of a tree based in-mem API that can read from and write out to a flat FileContainer. The serialization is a separate layer, so I also made a serialization as a json tree.

After I made that little protoype, I was looking at an in-memory FUSE library in rust, and thinking about how to integrate it.

It is definitely an area I would like to explore more, perhaps after finishing up the glob() support.

KafkaLee · June 27, 2020, 6:03pm

how this Syncer is maid for safe? maidsafe already have a good caching mechanism? no?

happybeing · June 27, 2020, 6:31pm

@danda I’ve done a review of the code and here are my notes:

Syncer Operation

Syncer creates an on disk filesystem in a local directory which is initialised using:

   syncer init local_dir remote_source max_local_size_MB

The local directory holds:

./config - a configuration file
./data/blobs/ - a directory holding 1MB blobs (content addressable chunks with content hash file name). See blobstorage.rs
./data/nodes/ - a directory holding filesystem metadata stored as a file for each node (inode?). See blobstorage.rs
./metadata.sqlite3 - a file holding the database of metadata for the whole filesystem (directories, files, symlinks etc and their corresponding nodes and blobs)

The storage is implemented by a Backingstore which stores blobs and nodes to disk and remote, and retrieves deleted blobs from the remote when needed. Synchronisation to and from the remote is done using rsync commands.

The full contents of ./data/nodes are kept both locally and on the remote server.
All blobs are stored on the remote but only a subset remain in the local cache (in ./data/blobs) and are retrieved on demand when not available in the cache.
Neither the config file or the database of medatdata are saved to the remote, but can be created by cloning the remote to a new location using:

   syncer clone local_dir remote_source max_local_size_MB

Cloning creates a new local directory, with config file and an empty metadata database just like syncer init. The content of ./data/nodes is then retrieved from the remote and used to initialise the ./metadata.sqlite3 database.

Simple SAFE Syncer

A SAFE filestore could be implemented by replacing the rsync command with the SAFE CLI, or rather making a SAFE CLI version of src/backingstore/rsync.rs This would be incompatible with SAFE Applications wanting to access these files directly on SAFE because the filesystem structure and filenames are not being stored using native SAFE APIs. The SAFE FileContainer will hold a copy of the blob files, rather than the directories and filenames visible to applications using the FUSE mount on the local directory.

Syncer with SAFE Compatible Metadata

It would be useful for data to be stored on SAFE using the SAFE file APIs and accessible locally as a cached FUSE mount in the manner of Syncer.

This would require the local cache to be created using a chunking, hashing, and datamap system which is compatible with the network version, and for Syncer to get and put individual SAFE chunks in a compatible way. Ideally using the same code, to avoid incompatibilty between data saved locally to disk and data put to the network.

I’m not sure how much of Syncer could be utilised in this case. It may be better to implement the FUSE operations directly on top of SAFE compatible file APIs, and provide a mechanism for caching SAFE chunks (for datamaps and other metadata as well as file blobs), which has some features in common with Syncer but is more closely tied to the SAFE core APIs and the SAFE code itself (e.g. best to use the same code to create a chunk which is written to disk, put to SAFE with no need to store metadata locally).

It starts to get complicated though - for example ensuring data is consistent when it might be modified independently. Locally cached chunks could easily become out of step with the latest data on SAFE. This might be an argument for the Simple SAFE Syncer approach, which avoids the issue by ensuring all conflicts are resolved in the locally mounted file system.

Conclusion

Option 1: create a Simple SAFE Syncer because this is relatively easy and would be very useful for certain use cases. The downside is incompatibility and reduced deduplication if files become widely stored both using Simple SAFE Syncer and directly using the SAFE API.

Option 2: look into ways to provide a cached locally mounted file system which is compatible with the SAFE file APIs, and understand the trade-offs in terms of POSIX compliance (e.g. easier using local FS than SAFE APIs), data version conflicts, and compatibilty with other SAFE Applications. For example, if other applications are only allowed to read SAFE Syncer data, the ability to use it as high performance local FUSE mount is still a big plus. Or if some way to allow syncing and automatically resolve conflicts can be achieved with fewer limitations.

I grappled with this to build SAFE Drive (FUSE mounted SAFE filesystem with in Nodejs) which is one reason I’ve pestered to have directories first class objects. Without a way to represent empty directories I had to simulate them with an extra layer.

It wasn’t 100% POSIX but worked pretty well based on the old SAFE NFS. For example you could use the Linux command line to access your SAFE files, and rsyc between your SAFE FUSE mount and local file system directories. Using rsync with SAFE Drive was the basis for safegit and this is why I think safegit can be resurrected using the new SAFE CLI safe sync ( at @southside).

Now we have support for empty directories there’s no need for the extra layer and handling the flat structure is not that hard.

More difficult to support are native FS features such as multiple open file handles, file modes, seek, truncate etc, and some attributes. As we no longer have a SAFE NFS layer some of this may be more difficult, although I’ve not looked into the new API to see. It was also necessary to cache certain metadata and directory listings because of how FUSE works, and to make directory listing fast enough.

I think syncer is well designed, provides one very easy option without SAFE file compatibility, but may also give inspiration for how to do something that is compatible with SAFE APIs. I’d be interested if you have thoughts on that - maybe we sync using real files while caching the inodes locally? Or maybe we go a step further and use a syncer style architecture with SAFE chunks instead of syncers blobs which are quite similar (1MB hashed objects)?

I think there will be various options with different trade offs as I’ve begun to think about in the notes above.

SAFE’s caching is for nodes on the network - so data is not cached on your local disk. Syncer’s caching is on the local disk, and also provides a locally mounted filesystem for use by any application without needing the application to know about SAFE.

danda · June 27, 2020, 8:53pm

thx for this!

The caching aspect of it is the most interesting thing to me, from the notes. It would be nice to get that “for free”.

What I’ve sort of had in mind is an in-mem filesystem, possibly/probably with a unique mount for each FileContainer.

The in mem filesystem would only write to the network periodically, and/or if sync() is called. Ie, one should be able to construct an entire FileContainer locally in mem using various filesystem calls before it actually gets sent to the network.

LIkewise, a read of a given path would pull down the entire FileContainer, which is then operated on locally.

All transfers take place using existing SAFE APIs. Higher level APIs such as file_container_put(), file_container_get() could be re-implemented as a thin layer above the new FS layer.

In such a model, the caching is not really necessary but can be an improvement to help prevent against data loss that could occur prior to sync().

I think that ideally, the underlying FS library could be used natively and cross-platform by safe-cli or safe-browser, but could also be exported to other apps via eg fuse.

Those are just a few thoughts for now, probably won’t be able to dive in deeper for another week or so, assuming the team is on board with pursuing such concept…

happybeing · June 27, 2020, 9:05pm

Your approach stimulated more thinking so just scribbling as I go…

Maybe the SAFE API could include a way to signal locking of a FilesContainer to simplify keeping things in sync. I guess it may not be necessary with versioning, but don’t know enough to think properly about that.

Then while updates are being made locally, the SAFE FC could be locked until it has been updated, and the version checked to see if it has been changed before locking for any further local changes.

I guess there are many ways to try and skin this cat. I recommend looking at the syncer code. It looks well designed to me, and by someone who knows how to do this stuff, so fairly easy to understand even for a non-rustacean.

Southside · June 27, 2020, 9:06pm

I have been tasked with other stuff for now but my first thoughts were to see where this might lead us and to wait for some consensus before spending any more time on SAFEgit,
I am deep in project documentation right now but later tonight I will try and build from that git repo as you did earlier today.

happybeing · June 27, 2020, 9:08pm

If you have anything useful on safegit please share it and I’ll take a look. Might not be much effort to finish - I just can’t test anything unless I set up a local network which I’d rather not spend time on.

happybeing · June 27, 2020, 9:32pm

A thought in favour of caching SAFE chunks locally but in Syncer like system is that the content will be encrypted on the local system.

The metadata could be cleared at any time, but rebuilt from the chunks when the user logs in and mounts the FUSE drive, or on demand as the mounted Files Container is accessed. So the chunks could be safely kept on disk, but metadata held in memory (similar to your proposal) and deleted when the device suits down.

Am I daft, or would it be quite easy to add the ability to SAFE libs to cache certain chunks locally - all chunks related to a given FilesContainer for example?

dirvine · June 28, 2020, 3:09am

Certainly this is possible and encrypted. For extra safety it would be wise to encrypt the whole local cache as well. I am still thinking about this one, seems feasible and unlikely to reduce security except perhaps knowledge of last time safe was accessed?

happybeing · June 28, 2020, 8:49am

Thanks David, I was concerned about security of SAFE activity and also thought it would be wise to at least make the cache an encrypted area. I’m glad it isn’t a significant issue.

I think as option 1 looks very straightforward I’ll try that. As well as being simple to get working, I think it would work really well as a backup system that preserves every version of every file within moments of any mutation. People could use it like an extra big drive, or copy/rsync to it to do a backup. If anyone wants to do this for themselves, by all means jump in and I’ll support you. I have plenty of other things to keep me happy and busy.

Also, if anyone wants to dig deeper into ways to support a SAFE-files-compatible syncer (option 2) I’m keen to help. Not sure I should lead given zero Rust skills and so many other pies with my finger holes in them!

If nobody else picks these up I may think again, but there’s so much I could be doing, so if you fancy having a go, be my guest and let me know if you’d like my input!

@danda quick question as I’ve forgotten: is there a limit to the size of a FilesContainer or the number of entries it can hold? I’m hoping I can just store all the syncer nodes and blobs in one container and forget about it at least for now. Probably not good in the long run, but I’d like to know if there are any hard limits.

Update - rsync features needed for option 1

The following rsync features are needed from safe files sync in order to be able to modify syncer to use SAFE CLI to implement its backend/remote storage.

All rsync commands begin:

rsync --quiet  --timeout=5 --whole-file

Variations add the following parameters in the given order…

Single file transfer:

file directory

Multiple file transfer:

file1 [... fileN] directory

example:

rsync --quiet --timeout=5 --whole-file /some/path/file1 /some/path/file2 remote:/data/blobs/

Recursive transfer:

-r --exclude=metadata* directory1 directory2

@danda can you confirm correct the above are supported except as noted here:

--quiet  # I don't see this, can we have it?
--timeout=5 # Not supported, could be useful?
--whole-file # Probably not applicable so no problem
-r --exclude=metadata* # Note the wildcard '*'. I think I can test without that so no hurry but I think this is a very useful feature (and can be specified multiple times).

Just to confirm, can we handle the multiple files example? This would be essential so if not is a blocker for now. I don’t think anything else is a blocker.

Nigel · June 28, 2020, 2:53pm

@anon57419684 you know Rust, correct? Not trying to volunteer you but not sure if you’ve seen this thread yet.

danda · June 28, 2020, 4:26pm

Unfortunately none of the files commands (including sync) support multiple file arguments at present. I would like to add this support because it is necessary for bash brace expansion eg photo{1,2,3}.jpg to work. But it will be a fair amount of work to implement.

You can of course call files sync once for each file.

Also, sync does not presently have an --exclude option.

happybeing · June 28, 2020, 4:38pm

I have syncer init working with a local SAFE testnet. It seems to have initialised the SAFE container as expected, so I’m a bit confused because I’m also seeing this an error from the safe CLI in the console.

As far as I can see everything has been completed ok (listing the syncer content on safe), but it looks like the ‘safe files sync’ is returning an error code to syncer, as well as issuing the error to the console.

@danda if there was a parameter error like the one below, could the safe files sync still have uploaded to the container or would it abandon before trying? And should thesafe CLI return values properly for use by a script?

I’m a bit confused by the console error because the error message suggests the safe files sync is being given a safe URI with /data/blobs/ at the end which I’m not expecting. I need to find a way to see exactly what is being passed to safe CLI!

Good progress though!

syncer init ~/.syncer-test "safe://hnyynyib9mr43r7cthdodtci7kur7rgkc7ayzdkjax1j5dse451d3ft1wgbnc" 1000
FilesContainer synced up (version 1): "safe://hnyynyib9mr43r7cthdodtci7kur7rgkc7ayzdkjax1j5dse451d3ft1wgbnc?v=1"
+  /home/mrh/.syncer-test/data                                                  
+  /home/mrh/.syncer-test/data/blobs                                            
+  /home/mrh/.syncer-test/data/blobs/082ad992fb76871c33a1b9993a082952feaca5e6  safe://hbyyyynwwg1cyjt979cdykr3xz3hmichw7ekqk9hsbcqa4mhuf3x8481hh 
+  /home/mrh/.syncer-test/data/blobs/675e110cbd20023c206bea2a1788c8ab304a7a5d  safe://hbyyyynmme1ei9wisfd9fnnpiqxxtam7se97a4dab9h61mabdpuf771i4t 
+  /home/mrh/.syncer-test/data/metadata.sqlite3                                safe://hbyyyyn3ay4wu4rowxhizia4sxhpjsq319ip758maken19kbyewoo7mny9 
+  /home/mrh/.syncer-test/data/metadata.sqlite3-shm                            safe://hbyyyyn7jpqz5nmbxtq8bsmfpu8iorbxu3owwgbp89fpatpyp3a159fhpn 
+  /home/mrh/.syncer-test/data/metadata.sqlite3-wal                            safe://hbyyyyd8yj4xgo6o4ok9zhmqzjsgwxefwsgr4yjfddqd59rywcni13xdra 
+  /home/mrh/.syncer-test/data/nodes                                            
error: Found argument 'safe://hnyynyib9mr43r7cthdodtci7kur7rgkc7ayzdkjax1j5dse451d3ft1wgbnc/data/blobs/' which wasn't expected, or isn't valid in this context

USAGE:
    safe files sync [FLAGS] [OPTIONS] <location> [target]

For more information try --help

I can probably work around this but will need to write some Rust

I can test without this but think its a useful feature for the todo list.

danda · June 28, 2020, 4:48pm

afaik, parameter validation is done before the operation begins. Is it possible safe-cli is being invoked twice?

I think you need to find where the safe cli command is being invoked and print or log it.

happybeing · June 28, 2020, 5:02pm

I’m trying but my lack of Rust is in the way. Can you modify this code fragment to output the command line before it is run. I’m trying to figure it out but am struggling to understand the docs around Debug, fmt etc.

pub fn run(&self) -> Result<(), Error> {
    for _ in 0..10 {
      let mut cmd = Command::new("safe");
      cmd.arg("files");
      cmd.arg("sync");
      
      // cmd.arg("--quiet");
      // cmd.arg("--timeout=5");
      // --whole-file is needed instead of --append because otherwise concurrent usage while
      // doing readhead causes short blocks
      // cmd.arg("--whole-file");
      cmd.args(&self.args);
      match cmd.status() {
        Ok(v) => {
          if v.success() {
            return Ok(())
          } else {
            continue
          }
        },
        Err(_) => {},
      }
    }
    Err(Error::new(ErrorKind::Other, "safe files sync failed"))
  }

I think there must a second safe command being attempted that I was not expecting. If I can echo the command to the console I’ll have a much better idea what is going on. Thanks.

No worries, gotit:

println!("safe files sync {:?}", &self.args);

danda · June 28, 2020, 5:21pm

This appears to be repeating the command 10 times. Is that desired?

Can you modify this code fragment to output the command line before it is run

You can try println!(“{:?}”, cmd);

see: How to get the command behind std::process::Command in Rust? - Stack Overflow

happybeing · June 28, 2020, 5:23pm

It only repeats if the command fails so I guess its crude error recovery! Thanks for the tip. I can see what’s happening now.

Not surprisingly the error is because syncer is trying to sync multiple files at once. I was just not expecting it to do that immediately after having synced recursively. I’ve missed something in the code because I didn’t see where that is happening.

Anyway, looks like in theory I can get this working and then try mounting the FS.

Topic		Replies	Views
Safe filesystem API for a FUSE implementation in Rust Development	73	1379	February 26, 2024
SAFE Rsync and Chunks etc Development	6	1287	February 11, 2018
Questions re concurrent access and caching Beginners	2	716	August 25, 2015
Having fast storage nodes and slow storage nodes differentiated Marketing	1	728	April 29, 2014
Advice wanted on node storage filesystem Beginners	2	477	August 29, 2022