Do We Need New File Types For SAFE?

Continuing the discussion from Self encryption and de-de-duplication:blush:

This conversation below me wonder if we need new filetypes for SAFE. (e.g. .safedoc, .safesheet, .safeshow, .safevid, etc.). These filetypes could be used either to view and edit files through a safebrowser set up, and/or be set up to easily convert back and forth from traditional office documents to these documents, which allow the virtual filesystem to create a new log entry whenever you save, based on the differences between the previous file and the one you are now creating.

Does anyone have any idea how feasible a project this would be, and would it solve the problem that @jreighley was pointing out?

3 Likes

Doesn’t .git store all of the logs? Why not store the changes in .git, instead of actual documents?

1 Like

I’ve always wondered how simple text editing would work. Like how could I open up a file with vim and save it without downloading a full local copy.

My thought is that I would have to download the full copy and then re-upload it to the network. At that point the network would take care of versioning, etc.

1 Like

The network can take care of versioning, but it does so in a way that doesn’t save space.
If you have a 35 KB text file and change 1 byte, it still has to PUT that somewhere.
If you save a file (or update your blog post to fix the typos) five times, and do it day in day out, it may be more cost effective to use a different approach for hosting your site (or files).
I think the network is meant for read-mostly operations. It may be possible to make it work for other purposes, but it’s probably not a good idea.

This isn’t necessarily the case (again its been discussed on the forum).

The reason is that the example (vim editing) assumes you’re saving the file to a virtual drive, so what happens depends on two things: the implementation of the virtual drive (e.g. SAFEDrive), and the implementation of SAFE NFS API that the virtual drive uses.

Now the first version might be simple, but even that is likely to hold changes locally until you close the file rather than on every save for example. I’m speculating, but that seems a sensible implementation.

Someone thinking about these issues might come up with a virtual drive that is cleverer than this - varying behaviour depending on file sizes, number and nature of changes etc.

So as usual, we don’t know the details yet, and it will depend on the particular virtual drive and NFS API implementations, and this can be improved in time once we see how things work with the network, and what the user needs and performance parameters are.

1 Like

So are you saying that (in this example) the SAFE NFS API should be integrated into the behavior of the editor itself? (.vimrc)

This could easily be achieved by making use of SAFE-specific file types.

No. I’m saying that saving a file in the editor writes to a virtual drive - the default of which will be SAFEDrive (but anyone could write one).

Such a virtual drive will most easily be written using the SAFE NFS API, so what actually gets written to the network, and when, will depend on how each of these layers behaves.

So you think that this problem can best be solved at the virtual drive level?

Would that require or benefit from the file types discussed above? Would the virtual drive need to know the difference between a working document/spreadsheet/presentation or would it be a one size fits all approach?

Of course it can’t. If you need to make a PUT (1 PUT) to the network and you don’t, your data is not on the network. It’s that simple. His solution is to avoid writing the file…

And eventually (say once every x hours) once you’ve modified a bunch of small files, “consolidate” those and pack the changes in the minimal number of new PUTs. If you edited just 1 file you’ll still spend one Safecoin to pay for the PUT and during this interval if you try to access your data from elsewhere of course you won’t see the new files.

So just as I said if you need to fix a typo on your blog, you’d better update on schedule and in batches and for data that need concurrent access to recent files from multiple locations, no workaround will work well unless your clients can exchange cached content between them using some sort of shared temp store (such as Dropbox ;-)).
Which brings me back to my earlier claim that it’s more productive to focus on read-mostly workloads rather than devise these complex schemes to turn MaidSafe into something it’s not supposed to do (at least in not in v1).

Related to this, not so long ago there was another nebulous idea to boot corporate systems from SAFE, without any consideration how most systems (including file servers) require fairly frequent updates to shared data or otherwise they can’t work in a meaningful way or risk data loss.

Which is not a solution,

So I agree, begrudgingly. Let’s get the “litestuff” functionality working first.

2 Likes

Hey @smacz are you really going to just accept @janitor presentation of this?!

“His solution is just too avoid writing the file”

:laughing:

Come on.

What I said was that the issue depends on how those intervening layers are implemented, and that to pretend (that’s what @janitor does all the time) that it works in a certain way, and then come up with criticism on that basis is flawed.

I also hinted that the kind of problem raised could be addressed - already, or in new implementations - in those layers.

I know for a fact, for example - because it has been discussed by David on this forum - that this issue has been considered and that there were plans to alleviate it within SAFEDrive or the NFS API (I don’t recall which, but think the latter).

@janitor is right to say it’s unlikely there can be a one size fits all solution to this - at least not immediately. Which means that it won’t be optimum, certainly not at the start. And there will be various ways different “problem” cases can and will - I have no doubt - be addressed. Some techniques had been mooted on this thread.

I can imagine using different virtual drives for applications that have particular needs - or being able to tell the drive to behave in different ways for certain folders for example.

A bit like we choose different formatting of HDD and these have different performance characteristics. Not the same but analogous.

It isn’t beyond imagination, well mine anyway, that an intelligent drive could detect and alter how it handles network writes on a file by file basis depending on the behavior at the time. But that’s getting advanced - we need to see what the issues and needs are first.

3 Likes

And of course there are temp file usage. Like when disk space was very small and files were usually stored on tape unless they were constantly used.

The tape file was read as need by the editor program, which only loaded the parts being worked on with some extra to reduce the wait time for new data to be read. The changes were written to temp files on the disk, and a final write was back to the tape in one of two main ways. 1) Only the tape blocks changed were written, null data padding allowed minimization of the “shift up/down effect” and the directory blocks of the tape was changed to show the new structure of the file on the tape. 2) The whole file starting from the 1st change was written to the tape.and directory updated to reflect this.

History has given us a number of ways to resolve this issue. I think that encrypted temp files on local disk maybe an option. The client parameters could contain the temp directory options 1) store on disk, 2) use immutable data (wasteful on PUTs), 3) use SD data blocks, 4) memory, 5) some other scheme.

Text files are usually only going to be small and a few MB is large, DOC, OSF files are larger but usually a few MB is considered large. Looking at my text and DOC/OSF files it is unusual to find one more than 1MB.

The issue of user edited files being more than 3 MB at this time is not the usual file. Image and other more voluminous file formats may cause more issues and changes to them are also more likely to involve most of the file.

TL;DR
It is unlikely that any special processing other than using temp files would be needed for text, doc, osf files since they will be rarely more than 3 chunks.

It maybe a lost cause to try and introduce special processing for image and other voluminous file types since changes to them involve large percentage of the file.

Temp files will be the key to editing and most editors already use them. The focus maybe better spent on special processing to allow the client to have the temp directory/files stored according to the users wishes and level of security (of data contents and data loss).

A better way may be to use Structured Data to hold a audit log of unsaved changes rather than save locally. We can probably learn from the modern disk formats.

1 Like

This is option 3) from @neo.

1 Like

Correct.

There are going to be noticeable differences between the SAFE network and ones own local disk. The major difference is that local disk is R/W on data blocks, and SAFE is Write once Read many, for the immutable chunks.

This difference has consequences being that if we wish to treat it as ordinary disk then there is an expense of additional costs when doing read-modify-write operations on parts of a file. Cost as in PUT cost and Total Chunk storage used. Such is the penalty for security&anonymity (security of storage and usage)

Otherwise the user can operate as if it is the same and only notice the extra PUT costs and delays when compared to local disk which is bulk paid once for whole disk of storage R/W blocks.

Now we know that many programs use temp file storage as you edit/modify the file, and some use RAM for the same thing when the file is small enough. So the trick is to take advantage of this to reduce immutable storage PUTs during the editing process and only do that on user requested saves, thus the user is in control of when immutable puts are done. And some options were given. Most options are really just directing where the temp file directory is located. Using SD data would require some sort of special storage mount that makes SD data look like a disk device.

2 Likes

The more and more I’m thinking about it, I’m thinking that using a combination of /tmp and StructuredData would be enough. For posts and the like.

Anything more immediate than that (e.x. IRC, IM/Chat, etc.) would have to be on a different P2P level - Think pair programming here. RAM usage would necessarily be higher for these kinds of Apps.

I suspect the NFS layer could manage this. At least the PUT part anyway

The first upload could be immutable type and the subsequent changes could be saved as a delta to a structured type. When retrieving the file, it would be reconstructed on the fly.

Periodically, you would probably want to re-PUT a new immutable type base for performance reasons. This could be on demand or when the structured type is fully consumed.

2 Likes

What are you trying to solve with this?
You’re proposing to make two changes (pointer and data) for each document change. It’s cheaper to PUT a changed 456KB presentation into a new chunk than into two new chunks.
Re “on the fly”: you’re GETing 2 or more chunks instead of just one chunk (for docs smaller than chunk size).