What about a catastrophic event that wipes out millions of nodes

Summary of what I’m saying:

  • Pre-chunking would increase resiliency with no drawbacks.
  • Pre-chunking will inevitably be a default in all clients regardless.
  • If there is no built-in or MaidSafe-endorsed standard for opt-out pre-chunking, it could get messy.

If anyone convinces me that the above bullet points aren’t accurate, I will edit this post to make a note that I was wrong.

Some interestring stuff:


This is a map of individual devices connected to internet.

Now, to make things more interesting, look at this animation, which shows how these devices are connected during the day:

Carnabotnet_geovideo_lowres
It basically shows how SAFENetwork data is moving around the globe, as devices connect and disconnect.
It shows the data concentration on earth, over the day,
(A lot more at play though, considering what kind of devices, bandwith etc.)

I think this later one is from 2012, so a bit old now.

So the distribution of cached data geographically will be limited also by this pattern that we see; i.e. number of available nodes over the day.
With data chains, the significance of this for older data should be low.

We can also anticipate resiliency increases, as that large area of Africa lights up over the coming decades, the same goes for South America and Asia (will happen earlier but not as dramatic change).

11 Likes

Now that is a very cool way to visualise data movement!

1 Like

@to7m
I’m still new to the forum and learning about the platform but these are my initial impressions that might address your points:

I think there are drawbacks. Besides introducing performance limits like I mentioned before, I would say that your method would reduce obfuscation and the desire to ensure that the security model is quantum computing resilient. The SAFE network is intended to stand the test of time. It could be theorized that (eventually) whatever basic encryption technique you choose will be weakened or broken when the attacker has the enough computing power and the whole file, but self-encryption helps get around this. While the minimum number of chunks for self-encryption might be 3, larger files can give you much more to work with during the XOR pass such that it becomes (dare I say) impossible to extract any meaningful data. Telling the adversary that one can link together meaningful info by picking the right set of 3 chunks at a time gives them a lot more to start with than just saying “It’s all random… good luck”. Since the choice of optimal trade-off between obscurity and redundancy is subjective, it makes a lot of sense to just let the network do this automatically based on file-size. Also, redundancy can always be improved by adding more storage (which also helps obfuscation), no need to cripple your future-proof encryption scheme for more of the same… right? Pre-chunking in might also negatively affect de-duplication.

Not necessarily if what I mentioned above is true. Your suggestion may be a feature that users would like to play with, but the current default would seem to me to be ideal. Hypothetically, you could tell users that increasing file-size reduces resiliency but increases obfuscation, but if the network is already handling the multiple levels of redundancy automatically in the background this sort of statement would misrepresent the actual data resiliency. I guess the point I am trying to make is that you get both benefits with the current scheme, whereas introducing your concept could sacrifice some key properties while increasing complicatedness.

It is messy. As @neo mentioned QuickPar or par2 on linux try to do this in a standardized way and are better than a couple of split and cat commands in a terminal or splitting up a zip or tar archive. I’m not convinced that this would be better than just keeping the files in their native formats and then having specialized apps for specific file-formats that could achieve what you are asking (and more) but make it more user friendly. Do one thing and do it well?

Client is the code on your machine that provides the APIs to access the network.

An APP would typically utilise the APIs to access the network. But it also can access the network by incorporating the “client” code inside of it.

1 Like

Besides introducing performance limits like I mentioned before

I see no reason that chunking in groups of 3 instead of trying to do them all as one big group would introduce performance limits. If anything, the user would be able to decrypt the files in 3MB chunks instead of waiting till the whole thing is downloaded (unless I’ve got the decryption process wrong?), which is essential for things like videos.

your method would reduce obfuscation

I don’t know, but if that’s a concern at 3 chunks, why not just increase the maximum chunk-grouping to something like 20? Surely any technology that can find 20 random compatible chunks would also render most current-method data vulnerable?

the desire to ensure that the security model is quantum computing resilient

I get how the chunk size would be relevant, but how does the number of codependent chunks relate to quantum computing?

larger files can give you much more to work with during the XOR pass such that it becomes (dare I say) impossible to extract any meaningful data

Yep, so an attacker could successfully reconstruct part of a file, or a network chunk could go missing and this wouldn’t kill an enormous file. I suspect the latter is more likely.

Since the choice of optimal trade-off between obscurity and redundancy is subjective, it makes a lot of sense to just let the network do this automatically based on file-size.

This makes sense. Instead of the codependent chunks equalling the file size, their sizes could be calculated based on the file size. Take a 2TB file for example - 2 1TB files would both have ridiculously high obfuscation, but the chances of the whole file being lost forever suddenly gets squared (significantly reduced). I suspect 3 chunks would be enough obfuscation, but if that’s wrong then any sensible standard for grouping codependent chunks based on file size would make sense to me. Larger files having larger groups of codependent chunks could be good.

Pre-chunking in might also negatively affect de-duplication.

I don’t see how… If anything, having a standard would reduce duplication, because non-standard techniques would be the cause of duplication. It could also reduce duplication in edge cases such as for files that have been truncated.

I guess the point I am trying to make is that you get both benefits with the current scheme, whereas introducing your concept could sacrifice some key properties while increasing complicatedness.

As I can’t see the added complication, the issue I see here is high obfuscation vs high resilience. Going back to the file-size-based-codependent-chunking idea, I think a balance could be found easily. Substantial segmentation and substantial chunk-grouping to get the best of both worlds.

It is messy. As @neo mentioned QuickPar or par2 on linux try to do this in a standardized way and are better than a couple of split and cat commands in a terminal or splitting up a zip or tar archive.

It’s not messy because it doesn’t even exist yet. What’s messy at the moment is the lack of consensus on the best way for users to split/merge files, which wouldn’t be an issue here. The clients would split and merge files anyway as that’s how chunking works. If a MaidSafe standard is described and implemented before the network goes live, then the messiness gets avoided entirely. As the network increases in popularity, some existing file split/merge issues will become redundant as traditional email attachment limits disappear.

specialized apps for specific file-formats that could achieve what you are asking

For things like compression, sure. Turn a .wav into .flac instead of zipping it. In this case though, the end-user would have to deal with so many differences between apps’ file-merge requirements. Much better to have a standard.

I’m still new to the forum and learning about the platform but these are my initial impressions that might address your points

You raise good points and seem to understand the platform pretty well : ) I appreciate the contribution.

Client is the code on your machine that provides the APIs to access the network.

So the client doesn’t really do features, just standard implementation, which an APP would do the prettier stuff?

I’m just gonna quickly suggest a few terms here:

  • chunk group - a group of chunks, all of which are needed to decrypt the group (under the current system this chunk group would be the entire file)
  • chunk group size - the number of chunks in a chunk group (my proposal is to have a standard but non-mandatory cap for this number)
  • codependent chunks - chunks in the same chunk group

Just a few thoughts that come to mind:

I was referring to how it seems like extra redundancy rather than pre-chunking methods (reaching an equivalent probabilistic survival rate in a major catastrophe) provides more available copies from which to pull from during regular use, which can speed things up, reduce latency and ensure greater geographic distribution, and increases obfuscation. SAFE has multiple objectives and so needs pareto optimum solutions.

Sorry, two things went wrong with that comment.

  1. I lazily used “quantum computing” as a general term for some super advanced computational cracking system that will/may hypothetically exist at some point in the future.
  2. I started writing about how using larger or variable chunk sizes for balancing obfuscation with redundancy might be a simpler way to achieve your goal, but that this variability may likely not work well or unleash havoc on XOR addressing… and larger chunk sizes would leak more data if a single one is decoded by the future super AI quantum cracker… this was even more complete speculation on my part so I decided to delete those parts before sending.

You’re right. I said “might” affect de-duplication. I think your argument is sound, but I don’t know enough about it to go into it deeper without further study.

I guess my point was that the “best” way to split/merge files for maximum data resiliency or ease of data recovery after some lower probability catastrophe is likely file-format dependent, otherwise you need a general tool like quickpar/par2 like neo recommended. But doing this in the client essentially makes all the data conform to a single file or archive format and seems like that would be a point of weakness from a security perspective. Thus, I don’t think consensus is possible, especially if you try and pull that kind of code into the core, and its affect of other SAFE objectives like launch date and feature completeness. Let’s say you did try automate the procedure through a few lines of code that looped over the current core chunking implementation to successively pre-chunk large files by recursively splitting all files by two orders of magnitude until you hit a 1MB final chunk. I think you would soon venture into maps of data maps and such, ie. messy. Whatever method you come up now should not be based on anything subjective and also be a good idea hundreds of years from now. I’m more than willing to acknowledge that my intuition is wrong on this, and an experienced dev might see an easy way to do it. It’s just that in general the software partitioning point of view would tell me that you wouldn’t want to bring this kind of thing into the SAFE net “kernel” but rather keep it in “userland”.

Let the user decide… right? It just seems a lot easier to achieve your goal in userland. For example, a few lines of code on the command line and you could split and merge most large video and image files like you want using imagemagick and ffmpeg. A few more lines of code to script it and make it transparent to the user…

I’ll admit that your argument may be a good thing for users to consider when uploading relatively large files, and there may be a point of diminishing returns with regard to obscurity and self-encryption when files become much larger than whatever the present day average is in the future. I’m still not convinced the automatic pre-chunking of files is necessary considering all the other objectives of SAFE, and still view the operation analogous to that of the user pre-encrypting files they have a high degree of paranoia about. My understanding is that it is easier to increase redundancy in a flexible manner through time and rely on ever increasing storage resources than improve security in the future. I see encryption/data security/obscurity taking priority to most other concerns. At this point I’ll need to digest a lot more info from you or other sources in order to add anything more useful to the conversation…

I was referring to how it seems like extra redundancy rather than pre-chunking methods

Extra copies would logically increase resilience and performance at the cost of network storage space.
Smaller chunk groups sizes would increase resilience and leave performance and network storage space unaffected.
Both seem necessary to consider for optimum resilience/performance/storage efficiency.

this was even more complete speculation on my part so I decided to delete those parts before sending.

I like the variable chunk size idea, but considering the potential complexity I can understand you deleting it:P Maybe it will be a world-changer though?

But doing this in the client essentially makes all the data conform to a single file or archive format and seems like that would be a point of weakness from a security perspective.

This isn’t based on anything, but it seems to me that the client shouldn’t be aware of the file’s format. Like, I would rather my client had no idea what I’m uploading, just like ‘cat’ doesn’t know what kind of file I’m asking it to output to the terminal. I can’t put my finger on why I feel this way… The files get split anyway for the current method of chunking, and that doesn’t take file format into account.

It would only really be relevant if additional splitting/merging would have to be done (which I’m not suggesting is necessary), and in those cases I suspect the processes used by ‘split’ and ‘cat’ are always going to be the best in all ways.

I think you would soon venture into maps of data maps and such, ie. messy.

That does sound messy and isn’t what I had in mind. However, if I understand it right, the chunk dependency would be in a tree-like form, with core (trunk) chunks mapping out the branch chunks. Trees are pretty hard to knock down, compared to big hoops (the structure shape of the current system). The chunks that pose most liability (as in, the ones that the whole file depends on) could be granted more copies than the rest of the chunks, like how a tree gets thicker closer to the trunk.

I would say that one reference header at the beginning of the first chunk is probably all that’s needed in most cases, but I’m no expert… Any accepted proposal to avoid an entirely codependent hoop-style chunk structure would make me happy.

Let the user decide… right? It just seems a lot easier to achieve your goal in userland. For example, a few lines of code on the command line and you could split and merge most large video and image files like you want using imagemagick and ffmpeg. A few more lines of code to script it and make it transparent to the user…

Yep, but my point is that it should ideally be opt-out, and at least be described fully by an accepted standard.

still view the operation analogous to that of the user pre-encrypting files they have a high degree of paranoia about.

Pre-encryption has to remain in userland for the paranoid users to be content. If it’s in the network, it’s not really pre-encryption, right?

Reducing chunk co-dependence should ideally stay in the client and network code, or at least be standard with zero obscurity. I think ‘pre-chunking’ was the wrong term for me to use - it was only intended as a warning of what apps would do rather than a suggestion; my preferred solution is for there to be no pre-chunking as it would all be merged into every client.

I also note that pre-encryption would take up additional processing time. Which is fine, I understand why people might use it anyway.

At this point I’ll need to digest a lot more info from you or other sources in order to add anything more useful to the conversation…

I hope it has felt productive so far(^:

FYI,
Did some quick off-forum reading/scanning the other day and noticed that “map of datamaps” is mentioned as a means of versioning files in SAFE. Not sure if it has been implemented this way, but your idea may already exist in the code to some extent… not sure.

That indicates some kind of ‘pointer’ functionality that this might rely on in the client :​D
Standardisation to ensure that everyone’s using it in the same way is still vital for pre-release in my opinion.

Go Here and check out the section of MutableData.
https://safe-network-explained.github.io/architecture

“The content of mutable data may point to other mutable data, allowing the creation of chains of mutable data [6] that can be used for many purposes such as version control and branching, verifiable history and data recovery.”

I think you will find that what is described is that the network has the inherent capability do exactly what you want… It might just take an app developer to implement it in the manner you desire.

So if MaidSafe comes up with a standard for version control, involving pointers to previous versions, I guess it wouldn’t take much to adapt that standard with file splitting in mind when that feature starts being tested (​:

One thing I don’t think we’ve talked about in this thread, but also bolsters your desire for “pre-chunking”, is the ability to use multiple threads on a single file. Pre-chunking large files into 4 or 8 pieces, then self-encrypting each piece in parallel sure would speed things up a bit on typical commodity desktop hardware. Especially if you write an app with a threadripper or GPU acceleration in mind.

Good point. Perhaps a tree-like structure of chunk dependency would also have the benefit of working well with parallel processing. One thread for the first (trunk) chunks, then more threads as the later (branch) chunks can be encrypted without taking other branches into account.

Related forum threads :

1 Like

Yes. When catastrophic event wipes out millions of nodes: I will worry about my 9.52% of lost video files. Or: I will worry about millions of lives lost in same event. EDIT: True: If it’s a computer virus, maybe no lives lost.

This may be true. You could do RAID on top of SAFE:

  • Split your file to 1MB blocks.
  • Group them into groups of a few blocks.
  • Make one or more parity blocks for each groups.
  • Store all blocks as separate documents on Safe Network.

You can restore your file if no more than the number of parity blocks is lost from each group of blocks.

So the solution is one of these:

  • Somebody could write an app for this. Tools already mentioned.
  • Safe Network could store chunks like this. Do we know: Why does it not?

Main reason would be there is no RFC yet, granted our RFC process has been lacking of late, but that is about to change (new hires). The other thing is replication/error correction/raid/reed solomon etc. factors always have a limit. So we can always lose up to the max of any algo and folk can still say then why not double up, re-replicate each chunk several times, increase replication count, add a mix of replication and another scheme etc.

So it is a very interesting topic and an easy resolution, like group size the replication factor or similar scheme is just a figure we choose. The limit will always be breached given enough of an argument, its an easy arg, just lose 1 more piece than the algo can handle. The trick is 2 things really.

  1. Are the pieces lost forever (will the nodes come back on line?)
  2. Are there copies available (Archive nodes)

The list goes on. for worlds data such as wikipedia, ancient texts etc. there are really nice things in design these days like the laser etched disks that SapceX put in the Tesla. Archive nodes handling this data would be definitely feasible. Then if huge swathes of data can be made tertiary like that the replication factor of more live data can be increased.

I think the story is a bigger one and likely to change over time. Increasing replication count is easy if we offload data, which I fully believe will happen. I can see these kinds of new long term storage becoming widespread where nearly every computer can hold old immutable data in huge quantities in ways that withstand the horrors the humans may do to the planet. Then SAFE becomes a mechanism where we are agreeing and arranging current data that will eventually go tertiary like the rest.

I am babbling a wee bit, but hopefully this expands the area a little to encompass possible future proofing data storage, creation and agreement over time.

8 Likes

Hmhmm if one would take raid software and then do the ‘standard’ self encryption on top of it… Then data could be replicated even if parts of it are lost and it might use less storage space all in all than raising the replication count :thinking: (and you don’t really need to request all parts if everything goes well)

Just thinking loud…

Yes, that is what I mean though raid 2,3,4,5,6,7,8, etc. or replication count 2,3,4,5,6,7,8 etc. The trade offs are very similar and there was even a French chap who did a Phd on it that concluded replication was simpler and just as effective in real world situations. A bit vim/emacs it is a debate that could rage through the ages though :wink: Bottom line is there will be a limit to whatever choice is made and those limits will only allow folk to say what about limit+1 it breaks so why not increase the limit, either raid or replication will do that, its just a matter of increasing forver, until all nodes holds all data, if that makes sense.

4 Likes

Yes. When catastrophic event wipes out millions of nodes: I will worry about my 9.52% of lost video files. Or: I will worry about millions of lives lost in same event. EDIT: True: If it’s a computer virus, maybe no lives lost.

Sure. But if the choice is between miillions of lives and files being lost, and the same millions of lives but no files being lost, why would we prefer the former?

Somebody could write an app for this. Tools already mentioned.

Yes, and multiple apps will be written with no unambiguous standard, which will create file duplication on the network. Maybe when the network is closer to release, a splitting standard will be defined, and then anyone using a different method will just have to pay more to upload. Without a standard, anyone uploading non-unique files using a splitting app/plugin will have to pay more.

2 Likes