What about a catastrophic event that wipes out millions of nodes


#61

Summary of what I’m saying:

  • Pre-chunking would increase resiliency with no drawbacks.
  • Pre-chunking will inevitably be a default in all clients regardless.
  • If there is no built-in or MaidSafe-endorsed standard for opt-out pre-chunking, it could get messy.

If anyone convinces me that the above bullet points aren’t accurate, I will edit this post to make a note that I was wrong.


#62

Some interestring stuff:


This is a map of individual devices connected to internet.

Now, to make things more interesting, look at this animation, which shows how these devices are connected during the day:

Carnabotnet_geovideo_lowres
It basically shows how SAFENetwork data is moving around the globe, as devices connect and disconnect.
It shows the data concentration on earth, over the day,
(A lot more at play though, considering what kind of devices, bandwith etc.)

I think this later one is from 2012, so a bit old now.

So the distribution of cached data geographically will be limited also by this pattern that we see; i.e. number of available nodes over the day.
With data chains, the significance of this for older data should be low.

We can also anticipate resiliency increases, as that large area of Africa lights up over the coming decades, the same goes for South America and Asia (will happen earlier but not as dramatic change).


#63

Now that is a very cool way to visualise data movement!


#64

@to7m
I’m still new to the forum and learning about the platform but these are my initial impressions that might address your points:

I think there are drawbacks. Besides introducing performance limits like I mentioned before, I would say that your method would reduce obfuscation and the desire to ensure that the security model is quantum computing resilient. The SAFE network is intended to stand the test of time. It could be theorized that (eventually) whatever basic encryption technique you choose will be weakened or broken when the attacker has the enough computing power and the whole file, but self-encryption helps get around this. While the minimum number of chunks for self-encryption might be 3, larger files can give you much more to work with during the XOR pass such that it becomes (dare I say) impossible to extract any meaningful data. Telling the adversary that one can link together meaningful info by picking the right set of 3 chunks at a time gives them a lot more to start with than just saying “It’s all random… good luck”. Since the choice of optimal trade-off between obscurity and redundancy is subjective, it makes a lot of sense to just let the network do this automatically based on file-size. Also, redundancy can always be improved by adding more storage (which also helps obfuscation), no need to cripple your future-proof encryption scheme for more of the same… right? Pre-chunking in might also negatively affect de-duplication.

Not necessarily if what I mentioned above is true. Your suggestion may be a feature that users would like to play with, but the current default would seem to me to be ideal. Hypothetically, you could tell users that increasing file-size reduces resiliency but increases obfuscation, but if the network is already handling the multiple levels of redundancy automatically in the background this sort of statement would misrepresent the actual data resiliency. I guess the point I am trying to make is that you get both benefits with the current scheme, whereas introducing your concept could sacrifice some key properties while increasing complicatedness.

It is messy. As @neo mentioned QuickPar or par2 on linux try to do this in a standardized way and are better than a couple of split and cat commands in a terminal or splitting up a zip or tar archive. I’m not convinced that this would be better than just keeping the files in their native formats and then having specialized apps for specific file-formats that could achieve what you are asking (and more) but make it more user friendly. Do one thing and do it well?


#65

Client is the code on your machine that provides the APIs to access the network.

An APP would typically utilise the APIs to access the network. But it also can access the network by incorporating the “client” code inside of it.


#66

Besides introducing performance limits like I mentioned before

I see no reason that chunking in groups of 3 instead of trying to do them all as one big group would introduce performance limits. If anything, the user would be able to decrypt the files in 3MB chunks instead of waiting till the whole thing is downloaded (unless I’ve got the decryption process wrong?), which is essential for things like videos.

your method would reduce obfuscation

I don’t know, but if that’s a concern at 3 chunks, why not just increase the maximum chunk-grouping to something like 20? Surely any technology that can find 20 random compatible chunks would also render most current-method data vulnerable?

the desire to ensure that the security model is quantum computing resilient

I get how the chunk size would be relevant, but how does the number of codependent chunks relate to quantum computing?

larger files can give you much more to work with during the XOR pass such that it becomes (dare I say) impossible to extract any meaningful data

Yep, so an attacker could successfully reconstruct part of a file, or a network chunk could go missing and this wouldn’t kill an enormous file. I suspect the latter is more likely.

Since the choice of optimal trade-off between obscurity and redundancy is subjective, it makes a lot of sense to just let the network do this automatically based on file-size.

This makes sense. Instead of the codependent chunks equalling the file size, their sizes could be calculated based on the file size. Take a 2TB file for example - 2 1TB files would both have ridiculously high obfuscation, but the chances of the whole file being lost forever suddenly gets squared (significantly reduced). I suspect 3 chunks would be enough obfuscation, but if that’s wrong then any sensible standard for grouping codependent chunks based on file size would make sense to me. Larger files having larger groups of codependent chunks could be good.

Pre-chunking in might also negatively affect de-duplication.

I don’t see how… If anything, having a standard would reduce duplication, because non-standard techniques would be the cause of duplication. It could also reduce duplication in edge cases such as for files that have been truncated.

I guess the point I am trying to make is that you get both benefits with the current scheme, whereas introducing your concept could sacrifice some key properties while increasing complicatedness.

As I can’t see the added complication, the issue I see here is high obfuscation vs high resilience. Going back to the file-size-based-codependent-chunking idea, I think a balance could be found easily. Substantial segmentation and substantial chunk-grouping to get the best of both worlds.

It is messy. As @neo mentioned QuickPar or par2 on linux try to do this in a standardized way and are better than a couple of split and cat commands in a terminal or splitting up a zip or tar archive.

It’s not messy because it doesn’t even exist yet. What’s messy at the moment is the lack of consensus on the best way for users to split/merge files, which wouldn’t be an issue here. The clients would split and merge files anyway as that’s how chunking works. If a MaidSafe standard is described and implemented before the network goes live, then the messiness gets avoided entirely. As the network increases in popularity, some existing file split/merge issues will become redundant as traditional email attachment limits disappear.

specialized apps for specific file-formats that could achieve what you are asking

For things like compression, sure. Turn a .wav into .flac instead of zipping it. In this case though, the end-user would have to deal with so many differences between apps’ file-merge requirements. Much better to have a standard.

I’m still new to the forum and learning about the platform but these are my initial impressions that might address your points

You raise good points and seem to understand the platform pretty well : ) I appreciate the contribution.

Client is the code on your machine that provides the APIs to access the network.

So the client doesn’t really do features, just standard implementation, which an APP would do the prettier stuff?

I’m just gonna quickly suggest a few terms here:

  • chunk group - a group of chunks, all of which are needed to decrypt the group (under the current system this chunk group would be the entire file)
  • chunk group size - the number of chunks in a chunk group (my proposal is to have a standard but non-mandatory cap for this number)
  • codependent chunks - chunks in the same chunk group

#67

Just a few thoughts that come to mind:

I was referring to how it seems like extra redundancy rather than pre-chunking methods (reaching an equivalent probabilistic survival rate in a major catastrophe) provides more available copies from which to pull from during regular use, which can speed things up, reduce latency and ensure greater geographic distribution, and increases obfuscation. SAFE has multiple objectives and so needs pareto optimum solutions.

Sorry, two things went wrong with that comment.

  1. I lazily used “quantum computing” as a general term for some super advanced computational cracking system that will/may hypothetically exist at some point in the future.
  2. I started writing about how using larger or variable chunk sizes for balancing obfuscation with redundancy might be a simpler way to achieve your goal, but that this variability may likely not work well or unleash havoc on XOR addressing… and larger chunk sizes would leak more data if a single one is decoded by the future super AI quantum cracker… this was even more complete speculation on my part so I decided to delete those parts before sending.

You’re right. I said “might” affect de-duplication. I think your argument is sound, but I don’t know enough about it to go into it deeper without further study.

I guess my point was that the “best” way to split/merge files for maximum data resiliency or ease of data recovery after some lower probability catastrophe is likely file-format dependent, otherwise you need a general tool like quickpar/par2 like neo recommended. But doing this in the client essentially makes all the data conform to a single file or archive format and seems like that would be a point of weakness from a security perspective. Thus, I don’t think consensus is possible, especially if you try and pull that kind of code into the core, and its affect of other SAFE objectives like launch date and feature completeness. Let’s say you did try automate the procedure through a few lines of code that looped over the current core chunking implementation to successively pre-chunk large files by recursively splitting all files by two orders of magnitude until you hit a 1MB final chunk. I think you would soon venture into maps of data maps and such, ie. messy. Whatever method you come up now should not be based on anything subjective and also be a good idea hundreds of years from now. I’m more than willing to acknowledge that my intuition is wrong on this, and an experienced dev might see an easy way to do it. It’s just that in general the software partitioning point of view would tell me that you wouldn’t want to bring this kind of thing into the SAFE net “kernel” but rather keep it in “userland”.

Let the user decide… right? It just seems a lot easier to achieve your goal in userland. For example, a few lines of code on the command line and you could split and merge most large video and image files like you want using imagemagick and ffmpeg. A few more lines of code to script it and make it transparent to the user…

I’ll admit that your argument may be a good thing for users to consider when uploading relatively large files, and there may be a point of diminishing returns with regard to obscurity and self-encryption when files become much larger than whatever the present day average is in the future. I’m still not convinced the automatic pre-chunking of files is necessary considering all the other objectives of SAFE, and still view the operation analogous to that of the user pre-encrypting files they have a high degree of paranoia about. My understanding is that it is easier to increase redundancy in a flexible manner through time and rely on ever increasing storage resources than improve security in the future. I see encryption/data security/obscurity taking priority to most other concerns. At this point I’ll need to digest a lot more info from you or other sources in order to add anything more useful to the conversation…


#68

I was referring to how it seems like extra redundancy rather than pre-chunking methods

Extra copies would logically increase resilience and performance at the cost of network storage space.
Smaller chunk groups sizes would increase resilience and leave performance and network storage space unaffected.
Both seem necessary to consider for optimum resilience/performance/storage efficiency.

this was even more complete speculation on my part so I decided to delete those parts before sending.

I like the variable chunk size idea, but considering the potential complexity I can understand you deleting it:P Maybe it will be a world-changer though?

But doing this in the client essentially makes all the data conform to a single file or archive format and seems like that would be a point of weakness from a security perspective.

This isn’t based on anything, but it seems to me that the client shouldn’t be aware of the file’s format. Like, I would rather my client had no idea what I’m uploading, just like ‘cat’ doesn’t know what kind of file I’m asking it to output to the terminal. I can’t put my finger on why I feel this way… The files get split anyway for the current method of chunking, and that doesn’t take file format into account.

It would only really be relevant if additional splitting/merging would have to be done (which I’m not suggesting is necessary), and in those cases I suspect the processes used by ‘split’ and ‘cat’ are always going to be the best in all ways.

I think you would soon venture into maps of data maps and such, ie. messy.

That does sound messy and isn’t what I had in mind. However, if I understand it right, the chunk dependency would be in a tree-like form, with core (trunk) chunks mapping out the branch chunks. Trees are pretty hard to knock down, compared to big hoops (the structure shape of the current system). The chunks that pose most liability (as in, the ones that the whole file depends on) could be granted more copies than the rest of the chunks, like how a tree gets thicker closer to the trunk.

I would say that one reference header at the beginning of the first chunk is probably all that’s needed in most cases, but I’m no expert… Any accepted proposal to avoid an entirely codependent hoop-style chunk structure would make me happy.

Let the user decide… right? It just seems a lot easier to achieve your goal in userland. For example, a few lines of code on the command line and you could split and merge most large video and image files like you want using imagemagick and ffmpeg. A few more lines of code to script it and make it transparent to the user…

Yep, but my point is that it should ideally be opt-out, and at least be described fully by an accepted standard.

still view the operation analogous to that of the user pre-encrypting files they have a high degree of paranoia about.

Pre-encryption has to remain in userland for the paranoid users to be content. If it’s in the network, it’s not really pre-encryption, right?

Reducing chunk co-dependence should ideally stay in the client and network code, or at least be standard with zero obscurity. I think ‘pre-chunking’ was the wrong term for me to use - it was only intended as a warning of what apps would do rather than a suggestion; my preferred solution is for there to be no pre-chunking as it would all be merged into every client.

I also note that pre-encryption would take up additional processing time. Which is fine, I understand why people might use it anyway.

At this point I’ll need to digest a lot more info from you or other sources in order to add anything more useful to the conversation…

I hope it has felt productive so far(^:


#69

FYI,
Did some quick off-forum reading/scanning the other day and noticed that “map of datamaps” is mentioned as a means of versioning files in SAFE. Not sure if it has been implemented this way, but your idea may already exist in the code to some extent… not sure.


#70

That indicates some kind of ‘pointer’ functionality that this might rely on in the client :​D
Standardisation to ensure that everyone’s using it in the same way is still vital for pre-release in my opinion.