Traffic sizes on the SAFE network

Chunks are limited to 1 MB so I thought it’d be interesting to see what the distribution of chunk sizes on the network might be.

Small files between 3 KB and 1 MB consume three chunks with each chunk one third the size of the file. Less than 3 KB is a single chunk the same size as the file (thanks @fred below for clarifying).

Large files are split into 1 MB chunks and each chunk (except the last) is 1 MB.

Looking at all the files in my $HOME directory, this is the distribution I found:

Gathering Current User HomeDir Stats
Total files: 355401
Files larger than 1 MB: 2377 (29.206469 GB)
Files smaller than 1 MB: 353024 (19.043795 GB)
Total chunks: 768004
Large chunks: 28785
Small chunks: 739219

Chunk Size  Count
   0-100 KB 672286
 100-200    62070
 200-300    2753
 300-400    570
 400-500    303
 500-600    608
 600-700    229
 700-800    103
 800-900    131
 900-1000   134
1000+       28817

Obviously this will be different for everyone, and this machine does not have many media files so probably doesn’t represent the average user very well (I’d expect more 1000+ size chunks from an average user).

The total number of ‘medium’ files (between 100 and 1000 KB) only accounts for a small percent of chunks (about 10%). 90% of traffic comes from the very big and very small files.

So from a network traffic perspective, it looks like it’ll mainly be dealing with (by quantity)

  1. routing messaging (should be <100KB and large quantity but small size)
  2. small chunks (< 100KB, also large quantity but small size)
  3. medium chunks (between 100 and 1000KB)
  4. large chunks (> 1000KB, should be small quantity but large size )

The list is reversed when ordered by the amount of bandwidth consumed.

The main thing I think to take away from this is latency may be a big bottleneck. At this stage it’s just speculation but it starts to give some idea of what the ‘shape’ of content traversing the network might be like.

I was surprised how little the ‘middle of the spectrum’ counts toward network activity.

15 Likes

And here I thought that only applied to files less than 3KB and anything over 3KB were chunked to a minimum of 3 chunks. So a 3KB file is chunked to 3 chunks of 1KB each.

6 Likes

Quite right, so the number of chunks would be triple in quantity and a third the size for files <1000KB. Good catch. This even further empties the middle of the range and increases the small chunk count significantly.

1 Like

Latency has always been on the back of my mind with this project, it would be mind boggling to me to be able to achieve end to end encryption against a distributed network AND receive those large files in a short time frame. I had a long shot hope somehow compression and other techniques could help improve such latency. Would be awesome if SAFE could be fast enough for digital media streaming and webcam video chats and such. But I fear such media requests from the network may need a decent buffering period when all things are said and done at first launch.

1 Like

…it’s just as with torrents … of course latency will be the dominant drawback… but since farmers do have an incentive to have a faster internet connection with low latency they will most likely have a ping of <20 … so even with 10 hops you would be at 200ms (and if you calculate with 20 → still at 400ms) … the apartement i lived in while studying had a ping of 7 … (yes i admit that 7 is rather extreme… but there are people with a that good internet connection …)

not saying it won’t be a critical point … it is … but i guess it’s something that can work out pretty well and won’t necessarily a reason for too many headaches in a too early stage … i’m sure the routing team is well aware that each additional callback will have a major impact on latency :slight_smile:

5 Likes

The number of chunks is higher than that.

Firstly for file objects the data map is not stored directly in a mutable data anymore but a pointer to it is stored instead, so that creates at least one additional chunk (see MaidSafe Dev Update - February 23, 2017 - #72 by tfa).

Then one additional MD is created for each file of a safe site (see MaidSafe Dev Update - February 23, 2017 - #85 by tfa).

9 Likes

@tfa: Then one additional MD is created for each file of a safe site (see MaidSafe Dev Update - February 23, 2017) > additionally, there is one service object (MD with tag 15002) generated for each file

I don’t think that’s correct, but I can see where I might be wrong so first let me say what I understand… There is one services MD for each public name. Then for each service created on a public name (eg www), there is one MD to provide an index for immutable files - think of this MD as a container for the files accessed by that service - so that’s like an extra MD per website.

Each uploaded file results in an entry in the container MD, a pointer to the ImmutableData of that file (ie to a datamap).

So MD count is:

  • 1 services MD per public NAME, plus
  • 1 NFS container MD per web service (ie per website)

Each additional file is just the chunks of that file, with a pointer to the data map of the file. So I could be wrong if this means that an immutable file datamap is implemented using an MD per file. That doesn’t make sense to me though, because it would not be an immutable file if that’s the case.

So I don’t think there is an extra MD per immutable file. Just an entry in the container used to represent the files uploaded on a particular service (ie website).

However, the datamap itself is an overhead, and I’m not sure d exactly what happens here, other than I don’t believe it can be a MD.

So the datamap is a possible source of confusion. Can somebody who knows the details explain how this is implemented for different sizes of immutable file?

4 Likes

Bit of a braindump on latency, so not a direct reply to the quote above, but the quote adds some context…

For me the risk with latency bottlenecks is it can very easily become an unintended centralizing force.

For example, if safecoin is awarded on a first-to-serve basis, rewards are prettymuch only affected by latency, not by bandwidth.

So vaults with high latency never get safecoin, thus leave their section, thus further concentrating low latency vaults in that section (which is good for performance, right?!). But if the ‘worst case latency’ keeps dropping out, eventually the section will be entirely within a single data centre. It’s a bit oversimplified, but latency is a tricky character in the reward scheme. And the distribution of file / traffic sizes seems to indicate that latency more than bandwidth is going to be important to consider.

Is latency a resource or not? Is low latency a better resource than high latency? There’s not really a clear answer to me just yet.

Maybe batching will help.

Maybe parallelism will help.

But these are scaling via bandwidth and don’t actually improve latency, which seems unimprovable. Bandwidth improvements can hide the effects of latency but not improve latency itself. Maybe that’s good enough?!

And then there’s the concern of Peter Todd who claims you can never tell if redundancy has been achieved, and his proposal is to use latency measurements to correspond to geographical distribution (which is just one dimension of redundancy). So latency has a lot of interesting properties, some good some bad.

13 Likes

So for some reason I feel like I remember that those holding a requested chunk enter into a lottery together for the awarded Safecoin. I have no reference material to provide but I am going to give a gander. Edit: found this Safecoin Lottery System for Farming - #2 by dirvine

2 Likes

The lottery is about whether there’s a Safecoin reward for a GET, but it is the fastest node that will be rewarded if there is a reward. So @mav is correct there.

@mav I don’t understand second as well as you obviously, so I though randomness would mean a section will always be distributed - even if each member ends up being one of the fastest.

So maybe they would gravitate towards the nearest data centre rather than the same data centre?

That’s assuming the local data centre is sufficiently fast to exert enough osmotic pressure (and profitable enough compared to the network average, which is designed to keep lower cost consumer farmers in the game).

Love your thought experiments man - you give us really concrete issues to think about and create some really good opportunities for experiments on the test networks, and I of course developing the ability to think about and solve these issues.

Its going to be fun figuring this stuff out for real in love test nets. :slight_smile:

6 Likes

Just thinking loud here…
Hmhmm especially in the early days it would make sense for a farmer to colocate his rig with the servers from maidsafe… So… Once we have a more or less centralized farming network breaking out of this doesn’t seem to be incentivised… :thinking:

(Okay I just wanted to open a calculation that should show that chances would be super low that he wouldn’t be involved in data delivery with at least 2 nodes… But actually he speeds up his competitors too if he doesn’t slow down others on purpose… )

So if he gets to deliver data for having an advantage at least 2 hops in a row need to be inside his data center (but only for him) so that advantage only exists when the node in the delivery chain directly after the vault storing the chunk is located inside his data center

Since data is delivered along xor space that probability is connected to the probability of having 2 nodes in the same group… So it’s like a long and slow attack for majority in groups isn’t it…?

… This attack seems to be subsidized by the network and is something someone in the telegram channel suggested could be problematic long term as well when professional farmers optimize the process in ways that cause centralization that we didn’t expect :roll_eyes:

Ps: hmhmm so on first sight this benefit seems to me to be as high as the %of the safenetwork in that one location/area you are looking at… Because that’s roughly the probability for having 1 faster hop than one located outside this region… (okay… Multiplied with the probability of competing against people outside the region or yourself …)

1 Like

Thanks for the link. A quote from it:

Get request satisfied by farmer. A random (but deterministic) number is calculated (this is the hash of several items such as message_id requester ID and the ID of all the close group to the farmer) this number is called NUMBER.

So we calculate if(NUMBER % FR == 0) { do farming request (try and get a safecoin)

the % means modulo division so if NUMBER divided by FR has no remainder then a farming request happens. So this is the lottery.

The key is the first five words, which seems to refer to a single farmer, depending how ‘satisfied’ is interpreted.

If ‘satisfied’ is interpreted as ‘returned chunk to client’ it means a single farmer
If ‘satisfied’ is interpreted as ‘stores the chunk’ it means all farmers holding the chunk

I think the second option creates better decentralization incentives, since latency / racing is no longer important. But it also degrades the user experience for exactly the same reason.

There’s a lot still unanswered about how a farm request is actually satisfied / created / validated / finalized etc. But further considering the details of the two definitions of ‘satisfied’ opens a pretty big can of worms. I’m very interested to see how the farmattempt aspect of safecoin evolves.

Great typo :slight_smile:


If anyone else is interested in profiling their files the tool source code and binary executables are available on github.

7 Likes

–Minimum of several hops between client and data holder, each way potentially having some different nodes

–Hops are through XOR space, using some data-center nodes, some desktop, etc., located in random locations anywhere in the world, each having different bandwidths and latency themselves. Might it occasionally happen that all nodes of a series of hops are in the same data center? Sure, but likely no one will ever know, especially if multiple data centers are involved, plus lots of home computers.

–Data centers, having high bandwidth and–likely but not certainly–lower latency will also contribute randomly to the path both ways, as will relay nodes of other capabilities. There is, again, no way to know the path, and paths will constantly change due to different paths to different endpoints with each action, due to changing requests.

It seems to me that these factors will level the playing field a lot. If my node has more latency but happens to hop through all data-center intermediaries, I’ll likely win, if it’s “first received wins.” To the degree that vaults are being maintained in data centers, overall they may get more safecoin. But then again, they are actually contributing more resources and that’s just fair. The network has no way of distinguishing one node from another in terms of real-world location beyond the first hop. Those contributing home resources that have a low marginal cost would still be incentivized to use their spare resources for the lesser reward, as opposed to nothing at all, taking advantage of all that nifty, high power bandwith and low latency of the big boys along the way. Likewise, the big boys are forced to pass their data through other, likely slower, nodes along the way resulting in slowing their response time.

Even if, in the long run, data centers all over the world winnow out lesser nodes, we’ve still got a bullet-proof network, cheap to use is not free. Unless, of course, the Data Center Union International decides to go on strike one day, perhaps exploding the network, and their successful business model. Not likely in a decentralizing world.

3 Likes

I think I’ve read (an explanation by David) that it is delivery of a chunk to the Data Managers, which satisfies the “Get request satisfied by a farmer” which @mav quoted, and not delivery to the client (as @fergish assumed).

No takes for this?

1 Like

Thanks. Of course, this makes sense. Doesn’t really change much in terms of my point above, though. Demands for different data in XOR space make the landscape a churn that high capacity nodes in data centers only contribute to.

2 Likes

The datamap itself could be treated differently based on the app preference.
For example, if the size of datamap is tiny (for small sized original file), it can be inserted into an already existing MD. i.e. multiple datamap using only one MD, hence no overhead for individual immutable file.
If the size of datamap is bigger, it shall be considered as another immutable-file and then be store as ImmutableData chunk.

You can even not to upload the datamap, and keep them somewhere else just as a normal serialized file. :slight_smile:

2 Likes

Yes, but all applications managing files with NFS necessarily store data maps as Immutable Data chunks in the network. For example this is the case of WHS. I consider this is a regression because this wasn’t the case previously (see MaidSafe Dev Update - February 23, 2017 - #72 by tfa).

2 Likes

Ah, I only mentioned how datamap can be treated, but didn’t mention what exactly our current client treats it…
The code explains itself that: the datamap file itself is stored as an immutable data, and the name of it is stored in another MD, allowing you to fetch the content later.

4 Likes

There is of course some randomness with the latency since the nodes of a section/group are scattered across the globe. I would expect that there will not be a consistent relationship between vault latency to a specific location and success, but rather the latency to the quorum nodes that vote on which vault is first and success.

For one request the quorum of nodes might be located closer to USA than the UK so vaults in the USA have an advantage. But then the next request might see the quorum of nodes closer to the UK than USA so the vaults in the UK have an advantage.

And of course the make up of the section will be changing over time so where one vault has an advantage due to its latency on a certain day, the next day might see another vault have an advantage.

Also vaults are not necessarily going to win the router war that goes on in a data centre either. So while it might have 12mSec latency to the boarder router on one packet, it might have 20mSec the next due to well quantity of packets flowing through the router and the cause is simply the unpredictable nature of packets flowing from the thousands of machines behind that boarder router. You know serving data for many purposes.

Basically I am saying that low minimum latency to the boarder router of a data centre or an ISP is not a guarantee of getting a majority of “being the first”. It may help, but is it enough to justify the cost? Will it cause the home user with +2 or +5 mSec average latency to not succeed?

2 Likes

Thank you @qi_ma I hadn’t thought of storing a datamap in different places like that - useful to think about. For now though I’m interested in what happens with safeNfs so…

This I know, my interest is in the overheads involved here for different sized files - pertinent to @mav’s categories in the OP. So what’s the smallest overhead (0 chunks or 1 chunk?) and how does this change for ever larger size files?

1 Like