Testing out the December 2020 testnet, just uploading and downloading some stuff and recording how it goes. The network isn’t stable yet but I wanted to have a go anyway.
Summary
Uploads are around 3-4 Mbps for small files (up to 5MB) and slows rapidly from there. 160 MB was the largest file I could upload and it took 45m.
Downloads are faster than uploads, but only when using safe cat
(5.5 Mbps), going very slowly with safe files get
(0.3 Mbps). But the download speed seemed quite consistent compared to upload speed.
Setup
Toshiba laptop
Intel(R) Core™ i7-8565U CPU @ 1.80GHz (4 cores, hyperthreaded for 8 processes)
MemTotal: 16088828 kB (16 GB) with 980 MB swap
Storage: 467 GB SSD
Node 0.25.18 commit e51fcfcd7399e3b6cbc40ef37af7ccecc80f9665 compiled with musl
pub const RECOMMENDED_SECTION_SIZE: usize = 10;
pub const ELDER_SIZE: usize = 5;
safe and authd via sn_api 0.17.0 commit f2589e047a5ac68834f6d0d3ce49a32a0e7ddab0 compiled with gcc (musl does not compile)
Uploads
Data was uploaded using this command:
dd if=/dev/urandom of=/tmp/2mb.dat bs=1M count=2; time safe files put /tmp/2mb.dat
upload size (MiB) |
PUT time (s) |
PUT bandwidth (Mbps) |
1 |
2.625 |
3.20 |
2 |
4.637 |
3.62 |
3 |
6.575 |
3.83 |
4 |
9.536 |
3.52 |
5 |
13.462 |
3.12 |
6 |
18.686 |
2.69 |
7 |
24.314 |
2.42 |
8 |
31.951 |
2.10 |
9 |
40.652 |
1.86 |
10 |
53.757 |
1.56 |
20 |
121.26 |
1.38 |
40 |
306.39 |
1.10 |
80 |
885.999 |
0.76 |
160 |
2775.349 |
0.48 |
The detailed timings for uploading 160 MB was
real 46m15.349s
user 45m50.452s
sys 0m7.495s
I tried 320 MB but after 3h it didn’t seem to be making any progress (no new data being added to nodes, but cpu still being consumed, see observations below for more thoughts).
Downloads
Data was downloaded using
safe files get <filescontainer_xorurl>
and
safe cat <data_xorurl>
Memory consumption began chewing up swap (all memory was consumed) when downloading 20MB with safe files get
so I stopped testing at that point.
upload size (MiB) |
get time (s) |
get bandwidth (Mbps) |
cat time (s) |
cat bandwidth (Mbps) |
1 |
10.828 |
0.77 |
1.548 |
5.42 |
2 |
39.153 |
0.43 |
2.946 |
5.69 |
3 |
79.941 |
0.31 |
4.574 |
5.50 |
4 |
105.17 |
0.32 |
6.061 |
5.54 |
5 |
137.227 |
0.31 |
7.678 |
5.46 |
6 |
168.656 |
0.30 |
9.858 |
5.11 |
7 |
204.177 |
0.29 |
12.451 |
4.72 |
8 |
182.854 |
0.37 |
12.296 |
5.46 |
9 |
225.214 |
0.34 |
12.973 |
5.82 |
10 |
234.98 |
0.36 |
14.422 |
5.82 |
Observations
CPU usage was much higher during download than upload. Fetching data saw most of the time spent at 100% cpu, uploading was mostly at around 75-95% cpu.
No issue with memory consumption during uploads. Started to see an increase during safe files get
tests and at the 20MB mark started using swap (ie all memory was full).
safe files get
finds the file data by using the filescontainer so has some initial work to do before the real data can be fetched. cat simply dumps the raw data without having to look through the filescontainer first. safe files get
uses a different safe:// url than safe cat
.
The difference between get and cat is so big I was very dubious, but sha256sum of both results was identical to the uploaded file. The difference is not the initial metadata fetch since the prompt to overwrite an existing file arrives very fast. get
does not seem to be simply cat_filescontainer+cat_filenames
.
Each test is only one sample so there may be some variation which isn’t accounted for in these results.
I really like the download progress meter on safe files get
.
Also some nice info messages from the node, great to see:
[src/node/elder_duties/key_section/transfers/mod.rs:325] Computing StoreCost for 1048612 bytes
[src/node/elder_duties/key_section/transfers/mod.rs:328] Got StoreCost 0.004454644
[src/capacity/rate_limit.rs:74] Checking network storage
[src/capacity/rate_limit.rs:78] Total number of adult nodes: 6.0
[src/capacity/rate_limit.rs:79] Number of Full adult nodes: 0.0
[src/capacity/rate_limit.rs:80] Section storage usage ratio: 0.0
In the later part of a 3h upload the cpu load would loop in a pattern of 100% for a while (~10s) then 50% (~30s) then 12.5% (~30s) (ie 8 cores, 4 cores, 1 core). It feels like nodes are doing a long-running thing, with some finishing sooner but blocking until all nodes are done. Then once all nodes are done they all start a new loop (maybe a new chunk? no, I watched the total files size in all nodes and the cpu load pattern did not match any change in the total size of the files). Probably the same loop is happening early in the upload but is much faster so the pattern isn’t obvious. Why would it be stretching out later in the process? My guess is there’s a list somewhere that isn’t being cleared properly or sorted optimally, and each new chunk uploaded is creating more and more work due to the expanding list of previous chunks…? Maybe a list of errors that once it gets long enough becomes very hard to clear? Something list-related I reckon anyhow, but just a hunch.
When the client process was cancelled after 3h the nodes continued to loop on the cpu cycle. The nodes had to be killed with safe node killall
I saw some errors in the node logs after about 3h of uploading:
error: Data(DataExists)
[src/node/mod.rs:212] Running as Elder: RunAsKeySection(RunAsGateway(FindClientFor(MsgEnvelope { message: CmdError { error: Data(DataExists), id: MessageId(93a334..), correlation_id: MessageId(511d0c..), cmd_origin: Client(0afef0..) }, origin: MsgSender { entity: Entity::Ed25519(5a6ebb..)AdultNode(ChunkStorage), sig: Some(EntitySignature::Node::Ed25519(..)) }, proxies: [] })))
Also seeing
error sending client message
[src/node/elder_duties/key_section/client/client_msg_handling.rs:209] There was an error sending client message on the stream: WriteError(ConnectionClosed(TimedOut))
I restarted the network a few times. It seems like a less full network is faster than a more full network, so maybe some stuff isn’t getting purged correctly? eg it’s much slower uploading 8mb after 1,2,3,…,7mb was done first than when uploading just 7mb then 8mb (31s vs 17s). Another example is uploading 20mb after 1,2,3,…,9,10mb took 121s but on a fresh network took 31s. This is a big difference and a good place to investigate further.
Summary 2
Downloading with cat
is much faster than get
.
Uploading gets slower with larger files. Best upload speed was the 3MB file which got 3.8 Mbps. Seems like this should be more consistent and while some part is to do with not using a fresh network for each file, there is a slowed upload rate for large files even on a fresh network. A 160MB file uploaded in about 45m but a 320MB file did not upload and after 3h was manually aborted.
Fresh networks are faster. But to run every test on a fresh network seems absurd, so might be worth some investigation around the inconsistent performance as work is accumulated.
Next Steps
Run the tests with greater logging and better investigation of those logs. It takes quite a long time to get familiar with the patterns and meaningful parts of logs but it feels essential for a first step to at least see what info we already have and if there’s useful clues in there. Especially the consensus parts.
Look into why cat
is so much faster than get
. This gives a clear codepath to follow and could give a clue to a broader issue along the way.
I’d love to set up a logger on all the lists and maps that are kept by nodes, see if any of them are growing in a way that doesn’t match their purpose. It’s a pretty detailed job but my hunch is this is a good place to look.
See whether the performance issue is more likely to be with nodes or authd. I suspect nodes (because of the cpu usage) but would be good to confirm.