Profiling node performance

A simple benchmark for alpha-2 so we can compare with later networks.

Uploading and fetching 100 ImmutableData and 100 MutableData took me 6m39.

This was done using the client_stress_test repeatedly (averaging over 4 tests 6m40, 6m38, 6m43, 6m35)

If anyone else can run the client_stress_test on alpha 2 and put some timings up that’d be cool (but it requires modifying code in client_stress_test so I’m not really expecting it).

3 Likes

Forgive me, I am not sure of a couple of things with the stress test.

Are we talking of 400MB total transferred (200 up & 200 down)?

Do you have the figure for the time to upload or is it upload one chunk then download it.

It looks like the 200 chunks down were sequential (received one, then request the next). Is that correct?

if 400MB and sequential then that represents approx 1000KB/sec (1MB/s)

2 Likes

Yes it’s all sequential - upload one chunk, then download it, then upload the next etc. I only have data for the whole process, not split into upload / download / per chunk etc.

My understanding of the amount of data is:
MutableData has an empty BTreeMap so is very small (<1KiB)
ImmutableData is 4KiB using rng.gen_iter().take(1024), ie 1024 * u32

2 Likes

Testing single vault speed vs other file storage protocols

safe is the command to run the client for uploading files to vault v0.19.2
dd is a tool for dumping data from one place to another, no encryption.
scp is a tool using ssh to secure the files while copying from one place to another (copying to a local vm).
rsync is a tool that transfers files remotely and also (like scp) uses ssh to encrypt the data over the wire.

Setup

Locally running on laptop (i7-8565U @ 1.80 GHz)

safe_vault 0.19.2 (2019-09-05)
safe-authenticator 0.3.0 (2019-09-05)
safe-cli 0.5.0 (2019-10-16)

Tests requiring remote upload (scp, rsync) use a VM on virtualbox and a host-only network adaptor.

Tests

1 MiB

dd if=/dev/urandom of=/tmp/onemeg.bin bs=1M count=1
1 MiB in 0.0105095s
scp /tmp/onemeg.bin test@localhost:/tmp/new/onemeg.bin
1 MiB in 0.534s
rsync /tmp/onemeg.bin test@localhost:/tmp/new/onemeg.bin
1 MiB in 0.621
safe files put /tmp/onemeg.bin
1 MiB in 0.214s

10 MiB

dd ... count=10
10 MiB in 0.111802s
scp ...
10 MiB in 0.569s
rsync ...
10 MiB in 0.625s
safe files put ...
10 MiB in 0.576s

100 MiB

dd ... count=100
100 MiB in 0.645761s
scp ...
100 MiB in 1.301s
rsync ...
100 MiB in 1.544s
safe files put ...
100 MiB in 4.016s

200 MiB

dd ... count=200
200 MiB in 1.17017s
scp ...
2.181s
rsync ...
2.656s
safe files put ...
7.613s

300 MiB

dd ... count=300
300 MiB in 1.78146s
scp ...
3.107s
rsync ...
4.050s
safe files put ...
11.391s

400 MiB

dd ... count=400
400 MiB in 2.3287s
scp ...
4.195s
rsync ...
4.934s
safe files put ...
15.270s

499 MiB

dd ... count=499
499 MiB in 3.10294s
scp ...
4.726s
rsync ...
6.440s
safe files put ...
18.898s

Results

Average upload speed

dd: 1212 Mbps
scp: 582 Mbps
rsync: 468 Mbps
safe: 182 Mbps

No conclusions or insight or concerns. Just some numbers. And of course a chart.


Quirks

You may have noticed the test uses 499 MiB as the highest value. When uploading 500 MiB file things get funky.

The first obvious symptom is the upload now took 6m15s no matter if 500MB or 5 GB or whatever size you pick over 500.

The log messages indicate timeout errors (from safe-api, ie the client).

[2019-11-03T22:19:42Z INFO  safe_api::api::files] Reading files from /tmp/500meg.bin

[2019-11-03T22:19:42Z DEBUG safe_api::api::files] Metadata for location: Metadata { file_type: FileType(FileType { mode: 33204 }), is_dir: false, is_file: true, permissions: Permissions(FilePermissions { mode: 33204 }), modified: Ok(SystemTime { tv_sec: 1572819581, tv_nsec: 316437266 }), accessed: Ok(SystemTime { tv_sec: 1572819578, tv_nsec: 940467113 }), created: Err(Custom { kind: Other, error: "creation time is not available on this platform currently" }) }

[2019-11-03T22:19:42Z INFO  safe_api::api::files] Processing /tmp/500meg.bin...

... 10s pass

[2019-11-03T22:19:52Z DEBUG quic_p2p::utils] ERROR in communication with peer 127.0.0.1:56281: Connection(ApplicationClosed { reason: ApplicationClose { error_code: 0, reason: b"" } }) - Connection Error: closed by peer: 0. Details: Incoming streams failed

...2m57s pass

[2019-11-03T22:22:49Z INFO  safe_api::api::files] Skipping file "/tmp/500meg.bin". [Error] NetDataError - Failed to PUT Published ImmutableData: CoreError(Request has timed out - CoreError::RequestTimeout)

[2019-11-03T22:22:49Z DEBUG safe_api::api::safe_client_libs] Putting appendable data w/ type: 1100, xorname: None

[2019-11-03T22:22:49Z ERROR safe_core::connection_manager::connection_group] Unexpected challenge, expected response (Custom("invalid value: integer `32`, expected variant index 0 <= i < 2")).

... 3m0s pass

[2019-11-03T22:25:49Z DEBUG safe_core::event_loop] Exiting Core Event Loop

[2019-11-03T22:25:49Z DEBUG quic_p2p::event_loop] Exiting QuicP2p Event Loop

[2019-11-03T22:25:49Z ERROR safe] safe-cli error: [Error] NetDataError - Failed to PUT Sequenced Append Only Data: CoreError(Request has timed out - CoreError::RequestTimeout)

It looks like 500meg.bin fails to upload due to a connection closing after 10s (19:42 → 19:52), then a request timeout error related to that failure shows up after another 2m57s (19:52->22:49); or is it 3m7s (19:42 → 22:49)?

A second data (appendable data) is then PUT, and a strange ‘challenge response’ error shows up.

I tried to dig deeper but for now it’s diminishing returns.

I did find there’s no timeout period near 6 minutes, I’m guessing this is two failures each of 3m as per REQUEST_TIMEOUT_SECS.

const CONNECTION_TIMEOUT_SECS: u64 = 30;

pub const REQUEST_TIMEOUT_SECS: u64 = 180;

const MAX_ATTEMPTS: usize = 10;

I didn’t find anything in the code specifically related to 500 MB other than a changelog message in safe_vault from 0.10.3 back in mid-2016, and that code is nowhere in the codebase so am certain that particular change is unrelated.

The threshold of 499 vs 500 MiB caused the same issue on both my desktop and laptop which means it’s probably not time related but size related since the desktop is faster than the laptop and would have errored more likely at 550 MB or somewhere like that.

To reproduce:

$ dd if=/dev/urandom of=/tmp/499meg.bin bs=1M count=499
$ RUST_LOG=debug ./safe_vault
... in second terminal run the authenticator
$ RUST_LOG=debug ./safe_auth --test-coins
$ RUST_LOG=debug ./safe_auth --allow-all-auth --daemon 41805
... in a third terminal run the upload
$ RUST_LOG=debug ./safe auth
$ time RUST_LOG=debug ./safe files put /tmp/499meg.bin
17 Likes

Not checked (sorry, just time) , but I think we still have a size limit in self_encryption that needs removed.

4 Likes

Ohhps Looks like 1Gb self_encryption/lib.rs at master · maidsafe/self_encryption · GitHub But in any case we need to remove the limit.

4 Likes

Maybe some combo of these…?

self_encryption/src/lib.rs

/// The maximum size of file which can be self-encrypted, defined as 1GB.
pub const MAX_FILE_SIZE: usize = 1024 * 1024 * 1024;
/// The maximum size (before compression) of an individual chunk of the file, defined as 1MB.
pub const MAX_CHUNK_SIZE: u32 = 1024 * 1024;
/// The minimum size (before compression) of an individual chunk of the file, defined as 1kB.
pub const MIN_CHUNK_SIZE: u32 = 1024;
/// Controls the compression-speed vs compression-density tradeoffs.  The higher the quality, the
/// slower the compression.  Range is 0 to 11.
pub const COMPRESSION_QUALITY: i32 = 6;

and

self_encryption/src/sequencer.rs

pub const MAX_IN_MEMORY_SIZE: usize = 50 * 1024 * 1024;
2 Likes

Yip we are probably not flushing this out and we should be. that library was a lot simpler but was made a bit more complex. It’s not huge to clean it up and these tests really help @mav Thanks for that.
BTW, just added Remove size limits · Issue #253 · maidsafe/self_encryption · GitHub

7 Likes

Glad it’s helpful.

I’m hoping it’s just params in self_encryption, I would test but setting up the layers of dependencies from safe-app > safe_core > self_encryption is a bit much just now, maybe later I’ll set this up to test a modified self_encryption parameters.

One thing that sorta got me though was this:

ERROR safe_core::connection_manager::connection_group] Unexpected challenge, expected response (Custom("invalid value: integer 32, expected variant index 0 <= i < 2")).

If this is ultimately because of a parameter in self_encryption… that’s sorta weird. The error seems very far removed from the source. I know you guys are pumping hard at the moment so I’m going to look into this also later, but I’m just flagging that it seemed an odd way for the error to show itself.

3 Likes

This is really good. Just to check something here. Was this all compiled release. I just ask as debug is about 10x slower with crypto.

Also an interesting thing to think of. When we did self encrypt tests way back we had interesting results. We could write large files much faster than just storing them (copy etc.) to network drives. The reason was self encrypt wrote many chunks in parallel (SSD) whereas the stream of the (unencrypted) file could only write one chunk at a time (the whole file basically).
We should be able to upload in parallel with quic-p2p, well not really in parallel but quic does stream multiplexing. So we can perhaps get an increase there as well, well if a channel is blocking/slow the others hopefully can finish faster and so on. We won’t send only donw slow channels is what I mean.

8 Likes

Yes we need @lionel.faber and the guys to look as well @mav I recon we may have a bug (at least in error reporting) as well as the self encryption limits.

4 Likes

Good points. Yes it’s compiled with --release

Agreed about the performance, it’s very satisfying to see it already so close to the existing protocols, I think it’ll be able to reach an extremely fast speed once that becomes the focus.

11 Likes

Be good to try that at 0 actually

2 Likes

Moved to dev topic to discuss this. I think it’s not self_encryption causing this error.

5 Likes

@mav this error is due to the Phase 1 limitation that a particular Client can have only a single connection with the vault. A new connection sends a challenge request and since the connection is already available it returns an error saying Unexpected challenge.

13 Likes

To check I understand this correctly, are you saying the “Incoming streams failed” error is causing an attempt to reconnect for the second file (the appendable data w/ type: 1100), even though the connection is already available?

I’m just wondering if a) the connection should be closed after the stream failed, or if b) the second file should be trying to reuse the existing connection?

Also maybe for a later time is there a way to make this error more meaningful? Maybe not possible in this particular case but wanted to ask since this is part of user experience (admittedly not a common user).

Not really expecting a detailed reply from you, I know this is a very busy time for everyone at maidsafe so please only treat a response as very low priority!

edit: also found the reason for 500 MB limit, it’s in quic-p2p, see this dev forum post for details

9 Likes

Testing out the December 2020 testnet, just uploading and downloading some stuff and recording how it goes. The network isn’t stable yet but I wanted to have a go anyway.

Summary

Uploads are around 3-4 Mbps for small files (up to 5MB) and slows rapidly from there. 160 MB was the largest file I could upload and it took 45m.

Downloads are faster than uploads, but only when using safe cat (5.5 Mbps), going very slowly with safe files get (0.3 Mbps). But the download speed seemed quite consistent compared to upload speed.

Setup

Toshiba laptop
Intel(R) Core™ i7-8565U CPU @ 1.80GHz (4 cores, hyperthreaded for 8 processes)
MemTotal: 16088828 kB (16 GB) with 980 MB swap
Storage: 467 GB SSD

Node 0.25.18 commit e51fcfcd7399e3b6cbc40ef37af7ccecc80f9665 compiled with musl
pub const RECOMMENDED_SECTION_SIZE: usize = 10;
pub const ELDER_SIZE: usize = 5;

safe and authd via sn_api 0.17.0 commit f2589e047a5ac68834f6d0d3ce49a32a0e7ddab0 compiled with gcc (musl does not compile)

Uploads

Data was uploaded using this command:

dd if=/dev/urandom of=/tmp/2mb.dat bs=1M count=2; time safe files put /tmp/2mb.dat

upload size (MiB) PUT time (s) PUT bandwidth (Mbps)
1 2.625 3.20
2 4.637 3.62
3 6.575 3.83
4 9.536 3.52
5 13.462 3.12
6 18.686 2.69
7 24.314 2.42
8 31.951 2.10
9 40.652 1.86
10 53.757 1.56
20 121.26 1.38
40 306.39 1.10
80 885.999 0.76
160 2775.349 0.48

The detailed timings for uploading 160 MB was

real  46m15.349s
user  45m50.452s
sys   0m7.495s

I tried 320 MB but after 3h it didn’t seem to be making any progress (no new data being added to nodes, but cpu still being consumed, see observations below for more thoughts).

Downloads

Data was downloaded using
safe files get <filescontainer_xorurl>
and
safe cat <data_xorurl>

Memory consumption began chewing up swap (all memory was consumed) when downloading 20MB with safe files get so I stopped testing at that point.

upload size (MiB) get time (s) get bandwidth (Mbps) cat time (s) cat bandwidth (Mbps)
1 10.828 0.77 1.548 5.42
2 39.153 0.43 2.946 5.69
3 79.941 0.31 4.574 5.50
4 105.17 0.32 6.061 5.54
5 137.227 0.31 7.678 5.46
6 168.656 0.30 9.858 5.11
7 204.177 0.29 12.451 4.72
8 182.854 0.37 12.296 5.46
9 225.214 0.34 12.973 5.82
10 234.98 0.36 14.422 5.82

Observations

CPU usage was much higher during download than upload. Fetching data saw most of the time spent at 100% cpu, uploading was mostly at around 75-95% cpu.

No issue with memory consumption during uploads. Started to see an increase during safe files get tests and at the 20MB mark started using swap (ie all memory was full).

safe files get finds the file data by using the filescontainer so has some initial work to do before the real data can be fetched. cat simply dumps the raw data without having to look through the filescontainer first. safe files get uses a different safe:// url than safe cat.

The difference between get and cat is so big I was very dubious, but sha256sum of both results was identical to the uploaded file. The difference is not the initial metadata fetch since the prompt to overwrite an existing file arrives very fast. get does not seem to be simply cat_filescontainer+cat_filenames.

Each test is only one sample so there may be some variation which isn’t accounted for in these results.

I really like the download progress meter on safe files get.

Also some nice info messages from the node, great to see:

[src/node/elder_duties/key_section/transfers/mod.rs:325] Computing StoreCost for 1048612 bytes
[src/node/elder_duties/key_section/transfers/mod.rs:328] Got StoreCost 0.004454644
[src/capacity/rate_limit.rs:74] Checking network storage
[src/capacity/rate_limit.rs:78] Total number of adult nodes: 6.0
[src/capacity/rate_limit.rs:79] Number of Full adult nodes: 0.0
[src/capacity/rate_limit.rs:80] Section storage usage ratio: 0.0

In the later part of a 3h upload the cpu load would loop in a pattern of 100% for a while (~10s) then 50% (~30s) then 12.5% (~30s) (ie 8 cores, 4 cores, 1 core). It feels like nodes are doing a long-running thing, with some finishing sooner but blocking until all nodes are done. Then once all nodes are done they all start a new loop (maybe a new chunk? no, I watched the total files size in all nodes and the cpu load pattern did not match any change in the total size of the files). Probably the same loop is happening early in the upload but is much faster so the pattern isn’t obvious. Why would it be stretching out later in the process? My guess is there’s a list somewhere that isn’t being cleared properly or sorted optimally, and each new chunk uploaded is creating more and more work due to the expanding list of previous chunks…? Maybe a list of errors that once it gets long enough becomes very hard to clear? Something list-related I reckon anyhow, but just a hunch.

When the client process was cancelled after 3h the nodes continued to loop on the cpu cycle. The nodes had to be killed with safe node killall

I saw some errors in the node logs after about 3h of uploading:

error: Data(DataExists)
[src/node/mod.rs:212] Running as Elder: RunAsKeySection(RunAsGateway(FindClientFor(MsgEnvelope { message: CmdError { error: Data(DataExists), id: MessageId(93a334..), correlation_id: MessageId(511d0c..), cmd_origin: Client(0afef0..) }, origin: MsgSender { entity: Entity::Ed25519(5a6ebb..)AdultNode(ChunkStorage), sig: Some(EntitySignature::Node::Ed25519(..)) }, proxies: [] })))

Also seeing

error sending client message
[src/node/elder_duties/key_section/client/client_msg_handling.rs:209] There was an error sending client message on the stream: WriteError(ConnectionClosed(TimedOut))

I restarted the network a few times. It seems like a less full network is faster than a more full network, so maybe some stuff isn’t getting purged correctly? eg it’s much slower uploading 8mb after 1,2,3,…,7mb was done first than when uploading just 7mb then 8mb (31s vs 17s). Another example is uploading 20mb after 1,2,3,…,9,10mb took 121s but on a fresh network took 31s. This is a big difference and a good place to investigate further.

Summary 2

Downloading with cat is much faster than get.

Uploading gets slower with larger files. Best upload speed was the 3MB file which got 3.8 Mbps. Seems like this should be more consistent and while some part is to do with not using a fresh network for each file, there is a slowed upload rate for large files even on a fresh network. A 160MB file uploaded in about 45m but a 320MB file did not upload and after 3h was manually aborted.

Fresh networks are faster. But to run every test on a fresh network seems absurd, so might be worth some investigation around the inconsistent performance as work is accumulated.

Next Steps

Run the tests with greater logging and better investigation of those logs. It takes quite a long time to get familiar with the patterns and meaningful parts of logs but it feels essential for a first step to at least see what info we already have and if there’s useful clues in there. Especially the consensus parts.

Look into why cat is so much faster than get. This gives a clear codepath to follow and could give a clue to a broader issue along the way.

I’d love to set up a logger on all the lists and maps that are kept by nodes, see if any of them are growing in a way that doesn’t match their purpose. It’s a pretty detailed job but my hunch is this is a good place to look.

See whether the performance issue is more likely to be with nodes or authd. I suspect nodes (because of the cpu usage) but would be good to confirm.

27 Likes

I need to mark your posts @mav this is really helpful, especially now as we tweak to launch :wink: (tweak you see not twerk :smiley: :smiley: )

11 Likes

my hats off to you again mav! great job!

7 Likes

Thanks for your work.

5 Likes