Profiling vault performance

Great post @mav! They always are, but this one is very nice.

I would solve this by identifying the entry point to the vault for the source data, and at first location for the crypto operation, create a struct that is henceforth passed on. Might not always be suitable, but in the naïve case. Like so:

{
source1: [source1],
source2: [source 2],
Etc…
value: [computed value]
}

So another abstraction for the data handled.

I’m just assuming (haven’t studied the code now) that the data today is passed on through function calls, and operated on in multiple places along the way, and this approach would then be less of a change than introducing a cache.

7 Likes

We do memoise in some cases and perhaps not in all. It is a good point to check where memoisation is missing and should be maintained. Nice point @mav thanks again, but really thanks is not enough. We will need to make a charity donation or do something, these posts are pure gold for us.

21 Likes

Exactly what neo said. One of the best features of safe is enduring data integrity, but no need for fancy sas controller cards and enterprise drives or filesystems… hard error rates happen, bits rot, etc.

EDIT : I guess we can’t get around the need for ECC ram though… something to consider for farming hardware… although there is a library called SoftECC that might help…

4 Likes

Correct since SAFE provides the redundancy built in.

5 Likes

A simple benchmark for alpha-2 so we can compare with later networks.

Uploading and fetching 100 ImmutableData and 100 MutableData took me 6m39.

This was done using the client_stress_test repeatedly (averaging over 4 tests 6m40, 6m38, 6m43, 6m35)

If anyone else can run the client_stress_test on alpha 2 and put some timings up that’d be cool (but it requires modifying code in client_stress_test so I’m not really expecting it).

3 Likes

Forgive me, I am not sure of a couple of things with the stress test.

Are we talking of 400MB total transferred (200 up & 200 down)?

Do you have the figure for the time to upload or is it upload one chunk then download it.

It looks like the 200 chunks down were sequential (received one, then request the next). Is that correct?

if 400MB and sequential then that represents approx 1000KB/sec (1MB/s)

2 Likes

Yes it’s all sequential - upload one chunk, then download it, then upload the next etc. I only have data for the whole process, not split into upload / download / per chunk etc.

My understanding of the amount of data is:
MutableData has an empty BTreeMap so is very small (<1KiB)
ImmutableData is 4KiB using rng.gen_iter().take(1024), ie 1024 * u32

2 Likes

Testing single vault speed vs other file storage protocols

safe is the command to run the client for uploading files to vault v0.19.2
dd is a tool for dumping data from one place to another, no encryption.
scp is a tool using ssh to secure the files while copying from one place to another (copying to a local vm).
rsync is a tool that transfers files remotely and also (like scp) uses ssh to encrypt the data over the wire.

Setup

Locally running on laptop (i7-8565U @ 1.80 GHz)

safe_vault 0.19.2 (2019-09-05)
safe-authenticator 0.3.0 (2019-09-05)
safe-cli 0.5.0 (2019-10-16)

Tests requiring remote upload (scp, rsync) use a VM on virtualbox and a host-only network adaptor.

Tests

1 MiB

dd if=/dev/urandom of=/tmp/onemeg.bin bs=1M count=1
1 MiB in 0.0105095s
scp /tmp/onemeg.bin test@localhost:/tmp/new/onemeg.bin
1 MiB in 0.534s
rsync /tmp/onemeg.bin test@localhost:/tmp/new/onemeg.bin
1 MiB in 0.621
safe files put /tmp/onemeg.bin
1 MiB in 0.214s

10 MiB

dd ... count=10
10 MiB in 0.111802s
scp ...
10 MiB in 0.569s
rsync ...
10 MiB in 0.625s
safe files put ...
10 MiB in 0.576s

100 MiB

dd ... count=100
100 MiB in 0.645761s
scp ...
100 MiB in 1.301s
rsync ...
100 MiB in 1.544s
safe files put ...
100 MiB in 4.016s

200 MiB

dd ... count=200
200 MiB in 1.17017s
scp ...
2.181s
rsync ...
2.656s
safe files put ...
7.613s

300 MiB

dd ... count=300
300 MiB in 1.78146s
scp ...
3.107s
rsync ...
4.050s
safe files put ...
11.391s

400 MiB

dd ... count=400
400 MiB in 2.3287s
scp ...
4.195s
rsync ...
4.934s
safe files put ...
15.270s

499 MiB

dd ... count=499
499 MiB in 3.10294s
scp ...
4.726s
rsync ...
6.440s
safe files put ...
18.898s

Results

Average upload speed

dd: 1212 Mbps
scp: 582 Mbps
rsync: 468 Mbps
safe: 182 Mbps

No conclusions or insight or concerns. Just some numbers. And of course a chart.


Quirks

You may have noticed the test uses 499 MiB as the highest value. When uploading 500 MiB file things get funky.

The first obvious symptom is the upload now took 6m15s no matter if 500MB or 5 GB or whatever size you pick over 500.

The log messages indicate timeout errors (from safe-api, ie the client).

[2019-11-03T22:19:42Z INFO  safe_api::api::files] Reading files from /tmp/500meg.bin

[2019-11-03T22:19:42Z DEBUG safe_api::api::files] Metadata for location: Metadata { file_type: FileType(FileType { mode: 33204 }), is_dir: false, is_file: true, permissions: Permissions(FilePermissions { mode: 33204 }), modified: Ok(SystemTime { tv_sec: 1572819581, tv_nsec: 316437266 }), accessed: Ok(SystemTime { tv_sec: 1572819578, tv_nsec: 940467113 }), created: Err(Custom { kind: Other, error: "creation time is not available on this platform currently" }) }

[2019-11-03T22:19:42Z INFO  safe_api::api::files] Processing /tmp/500meg.bin...

... 10s pass

[2019-11-03T22:19:52Z DEBUG quic_p2p::utils] ERROR in communication with peer 127.0.0.1:56281: Connection(ApplicationClosed { reason: ApplicationClose { error_code: 0, reason: b"" } }) - Connection Error: closed by peer: 0. Details: Incoming streams failed

...2m57s pass

[2019-11-03T22:22:49Z INFO  safe_api::api::files] Skipping file "/tmp/500meg.bin". [Error] NetDataError - Failed to PUT Published ImmutableData: CoreError(Request has timed out - CoreError::RequestTimeout)

[2019-11-03T22:22:49Z DEBUG safe_api::api::safe_client_libs] Putting appendable data w/ type: 1100, xorname: None

[2019-11-03T22:22:49Z ERROR safe_core::connection_manager::connection_group] Unexpected challenge, expected response (Custom("invalid value: integer `32`, expected variant index 0 <= i < 2")).

... 3m0s pass

[2019-11-03T22:25:49Z DEBUG safe_core::event_loop] Exiting Core Event Loop

[2019-11-03T22:25:49Z DEBUG quic_p2p::event_loop] Exiting QuicP2p Event Loop

[2019-11-03T22:25:49Z ERROR safe] safe-cli error: [Error] NetDataError - Failed to PUT Sequenced Append Only Data: CoreError(Request has timed out - CoreError::RequestTimeout)

It looks like 500meg.bin fails to upload due to a connection closing after 10s (19:42 -> 19:52), then a request timeout error related to that failure shows up after another 2m57s (19:52->22:49); or is it 3m7s (19:42 -> 22:49)?

A second data (appendable data) is then PUT, and a strange ‘challenge response’ error shows up.

I tried to dig deeper but for now it’s diminishing returns.

I did find there’s no timeout period near 6 minutes, I’m guessing this is two failures each of 3m as per REQUEST_TIMEOUT_SECS.

const CONNECTION_TIMEOUT_SECS: u64 = 30;

pub const REQUEST_TIMEOUT_SECS: u64 = 180;

const MAX_ATTEMPTS: usize = 10;

I didn’t find anything in the code specifically related to 500 MB other than a changelog message in safe_vault from 0.10.3 back in mid-2016, and that code is nowhere in the codebase so am certain that particular change is unrelated.

The threshold of 499 vs 500 MiB caused the same issue on both my desktop and laptop which means it’s probably not time related but size related since the desktop is faster than the laptop and would have errored more likely at 550 MB or somewhere like that.

To reproduce:

$ dd if=/dev/urandom of=/tmp/499meg.bin bs=1M count=499
$ RUST_LOG=debug ./safe_vault
... in second terminal run the authenticator
$ RUST_LOG=debug ./safe_auth --test-coins
$ RUST_LOG=debug ./safe_auth --allow-all-auth --daemon 41805
... in a third terminal run the upload
$ RUST_LOG=debug ./safe auth
$ time RUST_LOG=debug ./safe files put /tmp/499meg.bin
17 Likes

Not checked (sorry, just time) , but I think we still have a size limit in self_encryption that needs removed.

4 Likes

Ohhps Looks like 1Gb https://github.com/maidsafe/self_encryption/blob/master/src/lib.rs#L204 But in any case we need to remove the limit.

4 Likes

Maybe some combo of these…?

self_encryption/src/lib.rs

/// The maximum size of file which can be self-encrypted, defined as 1GB.
pub const MAX_FILE_SIZE: usize = 1024 * 1024 * 1024;
/// The maximum size (before compression) of an individual chunk of the file, defined as 1MB.
pub const MAX_CHUNK_SIZE: u32 = 1024 * 1024;
/// The minimum size (before compression) of an individual chunk of the file, defined as 1kB.
pub const MIN_CHUNK_SIZE: u32 = 1024;
/// Controls the compression-speed vs compression-density tradeoffs.  The higher the quality, the
/// slower the compression.  Range is 0 to 11.
pub const COMPRESSION_QUALITY: i32 = 6;

and

self_encryption/src/sequencer.rs

pub const MAX_IN_MEMORY_SIZE: usize = 50 * 1024 * 1024;
2 Likes

Yip we are probably not flushing this out and we should be. that library was a lot simpler but was made a bit more complex. It’s not huge to clean it up and these tests really help @mav Thanks for that.
BTW, just added https://github.com/maidsafe/self_encryption/issues/253

7 Likes

Glad it’s helpful.

I’m hoping it’s just params in self_encryption, I would test but setting up the layers of dependencies from safe-app > safe_core > self_encryption is a bit much just now, maybe later I’ll set this up to test a modified self_encryption parameters.

One thing that sorta got me though was this:

ERROR safe_core::connection_manager::connection_group] Unexpected challenge, expected response (Custom("invalid value: integer 32, expected variant index 0 <= i < 2")).

If this is ultimately because of a parameter in self_encryption… that’s sorta weird. The error seems very far removed from the source. I know you guys are pumping hard at the moment so I’m going to look into this also later, but I’m just flagging that it seemed an odd way for the error to show itself.

3 Likes

This is really good. Just to check something here. Was this all compiled release. I just ask as debug is about 10x slower with crypto.

Also an interesting thing to think of. When we did self encrypt tests way back we had interesting results. We could write large files much faster than just storing them (copy etc.) to network drives. The reason was self encrypt wrote many chunks in parallel (SSD) whereas the stream of the (unencrypted) file could only write one chunk at a time (the whole file basically).
We should be able to upload in parallel with quic-p2p, well not really in parallel but quic does stream multiplexing. So we can perhaps get an increase there as well, well if a channel is blocking/slow the others hopefully can finish faster and so on. We won’t send only donw slow channels is what I mean.

8 Likes

Yes we need @lionel.faber and the guys to look as well @mav I recon we may have a bug (at least in error reporting) as well as the self encryption limits.

4 Likes

Good points. Yes it’s compiled with --release

Agreed about the performance, it’s very satisfying to see it already so close to the existing protocols, I think it’ll be able to reach an extremely fast speed once that becomes the focus.

11 Likes

Be good to try that at 0 actually

2 Likes

Moved to dev topic to discuss this. I think it’s not self_encryption causing this error.

5 Likes

@mav this error is due to the Phase 1 limitation that a particular Client can have only a single connection with the vault. A new connection sends a challenge request and since the connection is already available it returns an error saying Unexpected challenge.

13 Likes

To check I understand this correctly, are you saying the “Incoming streams failed” error is causing an attempt to reconnect for the second file (the appendable data w/ type: 1100), even though the connection is already available?

I’m just wondering if a) the connection should be closed after the stream failed, or if b) the second file should be trying to reuse the existing connection?

Also maybe for a later time is there a way to make this error more meaningful? Maybe not possible in this particular case but wanted to ask since this is part of user experience (admittedly not a common user).

Not really expecting a detailed reply from you, I know this is a very busy time for everyone at maidsafe so please only treat a response as very low priority!

edit: also found the reason for 500 MB limit, it’s in quic-p2p, see this dev forum post for details

9 Likes