Format for storing performance data

tl;dr just wanna make nice graphs and help provide some useful data.

Looking at upload speeds and performance and I have written a wee script to get some performance data.
The aim being to build a body of test results to inform the devs of immediate problems now and to monitor performance enhancements in the future.
When I think harder about this, I wonder if I should take a step back and ensure that the data captured is in a agreed JSON schema for sharing and analysis by the devs and the brainier members of the community.

Another much bigger point is that a time in seconds on its own is no use without the context of filesize, iteration number, delay between iterations, software versions, PUT or GET, CPU type and frequency, RAM fitted, disk type (SSD, sata), local or remote, connection type(ADSL, cloud etc) and a host of other parameters that need to be captured to make each individual point meaningful. Should I be capturing real, sys and user times instead of just the real time? - probably…
So which of these parameters should I capture and although most of the data points will be numeric, would it make more sense to store them all as JSON strings?
Not being a data scientist, I’m going to be trying to learn quickly probably using python about data visualisation. However if there is any point to this, I’d like to use an agreed schema - which may well already exist internally in the project but I dont know about it yet.

Also right now I collect the real time formatted in seconds only using GNU time 1.7 NOT the bash builtin time command. See man time
/usr/bin/time -f "\t%e" safe files put

willie@gagarin:~/tmp/testnet-file-uploads-20210703_1345$ /usr/bin/time -V
GNU time 1.7
willie@gagarin:~/tmp/testnet-file-uploads-20210703_1345$ time -V
-V: command not found

real	0m0.458s
user	0m0.039s
sys	0m0.383s

see the difference? The problem here is GNU time 1.7 is not POSIX-compliant which will have implications perhaps for Windows users but it does let me format the time in seconds only instead of as 0m0.458s. I suppose this can be worked around with regexps/sed but I’m lazy…

In post 55 [Offline] Fleming Testnet v6.2 Release - Node Support - #55 by Vort @Vort is using Wolfram and C# and that may be the best route for visualisation and Im going to have to bite the bullet and try learn some C#.

2 Likes

HELP NEEDED!!!
While I wait to get some feedback on the above post, it’s moot unless I can output the important data properly.

Anybody know how I can get the output of the time command here but suppress the output from safe
(/usr/bin/time -f "\t%e" safe files $SAFE_COMMAND $DEST_DIR/$FILESIZE.dat)?

Try this

/usr/bin/time -f "\t%e" safe files $SAFE_COMMAND $DEST_DIR/$FILESIZE.dat > /dev/null

/usr/bin/time writes to stderr whereas safe writes to stdout so the redirect to dev/null works well here. Check the -o --output flag in man /usr/bin/time

1 Like

Just need one data point … works? [yes/no] ! :smiley:

If works then all the data in whatever delimited format will be interesting for spotting what is material to performance.

Yes - for what is not affected then by other processes and users; I guess cloud servers might have other users sharing real hardware etc…
All context might be useful - OS and kernel version - whether the CPU maxes out or not might cause a delinear performance too?

Then what the file is that is being uploaded might factor… if it’s pure true random or some content that will be de-duplicated client side. Perhaps a pure random versus a pure non-random might be interesting; so, a file that is pure duplication of text, we might expect to be as fast as possible and a pure random as slow/MB as practical.

Tests of each CLI function too would be most interesting - the response to each type of request seems to be different. Is create keys significantly different to put etc. Perhaps most useful from this type of performance data will be a sense of what is consistent and a measure of any performance that less consistent… boundary of performance important too. Perhaps the network will be more responsive at times and less at others naturally relative to how many users are active… initially a lot of activity might push the limits of what is possible; so, there might be difference over 24 hours and over 7 days.

tldr; it’s all interesting - so, more the better.

1 Like

So far I have this output - and despite help from @mav Im still not displaying the time correctly just yet - soon, real soon now

:slight_smile:

{"testrun":" Sun 4 Jul 00:49:20 UTC 2021", "safe_version": "sn_cli 0.31.1", "node_version":"safe_network 0.6.1", "safe_command":"put", "filesize":"5", "no-of-iterations": "5", "delay": "2" , "data":[{
"iteration": "1", "elapsed time": " 
Command terminated by signal 2
	178.01
"iteration": "2", "elapsed time": " 
Command terminated by signal 2
	8.56
"iteration": "3", "elapsed time": " 
Command terminated by signal 2
	5.32
"iteration": "4", "elapsed time": " 
Command terminated by signal 2
	6.42
"iteration": "5", "elapsed time": " 
Command terminated by signal 2
	4.09
"}

Indeed and to get a body of data that will allow this means I need to make the data collection process extremely simple and as easy as possible for a wide range of users across a similarly wide range of kit. I can parse the relevant CPU info from
cat /proc/cpuinfo
on linux but I dont have (easy) access to a windows box so I will need help on that as well.

This script is actually adapted from some work I did earlier which would see the user uploading some standard files of varying sizes and then timing the downloads. The intent here was to examine what effect deduplication would have on downloading these sets of common chunks. For now the script generates a (supposedly) random file of the desired size from /dev/urandom which should be good enough for our purposes. I’m not researching nation-state class encryption so I think we can forgive the not abslutely random nature of /dev/urandom.

1 Like

Uncoupling what is client side de-duplication time and what is network response might be useful too. Could do with more log file detail for noting timestamps of local CLI actions??

1 Like

Once I get the thing actually recordingh the times properly I will push it to the branch at GitHub - safenetwork-community/upload-performance: PUT a series of standard size files to exercise the SAFENetwork testnets

meanwhile PRs very welcome :slight_smile:

EDIT: New updates merged - Please try to break this - and then help me put it back together again…

Thanks @mav - getting places now :slight_smile:

{"testrun":" Sun 4 Jul 12:18:39 UTC 2021", "safe_version": "sn_cli 0.31.1", "node_version":"safe_network 0.6.1", "safe_command":"put", "filesize":"10", "no_of_iterations": "4", "delay": "1" , "data":[{
"iteration": "1", "elapsed time": " 
	15.78
"iteration": "2", "elapsed time": " 
	18.26
"iteration": "3", "elapsed time": " 
	20.36
"iteration": "4", "elapsed time": " 
	22.15
"}
2 Likes

This is my proposed schema for now. I debated adding a field for random vs structured data but rejected it as deciding what kind of structured data - text, image, audio would be too involved for now

{
    "testrun":" Sun 4 Jul 00:56:50 UTC 2021", 
    "safe_version": "sn_cli 0.31.1", 
    "node_version":"safe_network 0.6.1", 
    "safe_command":"put", 
    "filesize":"5", 
    "no_of_iterations": "3", 
    "delay": "2" , 
    "results":[{
        "iteration": "1", 
        "elapsed time": "174.37",
        "system time": "0.5",
        "user time":"1.0"
        },            
        {
        "iteration": "2", 
        "elapsed time": "192.33",
        "system time": "0.5",
        "user time":"1.0"
        },    
        {
        "iteration": "3",
        "elapsed time": "203.60",
        "system time": "0.5",
        "user time":"1.0"
        }
    ]
}
1 Like

Looks good… expect the filesize is MB but could perhaps be human format - usually there’s -h flag for that.

1 Like

Its kb - but its all tweakable… to suit this line cos I’m lazy


dd if=/dev/urandom of=$FILESIZE.dat bs=1024 count=$FILESIZE >/dev/null

now its mentioned, I can simplify that, the actual filename of the test data is irrelevant and it can/should be deleted at the end of the loop anyway.

willie@gagarin:~/tmp/upload-tests$ cat results-20210705_0112
{“testrun”:" Mon 5 Jul 01:12:06 UTC 2021", “safe_version”: “sn_cli 0.31.1”, “node_version”:“safe_network 0.6.1”, “safe_command”:“put”, “filesize”:“20”, “no_of_iterations”: “4”, “delay”: “2” , “data”:[{
“iteration”: “1”,
“elapsed time”:" 11.27",“user time”:" 0.58",“system time”: " 0.02"}
“iteration”: “2”,
“elapsed time”:" 13.04",“user time”:" 0.71",“system time”: " 0.02"}
“iteration”: “3”,
“elapsed time”:" 15.09",“user time”:" 0.82",“system time”: " 0.04"}
“iteration”: “4”,
“elapsed time”:" 16.91",“user time”:" 0.94",“system time”: " 0.03"}
]}

It could be useful to print date -u in every iteration. In case of perfomance drop or something unusual, it bet it will be nice to see if it hit different clients at once or how it spread across the network.

1 Like

For anyone wanting to dive into a bit of benchmarking w/ rust:

I’ve whipped up a basic example using criterion, it’s rust, but hopefully it’s clear enough that it could be expanded upon: tests: basic benchmarks by joshuef · Pull Request #136 · maidsafe/safe_network · GitHub

if you pull that branch, you’re basic run is (with a network already set up; could be local or WAN; you just need the node_connection_info file) cargo bench, and there’s some simple CLI output there, but more in depth charts and things are generated in the target/criterion/<bencmark> folder.

I wasn’t planning to get this into CI as yet, but just proposing it for the codebase inc ase we can work up an expanded benchmark suite that we want to use in the future.

11 Likes

In every iteration? There is already a timestamp at the top of the test run. Tests of course can take several hours, perhaps days to complete for large no of iterations and/or for very large filesizes but I don’t know how much value this has for for identifying bottlenecks/trends. Please convince me otherwise :slight_smile:

As ever, PRs are very welcome indeed, especially to help identify hardware on Windows machines.

Hardware detection - well CPU model and no of cores for starters - is working now and the output should be valid JSON. Please give safenetwork-community/ upload-performance a try if you can run a local test network on linux.

EDIT: Tacking this on here so as to not totally monopolise the thread
The relatively simple hardware detection we can do is processor model, core count and RAM. the free command gives us a good summary of memory usage but what vital parameter(s) should we capture? Total and available - or simply just available?

willie@gagarin:~/projects/maidsafe/testnet-scripts/upload-performance$ free -h --mega
              total        used        free      shared  buff/cache   available
Mem:            15G        5.6G        1.9G        262M        8.5G        9.8G
Swap:           32G        708M         31G

Note this is captured before the test run commences. Is there any value in reading this at the conclusion of the test run - or even every iteration or is this just getting silly now?
Or put the code in the loop but disable it by default?

EDIT: Hardware detection was not working properly. shellcheck.net insulted my pet and claimed my cat was useless. So I changed it as suggested and broke the hardware detection.
Reverted to the “deprecated” way f doing things

CPU_MODEL=`cat /proc/cpuinfo| grep name|head -n1|cut -c 14-`
^-- [SC2006](https://github.com/koalaman/shellcheck/wiki/SC2006): Use $(...) notation instead of legacy backticked `...`.
^-- [SC2002](https://github.com/koalaman/shellcheck/wiki/SC2002): Useless cat. Consider 'cmd < file | ..' or 'cmd file | ..' instead.

Did you mean: ([apply this](javascript:applyFixIndex([0])), apply [all SC2006](javascript:applyFixCode(2006)))
CPU_MODEL=$(cat /proc/cpuinfo| grep name|head -n1|cut -c 14-)

Maybe I will learn rust…

5 Likes