Fully agree with this because I was myself. Anyone reading this please give it a go it’s honestly not that difficult ![]()
Perhaps will be some time before we get another playground? ![]()
Well, it won’t be today I can tell you that. ![]()
Agree with this too!
When putting the kids to bed, I enjoy reading about the developments and enthusiasm of you guys when a testnet is live. But to join myself, I feel it is still a bit too complicated for me.
If we have some easy to follow guidelines and a network that runs longer, I would be definitely interested to be part of this history in making! Also with the time I am getting more curious about the uploaded files you are talking about ![]()
Can’t this be tested with very small nodes? Baby steps?
Sticking with fixed node sizes was a good improvement Imo. Another big improvement to the test nets is if we stick to uploading a known “standardized” SN testnet data set of a known file count and known size. For example
consider the following resources:
http://imageprocessingplace.com/root_files_V3/image_databases.htm
When the data is known ahead of time it becomes easy to rationalize the load on uniform/constant/fixed size nodes and the other cpu/bandwidth resources required to run the test.
Yes this is what I was trying to do a few months back . I had a script to up/download standard files of fixed size as you suggest.
I’ll link to it later - Im suffering from delusions of sobriety right now
THere are in an S3 bucket somewhere that just exceeded its free limit. Later…
Using a standard open/free/public domain dataset is really what we need. In this early phase prior to a real safe network it will allow many more people to feel comfortable with their participation, knowing that no questionable content will be uploaded to testnets by good faith participants.
Consider the following testnet scenario.
Standard testnet data content
- 1000 text files @ 4kB each
- 1000 images @ 4MB each
- 1000 audio clips @ 4MB each
- 1000 video clips @ 40MB each
- Total file count = 4000
- Total database size = 48004 MB
Testnet Properties
- cpu cores per node : 2
- ram per core : 2GB
- chunk replication count: 4
- total network storage required: 193GB
- elders per section: 7
- min nodes per section: 35
- max node per section: 105
- target network depth (section prefix length) : 3
- target section count: 8
- target final node count : 840
- assigned node vault size : 230MB
- expected time required to PUT all @ 40Mb/s: 12 hours
I do like the idea of a public data set as standard, something we can easily script to pull/pipe directly onto the network (or from a cached local copy so we’re not rinsing their hosting). Can even be done as an initial step and then allow splits free reign and set about verifying verifying. ![]()
Moving forward the standard set could be ammended and increase in size as larger vault sizes and network depths are tested. (Ex. A=48GB, B=96GB, … Z=128TB)
These standards end up being the known genesis data.
a vault is data every member of a section has a copy of right?
would it not be better that instead of vaults getting bigger, a node gets to join more sections.
That way if you lost a node, multiple sections would be able to replenish the replacement node.
There are tens (~70?) nodes in a section, but only four copies of each chunk, so not every node in a section has the same data.
The point of nodes holding data in relation to their section is because content addressing allows anyone to know where data needs to be sent to store it, and vice versa (where to ask in order to get the data for a given content hash/xor address).
The Safe Network Primer has more detail.
This sounds like a great dataset.
I’ve been poking around the link you provided but haven’t come up with anything like this. Currently dlding to check out sample images from COCO(https://academictorrents.com/details/74dec1dd21ae4994dfd9069f9cb0443eb960c962) to see if that’s any better. (edit: it wasn’t. ML data sets are not the way to go. all low res).
If you have a better split/organised data set like the example, send it my way please!
Maybe this? Unsplash Dataset | The world’s largest open library dataset
Same exercise with 2 or 3 other open source datasets should then give you other file types.
Does the internet archive make their data available in bulk? (to get it all in one place)
I was just trying to download that from some weird links. That’s much simpler, ha. how did i not look there?
Bah, it’s all DB files and more faff. I think we just want the images.
I’m calling it a day shortly. If anyone knows how to get at some photos reliably or similar datasets and can set one up, that would be awesome
I think Wikidata would be a good one and from memory I think there might be raw dumps available.
EDIT:
The latest compressed JSON dump is 72GB, but they do incremental dumps once you have that. Also various formats and tools to help download and process the data. More here.
I did look at wikipedia, but its 54gb, so a bit of a monster to dl and poke about.
I have some tools and can start doing it this weekend.The videos will be the most time consuming. It may be easier to have 100 videos that are 400MB each.
Is there an equivelent command to export SN_CLI_QUERY_TIMEOUT=240
Fot increasing the node timeout so my nodes don’t drop out of the joining queue?