Public Datasets on Safe Network

Public Datasets are going to have a massive role to play on Safe Network. There’s so much public data, much of it is static (so suits immutable data upload), none of it is copyright, most of it is in a good format, and it’s a perfect candidate for foundations or motivated individuals to voluntarily upload and increase the size of the network.

I feel like this is something that could be turned into a really neat project.

  1. Gather a list of public datasets

  2. Coordinate between uploaders to avoid duplicate costs (could be a project in this, maybe add a leaderboard to make it exciting)

I think we can start step 1 here in this topic.

Step 2 is perhaps something BGF or maidsafe foundation might sponsor? Or it could be individuals paying? Or it could be like a bounty system?

Some considerations:

  • Try to include some sort of ‘time travel’ feature for datasets that are the same context/format but vary over time

  • Think about doing this in a way that naturally allows archive.org and other archival datasets to participate

I see a lot of talk about wikipedia but I feel it’s not a good fit for the early stages of the network since it changes a lot and very rapidly, and there’s a huge administrative aspect to wikipedia that isn’t so easy to port across compared to public datasets.

Some public datasets I know of

https://www.data.gov/

https://data.gov.uk/

https://data.gov.au/

https://data.govt.nz/

https://www.archives.gov/

It’s not sexy but it’s better than uploading random data and if it’s done well it could be incredibly valuable. But I feel there’s a big challenge to coordinate this effort.

If you know of more public datasets or have ideas about how this could work please let us know.

22 Likes

Canada

https://open.canada.ca/data/en/dataset

1 Like

SETI

1 Like

and

http://runeberg.org/

5 Likes

avoindata.fi/en - All Finnish open data from one place.

1 Like

Great idea @mav ! This data will also be useful for folks wanting to write apps which refer to or represent this sort of data.

3 Likes

Search “Public Domain” throws alsorts, though how much is structured and accessible en bulke will be a task to work through.
Also perhaps option to seek CC0 etc licences CC0 - Creative Commons

Public Domain Information Project https://www.pdinfo.com/ tempts music ; there plenty of public domain images but again what of those is worth upload is less clear… would hope there is a public domain art but expect Google limits what it shows in Google Art despite being free.

Search “public domain data sets” throws

  • Google Trends.
  • National Climatic Data Center.
  • Global Health Observatory data .
  • Data .gov.sg.
  • Earthdata.
  • Amazon Web Services Open Data Registry.
  • Pew Internet.

then there is list huge “awesome list” but each instance listed might be it’s own rabbit hole… and there’s a question about how to ensure versioning is maintained. If there is a bulkdata.zip that’s something but if it’s a warren being updated then needs a versioning snapshot diff?

Expand for : awesome list - Table of Contents

Table of Contents

So much data… just looking for a place to put itself :smiley:

and another list at These Are The Best Free Open Data Sources Anyone Can Use

  • World Bank Open Data
  • WHO (World Health Organization) — Open data repository
  • Google Public Data Explorer
  • Registry of Open Data on AWS (RODA)
  • European Union Open Data Portal
  • FiveThirtyEight
  • U.S. Census Bureau
  • Data.gov
  • DBpedia
  • freeCodeCamp Open Data
  • Yelp Open Datasets
  • UNICEF Dataset
  • Kaggle
  • LODUM
  • UCI Machine Learning Repository

and it’s too easy to find too many

4 Likes

Perhaps #offtopic is that kind of public data that is blockchains… evidencing how the network can support a copy, might be a powerful message to the techies. If you can best the option they already use, then they will use it…

4 Likes

I have a list of active SPARQL (RDF) endpoints (servers), most of which are either historical data (so static) or largely static. If this task gets going I can dig that out rather than post them all here.

2 Likes

#makearchivingsexyagain I love the idea, very nice thinking.

thesession.org is a massive collection of Irish traditional instrumental dance music, extremely popular around the world for people into that kind of thing. I don’t know if the whole thing can be downloaded or if you’d have to scrape the site (I think that’s the correct term).

www.mutopiaproject.org is a collection of mostly classical music, transcribed into the very pretty lilypond format, which means pdfs too. Volunteers transcribe old stuff in the public domain using lilypond so that it’s easily accessible, modifiable, etc. Just over 2000 entries.

imslp.org is 11 million public domain pages from some 180,000 musical works, they mostly, it seems from my use, have scans of the original documents.

4 Likes

Wikidata (rdf). Having wikidata on SAFE could be very useful once we get proper RDF support. It has data on all kinds of entities and could be useful for many apps. Wikidata is a sister project of Wikipedia, but is a better fit at an earlier stage than Wikipedia itself.

The linked open data cloud. 1500 datasets, not all of them are that useful.

OpenStreetMap is another free dataset that would be useful to create Google Maps competitors etc. Would need some processing to get it in a format that’s usable on SAFE and some custom indexes for SAFE.

MusicBraiz (rdf) dumps no longer provided, needs to download MusicBraiz database and server and then run dump script to create triples.

In many cases datasets using RDF can just be updated with a diff from the last data dump

7 Likes

I would like to thank @mav for the active involved in my proposal to upload useful data to the Safe Network. Thank you!

Big thanks and to everyone else giving proposals!

My pledge to upload data to the network will be used for this datasets :dragon:


Privacy. Security. Freedom

2 Likes

Besides the actual discovery+uploading of data, I guess there will be quite a lot of effort on two other aspects:

  • data organizing - most of the work and value of the sites linked above comes from the site itself, which provides the metadata, structure, discovery for the underlying data. This will be a tough challenge since I feel uploading raw data to safe has relatively limited value by itself. It might even be wisest to simply try to get a safe://... mirror link for each data set on these sites if we can, which leads to:

  • Starting relationships and coordination around the data. Since we can upload data to safe network anyway (permissionless network and open license data), this could be a great foot in the door to starting relationships with the data providers, saying ‘we want to come to your party’ rather than leading with ‘do you want to come to our party’. Initially they will probably prefer to enhance their site, rather than migrate to safe network. But later they may want to migrate their site to safe too. It will be interesting to see how the incremental approach goes for building relationships with these data providers.

Great to see so many public resources linked here. It’s really exciting to think of the possibilities. I haven’t tried to estimate sizes or costs for these datasets, could be interesting to know too. Although I noticed if you want to host an imslp mirror it requires 100GB free disk space. Not that large.

11 Likes

This is exactly what I hoped we would be digging in to for the launch of the Network, and I think we need to work hard to make it happen, and when the time comes, perhaps something for the Bamboo Garden Fund to consider. I don’t necessarily mean funding the data upload costs, more building the software to make these things happen sustainable and be super useful.

BTW, I think we can get the Internet Archive on board for some specific data sets, and these things are well worth looking at for priming the pump.

More generally though, I think we should examine the workflow around archiving clearnet websites. How can we facilitate that, and make things easy to do in a collaborative public way, and also how do we make it easy to navigate and traverse on the Safe Network?

7 Likes

Just a complete brainstorm here…

  • What if we had a parallel NRS system that mapped to clearnet domains for the purpose of archiving?
  • Perhaps we could allow each domain to have a public wallet, which people could donate into, and then this could be used each time a page from that domain was archived?
  • Could automatic archives be facilitated? A service that watched for a diff on a previously archived page, and saving the new version automatically?

Anyone else got any initial thoughts/ideas? Fire away!

5 Likes

How to control the use of the wallets is my major concern.

I would think that a webcrawler type APP that also keeps the tree of the site.

People can either be an uploader or donate to an uploader who then uses those donations to upload.

This way any random person can run the application and upload pages not existing on Safe.

The issue I see is how to stop corruption of the data, what is to stop a bad actor from changing the data before its is stored.

I would think that has to be an App run by the site or an agent of the site to ensure completeness and/or continued updating. Safe really does not have compute and no services unless it is written into the core code. Services like this would need to be application level stuff

1 Like

Yeah, it’s how we facilitate it from the commons though.

This is a good point. How can we trust what is archived?

Yeah, I don’t think we can just rely on site owners though. The Internet Archive for example wouldn’t function if they did that. Some people won’t want their content archived, despite the fact it would be for the common good for it to be permanently accessible.

Yeah, I’m not suggesting that. It would likely need to be a clearnet browser bridge scenario.

Maybe a separate multi sig method of building the indexes so that if corruption is found then by the weight of others it can be removed and the correct one replace it.

I would have thought it would be an APP that anyone can run

By site owners I mean the internet archive site for instance, or the wikipedia site, etc. Other sites have dynamic content for a large part and becomes an issue. Like languages, like searchable databases etc

My first thoughts is KISS - maybe just have a fund managed by a non-profit, which just copies the data over (using scripts, etc).

I’m not sure it needs to be too clever and many an archive/data site works in a similar way on clear net. It could raise funds though donations too.

Maybe longer term, PtP could help to fund it, if it was integrated too.

1 Like

Whenever I’ve wanted to get mass amounts of pages from a site, I just use scripts and that could be written now to convert wikipedia over to the test nets. Obviously not a good idea to do so on the test networks due to the data and time required

1 Like