Public Datasets on Safe Network

I think one way to handle an internet archive on Safe Network is to create a path to work with the Internet Archive. We provide everything needed to encourage or assist them with whatever they are prepared to do at any point (starting from nothing, and showing them the opportunity).

What they are willing or able to do should increase as the path gets clearer and smoother, and the utility increases. They have the technology to do this, and an enormous historic dataset of the web. Safe Network can handle the storage and retrieval aspect at much reduced cost, and give a new angle on perpetuity that will need to be proven but should be attractive to them.

They have systems built to capture the data and know how to structure and retrieve it, so it would be good if by working together we each make the most of each others’ strengths.

8 Likes

https://wiby.me/ ran into this since seeing this, could reach out to whoever runs the search engine, who must have these sites stored somewhere conveniently. It’s an amazing resource of old 90s type websites, can’t find out how many, but there’s a link to a twitter where the developer is I guess contactable.

They have public APIs to help with exports/imports too.

https://archive.org/services/docs/api/

3 Likes

I didn’t know that but it makes sense they would, and it gives us a nice route. No need to write, maintain and run out own scrapers, or scratch our heads over how to structure the data and APIs we build for storage and access (well less scratching anyway). Those are massive wins.

They also give us something good to demo and begin a first stage collaboration, with clear dividing lines between the IA and SN roles. Then there’s a nice path towards them building use of Safe Network into their systems and ultimately using it as first a parallel system and migrating in that direction over time.

3 Likes

It’s pretty large though, in 2018 it was 40 petabytes, or 40 million gigabytes, even larger now. Getting it over to SAFE is going to be a huge and expensive undertaking. Maybe the best way would be to make a prototype with something like a few thousand webpages from the archive and then a frontend to browse it. Once they see it’s working and SAFE has been running a while to prove itselv, they could upload the data to SAFE themselves.

5 Likes

That’s what I had in mind but I guess not said as clearly! :grinning_face_with_smiling_eyes:

2 Likes

Just so that it is also in the list here: All pages - Distributed Denial of Secrets

and https://wikileaks.org/

and other similar leak type sites. The problem would of course be, as Mav mentioned, making them searchable and organised. I’m sure there’s plenty of journalists who would be happy to get involved here with this particular type of material.

1 Like

Scihub (whichever incarnation works at present). I think the whole history of scientific publication would be right up our alley.

4 Likes

I was thinking about the Khan Academy content :

They are a non profit that creates and distributes educational content under creative commons licence. Their content has been receive-only broadcasted by the othernet project - not sure if they still do.

4 Likes

I believe we already might have such a data set from them, from the last time we talked through a collaboration.

So yeah, I think we probably need to spark that collaboration back up now the shape of things is becoming clearer.

4 Likes

Many datasets are also just available in SQL dumps or other formats that wouldn’t be directly usable on SAFE. In these cases they’d have to be converted to JSON or preferable RDF to be usable. Getting that done could be a great way to show the maintainers of these dataset how SAFE could be usable. In the end it would be great to get people to store data on SAFE and having them just have frontend proxies for the clearweb usecases. In the beginning maybe much will be up to SAFE app developers, to upload the datasets they need for their apps. Uploading for example OpenStreeMap, essentialy requires developing a SAFE mapping app at the same time, to ensure it is actually in a usable format, otherwise it’s just a backup.

A lot of this stuff will be much easier and more usable once SAFE has support for RDF and SPARQL. That’s when you can really start using SAFE as a database.

Stuff like the datasets on Kaggle could be used right at the beginning though. It’s files containing data used for machine learning. It’s all just organized as files and folders. On SAFE you could even have it mounted as a folder and browse through the datasets as if they were already on your computer.

Yeah that could be very useful and it’s all available as torrents. It’s huge so downloading the all the torrents is inaccessible for most people, but if all torrents were downloaded, extracted and uploaded to SAFE it would be easily available for anyone. Having all papers available on a virtual drive would also enable data mining of scientific papers on a scale that has never been possible before. Data mining now is either done on just the abstracts of the papers or some subset of papers is downloaded and mined.

A sci-hub SAFE app would also be relatively simple, at the core you just need to convert the index that sci-hub has of clearnet url’s of papers to filenames to be a map of clearnet url’s of papers to safe xor urls.

It’s gonna piss off some large publishers though and might make them want to come after SAFE, so perhaps it would be best if it’s not available on SAFE too early?

4 Likes

What makes RDF better? This is a question I would have which is still not clear to me - what kinds of formats are best for Safe, to make data useful and organisable and indexable? Any eli5 appreciated.

1 Like

RDF and SPARQL is on the SAFE roadmap already.

You can actually store RDF as JSON too. RDF is a standardized data model that is useful for sharing data across applications and linking different datasets together.

For example if you look up an artist in MusicBraiz, you may see a link to Wikipedia and that links it together with Wikidata. You can then use the query language for RDF, called SPARQL, to ask things from these two datasets kind of like they were one single dataset.

There’s no such standardizations with all the random JSON data from APIs and such around the web.

2 Likes

Open NASA.

I like these ideas a lot.

  • A parallel NRS, is the point to avoid domain squatting? Wonder how the browser would/should resolve duplicate URL’s from both NRS’

  • Each site having its own public wallet to draw upload costs from is a cool idea. I think the site scraping and uploading could be automated and for the wallet that seems like some of that magical NAL David has talked about. If the clear net site changes then charge the safe site wallet (who’s owner is just under a faux ID the site itself holds the keys to?) or if it’s empty when needing to update then publish a message that the sites archive is behind, maybe a small error message reserve rule is in the wallet. Basically I think we need NAL and some imagination to make some cool smart wallets that can connect to anything, a site, IoT, a cart, subscription services, budgeting plugins, etc.

  • I think there should be a easy open source framework for client side apps made that look at clear net addresses, download and store content locally in a temp file, allow for some other process (using other clear net API’s to add meta data, sorting, etc), and then upload to Safe Network. That way it could be literally any content and the side process (think like a side chain plugin in a music DAW) could do some web page scraping or formatting magic, etc and then upload. Basically I have this idea of the app being an Event Horizon into Safe Network. It’s a black hole that swallows and organizes data before spewing a newly organized, linkable, perpetual data into the safeiverse.

I think the trick is, how do we know the sites or data is provably mirrored? I don’t know we can. The software or app could alter data and when you visit the safe site then you’re seeing a different reality.
So perhaps this app is open sourced and audited to work as promised and that is the only app with access to uploading to the parallel NRS.

Hopefully I’ve added something riffing off your great ideas. :+1:

To store this on filecoin, which currently has a price of $0.0007773 per GB per year (assuming permanent storage is 5x the annual cost) this means 40 PB would cost about $163K to store. Not so bad considering they pay about $1M in computer expenses each year (see archive.org irs filing p10).

Current default node size is set to 2 GB so if all nodes ran at that setting it would take about 21M nodes to store a single copy of the 40 PB archive.

Interesting to see the practical side of these ideas coming into play.

7 Likes

An interesting feature of Filecoin to note:
“For petabyte-scale datasets and larger, the most sensible solutions often involve shipping data on hard drives. The Filecoin protocol and project has tools and structures to support what’s known as offline data transfers.”

2 Likes

Na, it’s not to do with domain squatting, just purely as a method for archiving clearnet sites on the Safe Network, and folk knowing where to find them.

So something like safe-archive://www.bbc.co.uk or whathaveyou.

I was thinking the wallet and site would effectively be publicly owned, with the funds only to be used for archiving pages for that domain. It can be topped up by anyone.

Then if you are on the cleanet www.bbc.co.uk, some extension or app shows that the page you are visiting hasn’t been archived to safe yet, you hit the button to make that happen. If there are enough funds in the wallet, then it’s all good. If not, you get a cost for that page you can pay to archive it then and there.

Yeah this is probably the trickiest bit. I don’t know the answer. Just some ideas!

I do take the point about working with the Internet Archive to solve some of these problems. They are trusted to do this already. However, if there is a way to do it in a decentralised way, without the need to trust that authority, so much the better.

But there may be a middle way too.

But would Filecoin provide what they are after in terms of redundancy and the general SLAs they’d be looking for?

Yeah, there certainly will be differences in price approach for sure. However, factoring in global accessibility too, and Safe might be more appealing for things like IA.