Safe-Search, bringing content discovery to the SAFE network

Description
I have begun the development of a SafeNet crawler and search indexer which will form the backbone for a SafeNet search engine application that will allow users to discover public content.

Motivation
Allowing users to Search for content will be advantageous to users of the SafeNet especially with the impending release of Safe-CMS, since that allows content creation and this will allow content discovery.

Progress so far:
We have created a basic web crawler which follows links it finds on web pages to discover other web pages, domains, etc. At the moment itā€™s searching the HTTP internet (as prototyping it in this way was more simple while we work on CLI libraries for interfacing with the SafeNet). This is rudimentary, of course, but itā€™s a good first step. Here is a little demo:

In addition, weā€™ve drawn up an application diagram detailing which software already exists and what we still need to build, in simple terms (though weā€™ve privately gone into much more depth) we need to build the following:

  • A multi-threaded SafeNet crawler which generates and stores a structured graph of nodes representing the SafeNet pages, with directed edges representing links (familiar to those with experience of graph theory)
  • A process which filters, prioritizes and generally manages the queue of uncrawled URLs, as well as maintaining an internal list of recently crawled, redundant and malicious URLs
  • A process which takes the structured graph and calculates ā€œpage-rankā€ of each node based on content, inherited page-rank from in-edges, etc. and stores the search indexes on the SafeNet as mutable data to be consumed by:
  • An app which downloads the compressed indexes (as needed) and allows locally searching SafeNet websites.

Stuff which already exists and will be utilized:

  • The queues will be simple priority weighted key/value REDIS instances
  • The storage for the graph structure will initially be done in MySql but once complexity increases (and the size of the index as the SafeNet becomes more popular ramps up) weā€™ll likely look at a more suitable tool - weā€™re trying to minimize complexity for the original release.

The indexing and crawling process will (for now) run on the (gigabit-fiber connected) home PC which @shane will be using to run a permanent vault when the network officially goes live.

Cost of updating the public search index will (hopefully) come from the AppDeveloper wallet attached to the SafeCMS, but Iā€™m estimating fairly low costs for storage of the compressed indexes, so if Safe-CMS doesnā€™t make any Safecoin, itā€™s not a big expense.

Estimated release date for this is some time in early-mid April - this seems like quite a long way off, but a lot of the work is waiting on @shane to finish up on Safe-CMS so he can dedicate extra time to completing this.

Co-authors:
Shane Armstrong (@shane)
Software engineer with 6 years experience of building massively scale-able web applications in Rust and PHP5/7. Author of SafeCMS.

Andy Alban (@AndyAlban)
Frontend software developer with experience building React applications and PHP7 server backends.

64 Likes

Great work by the looks of it. Excellent.

Have you considered if there is any way to be able to crawl sites that are not linked from another. Basically they are sites that people visit by word of mouth or from other such sites that form a closed network of sites. In other words there is no starting point of links into these sites or closed set of sites.

In the traditional web the Registrars provide a list of new domain names registered for search engines to crawl and for others to update their records (or for info purposes)

But in SAFE the user registers his (Decentralised Name) themselves and no one is notified or can be found. The SAFE DNS is not searchable since you need to know the name in order to get the SAFE DNS record.

One way I thought of is to have defined a protocol for users to announce their SAFE DNS name for anyone who wished to know. So obviously there would be a simple APP to implement the protocol so that the user can easily do this. Then search engines can pick up on the announcements to add to their crawlers.

Oh another question.

Will your crawlers be decentralised or will they require ā€œserversā€ (computers) left on 24/7 to do this crawling.

I would love to see it decentralised and basically whomever uses the search APP also does some of the crawling work.

20 Likes

We HAVE planned for this, as it happens, great question - this was one of our initial concerns! The plan is to create an MDATA directory which anyone has public READ/INSERT permissions for, so people can advertise their websites in a way where the data wonā€™t ā€œbelongā€ to us. I donā€™t want us to build a walled garden where we own all the data and SafeNet users are trapped with us. Iā€™ll be building optional support for this in to the domain/service creation steps of V2 of the Safe-CMS. In essence, this directory will be the very first registrar of the SafeNet. :slight_smile:

In terms of compute, the plan is to migrate this to compute and pay people who opt-in for the resources once compute is supported by the SafeNet, but in the interim, itā€™ll have to be centralised and Iā€™ll be covering the cost of this personally.

19 Likes

Of course you could make an APP that people can run that does say x amount of crawling. You then have an account for each and once they reach enough to pay them you transfer the coin to their wallet.

Or you could make it that whenever someone does a search that the search APP crawls a couple of pages as well. Thus they effectively pay for the use of the search APP by doing a little bit of crawling.

The plan is to have PtD rewards where the developer of an Application is paid some reward every time people use the APP. (Based on the number of ā€œGETSā€ the APP does). About 10% of what the farmers get for those ā€œGETSā€

11 Likes

Yep, I totally agree, but I was wary of attaching conditions to usage of the search app - right now, in the clear web, search is ā€œfreeā€ (unless you count viewing ads). Since our only personal costs will be uploading the indexes (which will be fairly small) and the size of the SafeNet is going to be tiny at the start, I filed this under increased complexity.

Itā€™s always easy as a developer to make promises you canā€™t keep, myself and Andrew have something of a more pragmatic view of the apps we build: weā€™ll only build something if weā€™re happy we can deliver it.

14 Likes

SAFE needed someone working on search so bad, thanks guys. I think youā€™ll do really well in terms of PtD rewards! Bravo :clap:

8 Likes

You guys seem to really get it. The SAFE world is lucky to have you!

12 Likes

is Safe-fs building a search engine for SafeNet?

@AndyAlban this is great news :clap:

Iā€™d like to suggest you and @Shane consider adoption of LinkedData/RDF for content storage and public indexing, and schema.org use in content creation.

Using LinkedData makes the separation between app and data much wider - opening up the content to more and different kinds of app, at the same time as enhancing the value of the content by adding semantics and RDF rules (for validation I think).

I suggest at some point you (and others) could get together and talk to the Solid team to get their thoughts on how to do this - one now works on the Qwant search engine.

Meanwhile Iā€™m making a portability layer that will allow SAFE and SOLID apps to work on either platform if they use LDP (Linked Data Platform) for storage operations. Thereā€™s an API ready to look at when you want, but I havenā€™t implemented the storage operations yet.

UPDATE: Related to my suggestion that developers consider LinkedD/RDF for increasing interoperability and usefulness of the data created by SAFE apps, this just appeared in the Solid chat: Solid Application Data Discovery

19 Likes

@anon78865233 Hi, I believe they are, but content discovery is such an important feature for the network that it really needs some level of choice and competition, given the massive scope.

Ultimately, I think that our goals and Safe-FSā€™ goals are diametrically opposed, theyā€™re building a company around their offerings (which is fine, developers need to eat!) and weā€™re just trying to provide useful tools for the community in a free, unencumbered, public way.

Our indexes will all be public, as will our ā€œdomain registryā€, so the Safe-FS team (and any other team) are more than welcome to consume those, and hopefully the community gets something out of having multiple teams working on this problem in parallel.

20 Likes

Thanks for this, weā€™ll take a look at these resources and see where it fits in to / enhances our current plans. :slight_smile:

8 Likes

Nice approach. I can see this type of work ethic making SafeNet a very successful project indeed.

3 Likes

Dynamic Duo @AndyAlban @Shane This is a top application and will definitely drive adoption. As @neo points out, getting to a decentralized model is crucial.

Great work guys.

13 Likes

I would be curious to get your opinion, and @Shaneā€™s, on my plan/suggestion for someone (MaidSafe? Us? We? Me?) to philanthropically dictionary attack the network in order to setup a ā€œpublic reserveā€ for single word and proper name domains so that domain squatting is minimized and access is granted for Everyone to using these common language names, forever. I suppose you can think of is as an analogy for ā€œpublic reservesā€ for plants and wildlife that are setup in order to protect endangered species from poachers. This would also give you an initial known set of seeds from which your crawler could expand out from. I would say MaidSafe or MaidSafe Foundation are the preferred entities to actually do this, but if they donā€™t I think anyone who agrees with the general idea should ban together and form a charity/foundation to get it or something like it implemented. I suppose the only creative alternative would be some kind of pet name system that you might be able to work into SafeCMS and the Safe Crawler.

1 Like

If Iā€™m being truly honest, Iā€™m against any sort of decentralisation attack.

SafeNet is being designed first and foremost for freedom and anti-censorship. If Maidsafe have control over a central domain name structure, they also have control over the service domains under it, since Maidsafe are subject to both British and European law, as well as foreign copyright laws respected by the EU, this provides a simple method of censoring things: Simply threaten to sue Maidsafe until they remove the offending service from their domain.

I think that we really need to break away from this concept of only certain TLDs being valuable or usable, the (not-so) recent expansion of GTLDs by the ICANN organisation is a good example of this, these days a ā€œ.travelā€ domain is just as valid and discoverable as a ā€œ.comā€ domain.

3 Likes

Thatā€™s not how the safe dns works. Public ID controls the base safe://shane, or safe://i-am.shane
anyone could have safe://shane.the-guy-who-made-safecms. And the general idea is not something that MaidSafe would be in control of, just something that would initially be set aside for public use like a wiki page but more natural/better/safer. But this is off-topic for this thread, I just wanted to mention it briefly.

4 Likes

Really amazing work! So great we get more an more projects started on SAFE. One group working on identity-management, the others on an app-store, blog tools and now on search.

10 Likes

This cannot work by design. I would have the same issues you have if this were possible. In fact all that can be demanded is we stop work on it, but that means prison/exile for me as I will not stop. The teams are remote and I would hope they keep going vie community funds or some mechanism, perhaps somebody would release maidsafecoin to them, I am pretty sure they would :wink:

In any case we cannot stop anything on the network. If anybody did see something where we could then it would be removed. We should all watch for such things, just in case.

19 Likes

Maybe we even want to not use any longer a unique naming system ā€¦?

There is a pretty old topic about a petname system that i like a lot to be honest

A petname system obviously only makes sense if you can share your name space with your friends and connect them - if you can send another person a link that is valid on both endsā€¦

ā€¦ Large benefit would be that I would be able to visit your blog by going to safe://shane no matter how many other shanes are on this planet and there couldnā€™t be Domain squattingā€¦
But yes its very different to what people are used to see in the internetā€¦

Just thought Iā€™d mention it - maybe itā€™s something you might think about some moments

Ps:

You know what we had some interesting proposals back then - Iā€™ll just throw in another idea by @Seneca

Pretty interesting as well - if you search the forum we had other ideas too (but those two left the deepest impressions in my memory)

6 Likes

Since most of the work on this project is fairly heavy back-end code, the updates will most likely be fewer and further between than on Safe-CMS, but we have a little update here with some mock-ups weā€™ve done for the design. Weā€™ve gone as simple as possible, without any spots for adverts or extraneous information - people will come to the app to find something and we should make that journey as quick and easy as possible.

This is the home page of the app (after the SAFE-authentication cycle is complete, but it will use the same pre-loading screen as the Safe-CMS project does):

This is the search results page, obviously youā€™ll get more than 4 results per page, itā€™s purely an example:

Our next major development update will be in around 2 weeks where we will be posting things like application diagrams, details about how the crawler will work, rate limiting, support for robots.txt and sitemap.xml, etc.

Thanks all, @AndyAlban & @Shane.

34 Likes