Safe-Search, bringing content discovery to the SAFE network


#21

Great work! It’s really good to see these new apps being developed by you guys!


#22

You know, rate-limiting shouldn’t be an issue in this environment (except for links leading out of safe://).

It was important in the old internet because it involved being kind to the exclusive resources of a specific server so that spider/robot requests didn’t disrupt other visitors or use up a large amount of a paid resource (bandwidth, etc) quickly.

In safe, the crawling doesn’t impede on other user’s ability to access data resources. However, you may still want to rate-limit in order to manage your own network-write costs.


#23

Our rate limiting plans aren’t based on domain, but rather on total daily usage, rather than blitz through our crawl in 10 minutes hitting the network in a really bandwidth intensive way, we plan to amortise our crawls (in the early days while the vault network is still small) across the entire 24-hour pre-index period, so the network doesn’t see any unusual data-usage spikes.


#24

Oh, that’s a good point. It wouldn’t be demands on specific servers but the network as a whole – especially in earlier stages when the infrastructure is still developing. Good call.


#25

@Shane @AndyAlban and any developers interested in LinkedData / RDF, see the ‘Solid Application Data Discovery’ link I just added to my post above


#26

Missed this thread last week. Nice work! Great to see all the apps coming in recently :tada:

Couple of Qs:

How is the data being stored? (What sort of data structure is being used). (as @happybeing noted, available as a datatype/in a structure readable by anyone/app? it would be great if we as a community could come up with some standardized data structures for search).

Do you have a typetag you’re using for the data (or plan to use one)?


Some other thoughts:

Toootally.

A centralized crawler/uploader/search app shouldn’t really be needed for SAFE, in my (super idealised safe-world) opinion. Is there anything blocking users from doing some crawling and uploading data sets? (eg. using a specific tag / or using a public MD with open read/insert as you suggest, that could enable anyone to crawl, and add to the data set, I think).

This also allows users to own their data that they crawled. Which is another great benefit of SAFE IMO.

Crawling and searching don’t necessarily need to be the same application/site (at least, as @neo points out, is the intention with APP rewards). They could very much be two separate things.

As long as the data is open/available and standardised in some way (which we should figure out just how that would be*), then there’s a lot of flexibility brought in for users / consumers / other devs to build upon (just as search indexes).


Heh, @Shane @AndyAlban sorry for the barrage of Qs here. I’ve been thinking about search on safe / what that might mean for some time (but never got round to tidying up my dabbles :expressionless: ). Great to see work coming out in this area! I’d love to hear what you think on the above.


(* There are some threads on this: indexed data; an idea built around some older network data types, but a lot should still be applicable).


#27

I’m also very interested in playing with this once you publish it @Shane @AndyAlban , it’s really good someone is working on this ideas already !

While reading the posts, the first thing I started wondering is, in the long term but not necessarily the first versions of it, how I/users will trust the results are ranked effectively by relevance and not by any other interests?
Myself as a user of a decentralised network wouldn’t like to (and probably won’t) use a search tool if I cannot be sure the results are not being manipulated in any form, it feels I’m loosing part of the decentralisation aspect. Thus, how can/will I be sure the crawler you run in your servers to populate the network with search indexes and ranks are not running a modified version of the known algorithm? it sounds like the execution of this algorithm should be somehow decentralised or at a certain level at least, perhaps the search site provides the option to run a crawler as it has been suggested above plus the option to just publish specific sites/files to be indexed.


#28

I’d always been imagining something more personal to affect the results of search. Either some custom params you can tweak, or, (more super idealised) like using web of trust to enhance the results from people you care about. Or people you deem to be well informed. (somewhat like the idea of liquid democracy).

A prominent scientist having more weight / followers on science topics for example. And all they have to do is ‘like’ a site or something for that weight to be carried across into the algo.

While that makes individual page ranking unlikely, that could be left to the sites themselves. So maybe a stack overflow site has a lot of stars/likes/whatevers, in general. And so for programming, its got a good ranking, and then our search could pick up the sites self-made index and search that.


#29

I will be posting (hopefully) satisfactory answers to this over the next couple of days - I’m working on client work at the moment, but I have seen this and plan to reply soon. :slight_smile:


#30

See my message here in a discussion about embedding thumbnails of SAFE sites on the forum. Is the site crawler you have indexing for search, also gathering thumbnails of any kind that could be used to show what a site looks like underneath its link?


#31

@Shane @AndyAlban @bochaco , @joshuef , I have also given the question about a ranking system some thoughts. The barrier to entry will probably be lower on safe network than there is on todays internet because you will not need the hardware infrastructure to compete on safe as you do need on the internet if you would like to compete with facebook or other large enteties. In comparison between the stock market to the coin market, where the coin market has alot lower barrier to entry. The lower barrier to entry opens up for actors that are not serious or trying scam and so on, therefore a ranking system will probably be beneficial to the Safe network and I think that it would also be good to try to prohibit bots that will try to manipulate the ranking system. Solutions could be votes for ranking with captcha or time spent on sites or similar, maybe others have more or different good solutions that will work out great.


#32

@tobbetj have a look at web of trust that project decorum is aiming to implement.

https://www.project-decorum.com/endorsements-sharing/


#33

Definitely agree with this and was along the lines I was thinking too.

Perhaps a standard crawled data format can be established, then created by the user, so that many search engines can read these at will? Something like a json based index file.

There must be some sort of format that crawlers output after going through a site already, but I would assume this is propriotary?

Obviously, validating that the results are honest will be a challenge, but presumably a bad search map could result in the site being ommitted from searches, etc.


#34

Yeh I think if a search data is PUT by a specific individual and we have web of trust, then this becomes a lot easier. If you put bad data, then your rep is at stake.

Yeh, I think this is needed. I was debating starting a fresh thread for that, but wanted to wait and see what @Shane and @AndyAlban are doing so far. As that’s probably a good starting point for such a thing.


#35

Hi guys, wanted to address these questions and concerns.

Centralisation

First of all, with regards to centralisation of the crawling: I’m against releasing a product which requires users to have to give something in order to use it. Ultimately, most of the people using the SafeNet will be doing it in a read-only way (and will have no Safecoin) if we plan for the Safenet to have a large amount of usage. Requiring that users do crawling and upload the results from their machine (even in the background) will have an associated cost and this model really doesn’t mesh well with my idea of releasing free and unencumbered software.

Second to this, it could be optional, but then I’m relying on the generosity (and willingness) of other people to take part in the indexing program, which isn’t guaranteed - it also increases complexity significantly, having to create a distributed index and confirmation algorithm which achieves quorum for not only the contents of a webpage but also the finalised index results will take a long time - time which I believe could be better spent elsewhere.

The simple truth is that building a centralised crawler and indexer is a lot quicker and easier. I could spend the next year of my life working on a decentralised alternative with a bunch of white papers, or I could build 6 more useful products for the users of the network. In my opinion, the network would get more value from more user friendly tools than it would from a search engine with an associated cost.

Don’t get me wrong, I love the idea of free and open software, but I feel like the software can be provably correct without having to introduce all this extra development time, complexity, user cost, etc.

The plan at the moment is to make publicly available both the pre-indexed crawl results (page contents, extracted links, inferred category of the page and website, frequency of update, etc.), and the algorithm used to generate this, as well as the post-index ranking of pages, and the algorithm used to generate this. Since all these will be publicly stored on the network, verifying the neutrality of the crawl / indexes should be trivial work.

I am happy to have the conversation publicly though about the structure of these search indexes so that other programs can use them in the future.

Search algorithms and structure

With regards to how the index will actually be generated, the rough idea is something as follows:

  • When the crawler loads a page, it will extract the text content of the page and remove words from a stop-word list based on the language of the current page.
  • The extracted words will be ordered based on occurrence (and perhaps in a future version, context and sentiment analysis - “ideas” are more useful than words, since many different words can express the same context)
  • These words will form the core of an inverted index, which will be the primary data structure used in the index
  • Each entry in the inverted index will point to a list of related pages which are pulled from a central page-rank based ordered pair tuple, representing a URL and a page-rank.
  • Since multiple words form a search query, the search query will pull out the most relevant pages from each word (again, in the future, context and sentiment analysis would be a nice addition) and order them all based on search rank, weighted against relevancy to the inverted indexes searched.

This project is in its infancy (it has only been 2 weeks since announcement and I’m not yet working on this full-time - only once safe-CMS is finished in 2 weeks will I be dedicating significant time to it) and more details will come out soon, especially with regards to documenting the data structures in their entirely, but this is, for me, a good jumping off point.


#36

Thanks @Shane!

Fair enough. It’s hard to argue with the point that a full decentralised search won’t take waaaayyy longer. And it can be something that we as a community could work towards, so having something workable until then is also suuper valuable :+1:

But excellent that you’re happy to keep indexes open and I think whatever you’ve got will be a great jumping off point for discussing this (as indeed is this thread for me!). If we can make a good start on it, the content produced/indexed will be invaluable for safe-search or any search built for the network down the line. (which is something I find really exciting).


I don’t think this is the case. The idea (as I understand it) is that sharing some space should be easy enough to do in the end that a passive income of some small amount of safecoins should be feasible for most anybody (indeed, eventually farming on phones could be a reality).

That doesn’t really affect much with your stated goal of simple content discovery, at this point :slight_smile: but just to offer another view of how the network may operate.


Thanks for the response, @Shane! I’m looking forward to hearing more and seeing how this all progresses.


#37

No problem with any of this @Shane, centralised is pragmatic, for now as there is no alternative.

In time I can see the decentralised approach taking over for the reasons we are discussing, and I don’t see this as against your open ‘unencumbered’ values. Rather than requiring people to contribute we can devise ways to reward those who crawl and submit useful results, while use of the results can always remain free.

It seems to me that PtP could provide a simple mechanism for this, with little or no extra work for apps that want to crowdsource resources from users. The network would automatically reward those who provided results that others find useful.


#38

My thoughts were just to get the content uploader to crawl their own sites, then upload the crawl results too (say, as part of your CMS). As this act will increase traffic to their site, it would seem worth an extra PUT or two.

Anyway, there is space for all different takes on the problem and the intermediate crawling format would be great to see published. It seems to me that there is space to do things in many ways, some of which will be unique to SAFENetwork.


#39

People here are just so nice! I just love the way e.g. @Shane asks the community about including a simple little icon in the incredibly important Safe CMS, while at the same time apparently working very hard and professionally. Of course we have to be pragmatic. We’re not just hippies with their heads in a cloud. Thank you, @Shane and everybody else. You all put a smile on my face. :slight_smile:


#40

I agree there should be a canonical, centralized search index. At the same time there’s something that gets me extremely excited about an open format for search indexes, i.e. one that prefers content by marginalized creators, open-source software, non-profits over corporations, etc. (these are all simplified ideas of course). As a first step towards a foundation for decentralized search indexing, would it be that much overhead to standardize the format of the index results this project generates?