Sapp - SAFE Search App

Sapp is a search app for the SAFE network. It will work like a search engine for the web. When using Google Search for example, in the vast majority of cases it’s only the 10 first search results that are used. Therefore, in order to simplify the Sapp index, it contains only the (up to) top 100 search results for each query.

The top 100 search results are store in MutableData objects called RSR (Ranked Search Results) with the following structure:

id = hash(query)
sreList = ordered list of up to 100 SRE ids

An SRE (Search Result Entry) is an ImmutableData object containing:

id = hash(n + ’ ’ + query), where n is a sequence number 1, 2, 3…
timestamp = millisecond UTC for when the entry was stored
url = the URL for the search result
text = text snippet for the search entry

Note that the number of possible SREs for each query is unlimited so there can be many more search engine entries than those listed in the ranked search results.

The ranked search results are pluggable so that one version of the search app can rank SREs based on user votes and another version can rank SREs with a different kind of ranking algorithm, and so on.

For Sapp to work in practice a huge number of SREs are needed. Each url can have thousands (actually millions) of possible queries and each query has its own SRE object. Also, for instance the query “new car models” has a different SRE than the query “car models new”. The idea is that data storage on the SAFE network will be inexpensive and so an enormous number of SREs can be stored in practice making Sapp feasible.

Anyone can store SREs for any kind of queries and urls, including spam and false entries. It’s up to the RSR algorithms to rank and select the SREs to produce high quality search results.

17 Likes

Just because most people don’t scroll past page 2 or 3 does not mean some people DON’T look beyond that. There should be a way to expand this search. Perhaps the better idea would be to allow the user to specify how many search results to show (10, 100, 500, 1,000) and then leave it up to them. Most quick searches would probably be configured for the top 20 or so, maybe 50, but extended searches might go for top 200 or so. Remember google pushes a lot down so sometimes you need to look a few pages deep.

4 Likes

Yes, actually since the RSRs are pluggable, it’s possible to implement all kinds of different sizes. I chose 100 as a value for the first version of Sapp to have something to start with. I read in the RFC for MutableData that each object can contain 1 MB of data. So there is room for a long list of entry ids if needed.

Typically, the user interface will show 10 results on each page, so the client gets the RSR that matches the query and then gets the first 10 SREs. And if the user clicks ‘next page’ or any of the page numbers then the client gets 10 new SREs. So performance-wise there can be thousands of SRE ids in the RSR list and it will still be efficient.

5 Likes

Why use this rather than a more traditional index though? You may have to finish the search on the client side, but I think it should be possible to make it work. Then you wouldn’t have to store every single possible query, but instead just every word that exists on any page, and for each word a list of pages that contains that word, then optimize these to work well with SAFE and be split into pieces that could be efficiently downloaded and ranked on the client side.

1 Like

The last option I saw for this made a lot of sense - that the network notes locations for single words, and then a user has option to sum the results relative to what mix of words they want.

Providing an option of the variety of multiple word searches that might occur - just for the case in which they do, I wonder will run into an multiple order of magnitude problem - unless you limit to common search phrases… and then also that paired with a problem of how frequently and how costly it might be to update… (I’ve not got my head round mutable to know if that is free, expecting that free updates on anything won’t be possible without spam risk).

Still, it’s encouraging to see ideas on this as it’s important there are options to cater for the many ways that people could use the network.

2 Likes

@Anders you should name it X0R0 (said zorro). A play on traversing XOR space and the amazing fictional character we all love, Zorro!!! :smile:

3 Likes

My idea is that a first version of Sapp will be easy to develop. And in the beginning for many pages just a few queries will be needed. Even one query can be enough sometimes. Take for example a SAFE website about Bitcoin, then the query “bitcoin” will cover most searches for the index page.

Then later as the SAFE network grows and the exponential progress of information technology continues, then more and more queries can be added.

I too think you are better off making an index of words rather than the exponential growth of search terms.

Let the APP take the search phrase and goto the MDs for each of the words and come up with a results page.

By storing the search phrase then every new search phrase requires you to crawl through each safesite to check if it applies. But by indexing each word then new search phrases are a breeze in comparison. By indexing words then you only have to rescan each safesite occasionally and not every time a new search phrase is entered.

And of course as the number of safesites grows new methods have to be devised as either method will become too much. But the index of words will survive so much longer

The OP only describes the static case. For updates an additional MutableData type called SRMD (Search Result Meta Data) will be useful, so that the SREs only contain the id, timestamp and url, and the SRMD containing:

sreId = id for the corresponding SRE
lastUpdated = timestamp for when last refreshed
active = flag for page being active or deleted
rank = numerical ranking value for the page
url = redundant field for the url for improved performance
title = title of the page
text = text snippet for the search entry

The SREs are immutable, while the SRMDs are mutable and need to be refreshed now and then to reflect changes in the page and in ranking.

It’s the safesite owners and other users who submit URLs to Sapp together with the queries. So the crawling of pages is outsourced, and it’s the users who choose what queries they want to have for each page. This makes spam and abuse possible, and it’s the ranking algorithm that needs to move down low quality and/or inappropriate search results and move up the high quality results.

An example of abuse is the query “bitcoin price” submitted together with a URL to a porn page. And an example of high quality entry is the same query “bitcoin price” submitted together with a URL to a page showing the current bitcoin price. The task of the ranking algorithm is to move down the former entry (porn page) and move up the latter entry (relevant bitcoin price page).

1 Like

In the first version of Sapp my idea at the moment is to use user voting for the ranking algorithm. One useful thing with the SAFE network is that the app has access to registered users. This makes it easy to prevent double voting.

Both active and passive voting can be used. Passive voting is when a user clicks on a search result. The ranking count is then increased by a small number inversely proportional to this search result statistics:

Active voting is when a user clicks “up vote” or “down vote” on a search result. An active vote results in a larger increase/decrease of the ranking count than a passive vote increase.

I used to be a fan of nonintrusive ads on the internet. Lately however ads on the web have become increasingly annoying to me. Yet, I came to think about how ads on Google Search are often still relevant and actually bring increased value to the end users.

So sponsored links on Sapp are something that will improve the value for end users when relevant. And I think it’s easy to develop a simple auction function where advertisers can bid on search queries. Google AdWords sets the price to the second highest bid or something like that. A similar strategy can be used for Sapp although with a much simpler implementation. Advertisers can simply bid for say three sponsored links with safecoins via a completely automatic bidding function. The lack of manual oversight makes abuse possible, but it might still be good enough.

I doubt that a simple search app like Sapp will become popular enough to make sponsored links a viable business opportunity, but just to cover more bases I want to include that as a possible future option.

Ads undermine relevance because they are based on the goals and purchasing power of the advertiser and not relevance. Advertising centralises influence & power.

2 Likes

I often find ads useful on Google Search when I look for commercial products and services. And sponsored links will mostly come up for queries that are related to commercial activities. And it’s important to clearly indicate on the search result page the difference between sponsored links and search results.

It will be nonintrusive and largely relevant sponsored links. No video ads or something like that which the users are forced to watch. I think it will add value to the end users overall. But sponsored links are just a future possible option, definitely not something that will be implemented in the first simple version of Sapp.

These are false expectations IMO.

I’m not saying all ads are bad (some are going to be for stuff you would like to know about) but the overall effect is IMO bad, and we can do better. The reason we have paid advertising is commercial, for the benefit of those trading (ads & goods). They aren’t putting your needs, wants or wellbeing first - why would they? - but their own.

You can see exactly how this operates if you look at Google.com ads over time - they began small, highlighted etc. Over time they have become less easy to distinguish and now dominate page one. Most people think they are getting search results, but when they search for certain things (ones where advertisers are willing to pay) most of what they see are paid links. Same with website ads getting more and more attention seeking (annoying) over time. Ads are now used for all sort of other things - they track you across websites, install malware etc.

Nobody builds an advertising service or creates an ad thinking, how can I best help the people reading ads - they are always looking to maximise their influence over your behaviour, to best meet their own goals. Fundamentally that’s about disempowering you, not helping you to be you.

Advertising is not about helping you learn things of benefit to you. We can build systems that are about that, and they can be in tune with SAFEnetwork rather than working against it. If you want to help people: build something that is not commercially motivated to disempower them, but to empower them. It’s hard, but I think we have an opportunity with SAFEnetwork to change the things like this, and search is the place to do it.

3 Likes

@happybeing I agree. Without adblock google searches are almost worthless now because of all the adverts and the useless results due to bloat and fudging the keywords by sellers. 9 out of 10 times google search does not return results that I can use. And after a number of tries by changing the search terms I can find useful things. One trick I found if I search without adblock is to add swear words at the front and often that kills off the adverts and the search engine actually returns something useful.

But adblock works a treat. Or use startpage.com which kills the tracking.

2 Likes

Indeed, I use both. I only go to Google as a last resort.

The RSR MutableData (Ranked Search Results) will contain three lists. The first list with the top ranked pages. A second list with upcoming (rising in rank) pages. And a third list with randomly selected new pages. The purpose of the two additional lists is to allow new pages to be able to move to the top.

And when the results are presented to the end user the first five results are from the first list. Results 6 to 10 contain results from list 1 and 2. And results 11 to 100 contain results from all three lists.

2 Likes

One way of significantly reducing the combinatorial explosion of number of different queries is to limit the number of search words to three.


Average number of search terms for online search queries in the United States as of February 2017

Max three words in the queries covers 78% of the statistics. And to reduce the number of possible combinations even further the words can be sorted alphanumerically so that for example the query “new apps safe” and the query “new safe apps” become the same query. And common words such as “the”, “be”, “to”, “of”, “and”, “a” and “in” can be removed from the queries. The same thing for other languages than English.

2 Likes

Another restriction that will improve the quality of the search results is to limit the amount of pages each safesite can index for each query. Otherwise one safesite can spam a query with thousands of pages where the only difference is a slight change in the URL parameters, while pointing to the same page.

Therefore only 1 page per domain and unique query will be allowed in the Sapp index. Notice that a safesite can still have an unlimited number of queries indexed per page. The restriction is for the combination domain and unique query. So for example if the domain nofakenews.safe has indexed a page with the query “trump health care” then it can’t have another page indexed for that particular query.

If a site owner wants to point the query to another page, then the first page has to be removed from the Sapp index by marking the query with a minus - in this meta tag before reindexing:

< meta name=“query” content=“-trump health care” >

EDIT: It may even be useful to use this meta tag to specify queries. This prevents others from indexing the page for unwanted queries. As an example a page about Trump health care can have the following meta tag:

< meta name=“query” content=“trump health care, trumpcare, health care, health care news” >