Paying vaults to search public data

I’ve been pondering how to do searches of data on SAFE network, both within an application (eg a SAFE website) and more generally.

It occurs to me that vaults can be paid to help out and that this would benefit everyone:

  • network saves resources
  • vaults earn Safecoin for searching public data
  • apps can search large amounts of data very fast

Here’s the scenario, which I’m sure is not new to the team - it seems inconceivable I’ve thought of this before them - if its any good that is :smile:

For Public Data Only

An App wants to search blog posts for a keyword. It sends a special GET request for every blog document to be searched. The request includes both a Safecoin payment (minted transaction?) and the keyword(s) for the search.

Each vault holding a chunk will receive the special GET, but instead of just supplying the data, it first scans it to see if it matches (or could match) the requested search. If it does not, it responds NO-MATCH (without sending the chunk). If not, it responds MATCH (and sends the chunk).

Once the close group looking after the chunk has decided by voting if the GET is valid, and if the chunk matches or does not match, it chooses one of the correctly responding vaults to have a search-farming attempt for he supplied Safecoin. If it succeeds, the Safecoin is delivered to its wallet. I’m not sure how this can be achieved, but I’m assuming it can :smile:

The App only receives data back if it matches or might match. So if it was searching 1,000 documents, and only one matches, all it gets back is the matching document. This will be very much faster for the App, and so worth paying for.

How this is priced is something to be discussed. Maybe the network maintains a price and each App is configured to use this to determine whether or not to pay for a special search-GET or not.

Any thoughts?

2 Likes

Aren’t the chunks encrypted anc charded? Seems the vault would have no idea weather it matches or not. You would need the other chunks to know if there is a match.

1 Like

I was thinking the same thing, when you have 100.000 nodes on the network, chances are quite low that when a file is chunked into 100 pieces, multiple pieces of the same file stay at the same Vault. And the Vaults have no clue if they’re holding a part of a text- or html-file as far as I know. Google doesn’t search “live”. They’ve already indexed all the sites on a lot of keywords. I think the same thing could be done for safe:websites. A spider would go from 1 link to the other, scanning all the website and all the links to video. Getting all the meta-data like keywords etc. Vaults get payed in that example because they have to provide the Chunks when the spider is crawling :stuck_out_tongue:.

1 Like

These may be valid but I’m thinking:

  • public data (so any vault can access the plaintext)
  • plain text matching (maybe this isn’t a good solution, but I think it would work, so improvements or alternatives are welcome of course)

@polpolrene Building a “Google” is an option, and I hope someone will do that, but Google isn’t the only kind of search needed. So just take a WordPress website as the use case. Search is done without a pre-built index, by accessing the document raw data (on the server) and returning the results to the browser. Its wrapped in an SQL query, but its still essentially scanning the content of every page rather than referring to a search index. So that’s more my use case, and I think this could solve the problem of how to do search of a single application’s data (website or database) so long as the data itself is accessible by the vaults. Which for a website is almost always the case.

2 Likes

I would think that public data oughta be encrypted too.

Otherwise it may be censored by powers that be.

Good point. In that case I’m not sure how this could work, but there may be a way to do it evenso, just not in the manner I described. Perhaps the app could provide the necessary info in the GET for a vault to access it.

I think for search you should use the StructuredData type. They are not encrypted and have a 100 KB data field (this can be extended by putting in the address to a datamap instead).

The name (address) of a StructuredData element is the hash of it’s tag_type + identifier. So we take a fixed tag_type for index elements (for example 11), and the identifier becomes equal to the hash of a search keyword. The data field is filled with addresses of public datamaps whose data contain the search keyword.

So if you want to search for a keyword, you take the known tag_type (11), hash and append your keyword, and hash that. Then you GET the resulting address to retrieve the list of public files containing your search keyword. If you have two (or more) keywords, you do this once for every keyword and cross-reference the resulting lists. Note that keywords need to be exact or else the hash will be completely different. Any auto-correction of keywords should be done prior to hashing.

This is basic search, but having a list of raw datamaps isn’t really useful. You’d have to download and look at every file to see if it’s what you’re looking for. So we could introduce another subtype of StructuredData: The public file summary. Let’s say it has tag_type 12. The data field contains a summary of the contents of the public file in question. Using the datamap address as the identifier directly could lead to squatting, so the uploader (and hopefully permanent owner) encrypts the datamap address with his private key and uses the result as identifier instead. Since the owner’s public key is part of the StructuredData anyone can decrypt the identifier to retrieve the actual corresponding datamap address.

The index element data field would now contain summary addresses instead of actual datamap address. These summaries can be presented in a classic search result list. Clicking on any of the results would decrypt the identifier of the summary which produces the datamap address. This datamap is then retrieved, then it’s actual data is retrieved and presented to the client.

Someone has to create and maintain the index element tokens as well as the public data summaries. Index elements should probably be created, owned and maintained by a party that considers maintaining the search system it’s responsibility. Once we have distributed computation this could perhaps be a decentralized autonomous organization on top of SAFE, until then it’ll probably be a private party. The process of indexing could be done by SAFE crawlers or simply on request. Summaries could be created by the original uploaders of public data, by anyone else willing, or also by the crawler.

Next step would be figuring out how to sort (prioritize) search results, ideally not client side but in the data field already. Some sort of popularity measurement would be ideal. Not sure how to pull that off, without putting it in core code.

1 Like

If the flexibility of the StructuredData RFC wasn’t obvious yet, I think this is a pretty good example of a practical application.

1 Like

Great suggestion. Thanks.

1 Like

An alternative to the summary of the file could be a word index of the entire file, including the positions of the occurrences, i.e. chunk number and byte position. Then the search result list can display the words surrounding your search keyword, providing better context. In the case of multiple keywords, the closeness (byte distance) of keywords affects it’s relevance and position in the search results list.

The main challenge of search on SAFE is the effort involved in a proper organization of all the relevant data. Parsing documents and maintaining the distributed index is going to be a major cost. I suppose the best option is to charge SafeCoin for adding or updating a public file in the search system. The author or anyone else who finds the public file important enough can submit the file to be indexed and pay the SafeCoins.

Another model would be charging per search, but then it’d have to be either part of the core (way too complex for core in my opinion) or the system would rely on dedicated servers being live which would receive search request through SAFE messaging. I don’t think it’d work well, people are not used to paying for search, plus availability of search would rely on a centralized service. In my proposal only the maintenance relies on a centralized service. Search queries themselves are handled only by SAFE and the client. Maintenance outage for a couple of days would only affect recent news searches.

2 Likes