SAFE Search App

For filtering content, on the NRS thread Jim suggested a refinement of Josh’s idea above of using bookmarks, which would be that sites that people you are linked to had bookmarked would “bubble up to the top.” Hopefully wouldn’t throw up too many nasty surprises!

Essentially I think RDF can be seen as a mark-up language so that text can be easily sorted through by eg. search software. For example in the HTML for a site, the word “David” might be labelled as a person’s name (among other things.)

For it’s potential use in the search scenario I found the most obvious explanation on schema.org, but from what I understand schema itself is frowned upon as being much less sophisticated than, for example, SoLiD.

Apologies if I’m stating what you already know, or if I’m getting it wrong. @happybeing might be able to correct me if I’m misrepresenting it!

3 Likes

Ah, I see now. I guess that’s thinking along similar lines to @isntism

1 Like

Thank you, @david-beinn, for your work.

I just have one small comment. There are languages where the “base form” of a word is just one of thousands of possible forms. English has very little inflection and derivation, whereas Finnish has a lot. At some point this will possibly have to be dealt with using some sort of base form generating automaton in order to crawl/search for [cat/cats/cat’s] in other languages. I don’t want to complicate things unnecessarily, but I think it’s good to keep stuff like this in mind.

And a question for those in the know: Is it at all feasible to imagine searching the whole network using regular expressions, or is it obvious it would require too much computing power?

1 Like

I honestly can’t imagine pursuing this app to the point where I’m needing to consider things like that, but it’s a really interesting point.

How do you think, for example, Google deals with the sort of problem you’re talking about? Would those thousands of possible forms be linked somewhere in a database, and a search for ‘cat’ would send the engine to go and find all the possible forms before beginning the search?

In reply to your second question, I’m not ‘in the know’, but I can’t imagine that searching the whole network for ‘cat’ every time is feasible, if that’s what you mean. This is where I think scraping a site for the important information and formatting it to be put in the index in an efficient way would come in.

The only way I can imagine doing that is on some level flipping the logic so that when you search for ‘cat’ you go to a box that already has the details of all the sites that contain the word ‘cat,’ and then cross reference for your other search terms.

On the other hand, maybe Google just takes you to the sites that other people who have searched for ‘cat’ have clicked on in the past, and ignores the rest.

I’m speculating here though; like you, I’d be really interested to hear an account from someone in the know as to how it actually works.

1 Like

As we all know, Google is very secretive. I would like to say “your guess is as good as mine”, but I’m afraid yours is probably better. I’m really just speculating too, and thinking/fantasizing out loud. I imagine Google may crawl sites for the different forms of “cat” in advance and then automatically reduce them to just [cat]. Then, when somebody searches for “cats”, the search is automatically changed to a search for [cat] instead. Really just guessing, though.

1 Like

I did hear at one point (relatively recently) that they were hiring a lot of language experts!

1 Like

:slight_smile: I’m just some chicken guy who sucks at marketing himself or anything else.

1 Like

Oh sorry - I was thinking loud off topic about a

network feature

Tbh I’m not sure anymore why I thought this topic would be a good place to ask this question :thinking:

Yapp - that was what I assumed a while ago too - before I realized that if xor links don’t include the keys to read them they must be clear text content
… and after asking this question I just assumed that regular names might just be the same…

If you don’t assign an Encryption to a appendable data and the content is small enough that it is not broken into peaces it might be stored as clear text in vaults… That’s what I would love to have verified (as not being the case)

2 Likes

You develop the format. json is good because its flexible and can be expended later on. Specify the required elements

2 Likes

That’s a fair point that a format on a technical level would need deciding.

I was more meaning in terms of some kind of structure. eg. What information goes into the database, and where is it stored so that it is quickly and flexibly accessible?

1 Like

off-topic reply: I think nothing is stored in clear data and is fragmented in 1000 fragments where each fragments goes to one vault so a 1MB data is spread to 1000 fragments to 1000 vaults (and I think its duplicated twice or thrice)

1 Like

I’m very sure this is not the case

I admit this is a not really current post and I haven’t seen this mentioned in the self encryption video when skimming through it but afaik self encryption hasn’t been changed for years now

And when data map +file with no encryption are bundled they must be there in clear text

(not sure if the self encryption even comes into play when no encryption is being selected)

The fragments are not correct as you said.

Although David did say that all data will be encrypted so the vault owner cannot read it. Potentially double encryption so a vault cannot use tables of hashes to know if chunk is a chunk of a known file. Not sure how that would be done though

2 Likes

Hmhmm you’re right I just remembered that I read it somewhere

Obfuscating stored data
All data stored within a Vault on an individual’s computer must be entirely encrypted and unreadable

https://safenetwork.tech/roadmap/#obfuscating-stored-data

So yes it’s on the Roadmap but not implemented yet :face_with_monocle:

So I guess it’s not forgotten and will be taken care of before launch :blush::hugs:

5 Likes

I guess they use a stemming algorithm

For the actual search the basics should be an inverted index. For the word cat, all url with the word cat would be listed and for the word dog all urls with the word dog would be listed, then if you search for “cat dog” it would fetch the list of urls containing the word dog and the list of urls, then combine these to get a list of urls that contain both the word cat and the word dog. This list would then be ranked with pagerank and lots of other stuff to decide which order the results should be in. Web search engines does lots of optimizations so the index for cat may not contain all urls with the word cat,but only ones included by certain criteria for example.

An interesting alternative to an inverse index would be to use neural networks to associate word and concepts with urls so you could then search in a way more like how human memory works, if you know some websites about dogs you may have associated the word dog with certain urls in your memory and if a friend ask you if you know some good sites about dogs you’ll remember the urls. In biological neural networks this is thought to work a little bit like Hopfield nets. Even Google hasn’t yet figured out a way to make something like this work very well yet though.

5 Likes

That’s great, thanks very much for sharing that knowledge, really interesting.

I’d sort of come to the conclusion that’s how it must work, and was quite pleased with myself when you confirmed it, then realised actually, that’s just how an index in a book works, and always has done!

Got me thinking the Perpetual Web adds a whole other layer of complexity to the search challenge, although perhaps simplifies some things too.

I guess this is where RDF could help too, if the containers for ‘cat’ and ‘dog’ and ‘apple’ and ‘Apple’ are defined by the RDF spec it would simplify quite a few problems, though whether it is the best way of doing that I’m still not sure.

Doubt I’ll ever get to the point of implementing anything on this sort of level, but really fascinating to think about, and those are some good places to start Goggling from, thanks again.

3 Likes

There are some open source javascript search engines that perhaps could be used. If it’s going to be available from a website it needs to be implemented in Javascript, unless it is implemented in the browser itself using Lucene or something and then exposed as a Javascript API. Having the browser expose a javascript api for web developers to add search to their site could be an interesting option long term. Then indexing and querying could be done by the browser with native code. I don’t think it’s the best place to start though.

There’s a list of javascript search engines with benchmark here

https://raw.githack.com/nextapps-de/flexsearch/master/test/benchmark.html

The ones with the fastest query performance, has slow performance for updating indexes.

I haven’t investigated whether or not any of these would work well on SAFE.

The constraints on the SAFE Network is a bit different than the typical scenario. If the index is large, it has to be split into many parts and the relevant parts downloaded to the client before the query is executed on the client side. Normally joining parts of a large distributed index is done on servers with high bandwidth, low latency connections between servers in a single data center.

With an index for up to maybe a few thousand pages there shouldn’t be much of an issue, but once you get over that the index has to be split up in some way.

A good way to make experiment with creating a large index would be to created an index for Wikipedia.

Each time you fetch a part of the index, you have to make the network lookup an NRS or XOR address and fetch the data there. This introduces latency,so you don’t want to have to do this too many times. The part of the index that is to be queried also have to be downloaded to the client, so you don’t want that to be too big either.

There have also been proposals for search by using semantic hashing, search forum for semantic hashing and you’ll find some threads on that.

2 Likes

The Wikimedia search “Knowlege Engine” might be useful here.

In theory it is (will?) be open source according to this FAQ, not sure have been unable to find any repository - might be too early.

1 Like

Hi,

I’ve been have a think and trying to come up with a plan whereby I can get something out in time for people to have fun using Fleming, but then has a step by step route to something a bit more serious.

Below is what I came up with, hopefully should all be reasonably self-explanatory. The full model is pretty speculative and relies on skills well beyond what I’m ever likely to have (as well as person hours etc etc,) but hopefully the intermediate one should be fairly doable, especially if other folk were keen to chip in. As I’ve said before, I don’t have much confidence in my knowledge, but just trying to fill a gap that will help people use the network in early days. There may well be things that make the below approach completely impossible, so don’t be shy to say, because I don’t want to waste my time if i can avoid it!

On that note, the ethos behind what I’m doing is to create something that is open-source and transparent, though the nature of an inverted Index (which is required for speed) means I don’t think it can be fully decentralised. Happy to hear from anyone who shares these aims and wants to help out.

On a community note it would need at least some support from @maidsafe because it would require a reserved type-tag for Appendable Data.

Plenty more to say if folk are interested, but probably best keep it short for now. All comments welcome!

Edit: You’ll have to click on the diagrams, and even then they’re still pretty small, but hopefully should just be readable. Sorry tried to do SVGs but the forum didn’t like them too well.

Current (Alpha 2) Implementation

Intermediate Implementation

Project Plan

Mock Front Page

Speculative Full Implementation

16 Likes

Got a bit more time today, so done larger versions of the two more interesting diagrams, though sadly still can’t get vectors playing ball.

I’ll take it from the likes but lack of comments that everyone thinks this is a great idea and I should press on!!! Or maybe just no one could read the diagrams!!!

Most of this has been put together with only a rough idea of what is possible technically, and there are also some more fluffy decisions going on about who pays, and how site owners make their pages visible that are a little different from the current internet, so if anyone wants to engage on that level, happy to have those discussions too.

One specific question to anyone @maidsafe, having thought about how best to support PWeb in search, I felt like the option of a “Time Machine” Search was maybe the best. To do this I was thinking to update everything on a regulated schedule on the assumption that it wouldn’t be possible to search a version of a piece of data by the date itself eg. “give me the version that was current on 4.12.2009.” My approach would be “give me the version that is current version minus 5,256,000 versions” (if the index is updated every minute.) To implement this would be quite a lot of hassle though, so it would be nice if it wasn’t necessary! Or any other ideas how to deal with PWeb?

(Very Speculative)

10 Likes