Really thanks for your contribution.
Thanks Mark,
That looks like a pretty serious set of tools!
For anyone who’s interested, a quick update on what I’ve been up to regarding this project…
I’ve been working fairly constantly on front end stuff, partly in the absence of a finished SAFE API, and also in the absence of a clear direction to go forward with the more fundamental elements of search. I’ve gone pretty overboard with learning some of that front end/UI/UX stuff, partly in the hope of getting together a bit of a template that I can easily apply to other potential projects, and partly in the hope of gaining some knowledge to pass on in tutorials etc.
However, I have been trying to mull over ideas for what I think would be a good way to approach search to achieve something that can be decentralised in all senses of the word. Personally I think the problems of trying to achieve this are as much social and economic as technical, which is not to say the technical side isn’t extremely difficult in itself.
As I’ve mentioned elsewhere, I’ve really come round to a pretty uncompromising view in the sense that I no longer want to work on something that tries to take advantage of the short term efficiency of (aiming to) hold a single index for the network, even if that index were in some senses distributed.
There are a few reasons for that, but I won’t go into them just now, unless anyone really wants to have that argument!
As some may remember I argued that way myself for a while just because I couldn’t see any barely feasible alternative, but in the past couple of weeks I’ve been seeing how it could perhaps work. I’ll go into some of the ideas I’ve been having in another post, in the hope that someone may or may not say they’re completely ridiculous!
So the idea that I’m liking as a starting point at the moment is very simple one (though it may take a while to explain.)
The idea would be that website owners index their own site (possibly using certified software, but as long as it’s in a standard format we can assume it’s in their own interests to be accurate.)
The index for a site will be an inverted index ie. the location of any occurrences within the site of, say, the word ‘apple’ will be referenced from the ‘apple’ container. On SAFE, that container would be a network address. The index for a site can run to being quite large, but not large in the sense of the Google index, so that may leave some options for its format, but for now let’s assume its a hash table, which is pretty much infinitely scalable. If someone with public name ‘Bob’ owns safe://bob, in order to find a list of locations on that site containing the word ‘apple’, we go to (hash of (Apple+Bob)).
This formula is contained in the element of the site maybe. Also contained here is a list of all the indexed words on that site.
We expect that Bob, using software, will keep his index up to date. Remember, this may be an index for a news site with tens of thousands of pages regularly updated (or it could be small, single page site rarely updated.)
The financial cost of this is unlikely to be too onerous for Bob, and he has a strong incentive to do it so that people can navigate his website. However, if a search engine was doing this work for all the sites it indexed, it would be cumulatively very expensive, thus making it difficult for anyone to start a new search engine.
The obvious problem we have here though is that it would take far too long for a search engine to look for (hash of (Apple + PublicName)) for every site in its index, and most of them probably don’t contain the word ‘Apple,’ or at least not in a sense of any importance.
To solve the last part of this problem we must only index important words, but that is a traditional search problem which has been solved before (though it’s far from easy.)
To solve the first part, the search engine simply needs to go to the list of important words a site contains (mentioned earlier as being made available by each site.) This can be done ahead of time, and held in the index belonging to the search engine itself, (although referencing of that index is done directly by the client computer.) This index will not be small, but it only references what the search engine recommends, only references sites rather than pages, and does not need to be kept constantly updated. Or at least, it only needs to be updated when a site adds a word it has never used before.
As an example, I called my search engine Octopus, so let’s use that:
The client wants to search for the word ‘apple’. Firstly they go to (hash of (apple + octopus)). Here they find that out of the sites that Octopus recommends safe://bob and safe://alice both feature some information about the word ‘apple.’ Therefore the client computer goes to (hash of (apple+bob)) then (hash of (apple+ alice)) which return some page locations, which are our search results.
Note that I use the word ‘recommends’ for the search engine. Ordinarily, a search engine would not be able to afford to recommend things, because it would not result in a large enough user base to justify the cost of maintaining the index (assuming it could even find an income stream from those users.) However, by putting most of the cost of indexing and updating onto the website owners, we leave search engines freer to do their own thing, opening up the possibility of a multiplicity of search engines, rather than a monopoly or oligopoly.
Importantly, this cost to the website owners is not tied to a particular search engine, so the search engines do not owe them anything either. The websites are not the ‘customers’ of the search engines.
The search engine may use a combination of software and real humans to decide what it will or won’t recommend. If (hash of (octopus+apple)) returns too many results, the search engine may need to use it’s own ratings or relevance system, but this should still be paring down a limited number of results, rather than the billions that Google is dealing with.
Search engines could also pair with each other to expand their range of results, but still giving preferential treatment to the sites they specifically recommend.
Storing keywords in an RDF format or classifying them in a schema may help to better understand search queries and help with stemming etc. which could be very useful in the future as numbers get larger and things need to be more sophisticated.
Hope that makes sense and is not too wildly naive, but let me know if you think it is. Tagging @joshuef, @happybeing, @krnelson and @danda as I know you’ve all had thoughts on this, here and elsewhere. I’m sure others too, will have thoughts…
I like the idea and your logic of pushing the cost out to the owners seems a worthwhile approach. Effectively you are defining a way for owners to publish an index which anyone can use to locate information on their site, and some can use to build things like search engine apps, but not only search apps. It’s not that far from the ideas of the semantic Web, where the content itself is marked up to make the meaning understandable to machines. This I like ![]()
Your site index is like a secondary later highlighting what are the most meaningful index terms - which is not far off how site maps used to be created for submission to or crawling by search engines, before the engines became able to do this directly.
I don’t think it’s the case that site owners will be accurate in creating their indexes. The quality of indexes will vary widely, and the aims will align with the site owners goals rather than the ways one person or another might want the indexes to be useful. In practice the two sides will interact - as happens currently with search engine algorithms and websites using search engine optimisation (SEO).
On today’s web many owners will do anything to get traffic (or the people building sites for others etc.). Different owners have different motivations, and different site builders will be rewarded for indexes designed for those different goals, so it’s complicated.
I think we can play with this and try to imagine how different kinds of index might be formulated, and be utilised by different index consumers.
Spam
An obvious site strategy is to spam the index with keywords.
Counters to this is:
- force the site to rank the keywords, so the engine can cut off send use only a ‘best’ subset to negate the value of spamming, encouraging the index creator to be selective.
- the engine could keep its own index of metadata about sites and their indexes in an attempt to rank the quality of the indexes so better ones are prioritised. This in turn creates incentive for the site owners to be honest, but how to test quality is tricky and opens up the whole cat and mouse game we see in SEO.
Just some initially thoughts.
Thanks for the thoughts,
What I’m hoping would make a huge difference to the power dynamic is if there are many search engines then the site owners have to be honest/not spam otherwise they will just get dropped from the search engines. The idea is that the search engine would be small enough scale for this to be detected relatively easily. Search engines that want to play cat and mouse with clickbaiters are welcome to do so.
I guess in a way the paring down is similar to site maps, but I think I was imagining something not quite so drastic - somewhere between keywords and full text search (which still obviously needs to exclude a lot of filler words in its indexing, and to understand which words are particularly important to a given page.) An interesting phrase I pulled from here is ‘what is unique’ about a text. This is an area where I think RDF could come in handy to mark up the site indexes to show the context/relationships of words.
A difference with the semantic web though (as far as I understand it anyway) is that we would be taking the mark up out of the page itself, so making it more targeted and much less expensive to query. Developing software for owners to index their sites would also remove a lot of that potential reliance on how well sites are marked up (which of course varies wildly,) and which I know Google complained about from very very early on.
I guess another difference (perhaps) from the semantic web is that I’m relying on the idea of moving up a sort of tree, rather than just having one big interlinked web. Maybe like what I’m suggesting being representative democracy, vs. semantic web as direct democracy. I think that analogy only holds as far as the structure though!
I haven’t managed to dig into it yet, but the Sketch Engine tool I linked to above looks really interesting as a starting point for understanding and processing language.
RDF is just a bulky way of storing data… no reason to get hung up on that.
There’s no reason not to index everything within a site (it’s little difference). Noun phrases are more useful than single words. Not everything is text - indexing media is a different kind of problem.
The trick for a distributed solution might stem from not worrying about knowing everything but having an option to find something. So, nodes more like individuals with limited awareness - if you don’t know, you ask your neighbours. One node could index a certain random subset of sites - not too onerous and if there are not sufficient answers could ask for more - a flood of torrent queries to other random nodes. So, you might get different result set each time but why should that matter.
So, one option might be to see owner tasked with searching some random sites - hidden from the user but which include their own site with whatever human input they put to it; that submission validated by the process of random sites indexing that others will do. If you don’t know who is doing validation on your own submission, then perhaps that would help site owners be honest in their submission. Still it’s unclear to me the real added value a human brings to indexing but so much above tempting it…
Tasking those nodes with what they need to do, would need thinking through as there needs to be a level of consistency and common understanding of the aim and purpose.
tldr; distribution is about many small tasks and that is different from one big all knowing solution. Which tempts a question of how you get an answer from those nodes… what do they each hold?..
I think some might take exception to that statement! Sometimes I think that way, but then sometimes I really get it. For this sort of context it’s the schema side of things that appeals more immediately to me, but either way it’s not key to what I’m suggesting.
These two things are quite key to what I’m looking at. I’m trying to get away from the idea of knowing everything. The human side is that each search engine only adds to its index what it thinks is worthwhile. The other (more completist) option is to index everything and then try and get an accumulated rating/ranking from users, whether that’s by a page rank style system, or a ratings system or whatever. I think that’s actually much more cumbersome, and will not necessarily lead to better results in most cases. It also leads onto the slippery slop of data collection.
I like that you’re encouraging a distributed solution… tempts that any solution could be distributed.
Bias…
It rather depends who’s interest you are providing for… site owners or user.
A user seeking to discover, might consider the site owners opinion a curiousity… which perhaps is just an addition to the site owner also providing content… but there’s a risk expecting site owners to be competent in every dimension of activity… they might do themselves a disservice not putting in the required effort to represent their content fairly - and equally balanced relative to others. Tempts a very small dataset, limited by the imagination of the site owner for what others might be looking for. Bias also follows from ignorance; site owners might miss tricks and then you have a problem of SEO just because they put the effort in … but is theirs better content?
My instinct is to play contributors to their strengths… human site owners can spend their time providing content; computers can index quicker and more thoroughly - and notably more accurately… there’s 5-10% data entry error rate to content with, if owners are inputting data.
Indexing everything is not about ranking or rating… but relevance.
Find [noun phrase] [noun phrase] will not be possible, if the data is not known.
The downside of indexing everything is a fktonne of metadata, so that is another consideration but a computer having been thorough perhaps could pare down to a good subset better than a human.
A distributed solution is not a collection… there is no sum total … it’s like the sum total of all human knowledge held in peoples minds… only exists in the abstract.
The problem perhaps is encouraging each node in the engine to behave in the same way but I’d expect that is possible as some application.
oh and
It’s a matter of what purpose the data is being put to… RDF is useful in some contexts but if you’re storing data for a computer there’s more efficient ways than something that generic that requires time to consider at each read.
It rather depends who’s interest you are providing for… site owners or user.
A user seeking to discover, might consider the site owners opinion a curiousity
I think you’re maybe misunderstanding what I was suggesting here. Each site holds it’s own index (which I would prefer to be generated by a computer - I maybe shouldn’t have suggested that was optional in my first post.) The search engines hold their own index of the indexes. This is where the opinion comes in - each search engine only chooses to search the indexes of those sites which it recommends. It can afford to do this because it is not trying to be Google, and can also work in collaboration with other search engines.
You’re talking about search engines plural rather than nodes/instances of a certain type of engine… and yet also suggesting generated by a computer - which tempts a consistency. So, I’m not clear who determines what.
Still, the idea of having sites hold an index - relative to a standard, is a good one. The query I would have is that it would be a cluster of metadata - so, would perhaps want to be a single file, rather than many files, just for the cost of that. Perhaps that would be where RDF would be used but with nodes collating a subset of random sites ahead of time as prep for any request they might get from their owner or from their neighbours.
Sorry I mentioned semantic Web, I did so because I see a similarity, rather than the need for it to be used in this context. Where did provide such markup, it can be used to improve the relevance of term matching, but this is not the tricky part.
Aside: RDF isn’t bulky BTW. ‘Bulk’ is a function of the serialisation, and since RDF concentrates meaning it is a way of encoding more information than a keyword index in a given space. See HDT for example.
Anyway, back to this topic. For this we don’t need to consider RDF. Whether we use one form of knowledge representation of another isn’t what’s relevant, but how we generate, store and make use of this.
It’s an interesting and useful discussion to have.
I think @david-beinn, you are choosing to look at what will be possible with MVE, whereas if I understand @davidpbrown you are looking ahead to what might be done with enhanced features. Just commenting, I think both are interesting ways you think about this.
The idea is the format of the indexes would be relatively uniform, but perhaps with a certain tolerance. They would be generated by the computer belonging to the website owner, using certified software. As a protocol it would need a certain level of adoption to really work, but it’s probably easier than trying to get people to mark up their sites fully and correctly.
This still leaves plenty of scope for differentiated search engines to navigate between these site indexes and produce different results.
If there are multiple search engines operating, at each stage I think there is plenty of incentive to maintain ones own reputation, and therefore for the site owner to be honest/ for the search engine to ensure it returns good results.
So, you have a good idea there…
- build node one
- clone
- profit
![]()
Yeah, MVE is a good way of looking at it. My feeling is that if the structure of how data is held is good initially, then a workflow can be established, and a lot of the more sophisticated stuff built on top of that.
Even if a centralised out of the box solution overtook this sort of idea initially, I’d like to think that persuading site owners to run some indexing software instead of worrying about the weird world of SEO might be quite attractive, and there would also be certain benefits for clients as well.
I think there are merits in both of your ideas and I’m wondering if they could be combined in something that could be tried with MVE functionality.
If you have each site index linked, and could go from one to another, you can get a long way far in a few hops. I’m not sure how that can be done but I like the idea.
So a bit like a web crawler, but with the idexes already built.
The only real problem I see with this is that someone can start claiming those hashes before the sites do, breaking the entire search.
What if that were the default but it was also possible for a site to put the link to a list of all those hashes in their robot.txt (assuming we still have those), though now that I’m thinking about it maybe that would open up a new attack vector, search indexes spoofing the location of a site’s search data
Otherwise I like the idea, especially since it leaves room for communities to create their own search index which they can then let other searchers use.
Thanks @isntism. Firstly, as I mentioned way way back up the chain, I think this sort of thing would need a type tag dedicated to indexing. One approach to avoid squatting might be that people were only allowed to set up addresses that were the hash of a string ending in their PublicName. For this type of idea I’d feel more comfortable with a dedicated type tag on a network level than previously, because it would just be helping people index their own sites (which has no inherent bias) to make them more accessible to a next level of search engines, instead of trying to set up a single network index, which would inevitably have bias built into it in terms of what was ranked higher or lower, even if those rankings were sourced from a public ratings system.
I was thinking about the traversing idea a couple of weeks ago @happybeing. I think the big problem is that it has to be purposeful to have even close to an acceptable level of efficiency. Of all the links in a website (or its index,) how do I know which one is going to get me closer to the search term I’m looking for?
The idea I came up with was very semantic web style in the sense that we would try to understand all search terms first, and they would be classified in a hierarchical schema, to be reference before each commencing each search. Site indexes would be linked to each other by using their indexed words as nodes. So if I came to your site looking for ‘apple’ and you didn’t have that, I would go one step up the tree and ask about terms in the class ‘fruit’, and maybe you don’t have that either, but maybe you do have a node for the class ‘food’, and you say well my best contact for things in the class ‘food’ is safe://sadbeing, and I go there and start the process again.
Ultimately I felt this needed language understanding baked in, and would become very difficult for words with multiple meaning, noun phrases etc. and I thought the idea I outlined yesterday was more promising.
In terms of MVE I think yesterday’s idea would be reasonably easy if we made submitting keywords completely manual initially. I don’t have the skills to do anything else in the foreseeable future anyway! Again, I feel like setting up the workflow would have value anyway.
One approach to avoid squatting might be that people were only allowed to set up addresses that were the hash of a string ending in their PublicName.
Another way would be for the site to have a file or data structure called something like siteindex. Then they can index their site and no one can squat their records. Also this removes any need for a network reserved type. We can simply make a particular type to be referenced as indexing.
That’s a useful way of thinking about it actually.
I guess then the tl;dr of my suggestion boils down to helping sites make themselves a back-of-the-book index in a known format (as yet undecided.) As opposed to metadata indexes this index would eventually have sufficient information to enable full text search, and also be very quickly and easily searched by a search engine. The ultimate purpose would be to make search easier and cheaper, in order to avoid a natural monopoly of search.
I think it’s a worthwhile experiment and that whether it works out not we could learn from it. I’m curious to know if similar things have been tried, what research might have been done on decentralised indexes etc. I imagine there’s a lot to plough through, but asking in the right places might find some real gems.