And that APP gets the app rewards, so they pay to store the index and you get paid for supplying the APP. And also allows for people to compete on writing APPs to do the work writing in the specified format. And of course the format has to allow for future expansion. Future expansion can be reserved space, variable length records, space for pointers to additional records or whatever
Hi @neo, just wondering if I could pick your brains a little further on this suggestion, and perhaps @dirvine or one of the engineers might be able to shed some light on the way data types work as well.
My original thinking (of having different addresses for different keywords) was based on the idea that a big index even for a single site would obviously be far too large a file size to download and search on the client computer, especially if that process needed to happen many times in a single search.
However, I imagine a Data Map (key-value store) could also serve the purpose, depending exactly how it works. If I use a getKey command, how efficient is this? Is all computing done on the network? Do we need to pull all the chunks from the Data Map together before iterating over them? Or can we just go straight to the Key?
Or did you have any other file-data structures in mind @neo?
One interesting possibility of this approach of a siteindex is that public data from people’s account could be indexed if they so choose, making a network of data rather than just websites. I seem to remember @JimCollinson talking about making people an index of their account somewhere.
My thought was that whatever data structure you were thinking of that a special/system type is not needed.
When you are indexing then I gather there will be some database style of indexing and since you are creating the structure then you can do it however you want.
Now if you were thinking of the search term creating the address to search for then even special/system tags/types will not help.
There is going to be billions of sites and chances of collisions is significant and you’d need some sort of collision handling.
I was of the impression that a site will do its own indexing, then this will save a lot of work/cost of crawling the site and by using a specific format/protocol then a network wide searching would simply go to each site and process their index to create a hybrid indexing were the network wide knows if the site has that search term (and its scores ??). Then then presenting the results of the search the app would access each the site’s index records.
Forgot to say, the site would have a index “file” which is a file of pointers to the data structure you make or file of terms/pointers
I think your last paragraph there is summing up what I intended, yes.
Perhaps the difference is that I was imagining scores for importance (within the site,) and relevance and newness would be kept within the site index, as well as specific location within the site. Therefore the search would have to access that index (amongst potentially many others,) before presenting the results. The search engine’s own index would just narrow things down.
Interesting you mention collisions. Intuitively that feels like a possibility, but I was kind of trusting that it wasn’t!
This I suppose is what I was really getting at. What sort of database structures would be appropriate and efficient to live on the network itself? I’m afraid I’m pretty ignorant on this subject.
Have a look at those algorithms that create the image with words of different sizes showing how often/popular a word/term is on the site. I forget the name. I want to say word cloud, but I really am not sure
In one sense this seems like my original intention, it’s just my data structure would be very flat. The ‘file’ just contains a formula so that I can find out about instances of the keyword within your website by going to (hash of (keyword+neo)), which I was hoping was a truly unique address!!
Other data structures I’ve read about, such as trie structures, seem more designed for within the memory of a computer itself, rather than to live on a network. Like I say though, I don’t really have the background to know where to dig deeper on that sort of thing.
I’ll see what I can find, thanks!
Wasn’t really expecting a prompt reply to be honest, I imagine it’s not the best time of the day in your part of the world!
Collisions will happen because there are people who will just use the formula to create data at those addresses. Unless you can keep some sort of salt secret forever then everything needed is available to other APPs.
You have to generate the address when reading looking for a search term and they can create a data record at the address and stuff you up.
This is another reason why using a programmable addresses algo is not a good method for this.
Now if you had you own hobby APP and wanted calculated addresses, then you would pre purchase a block of addresses that fits the bill. And then still run a minor risk that while buying them someone buys ones of those addresses for some reason.
Of course you could have the “file” being full of calculated addresses → data record address. And this is going to be a common thing in SAFE when a desired address cannot be guaranteed to be available
I’m intrigued by this topic but I haven’t really had a time to go into it in any depth (furlough me godammit
). I get the sense that when we (as in someone clever, not me) cracks this whole lot of pieces are going to fall into place, not just for search but for other database-type operations too.
There are problems that can be distributed and problems that cannot? I don’t know yet the option for many answering a focused question of what is available on topic.
It’s also not obvious to me, whether a node on the network can be addressable for tasks that are done outside of the network - services that provide for the network but that are not distributed… are vaults on an xorurl of their own or some other address that can be called like an ip address?.. even if there’s a risk of those changing over time like tracking a static ip addresss, that would be something. Server like functions have a role to cover off what cannot (?yet) be done decentralised?
Even within the network computing doesn’t really happen yet, if I understand correctly - it’s not quite like a server in that you can just ask a server to carry out a computing task. However, there are obviously certain commands that you can ask of the network via the API. I’m curious for this purpose as to exactly where those commands are executed, and how they are executed - ie. are they efficient enough to build a fairly large (but simple) database using an off-the-shelf Data Map as it will be (key-value store, what used to be called Mutable Data.)
The specific question I was asking of @dirvine or anyone else from @maidsafe above was how a key (and its value) is retrieved from within a given Data Map. Does that have to happen on the client’s computer (and the Data Map be downloaded,) or does the vault where the Data Map is stored retrieve it for us, without having to bother with the rest of the Data Map. If it’s the latter, then how does the network deal with a scenario where the Data Map has been split into chunks, and perhaps encryption is an issue there too? Ultimately, how fast does this happen?
Yes, this is how it works. Data maps can be encrypted etc. and the network should never see them where at all possible. Public data is a data map at some location and the client reads that and retrieves chunks.
Thanks very much, that’s really helpful to know (although I would have preferred it if my fantasy of the network doing it for us had been true!)
Been thinking about the index database side of this problem and been doing a few back of an envelope calculations re. size, but wondered if perhaps there’s a blindingly obvious solution in the form of the NRS system.
In short a search database on the network will need to support very fast look-up of a large number of entries (probably in the millions at least, even just for a full text index of a single large site.) Hence my initial keenness to make use of the existing hash table (ie. the basis of the network) by going straight to the address that is the hash of the search term. The probably prohibitive problems of this were pointed out by @neo above.
If we use random addresses though, then we have to reference them from somewhere. Bearing in mind that each address is 32 bytes, the map to even find our initial start point for the search would be pretty large (in the order of Megabytes,) which would need to be downloaded in order to search through. The only alternative I can think of to this is some kind of improvised structure (eg. Like a data tree but using addresses as nodes??!!) which also seems a bit of a non-starter, to my somewhat untutored eyes at least.
However, I then realised that just using the NRS system could tick the necessary boxes. So if my search engine is ‘octopus’ and I wanted to index the entries for the word ‘apple’ I would just store those entries at a random xor address and then link that address to the NRS name safe://octopus/index/apple. Nobody else can (presumably) register an NRS name below safe://octopus once I have registered that, so there is no problem of squatting. However, there’s a formula there that can be very easily computed by the client, so the client can very quickly reach the required page.
My biggest question right now is that I’m not 100% sure how the NRS system works in terms of whether it would still be fast enough at looking things up if there were millions of entries in a single NRS map (which from the CLI ReadMe is apparently a publicSequence data.) I imagine a sequence isn’t the most efficient thing to search, but it might be better than the alternatives at this point. Does the vault do the searching for us, or do we need to download and search that public sequence as clients?
As usual, any enlightenment, engagement, criticism gratefully received!
![]()
afaik, the latter.
I believe @danda is correct, as currently implemented the whole sequence is fetched to the client and kept in cache memory so that subsequent operations will be fast. In the long term I expect that there will be evolution to help performance for a range of use cases.
I’ve summarised my understanding of the Sequence data type here:
Thanks @danda and @happybeing.
That totally makes sense for any standard type of data. Somehow I hoped that for the NRS map of a site it would make more sense to ‘go straight there’ though. NRS is part of the network, so why can’t we just have a special sequence data and get the vaults to do the work? (I’m sure there are good answers to that!) Something like a search index or a large site will be changing rapidly, so I imagine the cached version won’t really be of much use.
Currently there are close to 400M domains. If you take this number and multiply it by 32 bytes per entry, you get 12Gig… ![]()
Am I wrong here? How can clients handle this?
An NRS name maps to a Sequence, so when the client accesses a file via a URI it access one Sequence. It doesn’t have to access the whole NRS at once.
