Problem:
On the legacy internet we are used to being able to make requests of a server to find files within that server, which are located from a file system, or any other kind of data structure, which is also held within the server. On the Safe Network we do not have this facility in the same way. As a client we must get the file system or other data structure as a file (or files) from a vault, use this ‘location map’ to lookup the address of the file we are ultimately seeking using the computing power of our own machine, and then finally fetch the file we are looking for from the network.
For small data structures this is fine, or instances where the ‘location map’ can be stored natively, but for larger web-based structures the initial file download can quickly run to prohibitively large sizes, bearing in mind that for each entry we need a minimum of the file name and its 32 byte address on the network. If we optimistically estimate 50 bytes per entry, we end up with a ‘location map’ for 1000 entries running to 50kb. Again, this is fine for some applications, and there may be some optimisations we can do, but for applications needing multiple sequential lookups over very large file sets, it’s unlikely to match general performance expectations.
As far as I understand the CRDT Tree Data Type is designed to address this problem for the file system use case by at least optimising the structure, and, in particular, making it possible to store an initial download of the file system or ‘location map’ natively, and just update that from a network copy as necessary. However, again there are likely to be use cases where we can’t expect clients to hold a native copy of the map, and also instances where updates will be frequent and onerous to process.
As anyone who has read my previous posts will know, the application that made me think of this is a web-based search index, which requires millions or billions of words and phrases each held in a separate file. Such an index would also be very regularly updated, and a truly decentralised model would require multiple indexes, which a client should be able to access in a timely manner.
Another example might be the NRS map for a site with tens or hundreds of thousands of pages.
Solutions:
As already mentioned, I believe Maidsafe are currently working on optimising the model for a filesystem, and making it so that it can be easily and flexibly held natively.
What I am proposing here is that for some use cases it could be possible to bypass having to download a ‘location map’ at all.
The proposal uses a ‘hashing formula’ so it wouldn’t be appropriate for pieces of immutable data or ‘blobs’ which are always held at the address that is the hash of their contents. However, there may well be ways it could be used to speed up looking for immutable data too.
The proposal is based on the fact that we can ‘choose’ the address of a piece of mutable data, regardless of its content (this follows naturally because the content can change.)
For example, we could choose to store details of my ID on the network at the address that is the 32 byte hash of ‘David’. Everyone would know they could access my details just by hashing ‘David’, which is a pretty fast operation for a computer to do.
The downside of this is that if somebody else called David gets there first, I will never be able to store my details at that address, so we are back to the problem that we must store a pointer to the address where my details are.
Although for most use cases the formula would be much less generic than ‘hash of [name]’, the fact is that the formula needs to be known by everybody in order for them to access the address, which means anybody can ‘squat’ an address they anticipate will be needed by another application or user in the future.
Proposal
Since Mutable Data objects on the network also have a type tag as well as their address, we could have a specific type tag dedicated to ‘Indexed Data’, which has certain rules for the creation of addresses, which are enforced by the network.
The rule would be that the first half of the (32 byte) address MUST be a 16 byte hash of the public name of the user who is creating the address.
The second half of the address can be anything, but the anticipated use case would be that it would be a 16 byte hash of a known formula, made public by the owner of the address.
That way, if I have a search index containing the words ‘apple’ and ‘banana’ and [‘another word’], I can just program my application to store the files containing information about ‘apple’ or ‘banana’ or [‘another word’] at the address that is [hash of (my) publicName] + [hash of ‘another word’]. Anyone using my application can perform the same hashing operation to find that file, but is obviously unable to store anything at an address that begins with the 16 byte hash of my publicName.
It is NOT anticipated that there would be any ‘default’ formulas – custom formulas would always have to be provided by the application.
Issues
Given that the network always knows the publicName of someone creating an address, I can’t necessarily see any technical problem in enforcing the basic rule.
There may be risk of collisions in 16 byte hashes (I don’t know the maths well enough,) which could be a genuine problem, but hopefully there should be ways to account for this possibility. The number of publicNames and the number of files each one might wish to be storing are both relatively small compared to the number of addresses on the whole network.
There may well be other dealbreakers that the devs or others know about, but I just thought I’d put this out there as a proposal and see what anyone thinks.
If I’ve said anything false about the network let me know and I will edit this post, or delete it altogether as appropriate.