Proposal: ‘Indexed Data’ Type Tag for Mutable Data

Problem:

On the legacy internet we are used to being able to make requests of a server to find files within that server, which are located from a file system, or any other kind of data structure, which is also held within the server. On the Safe Network we do not have this facility in the same way. As a client we must get the file system or other data structure as a file (or files) from a vault, use this ‘location map’ to lookup the address of the file we are ultimately seeking using the computing power of our own machine, and then finally fetch the file we are looking for from the network.

For small data structures this is fine, or instances where the ‘location map’ can be stored natively, but for larger web-based structures the initial file download can quickly run to prohibitively large sizes, bearing in mind that for each entry we need a minimum of the file name and its 32 byte address on the network. If we optimistically estimate 50 bytes per entry, we end up with a ‘location map’ for 1000 entries running to 50kb. Again, this is fine for some applications, and there may be some optimisations we can do, but for applications needing multiple sequential lookups over very large file sets, it’s unlikely to match general performance expectations.

As far as I understand the CRDT Tree Data Type is designed to address this problem for the file system use case by at least optimising the structure, and, in particular, making it possible to store an initial download of the file system or ‘location map’ natively, and just update that from a network copy as necessary. However, again there are likely to be use cases where we can’t expect clients to hold a native copy of the map, and also instances where updates will be frequent and onerous to process.

As anyone who has read my previous posts will know, the application that made me think of this is a web-based search index, which requires millions or billions of words and phrases each held in a separate file. Such an index would also be very regularly updated, and a truly decentralised model would require multiple indexes, which a client should be able to access in a timely manner.

Another example might be the NRS map for a site with tens or hundreds of thousands of pages.

Solutions:

As already mentioned, I believe Maidsafe are currently working on optimising the model for a filesystem, and making it so that it can be easily and flexibly held natively.

What I am proposing here is that for some use cases it could be possible to bypass having to download a ‘location map’ at all.

The proposal uses a ‘hashing formula’ so it wouldn’t be appropriate for pieces of immutable data or ‘blobs’ which are always held at the address that is the hash of their contents. However, there may well be ways it could be used to speed up looking for immutable data too.

The proposal is based on the fact that we can ‘choose’ the address of a piece of mutable data, regardless of its content (this follows naturally because the content can change.)

For example, we could choose to store details of my ID on the network at the address that is the 32 byte hash of ‘David’. Everyone would know they could access my details just by hashing ‘David’, which is a pretty fast operation for a computer to do.

The downside of this is that if somebody else called David gets there first, I will never be able to store my details at that address, so we are back to the problem that we must store a pointer to the address where my details are.

Although for most use cases the formula would be much less generic than ‘hash of [name]’, the fact is that the formula needs to be known by everybody in order for them to access the address, which means anybody can ‘squat’ an address they anticipate will be needed by another application or user in the future.

Proposal

Since Mutable Data objects on the network also have a type tag as well as their address, we could have a specific type tag dedicated to ‘Indexed Data’, which has certain rules for the creation of addresses, which are enforced by the network.

The rule would be that the first half of the (32 byte) address MUST be a 16 byte hash of the public name of the user who is creating the address.

The second half of the address can be anything, but the anticipated use case would be that it would be a 16 byte hash of a known formula, made public by the owner of the address.

That way, if I have a search index containing the words ‘apple’ and ‘banana’ and [‘another word’], I can just program my application to store the files containing information about ‘apple’ or ‘banana’ or [‘another word’] at the address that is [hash of (my) publicName] + [hash of ‘another word’]. Anyone using my application can perform the same hashing operation to find that file, but is obviously unable to store anything at an address that begins with the 16 byte hash of my publicName.

It is NOT anticipated that there would be any ‘default’ formulas – custom formulas would always have to be provided by the application.

Issues

Given that the network always knows the publicName of someone creating an address, I can’t necessarily see any technical problem in enforcing the basic rule.

There may be risk of collisions in 16 byte hashes (I don’t know the maths well enough,) which could be a genuine problem, but hopefully there should be ways to account for this possibility. The number of publicNames and the number of files each one might wish to be storing are both relatively small compared to the number of addresses on the whole network.

There may well be other dealbreakers that the devs or others know about, but I just thought I’d put this out there as a proposal and see what anyone thinks.

If I’ve said anything false about the network let me know and I will edit this post, or delete it altogether as appropriate.

9 Likes

There would be a good deal more collisions possible. so 32byte hash is 2^255. 2^254 is 50% of the address space. 2^253 is another 50% loss. So a 16byte hash would be a 50% reduction in collision avoidance, compounded 127 times.

A trick we do with some things like this is hash the public name + [other word]. That still resolves and loses no collision avoidance. So you still tell somebody your name + other word and they will get to the right place.

Hope that helps.

4 Likes

Yeah, certainly understand it’s a much higher chance of collisions. My question really is whether it could be good enough for this purpose. 2^128 is still a pretty big number!

This was my initial thought, but unless the hash is reversible then we lose the ability to reserve a set of addresses for a particular account (obviously I’m only talking about policing things on one type tag, not impinging at all on the general freedom of the network.)

I’d think it’s relatively easy to cater for collisions in the second half of the address (since they all belong to the same account/app.)

If 2^128 is not enough to avoid collisions by hashing, and there is no satisfactory way to cater for collisions, an alternative might be to assign each account a unique 128 bit number. The problem then would be that the network had to keep a record of which number belongs to which account, but maybe that’s doable, I don’t know.

I just feel like something along these lines would make the network potentially so much faster in practical use. ‘Location maps’ (I don’t know what else to call them as a generic term) seem like they could be a real bottleneck.

3 Likes

@david-beinn you’re actually touching on something I’ve been thinking about for data indices for user’s. Ie. indexes of all data you store on the the network.

If we don’t want to force you to go via some other application which manages this, we’d need deterministic addresses for this, and I too was imagining a hash of PublicKey at a type tag, essentially.


How/what would an ‘unsquattable’ address look like for a given PK? (ie the hash(PK) rule applied), but then with additional data… From what @dirvine is saying it sounds like we’d be having to expand the address space for that…

So maybe worthwhile thinking if there’s other ways to achieve it. (If we have an unsquattable index available and retrievable for any PK… perhaps we just need an index entry in that… it’s only one look up more eg). But then you’re on to extra APIs for working with that index (which may need to exist anyway…), what would those look like?

3 Likes

Single words are not noun phrases…:thinking:

It’s noun phrases that is the key to recognition and search. For that noun phrases need to be known and available. Word phrases that have meaning beyond what is random. I wonder this is not raw but application level.

1 Like

Good! I always wonder whether I’m losing my mind with ideas like this! Perhaps it would be more appropriate to italicise you rather than store in that sentence? Not trying to be picky, just checking I understand you correctly.

I was actually going to post again on this thread anyway, because I had a cursory look into the numbers myself.

Certainly it seems 128 bit hashes aren’t regarded as useful for very much these days, from both potential for accidental collisions, and being broken maliciously. One thing I hadn’t realised about collisions is that their likelihood can also be dependent on the ‘diversity’ of the data being put into the them. In terms of malicious intent it appears very easy to deliberately generate the same hash from a different input, which would obviously allow squatting.

This all sounds brilliant, and possibly more ambitious than my original intention. I guess I wasn’t sure of the potential for the network to have extended or ‘parallel’ address spaces. By with additional data do you just mean the ‘apples’ or ‘bananas’ of my example, or were you thinking of some way of trying to include metadata?

In reply to the question “How/what would an ‘unsquattable’ address look like for a given PK?” I did have a very simple, albeit slightly clumsy idea based on my original notion of splitting the 256 bit address into two 128 bit hashes, one for PK and one for [search term] (or any other string.)

Basically you would ‘bagsy’ a hash by storing a copy of your public key at a ‘certificate address’, say, {128 bit hash of PK} + {128 bit 000…}. Every time you tried to store a piece of data at an address beginning with the 128 bit hash of PK, the network would check to see that the PK trying to create the address matched the owner of the certificate address, AND the PK stored at that address.

For accidental collisions where two PKs turned up the same hash, the second one would obviously fail to create a certificate. In that case you could give them a salt to add to the PK for hashing, and store that along with the PK at the new ‘certificate address.’

Seems so simple there must be an obvious way to attack it, but I haven’t thought of it yet

1 Like

I’m just using single words as an example. It could be any kind of data that we are hashing to create the address. Words, noun phrases, filenames.

As I see it the point is more about having a fast way of looking things up, not so much what those things are.

Yes, somewhere here I wondered having an option to know if hash exists or not, could be very powerful… without those necessarily being bound to an endpoint, though potentially that could be useful too…

Still, there are two different cases, of what can be automated by the network at a basic level and then that which requires knowledge… in this case of what noun phrases are.

There’s potential to create so much dust it becomes glue! So, needs thinking through… and then challenging what added value on network, rather than in an application if an application necessarily required above a certain level.

2 Likes

512bit xor address space anyone?

3 Likes

Yes, check for rainbow lists/attacks on hashed passwords to see how bad that can be.

1 Like

Guess that would double the size of any system that is needing to point to the full address though, such as a file system.

For me the emphasis (on my current thinking about the index), is that it’s for the data you store. Your data on the network for that key (unless you opt out on upload), would be referenced in that index (all blobs, all sequences etc). That’s where my thinking was anyway.

I’m trying to think generally. If a deterministic address for a given key is useful beyond the stored-data-index idea. There’s nothing stopping us opening up things there. It could be an index of indices eg… what would that look like. What other use cases might there be?

Ah, so here I was meaning, we basically have that from your idea. But just give all the address over to the PK hash (no extra word inserted), and reserve that tag at the network. That’s how we’d do that for the index at least.

So then, how could that index be more useful… ie, you could be sure that safe://<davidbeinn's PK hash>/data-index has all your data. But perhaps it’d be better safe://<davidbeinn's PK hash>/index/data and safe://<davidbeinn's PK hash>/index/search-keywords or what have you existing together in our PK-index.

:+1:

I think if that’s needed in general, then we’d still have the same collision issue, just on a different scale (or?). I think for this, if we allowed one tag to be tied to the PK being used, it may as well take up the whole xorname. But then, maybe what’s stored there can be made to be more flexible :man_shrugging:?

1 Like

i thought there was a post somewhere in the forum about 512bit hashes being recommended for content-addressible systems anyway (birthday problem and such things…). for just signing 256bit would be enough.

edit: hmmm Why are 256bits "enough"?

1 Like

Again, this sounds great. It makes me feel as if in even worrying about this issue I’m perhaps missing something as regards structure and flexibility of addresses on the network - ie. am I just proposing a solution for a non-problem?

For example, using the addresses you suggest:

safe://<davidbeinn's PK hash>/index/data and safe://<davidbeinn's PK hash>/index/search-keywords

can we go straight to a piece of data, or at least with one small extra lookup? What happens after that? eg. what about:

safe://<davidbeinn's PK hash>/index/search-keywords/apple

My concern is that those addresses look like they might require an NRS map to resolve which, if we had eg. millions of /search-keywords/ might be prohibitively large. Am I missing something there?

Until you’ve tried a proof of concept, you might be missing how much work this is for the network to churn, the number of updates to single files… and the cost of that, making no sense to me at least atm.

You might also be missing the reality of natural data, doing 80% is literally trivial but getting a useful solution is a lot harder - a lot more work.

Also, I still cannot make sense of what you’re proposing for finding information between sites… or if the intent is simply for per-site searching. My question would be why would you be looking at a particular site, to then look at its index? Normally, a user has a query and wants to find unknown sites that contain that query - and closely related phrases. The path from query to answer for the user, is not clear here.

If you try a proof of concept on the unSAFE websites, it’s one line of code to do search as compound of lists… but the time it takes to get meaningful data to search phrases is not trivial, and that seems a problem not obvious addressed above, with anything the network could do. The reference set to create a meaningful dataset for lists, is huge and fragmented.

I’d suggested a better approach is to develop something as proof of concept outside of network, evidence it can work and then apply it to the network if possible… rather than force the network into doing huge amount of work, for no result.

1 Like

I think you’re maybe missing the point of what I’m saying here.

I’m not really talking about search as such in this thread.

I’m talking about getting to large numbers of addresses in an efficient manner, which could be useful for lots of purposes.

The hash table obviously does the technical side of that, but in order to make use of it we need a 32 byte address, which on its own is meaningless. We need somehow to store the meaning of an address. ie. david’s_file = {123…xyz}.

It’s fine to do that as a list or a tree for smaller amounts of addresses, but for larger amounts, that list of pointers could get very large, and therefore unsuitable for a web app.

At least, that’s the problem as I see it.

Edit: I’m not asking the network to churn anything, just to check the ID of the account trying to initially create a piece of Data. After that the network checks the ID of an account trying to modify the data anyway.

4 Likes

I don’t know, enough about file types and the limits of what is not possible but if there are any few characters that cannot ever be the header to a real file and then those would tempt free xorurl space and it would be possible to have [neverchar].[userUID].[as much of an xorurl space is left] as one xorurl address?

I guess allowing users access to such a feature, could be made useful in alsorts of ways.
My hesitation using it would be both that perhaps necessarily it would want an end point… some file written; and the cost related to making use of it. If there was an alt option for without endpoints, perhaps that would be useful reference dataset in some cases.

I wonder it’ll be the cost of this dust-like action - both in $$ and in work for the network, which needs some clearer use case to persuade that it makes sense??

1 Like

To clarify my thinking safe://<davidbeinn's PK hash> is a deterministic XorUrl (derived from PK; at a type tag requiring that PK to PUT data there) pointing to a piece of MutadableData.

Prior to this that MutableData was to be just a list of all data PK has written (without specifically opting out of indexing). As a record on the network for users’ data.

But we could put anything in that MutableData.

So safe://<davidbeinn's PK hash>/whatever is an NRS like example. But essentially, we’d get the key whatever from the MD. And it’s value could be… whatever. In the case you’re looking for it could be another MD which is used for your search indexing. So one more lookup.

That would allow you to deterministically find any data for a PK that we wanted to make public.

Is that useful? Re-reading the OP, i’m not suuure it is.

With the above, you’d be able to calculate the location map for a given PK. But you’d still need to look it up.

Here I think just a SafeId with all your data/indexes, at David, or DAvid1231 is fine, without needing something PK specific. Seems like you’re trying to get around the limited namespace for Ids as NRS names. But your safeid could just be at a random xor. It doesn’t have to be an NRS name. It’s less pretty, yes. But there’s no real difference between knowing a xorname and knowing a PK (which is in the end the identifier of an account).


All that said, if there is a useful case for reserved PK deterministic address, beyond just indexing data put by that same PK, it may well be possible to do so.

2 Likes

But you are still just getting the contents of one file for a network name. I don’t get it, creating an index on the DHT seem very expensive, you need a DHT entry for each entry in your index (and that one would be just a reference to an other file). Wouldn’t it be better to use some sort of “embedded” index files, that contain multiple index entries in one file and split out for bigger indices (in b-tree fashion)?


edit:

rather something like Hash array mapped trie (got that from the im crate). Then when updating the index you would do pretty much the same thing you would to with an im hashmap and atomically update it (using the data version, don’t know how they are calling it in the safe net nowaday).

1 Like

Thanks Josh.

Yeah, I think we’re getting to the nub of what I see as the problem now anyway.

I’m aware that I can create my own index structure at a deterministic address, but again, my problem is that for any large set of pointers, even if I created that structure with efficiency in mind, it’s going to be a pretty big file to download every time we do a lookup, or else, as @urrtag suggests, means arranging different index files in a tree structure, which requires sequential lookups, the time for which could quickly add up.

I think it’s worth seeing the structure I’m suggesting not as an index in itself, but as data that is by its very nature indexed - by filename.

Obviously unlike a traditional file structure though you can’t change a file’s name or move it around. However, you can of course mutate that file, because it’s Mutable Data that we’re talking about.

I see where you’re coming from here. I think one thing you (and maybe others) are misunderstanding from my original suggestion is in thinking that the file we’re trying to get to is just a pointer to another file. It is the file. We have just made it so we can find it again by creating it at an address that is determined by it’s name (or some other formula.)

For example, maybe I have a million different entities in a database, and there is no way (or else it is not desirable) to categorise them in a tree structure based on a single metric. It makes sense that each entity has its own file.

For simplicity, let’s just store them in a list based on the entity’s name:

entity1 = {123…xyz}
entity2 = {456…abc} and so on.

At 50 bytes per entry that is 50 megabytes, which for a web app is too much to download every time someone wants to query it.

As you mention, we could split that into files and put them in a tree structure based on entity name. That’s certainly an option I’ve considered, but if we’re making a few queries, each of which needs a few sequential lookups to navigate the tree, then the time adds up, especially on slower connections.

The other option is to keep your database on your own machine and essentially set up a server, and we’re back to the old internet model.

Going back to what @joshuef was saying about useful cases, I find it hard to see decentralisation being maintained for large database usage without some kind of system like this.

I feel like I’ve thrown things off a little bit by relating it to a web-search index, but that use case alone would justify this sort of feature in the network. If web-search can’t be decentralised and reasonably performant, then we’re in trouble.

Just seen your edit @urrtag, will check that out when i get a minute.

5 Likes