Proposal: ‘Indexed Data’ Type Tag for Mutable Data

At this point I’m confused. I thought the whole point was to integrate search indexes into the network. Seems like these would need to be public, not private?
The datatype would require a pow to update the index, perhaps crawl the index and verify… It’s deterministic based on the hash of the search terms… Yes?

1 Like

Thanks Mark! Just in the middle of something now, but I’ll keep trying!

1 Like

It sounds like you only want to receive part of your mutable data index back, instead of the whole thing? So from a file with 50 million entries, you want only apple ?

If that’s the case, MutableData currently has this capability (you can get an entry by key).


This is possible, I think, with data at a specific type tag/address being reserved for a PK which hashses to that address. Would need implemented, but as I say above, this is my current thinking for data-storage indexing.

When I’m asking about other use cases, search could be one (if deterministic data by PK is required… i’m not sure it is though, @david-beinn ).

2 Likes

Yes, exactly.

That would be perfect, but when I’ve asked about this in the past I’ve been given the impression that ‘getKey’ is not performed by the vault, but by my machine, which would presumably mean downloading the whole piece of MD. The reason given for this (by @dirvine) was that all MD is encrypted so the vault cannot ‘see’ it or therefore ‘work’ on it???

If I’ve misunderstood something there that would be handy!

2 Likes

As it stands vaults grab and send only the entry for the given key for an MD (or Map as they are becoming known).

1 Like

Well, that’s good news!

I’ll think it over, but that might pretty much solve the problem!

We do not have data obfuscation at vaults yet. And when we do… I’m not sure if that’ll be applied to Maps. I guess it would have to be, but really not sure of the knock on there.

@dirvine you have more thoughts on this?

I’d wager we’d want to keep the current functionality in place for maps, as it’s definitely useful.

2 Likes

I agree. Data obfuscation will only be for blobs, however I think all mutable types (eventually) hold no data, only pointers to blobs. In any case atm the mutable types can encrypt contents or not. There may be a way vaults still get that encrypted but no plan right now.

4 Likes

Thanks, this sounds more promising.

Got to run off elsewhere now, but like I say I’ll think over the possibilities and see if that will solve my problem.

I still think a way of bypassing pointers altogether might be useful, and I might at least see if I can find a better way of explaining that, but hopefully a bit of action from the vaults might make all the difference!

2 Likes

Aye, I think you want the MD entry, but then if that’s a xorurl, grab that data.

There’s nothing stopping such things happening, I’d say. It’s just about optimising it and figuring out the patterns for how this could be applied to vaults eg. (ie, do you need to pay extra for such data handling/parsing in general).

I don’t understand what this adds, if vaults only have sight of a limited random part of the network. Perhaps, if any data as points is visible, then that could see work done against it… but if they are just pointers, without description of what they are pointing to, then what value; and if they are clear what they point to, then issues about hosting what users might not want to support. Data obfuscation surely is a simple answer that everything is encrypted and simple. :confused: … or perhaps this is something still being worked through to make it more apparent.

Couple of quick further question how this works in terms of speed/efficiency, though maybe we need to wait and see in practice.

If I remember correctly there is no longer an upper limit on the number of keys in a Data Map?

Which gives the possibility of something like the example above, a 50mb Data Map of a million keys, split into 1mb chunks. Presumably the vault that has the ‘Chunk Map’ (sorry I forget the correct terminology) does the getKey operation.

I suppose we still need to get all those chunks in one place, but I guess that happens in parallel and should be pretty quick??

I was using MD as a generic term to include Sequence and Tree types as well - I’m hoping tree might be useful for similar purposes, but with some added benefits?? I’ll wait and see what the API offers, but in principal could be used in a similar way??

Old thread just bumped Translation will need safenetwork
and prompted a thought that above perhaps could cater for translations. Single word indexes that list the alt word in other languages, and perhaps certain basics character data about the part of speech in different languages.

Edit: and the same for phrases would work too. The key difference is that this would be fairly static, where data on indexes for sites would churn.

1 Like

Given that they can be queried on the network itself I’m feeling pretty reassured that using Data Maps with the standard API will be fast enough initially as a means of storing pointers to lots of files by name/title, and it’s probably the simplest way as well, so not feeling quite the same necessity that inspired this thread.

However, I still think there might be something in the original idea here, so I’m going to try one last time to see if I can make myself a bit clearer!

I’ll start a new post…

2 Likes

It is understood that IMMUTABLE Data is stored at the address that is the hash of its content. There is no getting round the fact that we have to store pointers to find that data - the address is fixed, and in terms of finding it, appears random.

In the case of MUTABLE data types though, we can choose the address they are stored at. This might just be a randomly generated 32 byte hash, in which case we need to store a pointer to it in exactly the same way as above.
ie. We need to get the file that contains the pointer before we can get the address of the file we ultimately need.

However, we can bypass this initial pointer lookup altogether, if we choose to store the file (the one we ultimately require) at an address that it is the hash of its name. Instead of the lookup operation, we just hash the name of the file we want, and that automatically gives us the address of that file.

In practice obviously, filenames are likely to be duplicated, so perhaps we might add something else into what we’re hashing, to create a known formula where we can again automatically generate the address of the file we need to lookup -

davids_file is at the address that is (hash of {davids_file + [David’s Public Key]})
davids_other_file is at (hash of {davids_other_file + [David’s Public Key]})

By sharing this formula with others we can share access to the files with them, without ever actually storing the addresses.

As long as there are no duplicate filenames within my account, there should be no duplicate addresses, and if people stick to the rules, there should be no duplicate address problems with other accounts either, as Public Keys are unique.

However, there is nothing to stop someone creating an address using my Public Key:

bobs_file is at the address that is (hash of {davids_file + [David’s Public Key]})

ie. If they anticipate that I will want to have a file called davids_file they can squat that address.

Therefore we would need to have a way of enforcing a scenario where only I could store things at addresses that used my Public Key as part of the hash (remember we’re only talking about doing this on one type-tag, not for all Mutable Data.)

The trouble is, when my Public Key has been hashed together with a filename it just spits out a random address - the hashes generated using my Public Key have nothing in common with each other, so we cannot reserve a set of addresses for my Public Key.

This is why I suggested viewing the 32 byte address as two 16 byte halves. In theory the network could enforce a rule whereby it checks that the first half of a new address is equal to the hash of the Public Key that is trying to create that address.
The second half of the address would then just be the hash of the filename.

In this case, by concatenating the result of two hash operations we can automatically generate the address of a file from its filename, without having to lookup any pointers.

For the sake of simplicity I’ll leave out the technical problems that might be associated with this for now, such as 16 byte hashes being too small. I don’t think they’re necessarily insurmountable.

I feel like the stumbling block for a lot of people is seeing the benefit of this.

It sounds a lot of trouble to go to just to skip having to lookup any pointers, (especially given some of the restrictions it imposes such as not being able to change file names,) but as soon as we get into big numbers of files or lookups, it could make a huge difference. It could be the difference between a usable app and an unusably slow app.

Again, bear in mind that in a server all this stuff is happening within one (super) computer that already has all the data within it, and where both software and hardware has been honed over decades for efficiency. In a distribute network like Safe, we’re going to have to work really hard to make everything as efficient as possible to be anywhere close to the performance we expect from centralised systems.

I just see it as another tool that could be useful to developers (possibly essential in some cases,) and one that would not necessarily require huge changes to the network.

2 Likes

Instead of a public key could you not use a private key to generate the address? That would eliminate the squatting problem at least, although there may be usability issues with that approach (lose the key and there’s a danger of losing your data).

Yes now. But, if this RFC is adopted, not in the future.

2 Likes

I was thinking something similar such as having to do some signing operation of your pk with your sk first, and then proceed to hash with your pk. That way it can only be hashed by you and your sk isn’t tied to the address in any way.

The primary motivation behind this is to be able to share a set of addresses without passing round a big set of pointers, so everybody needs to have access to the ‘formula’ that creates the address - it needs to be something public.

If we had homomorphic encryption maybe!

2 Likes

Thanks, not found the bit where it says that as such yet, but that RFC seems a good resource for understanding better how mutable data types work under the hood - at first glance it looks like they’re basically just a bunch of pointers anyway which I suppose would make sense. I’ll have a good read through, but might have to be after I get back from work.