Proposal: ‘Indexed Data’ Type Tag for Mutable Data

Yes, somewhere here I wondered having an option to know if hash exists or not, could be very powerful… without those necessarily being bound to an endpoint, though potentially that could be useful too…

Still, there are two different cases, of what can be automated by the network at a basic level and then that which requires knowledge… in this case of what noun phrases are.

There’s potential to create so much dust it becomes glue! So, needs thinking through… and then challenging what added value on network, rather than in an application if an application necessarily required above a certain level.

2 Likes

512bit xor address space anyone?

3 Likes

Yes, check for rainbow lists/attacks on hashed passwords to see how bad that can be.

1 Like

Guess that would double the size of any system that is needing to point to the full address though, such as a file system.

For me the emphasis (on my current thinking about the index), is that it’s for the data you store. Your data on the network for that key (unless you opt out on upload), would be referenced in that index (all blobs, all sequences etc). That’s where my thinking was anyway.

I’m trying to think generally. If a deterministic address for a given key is useful beyond the stored-data-index idea. There’s nothing stopping us opening up things there. It could be an index of indices eg… what would that look like. What other use cases might there be?

Ah, so here I was meaning, we basically have that from your idea. But just give all the address over to the PK hash (no extra word inserted), and reserve that tag at the network. That’s how we’d do that for the index at least.

So then, how could that index be more useful… ie, you could be sure that safe://<davidbeinn's PK hash>/data-index has all your data. But perhaps it’d be better safe://<davidbeinn's PK hash>/index/data and safe://<davidbeinn's PK hash>/index/search-keywords or what have you existing together in our PK-index.

:+1:

I think if that’s needed in general, then we’d still have the same collision issue, just on a different scale (or?). I think for this, if we allowed one tag to be tied to the PK being used, it may as well take up the whole xorname. But then, maybe what’s stored there can be made to be more flexible :man_shrugging:?

1 Like

i thought there was a post somewhere in the forum about 512bit hashes being recommended for content-addressible systems anyway (birthday problem and such things…). for just signing 256bit would be enough.

edit: hmmm Why are 256bits "enough"?

1 Like

Again, this sounds great. It makes me feel as if in even worrying about this issue I’m perhaps missing something as regards structure and flexibility of addresses on the network - ie. am I just proposing a solution for a non-problem?

For example, using the addresses you suggest:

safe://<davidbeinn's PK hash>/index/data and safe://<davidbeinn's PK hash>/index/search-keywords

can we go straight to a piece of data, or at least with one small extra lookup? What happens after that? eg. what about:

safe://<davidbeinn's PK hash>/index/search-keywords/apple

My concern is that those addresses look like they might require an NRS map to resolve which, if we had eg. millions of /search-keywords/ might be prohibitively large. Am I missing something there?

Until you’ve tried a proof of concept, you might be missing how much work this is for the network to churn, the number of updates to single files… and the cost of that, making no sense to me at least atm.

You might also be missing the reality of natural data, doing 80% is literally trivial but getting a useful solution is a lot harder - a lot more work.

Also, I still cannot make sense of what you’re proposing for finding information between sites… or if the intent is simply for per-site searching. My question would be why would you be looking at a particular site, to then look at its index? Normally, a user has a query and wants to find unknown sites that contain that query - and closely related phrases. The path from query to answer for the user, is not clear here.

If you try a proof of concept on the unSAFE websites, it’s one line of code to do search as compound of lists… but the time it takes to get meaningful data to search phrases is not trivial, and that seems a problem not obvious addressed above, with anything the network could do. The reference set to create a meaningful dataset for lists, is huge and fragmented.

I’d suggested a better approach is to develop something as proof of concept outside of network, evidence it can work and then apply it to the network if possible… rather than force the network into doing huge amount of work, for no result.

1 Like

I think you’re maybe missing the point of what I’m saying here.

I’m not really talking about search as such in this thread.

I’m talking about getting to large numbers of addresses in an efficient manner, which could be useful for lots of purposes.

The hash table obviously does the technical side of that, but in order to make use of it we need a 32 byte address, which on its own is meaningless. We need somehow to store the meaning of an address. ie. david’s_file = {123…xyz}.

It’s fine to do that as a list or a tree for smaller amounts of addresses, but for larger amounts, that list of pointers could get very large, and therefore unsuitable for a web app.

At least, that’s the problem as I see it.

Edit: I’m not asking the network to churn anything, just to check the ID of the account trying to initially create a piece of Data. After that the network checks the ID of an account trying to modify the data anyway.

4 Likes

I don’t know, enough about file types and the limits of what is not possible but if there are any few characters that cannot ever be the header to a real file and then those would tempt free xorurl space and it would be possible to have [neverchar].[userUID].[as much of an xorurl space is left] as one xorurl address?

I guess allowing users access to such a feature, could be made useful in alsorts of ways.
My hesitation using it would be both that perhaps necessarily it would want an end point… some file written; and the cost related to making use of it. If there was an alt option for without endpoints, perhaps that would be useful reference dataset in some cases.

I wonder it’ll be the cost of this dust-like action - both in $$ and in work for the network, which needs some clearer use case to persuade that it makes sense??

1 Like

To clarify my thinking safe://<davidbeinn's PK hash> is a deterministic XorUrl (derived from PK; at a type tag requiring that PK to PUT data there) pointing to a piece of MutadableData.

Prior to this that MutableData was to be just a list of all data PK has written (without specifically opting out of indexing). As a record on the network for users’ data.

But we could put anything in that MutableData.

So safe://<davidbeinn's PK hash>/whatever is an NRS like example. But essentially, we’d get the key whatever from the MD. And it’s value could be… whatever. In the case you’re looking for it could be another MD which is used for your search indexing. So one more lookup.

That would allow you to deterministically find any data for a PK that we wanted to make public.

Is that useful? Re-reading the OP, i’m not suuure it is.

With the above, you’d be able to calculate the location map for a given PK. But you’d still need to look it up.

Here I think just a SafeId with all your data/indexes, at David, or DAvid1231 is fine, without needing something PK specific. Seems like you’re trying to get around the limited namespace for Ids as NRS names. But your safeid could just be at a random xor. It doesn’t have to be an NRS name. It’s less pretty, yes. But there’s no real difference between knowing a xorname and knowing a PK (which is in the end the identifier of an account).


All that said, if there is a useful case for reserved PK deterministic address, beyond just indexing data put by that same PK, it may well be possible to do so.

2 Likes

But you are still just getting the contents of one file for a network name. I don’t get it, creating an index on the DHT seem very expensive, you need a DHT entry for each entry in your index (and that one would be just a reference to an other file). Wouldn’t it be better to use some sort of “embedded” index files, that contain multiple index entries in one file and split out for bigger indices (in b-tree fashion)?


edit:

rather something like Hash array mapped trie (got that from the im crate). Then when updating the index you would do pretty much the same thing you would to with an im hashmap and atomically update it (using the data version, don’t know how they are calling it in the safe net nowaday).

1 Like

Thanks Josh.

Yeah, I think we’re getting to the nub of what I see as the problem now anyway.

I’m aware that I can create my own index structure at a deterministic address, but again, my problem is that for any large set of pointers, even if I created that structure with efficiency in mind, it’s going to be a pretty big file to download every time we do a lookup, or else, as @urrtag suggests, means arranging different index files in a tree structure, which requires sequential lookups, the time for which could quickly add up.

I think it’s worth seeing the structure I’m suggesting not as an index in itself, but as data that is by its very nature indexed - by filename.

Obviously unlike a traditional file structure though you can’t change a file’s name or move it around. However, you can of course mutate that file, because it’s Mutable Data that we’re talking about.

I see where you’re coming from here. I think one thing you (and maybe others) are misunderstanding from my original suggestion is in thinking that the file we’re trying to get to is just a pointer to another file. It is the file. We have just made it so we can find it again by creating it at an address that is determined by it’s name (or some other formula.)

For example, maybe I have a million different entities in a database, and there is no way (or else it is not desirable) to categorise them in a tree structure based on a single metric. It makes sense that each entity has its own file.

For simplicity, let’s just store them in a list based on the entity’s name:

entity1 = {123…xyz}
entity2 = {456…abc} and so on.

At 50 bytes per entry that is 50 megabytes, which for a web app is too much to download every time someone wants to query it.

As you mention, we could split that into files and put them in a tree structure based on entity name. That’s certainly an option I’ve considered, but if we’re making a few queries, each of which needs a few sequential lookups to navigate the tree, then the time adds up, especially on slower connections.

The other option is to keep your database on your own machine and essentially set up a server, and we’re back to the old internet model.

Going back to what @joshuef was saying about useful cases, I find it hard to see decentralisation being maintained for large database usage without some kind of system like this.

I feel like I’ve thrown things off a little bit by relating it to a web-search index, but that use case alone would justify this sort of feature in the network. If web-search can’t be decentralised and reasonably performant, then we’re in trouble.

Just seen your edit @urrtag, will check that out when i get a minute.

5 Likes

hmmm, wouldn’t that reduce the usability of such sort of an index by a huge amount? cause you’re now limited to putting that file in only one index (the “filename”) (also: the cashing story for everything beside immutable data isn’t that great. Deduplication wouldn’t work either)

Aren’t clients going to cache data fetched from the network? (at least for immutable data, cause IM is really easy to cache).


But yeah, some sort of “namespacing” would be nice to have. And as david already said:

use

fileName = hash(nameSpacedName)
networkName = hash(publicKey + fileName)

then push the PUT to networkName and include fileName in that PUT, and let the vaults validate the namespace by calculating the networkName for them self.

I’m not even sure if this could even work, we would need to somehow prevent PUTs at that network name (someone could put non-namespaced data at that network name). Pretty much separating the namespaced and not-namespaced data. Adding a prefix and only allowing namespaced PUTs on that prefix wouldn’t work as it would interfere with routing (due to network names not having even distribution).

1 Like

a burn on time for a problem not even thought through. :confused:

Again, I see where you’re coming from, and I think your objections are valid, but I don’t think any of them are necessarily deal breakers. It’s quite a specific use case we’d be looking at, but I think a valuable one, and it need not impinge on any other functions of the network.

I think caching can be useful of course, but it can’t be the answer for all applications and use cases.

The last point you make is exactly what I was trying to explore in my original formulation. I don’t know enough about routing to know how much of a problem that would be, but I doubt it would be insurmountable in itself.

2 Likes

I think you’ve identified a very useful feature @david-beinn, and that it is something that requires a new data type.

The operation needs distilling into a specification, and it would help to have a couple of example use cases to demonstrate how it works and assess the benefits compared to making do with what we have atm.

It’s not yet clear to me exactly what you are proposing, but it does feel like a useful advance that could open up a lot of applications because indexing is so useful. It’s pretty fundamental to do much of everything we do with computers and the internet.

1 Like

At this point I’m confused. I thought the whole point was to integrate search indexes into the network. Seems like these would need to be public, not private?
The datatype would require a pow to update the index, perhaps crawl the index and verify… It’s deterministic based on the hash of the search terms… Yes?

1 Like

Thanks Mark! Just in the middle of something now, but I’ll keep trying!

1 Like

It sounds like you only want to receive part of your mutable data index back, instead of the whole thing? So from a file with 50 million entries, you want only apple ?

If that’s the case, MutableData currently has this capability (you can get an entry by key).


This is possible, I think, with data at a specific type tag/address being reserved for a PK which hashses to that address. Would need implemented, but as I say above, this is my current thinking for data-storage indexing.

When I’m asking about other use cases, search could be one (if deterministic data by PK is required… i’m not sure it is though, @david-beinn ).

2 Likes