Indexed Structured Data

It would be useful to have a Structured Data type that is automatically indexed and searchable.

EDIT: Here’s a GIST with a bunch of questions and ideas about a possible implementation.

It could be used for extremely useful things like semantic search and graph databases. Semantic hashing fits right into the already existing structure of SAFE (it uses XOR metrics for expressing similarity), and much of what we care about can be expressed as a graph database. @dirvine mentioned both deep learning (which is one of the tools to implement semantic tagging) and graphs here when talking about possible ways for search on the SAFE network.

The mandatory encoding for this type could be a binary variety of JSON. Maybe the already existing BSON format could be extended with a specific data type for 256-bit hashes (or whatever the final block address turns out to be), which would then be automatically indexed.

Index entries would be stored/cached by vaults based on their proximity to the key in XOR space (similar to data blocks), but instead of the contents they would only store the address of the block that contained them.

Searches would be of two types:

  • exact, e.g. for references,
  • inexact, for semantic tags, where all records would be returned within the Hamming ball of the specified radius around the search key.

The potential for abuse (e.g. flooding the network with random tags) would require countermeasures, such as one or more of these:

  • indexed blocks could be more expensive depending on the number of indexed hashes they contained,
  • only the first few hashes would be indexed from a block, period,
  • searches could be restricted for records coming from blocks owned by a given user (e.g. a well-known search provider).

EDIT: for clarity’s sake: “references” would not have to be block addresses (though I suggest the key size based on that, for convenience); the idea is just that their exact value matters, so they could be a hash of anything:

  • a URL
  • an (attribute, block address) tuple, like (spouse, 8ae8223f9…992a3)
  • an (attribute, value) tuple, like (author, Mark Twain)
  • an (attribute, language, value) tuple, like (author, Japanese, マーク・トウェイン)
  • you name it
12 Likes

@lightyear any talk of this internally to your knowledge?? This would bring amazingly more simple functionality to the network in what we’ve all come to know of modern apps, almost all allow search.

4 Likes

Not of this in particular, but you probably want to look through todays Dev Update and the corresponding RFCs we are publishing. I think what we are proposing would allow a structure like this to be built from the App-Side already. I am not certain about this (can’t really follow this in detail right now), so please check it yourself and give us your feedback!

5 Likes

Looking forward to it, but I doubt it can be done on the App side because it needs inexact key lookup. Index entries are tiny and they are of fixed size, so vaults could store a lot of them (over a million in 64MB in the simplest case, half of that if we include ownership and attribute type).

TL;DR:

  • Indexed Structured Data as a new SD type
    • actually structured: content must validate against (e.g.) a binary variety of JSON
    • has a field type for an indexed hash or a (type,value) tuple, where both are hashes
  • vaults extract indexed hashes when a block is uploaded or modified
    • the resulting key => document mappings are stored by other vaults (based on the key)
  • indexes are searchable in two ways:
    • exact search:
      • e.g. for block addresses, but also for any hashed value
      • useful for graph relationships and exact attributes (identifiers, book titles, categories)
    • proximity search:
      • for when XOR distance implies similarity (semantic hashes, numerical values EDIT not good for numerical values, sorry; it seems there’s a place for indexed numerals, after all)
      • useful for general search queries

Maybe I’ll need a TL;DR for the TL;DR…

1 Like

@Tim87 maybe you could take a crack at writing up an RFC on github to make the discussion official?

1 Like

I’m not sure about pinning down the technicalities just yet; I would prefer some input from smart people first. But yea, I’m collecting my thoughts for that.

3 Likes

Modern Apps supply their own resources to implement that search. They don’t free-ride on shared resources. For example, when you do a google search, you are only loading Google’s servers. Google is not scanning the entire internet for you in realtime.

I know many have the dream that all SafeNetwork Apps will be serverless. I have a dream that big data app developers pony up some server capacity and don’t make money by freeriding on others. The SafeNetwork still provides them for free, with anonymity, as well as their users anonymity. It also provides them with an encrypted transport layer.

I would be very happy with just this. If a feature is provided that allows clever people to freeride, they will do so, which will probably lead to a worse experience for others.

I don’t know what is a reasonable line between what we can ask the network to provide for us for free, and what we need to provide our own infrastructure to accomplish. I just fear people gaming things to others detriment.

I suppose I’m worrying for nothing, as Structured Data is still charged for, so the economics should sort things out. But if I have structured data that is constantly being searched, and I paid the same as some other guy paid for structured data that is not searched, I have achieved a free-ride of the system. The network is worked much harder for my structured data that is searched, vs others structured data that is not.

3 Likes

@Tom_Carlson It’s difficult for developers to figure out the indexing on a distributed network without (like you said) relying on a server infrastructure. Not trying to sound purist or anything but that solution is no solution and is antithetical to the goals of maidsafe and it’s communities vision. So I think @Tim87 idea would be a great help in completing the ease for developers, much like maidsafe is aiming to do with the tutorial series. No one here wants a “free ride” we want p2p security and p2p solutions, not the same old crap.

1 Like

I think you mistake this idea with something fundamentally different than what SAFE already does; I think it’s just a natural extension. At the lowest level, blocks are found by their IDs. This idea would let a document be found by more than just one ID, that’s all.

Now you can ask:

  • “Who’s Tom Carlson?”

Then you could ask:

  • “Who said Tom Carlson is their friend?”
  • “What are the books Tom Carlson wrote?”

But also:

  • Here’s a book (as an indexing service, run by a person or a group, described it using a semantic hash); what other books are similar to it?
2 Likes

I get it. But we must charge you much more for searchable structured data, so that you don’t get the free ride. Since I can’t predict how popular you will be, and so how many searches the network will do on your behalf, how do I come up with a fair fee for you?

1 Like

I don’t know enough to argue one way or the other. I just have a concern.

The way to put this to bed, one way or the other, would be to quantify the network “work” that needs to occur for the search, vs a standard GET. If it’s 1:1, then I’ll shut up. If it’s 2:1, than we just charge you twice as much. If it’s somehow undefined, we need to be careful.

2 Likes

I just posted a gist with a bunch of notes and comments and questions, one of which is this very question.

Anybody who’s interested, please go check it out.

3 Likes

Traditionally, database indexes are updated at insert time, rather than read time. Updating an index after adding/updating data would seem reasonable as a result.

I suspect you could use structured data types as indexes which reference where the data is stored (e.g. address hash). This would enable searches on any fields as defined by a chosen index. This would be similar in concept to something like MongoDB, where you can store documents and index based on chosen fields.

2 Likes

Well, of course, but I’m confused; did I imply this would work differently?

One could indeed use Structured Data to implement some of it, but we would need a full SD block for every single index entry because you can’t put more in the same document because … you can’t search for the contents! SD blocks come with a lot of processing and metadata overhead, and I’m not sure that would be feasible on the required scale.

More importantly, you simply can’t do stuff like XOR distance queries over semantic hashes just by using SD, but that is required if we want to have search like Google (i.e. I don’t know the exact URL, but I can kind of describe what I’m looking for.)

I keep talking about semantic hashes, and I’m not sure everybody understands why I consider them crucial, so let me explain.

Semantic hashes are the polar opposite to cryptographic hashes. For the latter, the tiniest change in the document should result in a completely different hash, but it’s the opposite for semantic hashes: small changes in the hashed document should result in small changes in the hash itself.


An example for a 4-bit semantic hash is the Myers–Briggs Type Indicator (MBTI), which tries to group people into opposite categories across 4 independent dimensions: if “ISTP” is “0101” (0: introvert, etc) then “ESTJ” is “1100” (1: extrovert, etc), and the smaller the Hamming (XOR) distance between the two, the closer the two types are. This, of course, is a much more rudimentary method to describe a person than, for example, the Big Five personality model with its 5 real-valued dimensions, but it can still be useful.

Another example is a personals site, where you can categorize people in many different ways:

  • male / female
  • likes / hates cats

You can just specify your ideal partner and look for the closest ones (that is, in XOR space) to find your match.


You can do math with semantic hashes:

  • HASH(“male”) + HASH(“royalty”) = HASH(“king”)
  • HASH(“queen”) - HASH(“female”) + HASH(“male”) = HASH(“king”)
  • HASH(“Donald Trump”) - HASH(“sociopath”) = HASH(“wealthy”) + HASH(“orange”) + HASH(“clown”)

On to SAFE now.

Let’s say we are a search provider, and we trained a neural network to extract binary features from images. We end up with stuff like “much of this image is the sky”, “this image has people on it”, but also things that are less accessible to humans, yet are useful for categorizing images. We store the resulting hashes in SD blocks together with the image they refer to, and let the network index them.

We make a corresponding search app so people can actually use our service. It includes a smaller and faster version (though this link is not exactly that, just related) of that same neural network that is still good enough to produce codes to search the index. You take a picture, the app computes the hash, searches the network for close matches, then shows you the results: you are happy because there you have a bunch of similar pictures than the one you just took.

Same thing is possible for textual search: you type a sentence, the app computes the semantic hash, searches for similar hashes (computed and uploaded by the search provider), and there you have the pages that talk about the thing you just typed.

I’m really not sure how else we could do anything even remotely this powerful without this feature.

6 Likes

http://www.cs.utoronto.ca/~rsalakhu/papers/semantic_final.pdf background

4 Likes

You aren’t the only reader, Tim! :slight_smile:

I am not sure what you mean relating to SD. My understanding is that you can modify an SD over and over and can read it any time. This sort of temporal format would seem well suited for generating an index of what is stored elsewhere.

Obviously, any index which isn’t updated when the data it is referencing changes, but a sort of DBMS layer could look after that.

Maybe I misunderstand?

1 Like

Yea I guess you could build a B*tree or similar, and store it in a number of SD blocks.

The thing that goes for a built-in, network-wide index for semantic hashes is that 1) it helps with finding stuff so, just as Google on the internet, it may end up as one of the most used features, 2) it could directly reuse the code that finds / assigns blocks to vaults based on their XOR distance.

2 Likes

I know many have the dream that all SafeNetwork Apps will be serverless. I have a dream that big data app developers pony up some server capacity and don’t make money by freeriding on others. The SafeNetwork still provides them for free, with anonymity, as well as their users anonymity. It also provides them with an encrypted transport layer.

Hi. Interested app developer here. Can I just throw into the wish-list mix that if there were also some of the features from Datomic (Datomic: The most innovative DB you've never heard of (August Lilleaas' blog)) like the event-sourcing/audit-trailing/ability-to-query-in-time bit, that would be totally sweet also.

2 Likes