Semantic search

I don’t usually suggest things that require extensions to the core, but this is one of them.

The idea is about the semantic tagging of documents, and it requires:

  • a new data type (yes, I knowww :scream_cat:)
  • that can be searched in XOR space in fuzzy way

The idea is based on this paper: Semantic Hashing (Salakhutdinov, Hinton).

What the Heck This Is About

“Dimensionality reduction” is the method where we assign an unlimited number of documents to a limited number of categories, with a certain weight.

For example, let’s take the categories age (“oldness”) and sex (“maleness” or “femaleness”), and let’s take the people of the world; if we decide a 120 years is 1.00, then a 60yo geezer would get 0.5 on that category. If he’s a dude, he’d get a 1.00 on “maleness” or a 0.00 on femaleness, depending on which way we decided to express sex/gender. Heck, you could do both and one can get a 0.9 male, 0.6 female rating at the same time :joy_cat:

The categories can be handcrafted, but they can also be automatically generated (e.g. by neural networks or support vector machines) in which case the model is not readily understood by humans, but it is more optimal for similarity search.

“Semantic hashing” is the binary variation of the same idea: age becomes “young” and “old”, and then “young” can be further subdivided into “child” and “teen”, “child” into “baby” and “no longer a baby”, etc.

The resulting semantic hash is exactly like what you think: a looooong binary number. However, unlike in the case of cryptographic hashes, if the XOR distance between two numbers is small, we can be fairly sure that the two things are also quite similar (unfortunately, it’s not always true in the opposite direction, but it’s good enough.) If you bring an example for the kind of thing you’re searching for, we’ll just have to look for everything with a semantic hash within a small enough XOR radius to find similar things.

And this is where we need help from the SAFE core.

The SAFE Way

In a nutshell:

  • we need data to represent:

    • creator – the one who does the hashing, e.g. “Google” (well, anybody who wants to be the safe google)
    • version – so that the same creator can improve / change their hashing method
    • hash – a 512-bit (OVERKILL FTW) number, e.g. “100110111110011…”
    • document – the “real hash,” the address on the network of the document being described
    • signature – authentication, obv
  • this is a tiny piece of data (well below 1K) and there would be a LOT of it, so it needs to be very cheap (cheaper than SD), and a PtP scheme to offset even that cost: indexing services would provide a valuable (if not crucial) service

  • there needs to be a way to search for (creator, version, hash, distance), where creator and version are exactly matched, and hash is matched within the given XOR distance (“Hamming ball”)

  • SAFE already searches for hashes in XOR space all the time, so that’s not something new to develop :smiley_cat:

13 Likes

Fantastic to see folk looking into this :thumbsup: . Semantic search is a critical component of the future of the network in my opinion. It will allow computation to use such “data sets” and give answer engines instead of search engines. Then the ability to link “searches” in a much better way. I feel many times when we search we are looking for answers and this all makes a lot of sense. I look forward to seeing this fleshed out and moving forward. I wish I had more time, but this is good news to see movement in this area now.

[Edit I should say, perhaps, take a look at some of Wolframs ideas and code as well. Some of that is very good, also cycorp etc.]

9 Likes

I have actually looked into Cyc (and similar things) before; interesting stuff! SAFE could (will) be FreeBase, resurrected :smiley_cat: (well, among other things)

Wolfram is sooooo out there; I’ll have to read some of his stuff, he’s a very original thinker.

RAMBLINGS Random idea related to ontologies: SAFE could be a huuuuuge triplestore, but again: isn’t SD 1) designed for bigger data, thus too expensive, 2) not content searchable? I mean, storing the triples requires just a few bytes (4 x 512 bits for the hashes of the triples and the owner: 256 bytes + signature), and you’d want to be able to search for any combination of the components… That could of course be done by making separate entries for the combinations ((A,B), (A,C), (B,C)), but isn’t it too expensive to store a half K in something designed for 100K? Or could the PUT cost depend on size?

2 Likes

Semantic search is an important component. Baked in web of trust is another. And some built-in support for externally uploaded black/whitelists etc is another.

2 Likes