Indexed Structured Data

I know this thread has been dead for a while and am not sure if by now this is completely outdated, but I have only recently got wind of the maidsafe project and am still orienting. Anyway, I am very much interested in this topic as I have been thinking of how to build semantic/smart search into the safe-network. I have a background in AI and work with NLP and ML so I can probably contribute a bit on the representation of words and texts in a meaningful way (word2vec, GloVe). The problem I keep running into is that so far I have not come up with a solution that does not require this to be part of the underlying data-structure or uses a centralised server. So I am looking for other people to do some brainstorming on this. Hopefully people who understand the safe protocols better, since I still have a lot to learn :slight_smile:

p.s.
Not looking for profit, just want to build cool stuff and help this awesome project :wink:

4 Likes

The SD data types don’t exist and was replaced by MD (mutable data) that is 1MB in size and has new features.

1 Like

Thank you for the quick reply, I guess I have been looking at the MutableData specs but did not realise it replaces SD. Although I think this does not change a lot in this respect as it mainly adds more fine grained permissions as far as I can tell. I was thinking of sending a pm to Tim about this but unfortunately he has not been active since march, so still hoping to find some other members interested in this topic.

2 Likes

you can try on the dev forum too: https://forum.safedev.org/

The indexing strategies described in Schema Agnostic Indexing with Azure DocumentDB are interesting and some of the ideas there might be useful for apps wanting to use SAFE as a JSON document store. Basically JSON documents are represented as a tree and then paths through the tree is used as a key and the value is the id(s). These indexes could in SAFE be stored as MDs, paths as keys (xor name) and values as a list of ids of all MDs (or JSON in immutable data even) that contains that specific path.

Semantic hashes could in this case make the queries rather more flexible, where you’d get an inexact match on the path and then the app would go through all the matches on the client side to filter out matches that are not useful.

3 Likes

I think there is just one thing missing from SAFE for making possible to experiment with semantic hashing on the application layer and that is an api call to return MDs close to (by xor distance) a hash.

Maybe this api call would only work with MDs with a specific type tag representing a semantic hash in case there are security implications of opening it up for all MDs.

To get a mutable data handle the current method is

window.safeMutableData.newPublic(appHandle: SAFEAppHandle, name: (String | Buffer), typeTag: Number): Promise<MutableDataHandle>

The new API call could be similar, but instead of returning a single MD with a direct match, it would return a list of the closest MDs of the type tag for semantic hash.

window.safeMutableData.getCloseByName(appHandle: SAFEAppHandle, name: (String | Buffer), numHits : int): Array<Promise<MutableDataHandle>>

4 Likes

I would want this anyhow. MDs are going to be used for all sorts of things and don’t want coin MDs appearing when I want MDs around my Database or whatever. :wink:

How hard is this going to be?

SAFE doesn’t really keep a master index anywhere and so this could mean the network is doing a lot of searching.

Is there a way to do this?

1 Like

I think the MaidSafe devs have to answer that.

There are many questions here though.

I think storing the index in an MD could work, but it is possible it might not have ideal performance. An alternative would be to have an MD-like structure that is for example always stored in memory, has a different size than an MD etc, I’m not sure here.

When it comes to the data formats I think it could make sense that this is done on the application layer and that the network would only do the most low-level necessary things. It would then be easy to try out different ways of doing things. The disadvantage is that an app could potentially get data in a format that it doesn’t know how to handle.

I imagine the basic process could work like this

An app uses some algorithm to compute a semantic hash. It would for example for two very similar documents doc1 and doc2 compute hash1 = semanticHash(doc1) and hash2 = semanticHash(doc2). Then later a query would be done with some words that appear a lot in these two document queryHash = semanticHash(query).

Running the getCloseByName function with queryHash would then return hash1 and hash2 (or the corresponding MutableDataHandles).

What would these two results then contain? That could be up to the application that stored them. It could be the actual documents. References to the actual documents, that could be either mutable or immutable data, perhaps with some extra metadata for filtering results.

Is it feasible to do it in this way?

3 Likes

Actually if mutable data is used I guess it shouldn’t be a hardcoded type tag for a semantic hash, but maybe instead a range of type tags, say semantic hashed could be any type tag between 20 000 and 25 000 and when doing a query you could only look for something in this range. Each type tag would then signify one specific hashing algorithm used and maybe also a specific format used for the index.

There are already several different algorithms (neural nets) for calculating a semantic hash where some vary in what kind of data they are useful for, and there will be more in the future. To lookup something hashed with a specific algorithm for calculating the semantic hash, the same algorithm has to be used to hash the query, so it could be useful then to specify this via the type tag.

Isn’t also getting the hashes close to one specific hash basically what is being done all the time to find a close group? If you want to PUT something on the network, the network will find the nodes closest to the hash/id of the chunk to be stored and that will form the close group. For the semantic hash lookup it would send a semantic hash to the network and then instead of getting the ids of the closest nodes it would get the id of the closest MDs with the specified semantic hash typetag.

2 Likes

Would this work?

  1. Find the close group for a hash
  2. Ask each node in that close group which data elements with a semantic hash type tag that they have stored/are responsible for/know about and return those
  3. If you want more results, ask the nodes in the close groups about neighboring nodes they know about and then get the semantic hashes they know about
3 Likes

One thought that strikes me is the workload for the nodes to do this and if done often enough would represent surely a major portion of the workload for groups/sections or whatever they are called now. And really for what purpose?

I would have thought if you wanted indexing then you “roll your own” or use a “standard” library. These searches that would be done as an api would pull up so much data that your APP could get swamped anyhow. We are talking of world wide data here and even a specific tag type would not necessarily make it all that much better when potentially millions are using the APP with millions/billions of these JSON trees to search.

2 Likes

Making your own indexes also makes sense for many cases, but semantic hashes might open up some possibilities that wouldn’t otherwise be possible, but perhaps there are other clever ways of doing things.

If you want to have full-text search in your app and you don’t have any crazy amount of documents, it should work fine in a standard lucene way I suppose. If your query is “hello world” you would download two indexes with the keys “hello” and “world” and merge them on the client side to get your results. If it’s up to a few thousand documents in each index, I imagine it could be quite fast. What if there was millions of documents containing the word “hello” or the word “world” though? Downloading those indexes and expecting to see the search results immediately would not work. I do think it might be possible to make it work to some extent, by using other ranking factors (page rank or whatever) and partitioning the index in clever ways, maybe setting it up in some kind of search tree where some nodes were MDs themselves that you could jump directly into the tree with keys that specify some range of nodes containing a subtree or something, anyway just some random thoughts.

Anyway, there are a couple issues with these kind of lucene-like indexes that semantic hashing might improve upon. One is that the search time increases by the number of words in your query. If your query is “hello world” you download two indexes, if your query is “hello world, what up” you would download for indexes. Another is that the search time increases by the number of documents in each index when there’s more than one word in your query and thus several indexes that needs to be merged.

Supposedly semantic hashing can help with this issue. You have an app that on the client side that makes a list of each word in a document and the number of times this word appears in the document. This list is passed to a neural networks that outputs the semantic hash of the document. When you later want to do query you pass in the query as a document and get a semantic hash back, the query should be maximum a couple words anyway so it should be fast to calculate. You then search for the semantic hashes close to this hash to get the search results that supposedly should be similar to tf-idf (lucene). In this case the heavy lifting is mostly done at indexing time, if searching for close semantic hashes could be done efficiently without swamping the network, that is.

2 Likes

You know one big problem with this is. A lot of files will be immutable files and the network simply cannot do these sort of hashes. The network cannot read the text in those files. It will only work for unencrypted MDs which may only be a very small portion of the files you want to search. While your APP can read the files because they have the datamap of immutable files, the network code (api) is ignorant of this. Otherwise someone could abuse this to find data from private files since the network does not know what is public or private immutable files. Public files have their datamap available via public directories stored on SAFE and this the file is public. Private files have their datamaps only available to the user

I was actually always thinking of doing the hashes as something that would be done on the client side, especially as there’s many different hashing algorithms that could be useful, so basically the APP would use a library to calculate the hash and the only thing the network would do is to calculate XOR distance between hashes and you’d use a type tag to specify which hashing algorithm was used so that you don’t try to compare hashes hashed with different algorithms.

3 Likes