The DHT behind Storj

So, I met Gordon, the core dht developer for Storj two weekends ago at LibrePlanet. He’s a very nice individual and has seemingly revamped Storj development since his joining last November. I had honestly dropped out of keeping up with Storj for quite some time but the infrastructure behind what he’s been working on for them is quite interesting so I’d thought I’d share it here.

So it’s all based off his library called “kadtools” https://github.com/kadtools/kad
It’s pluggable use lets Storj integrate their design on top.

I asked him to give some differences with mainline dht which he outlined:

  • Nodes IDs are ECDSA pubkey hash
  • RPC messages are signed and verified
  • Message format is JSON-RPC
  • Transport protocol is HTTPS (might sound nutty, but happy to chat
    about why it’s not)
  • File transfer uses WebSockets instead of UTP/LEDBAT
  • Distributed pub/sub system on top of Kademlia
  • Key/Value items in DHT must be content-addressable
  • NAT traversal uses UPnP and HTTP tunneling instead of STUN/TURN

Seems like similar goals to webtorrent with perhaps some distinct differences I’m not aware of yet. If y’all have questions or a discussion forms around this, I’ll invite him to respond, etc.

11 Likes

This is much better for Storj to be considering and pursuing for sure.

How many nodes do they expect to be able to get through uPnp would be neat to know. We have upnp but without effective hole punch it seems restrictive.

I assume they are all tcp based?

In terms of websockets they expose ip addresses to all hops. So with normal kad using iterations this will expose all ip addresses. Is this something considered?

Good if they get on and discuss some of this on terms of pro’s and cons/ It will get them faster to decentralising away from owned servers I hope.

5 Likes

I am not sure Storj offers sender/destination anonymity (I would love to be corrected here). Also, I think the centralization aspect is with things like their metadisk product, not sure about the core network. I personally am a bit disappointed especially with the metadisk central server thing, but I fear bringing it up again because of the embarrassing politicking done by members of this forum at Storj - THERE IS NO CLOUD: it’s just someone else’s computer.

To the point, the HTTPS does seem a bit nutty. So the receiving user acts as a server performing diffie hellman with some agreed upon x509 cert? (Or is it centralized like metadisk and I just misunderstand thinking it is P2P?) Using a CA-based trust protocol like HTTPS makes it difficult to trust the sender when the CA concept is removed.

Update to provide details on my question about comparison to webtorrent:

  • No single node stores a complete file

  • Shards are encrypted to the “renters” key

  • Incentive model (storage contracts between parties with payments)

  • Cryptographic auditing of data (using precomputed challenges and merkle trees)

  • Redundancy using Reed Solomon erasure codes (on roadmap - not yet implemented)

I’ll just go ahead and invite Gordon to this thread. :slight_smile:

3 Likes

Hi @dirvine!

It’s difficult to get an accurate measurement on this because support for UPnP varies so greatly between different service providers and hardware. Geography also seems to have some effect too: the success rate of using UPnP to punch out is higher in the US than in Europe (forgive my lack of citation - I’ll dig it up and update this later). Currently if UPnP fails the implementation will fallback to using a HTTP tunnel server (this is semi-centralized right now, but will become an addition to the protocol so that any public node can tunnel connections for firewalled or NAT-ed nodes - optionally negotiating a payment channel for the bandwidth).

Currently nodes talk to each other using HTTP(S) - however the underlying framework (kadtools) is transport-agnostic, so ideally in the future, the Storj protocol could be transport-agnostic too.

Right now, WebSockets are only used to open a file transfer channel directly with the node with which data is being stored or retrieved. At the moment I am not focused on anonymity, and I am not sure when or if that’s a problem we aim to solve since the software can run over Tor, I2P, or similar.

Very happy to join the discussion! I think MaidSafe is an awesome project and anything we can learn from each other will be good for everyone.

8 Likes

Excellent, yes there are a couple of differences, but still great to see this direction. There is a ton we can hopefully help with here. Kad as is obviously is very good, ipfs etc. have shown that. Where we found issues was guaranteed, not of retrieval as you can increase the group size (K) to get around retrievability. The issue we found was in fact editability/delete where you must be able to get at all copies of a data element. If you start looking in that direction then kad starts to become an issue.

We also implemented down list modifications/beta refresh and others to try and remove route poisoning (due to age not malice) and again found there were things we could not overcome without a rethink. So these are areas we can probably help with.

The easy way here would be to not require that level of exactness for edit/delete capability. So that is a valid route to take, but may lead to some other edge effects. For sure we can help each other, ipfs also can help as well as they have done very well with not only kad (they are not so concerned with security or privacy just yet) but I think they will. They have made great advances into usability and delivery though, so collaboration in time will be great. Now if you folks also have an angle to take on board then we will all win for sure.

Welcome to the forum btw excellent to see you here. Lets all move on with real advances now and share what experiences we can. We all win then :wink: superb!

7 Likes

Hi, @cretz!

That is correct, but as I referenced in my reply to @dirvine, Storj can be run over Tor/I2P/etc, so as of right now sender/destination anonymity is not a priority.

After reading that, I’m a little scared too. o.o What’s important to understand about MetaDisk is that it is a centralized bridge into the network. It’s not intended to replace the protocol, but rather to provide a convenient API for developers to use the network. Also important to note is that MetaDisk is also free software and you may run an instance yourself.

Every node in the network acts as an HTTP(S) server and client to communicate with other nodes. MetaDisk is just another P2P node that also exposes a centralized API for accessing the network.

This is true. Right now, I’m taking a web-first approach so that content can be stored and retrieved P2P in the web browser, but the truth is that it increases the barrier to entry for running a node (since it must be publicly addressable and secured) or in the case of private nodes, decreases security and speed through the use of a public tunnel server. These are problems I’ll be working on solving and are related to my previous reply to @dirvine regarding being transport-agnostic.

2 Likes

With bitcoin we’ve had quite some success with automatic TCP port forwarding using UPnP. No statistics on what % of routers support it, though. But many have it enabled by default and it’s a good option if it can be implemented securely. We recently stopped enabling UPnP usage by default because a terrible exploitable bug (as well as code quality concerns) in miniupnpc, the library we use.

One good reason to use this is difficulty of detection for blocking. Most firewalls will simply let through https connections, but not a custom protocol. For example Tor goes through a lot of trouble to masquerade their connections as https.

4 Likes

That is good to hear. We currently have uPnp in crust at the moment and I do keep looking at it and wondering. So good to hear bitcoin have seen this as successful to at least some extent. We are really pushing hole punch and random ports as well (our uPnp also uses random ports) and hope that is a success. I think many projects could use crust then (even via the “c” api) to just forget about NAT and connections (we can dream). The goal is to get this to a state like NaCL or Sodium where it’s good quality and does it’s job. OS errors are such a PITA tp deal with. It’s relatively small as libs go but needs switched to epoll etc. which is happening now. Then it will become resource efficient.

Anyhow good to know uPnp is helpful in the wild.

1 Like