Most importantly how do you ensure redundancy? Proof of storage is a trivial problem and not what has stopped decentralised data storage so far. "Proof of redundancy" is what we need! What is stopping me from creating thousands of nodes with one shared storage to increase my profit? Without "Proof of redundancy" the market forces will work towards the cheapest (least redundand) storage.
I visited Maidsafe a few weeks ago and left with very similar concerns. Essentially I’d describe it as the “Google Attack” where the financial incentive to offer data storage gives a company like Google the opportunity to sell nearly 100% of the capacity. Even if they have incentives to be reliable they’re still a single point of failure that can go down at any moment.
The best I could come up with was to use multiple, trusted, geographically distinct auditing nodes. With the limited speed of light you can prove that more that one copy of the data is being held, although again, you can’t prove the nodes aren’t being run by the same organization. Similarly the auditing nodes are trusted, which means most users will be relying on centralized servers anyway.
As for the coins associated with Maidsafe and Storj, there’s no reason to think any of this instant confirmations, low-fees, no-mining stuff is sybil resistant. Sure, it might work if we can rely on the majority of nodes being honest and you not being sybil attacked, but at best that’s certainly a lesser standard of security than Bitcoin offers; at worst it just won’t work.
Perhaps increase the redundancy levels. It might use more storage to store say 5 copies of every file but it’s very redundant.
But in theory if a centralized source provided most of the storage capacity and they shut it down as a way to attack then it might be a problem. It just means you’ll need perhaps a bit more farmers and redundancy and most importantly the farming should be autonomous in itself.
Meaning no human operation required. Plug it in somewhere or use a drone. Centralized storage could allow a government to raid and shut down entire sections of the network.
I think you’re missing the point here. The problem is that if a farmer floods the network with accounts, multiple of the accounts will be assigned by the network to store the same file in order to achieve redundancy. It is most cost effective for that farmer to store the file only once and have all the relevant accounts use that source. Increasing the amount of copies won’t help against this exploit, it only makes it more cost effective to apply.
In my view, the solution to this would be making it impossible for farmers to identify copies of the same file. Since all file parts are encrypted, what about adding a (different for every copy) pre- or suffix to the file data before it is encrypted? The drawback of this is that when uploading a file to network, it needs to be encrypted and uploaded multiple times. But farmers wouldn’t be able to match the different copies, since in encrypted form they look completely different from each other.
The number of devices farming(not storage or bandwidth capacity) will significantly outnumber any organization. The odds that 4 copies of data landing in one person or organizations hands is tiny I think. If it does become a problem, the network could always increase the number of copies it stores.
That is actually at the core of how the network operates. Farmer nodes follow rules about handling unidentifiable file particles, or they are marginalized and not allowed to participate. Farmers have no discretion. If anything looks like human involvement at the base operation level, the node is not trusted.
The only way I can see that a Google attack could matter is early on by putting on a monsterous number of nodes which dwarfed the individual participants, and then withdrawing them all at once. That might result in sufficient data loss to cause network failure. But only early on, before diversity and mass adoption dwarf the effect.
Also remember that geographical location is irrelevant. “Close nodes” are likely spread all over the planet.
EDIT to add: Trusted groups of nodes are spread randomly all over the planet, and those groups direct storage and retrieval of file particles which are stored on yet other nodes, also spread randomly all over the world. The network works constantly to maintain at least 4 copies of each particle.
A large number of nodes which had been behaving as trusted nodes and then being shut off at once would cause a huge churn event in which the network would attempt to reorganize itself and recover the requisite number of copies of file particles. It might be possible that all copies of some file chunks were all contained on the withdrawn “Google attack” nodes. Those particles would be lost. But if even one copy of any particle remained somewhere, the whole data set would be reconstructed, if the network were able to deal with the size of the churn event.
I’m not a techie on this but have worked to get a theoretical grip on the network. Some of the Maidsafe team can comment if I[ve failed to describe this correctly.
Certainly Peter Todd did not visit Troon, did he? Or does he mean that he visited the site and/or forums? Anybody have the poop on this?
dug it up, seems like he did:
I’ve got no reason to think either Storj or Maidsafe are scams - I’ve met both teams, even spent a few days in Scotland visiting the latter. I’m sure they write perfectly good software. The issue is they’re trying to solve extremely hard problems that may not actually be solvable.
Yes he did come over, it was weird actually. Ron Gross from Mastercoin said “you should employ this guy as chief scientist” which I could not understand, so we invited him over when a couple of the guys from the community were here. We did not get into any system detail in any depth though, which is a shame. Peter said he had not read any of the docs or code etc. but did offer some attack scenarios on decentralised networks in general. The ISP taking over the entire network or pretending it is the entire network is one, its in the system docs. Also the Google attack of becoming a huge % of nodes etc. as well as sybil attacks in general. He did think we had gone overboard with self encryption as AES was enough on its own, but I have heard lots of people with that opinion though, at least we can be accused of being too secure
I think Peters reason for the visit was not as we had thought at all, he seemed to think it was to evaluate us for an investor or the “bitcoin wizards” as he called them, I am not sure what the reason was actually. He seemed like a smart enough chap though, so with some further reading it may have been more constructive.
In any case even without any detailed info he is right we are solving very hard problems, many of which are now done, measurable and provable, apart from that it would be difficult to estimate what we cannot solve. In terms of unsolvable problems, well that’s when I get excited as a great deal of real innovation should be solving what was previously unsolvable, I am pretty sure bitcoin did that.
These problems aren’t unsolvable, just unsolved. They are hard problems which is why so many people sent money to the MaidSafe team. It’s why we are here discussing these problems.
As problems arise the MaidSafe team can rely on the ingenuity and creativity of the community. Highly skilled programmers can be found and hired to implement whatever we come up with.
Honestly why would anyone want to work on easy problems to begin with? It’s boring to work on easy stuff and the fact that building a decentralized Internet is very hard is what will attract so many bright minds to it.
It is a fair point this, but a huge part of being an experienced developer is understanding how big a chunk to bite off at once. For example you might think from WG21 N3425 (Concurrent Unordered Associative Containers for C++) that std::concurrent_unordered_map<K, T> was a done deal, but I’ve spent the last four weeks after hours trial and erroring, and I believe I can improve on it (mainly by making erase() safe) at the cost of iteration and size() becoming slow.
Now, that was four weeks of my time, and a fifth is to come as I write a final version. Even the very elementary stuff can be extremely tough - the combined resources of Microsoft, Google, Apple, Intel and all the major multinationals hadn’t done better than N3425 before.
Everybody is talking about a “Google attack”, I’m actually more concerned about a “NSA attack”, because they got more storage, money and maybe even smarter people. Maybe the “Farmer” setup should be different. Maybe the network should start with an invite scheme, a storage limit, sms verification, payment let’s say 10000 Safecoins to start a farmers account and to top it off, you can’t make more then 10000 Safecoins per month. This sounds harsch, but it’s maybe a way to only have really dedicated people farming. The 10000 Safecoins entry could also go to R&D of the Maidsafe system, to make it sybil resistance and “NSA attacks” proof.
I think that if you have to pay to become a farmer, you’ll be less likely to launch a sybil attack. It’s true that you’ll be making money temporarly, but you’ll also pay for the R&D to fight it in the future. If you got a storage limit…
@nikster good that you started the discussion, because this is really important
I mentioned this in another thread, but I think people’s real concern is about transparency. If you’re working alone on your own dime, you’re not indebted to anyone. But if you raise money, then you’re indebted under whatever guise the money was raised.
So it’s mostly about how clear the goal, intention, and feasibility of the project is.
It’s on the part of the investors to research the feasibility of the project as much as it’s on the project’s creators to make crystal clear what is within their grasp, what is rough prototypes or proof-of-concept, and what is still in an undeveloped R&D phase. Especially if any part of the project is a known hurdle in computer science.
The main issue people have is, at the very foundation of the project, Proof of Resources and Proof of Redundancy is the real technological advancement. Is there a roadmap in solving it that they believe is feasible or is there still something not prototyped and strictly in R&D. That would make a huge difference to people donating/investing. It’s a huge indicator for risk and return.
They’re concerned whether any of these distributed storage projects are being dishonest about the feasibility of it as a whole. If they raised money as a way to just support the team while they wrap up a set timeline, or if the money is needed to do research into a potentially insurmountable issue in computer science. Everyone needs to do the best they can to accurately represent scope.
Making that clear is MASSIVE.
Also, I personally think all of these open-source projects crowdfunding money desperately need to hire communication experts to translate the complex ideas into publicly-consumable bites. A big problem with all these projects is communication. Watching the founders miscommunicate their ideas with bad grammar, spelling errors, or syntactically confusing messages is disheartening because I know they’re just not good at expressing the real depth of their thoughts. And I think tech people need to understand that. Just like they’re good at code and the science behind it, there are people who are experts in communication that aren’t just being manipulative. They’re a tool, like a framework or a library.
I know you are talking in general @russell but here is a stab at some data on this from our perspective at least. I think we are probably a bit different due to the time at it and amount of ‘stuff’ we have around us. So probably agreeing to a great extent, but even we have what I feel are hard issues to deal with in terms of communication. I do see some of these project now use the term farmers and talk of data shards across machines etc. so that is good I suppose. Although I have seen hash being touted as encryption, which is a red flag and a shame.
I agree, we have patents describing this since 2006 as well as a high level description in the systemdocs http://maidsafe.net/documents proof of retrievability and redundancy are implicit long before these phrases came about. Proof of not only redundancy but corruption free storage is handled as well. It all depends on an accurate DHT and then being able to identify nodes (PKI type network) as well as securely encrypted data on a public network. For self authentication to operate the network requires to be anonymous and that means no people, therefor autonomous, people really miss this point and it is a huge one. If any system has a login that is not controlled by a fully autonomous system then its simply not private. Then if you can login and create an account anonymously (1/3 of issue), and can secure data in a way that protects it even if the crypto algorithm breaks and do so with real time de-duplication (arguably a nice to have feature) then you have another 1/3 of the system. After all that there lies the network and how does it operate, it is very like describing a life-form or species in many ways. A truly autonomous system is not sentient, but should be closer to that than any system that requires servers and people. So this part is very difficult to describe.
So we have
1: A ton of papers and citations on google scholar
2: White papers (2 require updating and publishing which are self encryption and PKI)
3: Published papers
4: I wrote over 2000 pages on the whole system and these were presented to the patent office and published (and granted worldwide). The UKPTO hated me for that as it was one of the largest submissions they had ever received apparently. (2007)
5: The code and not only the code but included tests/examples/measurements/attack simulations etc.
6: A github repository of network simulations carried out over three 6 month doctorate research programs (its called Simulation and can be set up and run in oversim)
7: This forum + a ton of answered questions on reddit etc.
8: A load of interviews and press http://maidsafe.net/press
9: Qi and Fraser did video blog posts on the maidsafe blog way last year showing these very aspect of proof of storage and retrievability with both voice commentary and transcripts
Almost none of the above is enough, or of good enough quality as critics do not read them in general AFAIK.
I think though it is beyond us to describe it very well, I suppose Google would have a problem explaining pagerank to their early investors as well, even with a paper folk would not read it etc. Same for map reduce etc.
I used to describe it like this, say you a chemist/scientist and develop a cancer cure, do you
1: require all investors understand all the chemicals and if they cannot can you educate them to understand them all and their interactions. You must do this in a short paper or sentence though. etc.
2: Make sure you are peer reviewed as much as possible, have Universities across the world agree with you, get investors who can see and check this, be open and jab somebody in the arm with your cure after investors are able to watch all the tests leading up to that ‘launch’. Then do people even need to know what the constituent parts and their interactions are. [ps peer review is not always good Relativity theory failed this stage]
So while I agree communications are vital, I also think there is a desire for non technical people to be educated and understand new cryptographic concepts in a sentence or headline. I am not sure it is always possible to do that or even that it is valid. It seems the crypto community seem to want to find a big phrase and say , we do that ! Then people can repeat the phrase (blockchain clockchian, POW POW POS etc.) and do not really understand it.
So a balance between having too much info and tested/working codebase with proof points and public visibility is the struggle for many. I do fear the quick script systems when crypto etc. is involved though. You do not want to mash together things that require implementing security for sure, thats a recipe for disaster.
So here is proof of retrievability and zero corruption as an example
A group of nodes with authority (being the closest on the network to the address of the data) create an agreed pseudo random data element. They send this encrypted to the holders of the data via the holders of the data’s closest nodes, who act as arbitrators. The holders then concatenate this random piece to the data that they claim to have. They then perform a cryptographically secure hash function on this data and return the result. This result is analysed on receipt to ensure all holders agree. Holders out of agreement are de-ranked by their managers (the arbitrator’s previously mentions). The code is here UK anti-piracy campaign set to begin - BBC News On complete failure the data can be retieved and its very creation from a serialised stream performs a local integrity check (so the name of the data must match the hash of the content peice). This is checked at every node along the route.
OK I am sure somebody could explain it better, but here is the following conversation (remember all this is documented)
1: How do we know who is close to the address they could only claim to and then ask for derank on an innocent computer.
Answer: needs to go into cryptographically secured address. So its basically a public key authentication.
Immediately goes to next question
2: How do I know its your public key
Answer needs to explain the PKI mechanism where there are data types that are the public key that hash to your ID
Immediately → who gave you the key
Answer needs to explain the bootstrap process etc.plus the keychains for revocation and then transfer to a client as part of their keychain, so they can prove they owned this vault, then this goes way down a rabbit hole into consensus groups and chains of keys for different purposes etc.
3: This can be sybil attacked and a bunch of other headline stuff
Answer need to go into ranking and how this is calculated and earned
Immediately: The nodes will go off, how can you not lose files
Answer needs to go into replication and how many torrent type things show DHT working at massive scale etc.
Immediately: What if I then just delete the data
Answer, getting into rank again which is the measure of proof of resurse and retrievability
Immediately : How are they close, I can make a close node … I can … I can … … and on to infinity we go →
This goes on for weeks, when a couple of hours would have read the papers or at least know where to get the answers from
This has to be repeated every time with the questions on a different aspect of a huge system.
I think the answer is that we have tones of docs, but no technical author as was previously discussed here. I think the nuts and bolts (proof of anything) needs a technical author (we have meetings with some next week I think). Then the higher level parts are likely best managed with this community itself. Failing that we could get a PR company involved, but I am very wary of such things to be honest, thats just me but I feel perception control etc. is a bad thing. Again though, always open to suggestions and especially help (nudge nudge Russell)
I was hoping the systemdocs may do the trick and where it was not clear enough people could type in a comment or suggestion, even if it was get this written by an adult I think though if it at least covered all points that could be raised then it would help, but perhaps getting launched will help even more? Again part of the 24 day limitation just now. I suppose it is a balance.
I think that there’s no “let’s do this once (system docs, or comprehensive FAQ…) and all community questions will stop” – moment in time. That’s an unrealistic expectation.
The truth is that there are many users with different backgrounds (non-techy, semi-techy, to cryptography experts etc), coming in at various points in the process and there will always be questions - as you said, you are building a very very complex, revolutionary, unique system…from the outside looking in, it’d be more of a surprise if people didn’t have a myriad of questions for the community (I guess that’d mean they don’t care…which is worse) or they just “got it” instantly by reading a bunch of docs(which would mean it’s a highly simple technology). And I’m not talking about trolls – but noobs with honest noob questions and experts with honest expert questions.
Remember, even the Bitcoin whitepaper, as elegant as it was, didn’t stop the questions. It was the beginning of a discussion – that’s going on up until now.
So i think it comes down to this, which you allude to David:
Not enough people in the community right now are knowledgable enough to answer technical questions on a deep level (at least to your level of knowledge) to take the burden off of yourself and team.
For example, on r/bitcoin they have “moronic mondays”. Because there are many people ingrained in the tech and are already highly knowledgeable, the community itself answers questions from everybody,confidently, about anything under the sun, from the complex to the most basic stuff. There’s also an r/bitcoinbeginners – just a friendly sub with folks dedicated to initiating people to the tech and getting people up to speed. I see that @dyamanaka started an AMA which I think will be very helpful for the community.
The situation right now with Maidsafe is as if Gavin Andresen himself was on reddit, 24/7, answering every questions about how the blockchain works. I can see that getting tiring
Maybe that community expertise will come with time. Maybe it just needs to be more of an organized or siloed effort. But I agree with you. No PR companies please.
In the end, the approach is not “stop the questions you people - just read up on the docs!” but to rather absorb it organically. Absorbing all those questions, is equal to absorbing more users – which in the long run is what you want for the success of the project.
Yes these are the people we need to inform properly and at the right level. I completely agree with your summation. It’s separating real folk from trolls and buzzword touting egomaniacs that is hard for me. It goes with the turf though, I would choose no other path, but I would get a more comfy chair to fall asleep in
I consider myself pretty tech savvy and learn things quickly. I’m still learning about Project SAFE everyday. I had a 3 hour chat on MaidSafe Mumble yesterday. If my amateur knowledge helped +1 @Ryan so he could share his new knowledge with another +1, that would be helpful.
While it would be nice for the entire community to get their education directly from the source (MaidSafe Team / @dirvine), it would be more productive to get the Network Launched first, then start showing examples, answering questions, teaching new dev pods, etc.
This will not appease everyone, but we have to be practical. MaidSafe does not have special focus departments compared to big companies. I’m worried about you and your Team burning out. If I’m out of line, I apologize. Project SAFE still has my support to the best of my ability.
Cheers Mark, it was really contained in the answer above, what I was (badly) showing is that there is no single paper on each aspect, but the papers together are probably required. The code is right there so the likes of Bitcoin or c++ devs should be able to read that. The problem is it works on an autonomous network, so that needs explained (and that is a huge job), then it requires and explanation of how data is distributed (random, but even across the network, until rank moves it, which it does). I think this is the issue really too many inventions in one system that interact with each other. So telling somebody a little bit they can understand is really really hard. It either gets met with an Oh! or for critics then they start down the road of what about X and we reply Y then it goes on forever until you need to explain the whole system, well that is how I seem to find it anyway. The Data Managers job is to manage copies and this is redundancy and they know it is true as the managers surrounding the data holder can attest to this. I really wish it were easier to explain though