Tiered Storage in Farming Rig

Hi,

The box I have dedicated for use in MaidSafe farming runs in a VM, & it has 2 virtual disk images available to it, each with vastly different speeds.

One is 32GB on an SSD, & the other is 360GB on a SATAIII HDD connected via USB3. The OS the MaidSafe software will be installed to is on its own 8GB SSD-based virtual disk.

In the event the MaidSafe software bases its benchmarks on the lowest performance score of the box, I’m afraid the SSD’s added quickness will be nullified by the HDD that’s attached.

So my question is, will there be a method presented by the MaidSafe software with which to account for the tiered storage model present in the box?

If so, my plan will be to provide 2 levels of storage to the network, as much SSD space as I can afford to add to the host, with larger allocations provided by cheaper, HDD disks. A 6TB HDD & a 512GB SSD are about the same price, & I’d like to balance space with speed & provide both- if it’s supported such that this strategy makes sense.

Might I be better off disconnecting the HDD disk to prevent my benchmark from being dragged down by its slowness, & focus my upgrades on providing SSD space exclusively?

Please advise.

Thanks,
-F

He who responds first with a chunk is most likely to earn the most coin. You can either go for volume, or latency.

As far as I am aware no benchmarking is done, it’s basically hit ratio. If your farm serves the most chunks to the most people, you earn. Centrally located dedicated servers with SSD cached ZFS stores probably would be superior to all others.

Niall

All that was already discussed on this forum.

Tiering is irrelevant for MaidSafe setups except in the few situations similar to those that @ned14 mentions (and which don’t apply to mom and pop farmers).

One of the big reasons I did the port of the MaidSafe codebase to FreeBSD was to avail of ZFS. Our storage layer is actually currently very, very simple, it literally stores chunks as files by opening a new file and blatting the chunk into it, similarly to read chunks we currently open a handle to it and blat the chunk into memory. We don’t, currently, even use memory mapped i/o, we just use the standard iostreams file i/o built into the C++ standard library.

As a result, with the present extremely naive storage layer there is no concept of storage tiering, or anything fancy at all. You basically need a local not distributed filing system, and that’s it. As a result, performance is heavily influenced by what a single non-tiered filing system can deliver. And of all the local not distributed filing systems presently available, the clear performance leader is ZFS which can provide truly astonishing synchronous write performance (i.e. write a chunk and fsync it, this murders most filing systems) plus it makes very good use of any RAM you feed it as a read cache, it’s far far far better at this than ext4 or NTFS, and it self heals bitrot which means the SAFE network won’t penalise your node for serving corrupted chunks.

Hence it is my guess that for those looking into big iron farming, the sole choice of OS will be FreeBSD and the sole choice of storage will be ZFS paired with a fast SSD write intent log device. Hence me doing the port to FreeBSD 10 where we now regularly CI test to ensure we don’t break things.

Niall

6 Likes

Thanks for sharing! I think people generally prefer to “cheat” by buffering writes which can get expensive when power outage strikes, but for those who can run FreeBSD (or a Linux OS with ZFS, I think there are a few, assuming it’ll be possible to build MaidSafe on them) ZFS would be a great choice.

(FWIW, I think the question was framed to cover both software and “hardware” based tiering solutions such as smart RAID cards and things of that nature.)

Thing is, the SAFE network (currently) probably rewards storage capacity over access speed. I pitched a notion at a technical review meeting some months ago where I asked what if someone bolts super-cheap cloud storage like Hubic (€1/month/1Tb) which has enormously high latency onto a farm of Maidsafe front ends, probably all running on very cheap €2/month VPSs?

As far as I currently know, we actually don’t know the answer to whether that would earn more profit than a farm of FreeBSD + ZFS nodes, nor indeed should it earn more or not. The latter would earn more in absolute terms sure, but it also costs a lot more, especially for electricity and network bandwidth colocation (the hardware itself isn’t actually all that expensive relatively speaking). The former is dirt cheap with lousy ~2000ms access times, but it is very cheap in monthly costs because bandwidth is so cheap (free first 10Tb per node per month) and electricity and colocation are free.

Right now SAFE doesn’t disambiguate - he who serves first wins. But whether lower latency or having a chunk at all wins overall I just don’t know. I personally suspect that more capacity, even if colder, wins.

I am not a believer in hardware RAID - I have never had anything but trouble with it. ZFS gives you RAID for free, is far more resilient to drive drops, and resilvering is not brittle.

Regarding whether to use ZFS on Linux, well the Lawrence Livermore Laboratories claim it is production ready, even though it’s still a custom kernel build. I think though if you are dedicating hardware to the SAFE network you’re going to not care what OS it is hugely so long as it is trouble free and especially secure against attack as people will be trying to nick your coin. That is definitely BSD any day over Linux. I might add that modern BSDs can run native Linux binaries with ease, they emulate Linux when they see a Linux executable.

Niall

1 Like

I am contracted to MaidSafe, so it’s in the release product.

The port makes no use whatsoever of any BSD specific functionality. Indeed, the storage backend is presently extremely simple, it literally stores a file chunk per file in a directory. The files are read and written using the C++ STL.

In the much longer run, we may make use of error correcting encryption such that cosmic ray bitflips don’t cause unnecessary refetches of chunks whose SHA no longer matches their content. As that’s a 10^-14 event, it isn’t a priority. Besides, a ZFS store scrubs such bitrot anyway.

Niall

1 Like

Interesting. Thanks for the clarification. Would it be possible to explain your architecture in a little bit more detail … eg, does the STL call go to the MaidSafe-API, Vault or Drive module? Why you used a ZIL?

Just wondering how easy/difficult it would be to incorporate the MaidSafe client directly into a BSD/ZFS engine either as a PlugIn or Jail? If that is possible, could we then also install the MaidSafe Vault module inside ZFS and present the data pool as a NFS share directly to the MaidSafe Client? ie, run the Farming engine as a self-contained BSD/ZFS module with no need for any other OS or file system?

Neil

1 Like

I actually don’t know which library does the chunk storage, it isn’t my bit. @Fraser can probably tell you straight away.

Regarding a ZIL, as I mentioned MaidSafe code has zero awareness of any filing system specific anything. It just reads and writes chunk files in a directory using std::ifstream and std::ofstream. If you configure your ZFS to use a ZIL, then of course it gets used, though be aware that I don’t believe MaidSafe code ever calls fsync() nor does it use O_DIRECT, and therefore a ZIL would not be used.

I can’t think of any reason it wouldn’t run inside a BSD jail. Much of our CI testing is already done in OpenVZ and KVM containers on Linux.

Niall

1 Like

Thanks for swift reply … glad to see you’re not dependent upon ZIL … with a large enough SSD pool (8TBs) I typically just utilize L2ARC and DRAM for hot data and then have a second large HDD storage pool for cool data … let ZFS smart caching handle the rest.

Have you by any chance tested your current port on FreeBSD 9.3? … any gotchas?

Neil

Yes, FreeBSD before 10 doesn’t have enough of the prerequisites. In particular, we need a compiler far newer than GCC 4.3, and we need 10’s libfuse implementation.

I’m sure something could be made to compile for FreeBSD < 10, but I can’t see us supporting it.

Niall

Thanks for the heads up … will see if I can get V 10 up and running.
Neil

Encrypt module reads data and Drive writes data (through Nfs, Nfs uses a client routing object) if that helps. This is for the network. For a vault then the data storing persona is the PmidNode so the physical storage is in vaults->PmidNode persona https://github.com/maidsafe/MaidSafe-Vault/tree/next/src/maidsafe/vault/pmid_node

1 Like

Thanks for clarification … where do you think the bottleneck will be as the chunks egress the network I/O port and traverse the farming vault? It seems a fairly simple data pipeline for Read and Writes to the logical volumes.

Are the I/O and R/W processes all multi-threaded and 64 bit optimized?

Neil

We keep designing to ensure this is per well behaved node as opposed to larger devices (thereby opposing centralisation or the issue of those who can afford to can do). I hope this continues,

2 Likes

[Quote]where do you think the bottleneck will be as the chunks egress the network I/O port and traverse the farming vault? It seems a fairly simple data pipeline for Read and Writes to the logical volumes.[/Quote]

My personal best guess is latency. I think it’ll take tens of milliseconds to get data from the drive to the network port. We also copy and scan data far too frequently which pummels the CPU caches, though with the rewrite of RUDP the most egregious of that is fixed.

I mocked up a GPU offload for the crypto and chunking, and while it delivered a 6x to 10x improvement on today’s hardware for large files, next year CPUs gain dedicated hash offload instructions which reduce the benefit very significantly. I had to reluctantly conclude it wasn’t worth the effort, especially as it would add even more latency.

[Quote]Are the I/O and R/W processes all multi-threaded and 64 bit optimized?[/Quote]

Like many new codebases we used too many threads :frowning: which is the bad influence from Java. Threads are great when you need concurrency in the CPU or the kernel. For anything else they are inferior to coroutines. Don’t get me wrong, most C++ code bases make the same mistake, indeed Microsoft only just moved to a M:N threading model with WinRT.

Last year MaidSafe took on a number of Boost C++ library engineers and that created the capacity to start using coroutines. RUDP v2 and Routing v2 will be mostly coroutine based. That will have effects on the rest of the code base eventually, but we are somewhat limited by the lack of tooling (C++ 17 will have coroutines built in, right now we have to emulate that as only Microsoft’s compiler supports the proposed features). Indeed myself and David are currently figuring out how to best do that emulation.

As far as 64 bit optimization, we should use more memory maps than we do on 64 bit. But it’s no show stopper, and it makes 64 bit closer to 32 bit for testing. Otherwise as with all new code bases everything is 64 bit safe.

Niall

2 Likes

Coo, seems the WIMP class libraries I built to learn C++ back in 1988-9, also incorporared concurrent task handling using… coroutines! And I never even knew it… bit late for a patent huh :wink:

1 Like

Could you elaborate a little more on the technical descriptors of the ‘well behaved node’ that you hope to see propagated across the farming community? … the concept seems to permeate much of your design philosophy and the underlying architecture of SAFEnet.

Neil

In the code we imagine every line is compromised and run by an attack node. So a well behaved node is one the follows the rules of the network. These are measured signed responses and failure to follow them means de-ranked. Lose enough rank and the node is considered not worth while. It’s group then disconnect from it and never reconnect.

Will the use of SSD storage significantly reduce that latency factor?

Do you foresee the possibility of the development of coroutine specific ASICs and the introduction of mega node storage farms along similar lines that beset BitCoin with the advent of large hashrate mining contributors with heavy capital investments in low energy cost locations?

I guess my question is related to David’s notion of the ‘well behaved node’ and how MaidSafe is going to protect the intrinsic distributed nature of SAFEnet and avoid the system be hijacked by large capital intensive agribusiness reminiscent of the Enclosure movement of the Agriculture Revolution.

Neil