Object storage prior art and lit review

From March 2018, this is an interesting article by Casey Bisson:


Some extracts that relate to this project (emphasis mine):

A key priority for most storage solutions has been to reduce the cost per unit of storage while meeting increasing availability goals.

all those systems switched from a replication model to an erasure coded storage model to meet those goals.

more recent object storage solutions have eliminated the filesystem from the storage hosts.

[AlibabaCloud Object Storage Service] claims to store three copies

The durability and availability options for Azure Blob Storage range in claimed durability from 11 to 16 nines.

Target throughput for single blob: Up to 60 MiB per second, or up to 500 requests per second

The default method of durability is to replicate objects to multiple OSDs, but Ceph supports a plugin architecture and plugins that provide for erasure coded storage. Erasure coding is specifically recommended to increase storage efficiency while maintaining object durability.

they [DigitalOcean] actively recommend customers use the AWS S3 SDKs to interact with the offering

The [DreamHost] service offers an S3-compatible API to manage objects and closely maps to S3 conventions and semantics.

Most of the comparators in this section use some form of erasure coding.

availability implies durability

Unlike replication, erasure codes can increase reliability without increasing storage costs.

major cloud storage solutions use some form of erasure coding to achieve the levels of availability they offer at the price points they’re offering it at.

In April 2009, Haystack [Facebook’s early photo storage solution] was managing 60 billion photos and 1.5 petabytes of storage, adding 220 million photos and 25 terabytes a week. Facebook [in 2013] stated that they were adding 350 million photos a day and were storing 240 billion photos. This could equal as much as 357 petabytes. … as of February 2014, Facebook stored over 400 billion photos.

The system described in that paper uses object replication for durability, though that has since been replaced with erasure coding (see below).

The architecture assumes objects will rarely be deleted once they are uploaded

Rather than object replication as used in Haystack, f4 uses a combination of Reed-Solomon and XOR coding for durability and availability … settled on a n = 10 and k = 4 Reed-Solomon(n, k) code that allowed objects to survive up to four failures without loss of data or availability, while requiring just 1.4× the original object size for that durability

Availability and durability are then further increased with geographic distribution in another region

The paper outlines four types of failures the authors designed for:

  1. Drive failures, at a low single digit annual rate
  2. Host failures, periodically.
  3. Rack failures, multiple time per year.
  4. Datacenter failures, extremely rare and usually transient, but potentially more disastrous.

Rebuilding [lost data from failed nodes] is a heavyweight process that imposes significant I/O and network load on the storage nodes. Rebuilder nodes throttle themselves to avoid adversely impacting online user requests. Scheduling the rebuilds to minimize the likelihood of data loss is the responsibility of the coordinator nodes.

High storage efficiency means no UPS, no redundant power supplies, and no generators; just rack after rack after rack of storage.

In 2011 they [instagram] described that all photos were stored in S3, and the total amounted to “several terabytes” at the time. … [In December 2012 instagram had to] accommodate “more than 25 photos and 90 likes every second.” … [In 2013] that had grown to “10,000 likes per second at peak.”.

we see [instagram using] Cassandra added alongside Postgres as part of their primary data store as they scaled to meet the demands of “400 million monthly active users with 40b photos and videos, serving over a million requests per second.”

This 2018 post claims Instagram now hosts 40 billion photos (growing 95 million per day) and is taking in 4.2 billion likes per day

a region [for AWS S3] like us-east-1 might have up to six AZs, each with up to eight facilities.

Amazon S3 is designed to sustain the concurrent loss of data in two facilities … by quickly detecting and repairing any lost redundancy.

Wasabi is just like Amazon S3. If Amazon S3 were 6x faster and 1/5th the price.

In Spring 2015 the company [Yahoo] announced it was replacing MObStore with Ceph. The primary driver was cost reductions.

Yahoo stores more than 250 Billion objects and half an exabyte of perpetually durable user content such as photos, videos, email, and blog posts. Object storage at Yahoo is growing at 20-25% annually.

[With Ceph we [Yahoo] use] erasure coding with each object broken down into eight data and three coding fragments.

admitted there are two projects they’re working on:

  • How to tune the performance for large number of small files.
  • Low latency geo-replication.

young photos are a lot more likely to be deleted [than old photos].


Amazing article - really detailed and interesting! Lots there for SAFENetwork to learn from. Thanks for sharing!