Amazon Glacier: Archival storage that's cheap?
Amazon Glacier is a new service from Amazon that offers archival/cold storage at a cheap & flexible on-demand price of $0.01/GB/month. They say this is highly durable storage with a durability of 99.999999999% (nine-nines, same as Amazon S3), but availability for retrieval is going to be delayed by several hours (as opposed to instant retrieval in S3 with availability of two-nines over a year).
Traditionally, cold storage meant tape storage. Last time I personally used tape backup for a server was over a decade ago. Disks have taken over as the medium of backup in most companies. (except for cases that really needs cold storage forever like CERN's LHC).
Let's do some simple math for a disk based system.
Cheapest 3TB SATA disk that costs about $100. (Enterprise class drive would be 3 times that cost). What's the actual usable storage in this? A "3TB" drive contains 3 trillion bytes of storage and not really 3 terrabytes. And a filesystem will have a few GBs of overhead.
Within a storage pod, we could use reed-solomon style encoding to provide solid redundancy with 25% space spent on error correction bits. Taking all this into consideration, we get only 2000GB of usable space per disk. If we want 3 geo-separated replicas, the effective storage per disk goes down to 675GB.
These drive costs are typically amortized over 3 years. So just the storage cost per GB per month is =
$100/36months/675GB = $0.004/GB/month.
Now, the server, power, cooling and space costs remain to be accounted for in the remaining $0.006/GB/month. A 60-drive server would cost about $2000 excluding the drive cost. It would also cost about $70 per month for space rent. Power/cooling would be another $70/month.
So that's about $200/month for 60 drives worth storage.
So the cost per month is ($200/month) / (675 GB * 60 drives) = $0.005/GB/month.
If we turn off the servers completely we can save more on power. If we build denser servers we can amortize the server cost better.
So it seems to be possible to build disk based solution that may work at even smaller scales (relatively, a pod would be a row of 12 racks with 10 4U 60disk servers per rack = 4.6 petabytes usable storage).
Obviously I've not factored in the cost of people needed to develop, deploy and maintain such a system. This development effort would be not be a trivial investment and would make sense only at large scale and for strategic reasons.
Also, I've not factored in the networking infrastructure to provide equal-cost access to all data etc.
So, in conclusion, Amazon Glacier might be the best cost solution for small scale archival needs.
Traditionally, cold storage meant tape storage. Last time I personally used tape backup for a server was over a decade ago. Disks have taken over as the medium of backup in most companies. (except for cases that really needs cold storage forever like CERN's LHC).
Let's do some simple math for a disk based system.
Cheapest 3TB SATA disk that costs about $100. (Enterprise class drive would be 3 times that cost). What's the actual usable storage in this? A "3TB" drive contains 3 trillion bytes of storage and not really 3 terrabytes. And a filesystem will have a few GBs of overhead.
Within a storage pod, we could use reed-solomon style encoding to provide solid redundancy with 25% space spent on error correction bits. Taking all this into consideration, we get only 2000GB of usable space per disk. If we want 3 geo-separated replicas, the effective storage per disk goes down to 675GB.
These drive costs are typically amortized over 3 years. So just the storage cost per GB per month is =
$100/36months/675GB = $0.004/GB/month.
Now, the server, power, cooling and space costs remain to be accounted for in the remaining $0.006/GB/month. A 60-drive server would cost about $2000 excluding the drive cost. It would also cost about $70 per month for space rent. Power/cooling would be another $70/month.
So that's about $200/month for 60 drives worth storage.
So the cost per month is ($200/month) / (675 GB * 60 drives) = $0.005/GB/month.
If we turn off the servers completely we can save more on power. If we build denser servers we can amortize the server cost better.
So it seems to be possible to build disk based solution that may work at even smaller scales (relatively, a pod would be a row of 12 racks with 10 4U 60disk servers per rack = 4.6 petabytes usable storage).
Obviously I've not factored in the cost of people needed to develop, deploy and maintain such a system. This development effort would be not be a trivial investment and would make sense only at large scale and for strategic reasons.
Also, I've not factored in the networking infrastructure to provide equal-cost access to all data etc.
So, in conclusion, Amazon Glacier might be the best cost solution for small scale archival needs.