Wednesday, August 22, 2012

Amazon Glacier: Archival storage that's cheap?

Amazon Glacier is a new service from Amazon that offers archival/cold storage at a cheap & flexible on-demand price of $0.01/GB/month. They say this is highly durable storage with a durability of 99.999999999% (nine-nines, same as Amazon S3), but availability for retrieval is going to be delayed by several hours (as opposed to instant retrieval in S3 with availability of two-nines over a year).

Traditionally, cold storage meant tape storage. Last time I personally used tape backup for a server was over a decade ago. Disks have taken over as the medium of backup in most companies. (except for cases that really needs cold storage forever like CERN's LHC).

Let's do some simple math for a disk based system.

Cheapest 3TB SATA disk that costs about $100. (Enterprise class drive would be 3 times that cost). What's the actual usable storage in this? A "3TB" drive contains 3 trillion bytes of storage and not really 3 terrabytes. And a filesystem will have a few GBs of overhead.
Within a storage pod, we could use reed-solomon style encoding to provide solid redundancy with 25% space spent on error correction bits.  Taking all this into consideration, we get only 2000GB of usable space per disk. If we want 3 geo-separated replicas, the effective storage per disk goes down to 675GB.

These drive costs are typically amortized over 3 years. So just the storage cost per GB per month is =
$100/36months/675GB = $0.004/GB/month.

Now, the server, power, cooling and space costs remain to be accounted for in the remaining $0.006/GB/month. A 60-drive server would cost about $2000 excluding the drive cost. It would also cost about $70 per month for space rent. Power/cooling would be another $70/month.
So that's about $200/month for 60 drives worth storage.
So the cost per month is  ($200/month) / (675 GB * 60 drives) = $0.005/GB/month.

If we turn off the servers completely we can save more on power. If we build denser servers we can amortize the server cost better.

So it seems to be possible to build disk based solution that may work at even smaller scales (relatively, a pod would be a row of 12 racks with 10 4U 60disk servers per rack = 4.6 petabytes usable storage).

Obviously I've not factored in the cost of people needed to develop, deploy and maintain such a system.  This development effort would be not be a trivial investment and would make sense only at large scale and for strategic reasons.

Also, I've not factored in the networking infrastructure to provide equal-cost access to all data etc.

So, in conclusion, Amazon Glacier might be the best cost solution for small scale archival needs.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Tuesday, August 30, 2011

Stuff to get it right early for a startup

I'll make this a short post. It takes less time to set this up initially and get all of your projects conform to it than trying to retro-fit it later. And the effort spent on this will pay for itself in saved time from increased productivity ten times over.

  • A source control repository
    • Separate binary file assets (like lots of images, videos etc) from text file assets (like source code into separate repositories.
    • Use a distributed version control repository, like git.
  • Integrated Code review tool, like gerrit.
  • Integrated Bug database, like bugzilla (it's very customizable and fast) or jira (newer versions are pretty good).
  • Integrated code browser, like opengrok. 
  • Every project should be buildable, preferably using autotools.
    • Even if it's 3rdparty code, never just keep the binary. Always keep the source in good building shape.
    • Also, save the web url or location from where it was downloaded. There maybe a bugfix or a update you may want to pick up later.
    • Ensure build is fast. Use distcc and ccache to make it faster.
    • Split the overall code into independent layered one-way dependency projects.
  • Continuous build and deploy and smoke-test setup
    • This is extremely important for a project that's in active development.
    • Ensure smoke test is up-to-date and extensive.
    • Build system itself should be version-controlled. Treat build systems as sacrosanct. Don't install or upgrade packages randomly on this system.
  • Don't be stingy about the hardware for these systems.
  • Backup everything.
For a startup, hiring a build-automation engineer who can do all this stuff correctly and efficiently is probably money well-spent because it will save tons of effort on behalf of your costlier engineers,  architects etc.

If you are building an Internet application, then there are more things to pay attention to - like ensuring your build system can publish to software distribution system, upgradable builds with build ids and version numbers and automatic dependency management etc. Maybe another quick post on it sometime later.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Wednesday, April 13, 2011

nginx 1.0!

My favorite server, nginx has hit 1.0 release. With this release they have made public the svn repo holding the code with history all the way from 2002. The repo is at svn://svn.nginx.org.

Kudos to Igor and his team on this awesome piece of practical software. I see myself continuing to be a fan of both Apache httpd and nginx for a long time to come.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Saturday, October 30, 2010

Thrift for serializing/deserializing objects in Membase

First off, if you haven't heard of Membase, you should check it out. It's an evolution of sorts from memcached.

Typically when you use memcache or membase to store/retrieve key-value data, the value part is not a simple datatype. Instead, it would most likely be a serialized representation of some complex application specific data-structure. It's great to set/get complex datastructure with a single remote call like this. But what could become a problem very soon is the performance of the serialize/deserialize operations that needs to happen with set/get operations.

With php, the obvious way to do this is to use the language's builtin serialization facility. Since the serialized format is a ASCII based format, I would guess that it's performance is not optimal (especially for deserialization). Also, one would want to do compression to reduce the data transfer and storage costs. This again adds to the set/get operation costs.

I'm looking at one such application which could be optimized to work more efficiently in these areas. I've looked at Google Protocol Buffers. It's very easy to understand and use and has very good documentation. Unfortunately it doesn't have good support for PHP. So I'm now looking at Thrift. Thrift was initially developed by Facebook for use primarily with PHP and other languages. So it has good support for PHP and has comparable performance and functionality to that of protobufs. But it's documentation seems to be too sparse.

On Compression
LZO compression is a more suitable compression algorithm for reasons of CPU and memory efficiency. When compression is used as part of a web request handling, one has to carefully do the trade-off between compression size and speed.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Sunday, April 04, 2010

Need for service with guarantee of security and privacy

Current Situation
With all of the online communication services like e-mail, social media sites we use today, pretty much all of them are "free" services in the sense that we don't pay them any subscription fee for using the service. And as such the "Terms of Service" are heavily tilted towards the service provider.

In most cases, the only way such free service provider makes money is by mining the data they collect when we use the service. Every time we use such service we are inputting some data for query, transmission or storage. Most of the time this data is sensitive, confidential, private data like your contacts, personal messages that reveal who you are, what you like or don't like, what, when and where you do things etc.

By mining this information for profiling the user and using it to show targeted ads or to do market demographics research and sell that information to marketers are most common ways of making money. In such cases, no particular user's data is specifically exposed as it's all aggregate information. So such uses may be acceptable.

But what is scary with this is unauthorized, accidental data leakage or theft of data by illegal "hackers" or even government powered agencies getting access to this data to spy on people or corporate espionage.

What's needed?
I don't know if there ever will be a complete solution to this problem. But, to start with we need guarantees about privacy and security from service providers. It should be verified and certified by multiple 3rdparty agencies. It should be scientifically provable. And there should be stringent consequences for breaching this guarantee irrespective of whatever the reason may be.

Now, I know such security is difficult and will cost a lot of money. So, it is acceptable to have subscription fees to cover such services.

What is alarming today is there are absolutely no such services in existence. Even if someone values their privacy and security and are willing to pay subscription fees for it, they have no choice but to use ads powered free services.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Sunday, February 14, 2010

Flash video sucks!

Summer is almost here. With the rising ambient temperature, my macbook gets hot sooner.
Especially when my browser is open it gets hotter sooner. The reason is the all pervading Adobe Flash player based ads or video players on web pages.

This has made the experience of watching videos on youtube or ted.com an unpleasant experience. If I watch the video in full-screen, I notice both cores on this macbook doing full 100%. That's horribly wrong when it only needs less than 1% when I play the same video via a standalone video player like VLC.

This really needs to be fixed. Something is horribly broken here. Is this just me or everyone else simply putting up with this problem?
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Friday, January 15, 2010

Google finally getting into data backup!?!

With their latest announcement to host any type of files on Google Docs, Google is foraying into the arena of "your data in the cloud, access, organize, share - anytime from anywhere" business that we have been envisioning from a long time (over 4 years now!).

What's interesting is the approach that google has taken. Instead of traditional approach of building all the features that are geared towards providing this product vision from ground up and releasing the end product, Google has built seemingly independent product and tested the waters first. And once the users have accepted each of those individual pieces reasonably well, they are integrating them all to provide a powerful experience. (Privacy conspiracy theorist may say this is much like boiling a frog in the water slowly!).

Interestingly enough, Google's price of storage per GB seems to be the cheapest at the moment at $0.25/GB/year. But their initial free offering is just 1 GB with 250MB file size limit. At this price, it seems cheaper than amazon. And as expected for a end user product, there are no transfer charges (bandwidth costs). In comparison, Microsoft SkyDrive offers 25 GB free space with 50MB file size limit.

Even though Google doesn't have it's own backup client that can run on your desktop like traditional backup clients, I'm sure, given their good data apis available to 3rd party developers, there will be many cropping up like mushrooms.

Surely this will change the market for good in the long term. Let's see how the traditional backup companies (including us) will react to this.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Saturday, January 09, 2010

Macbook Pro Battery

One thing I realized with my last Macbook (White) is that putting your macbook to sleep all the time and never shutting it down (especially overnight) is not good for your battery. By doing that I've had consumed more battery cycles and now the battery discharge time has come down to just over 2 hours. Last week I got a new macbook pro. And this time I figured out how to make it hibernate (not sleep) upon closing the lid. With that, I've managed to do only 3 cycles of battery recharge in last one week.

Here's how to put your macbook to deep-sleep (hibernate):
Put these two lines in your ~/.bash_profile:

alias hibernateoff='sudo pmset -a hibernatemode 0'
alias hibernateon='sudo pmset -a hibernatemode 5'

And whenever you are about to close your lid (like before you go to bed), just turn hibernate on by invoking hibernateon in terminal. At other times, when you don't want it to go to deep-sleep, just turn hibernate off. This is useful when you close your lid while moving between meeting rooms etc in office.

Also, I realized that these new batteries don't need to be discharged and recharged regularly as they don't have the "memory" problem like the older technology batteries did. So I use battery only when I need to and stay on power adaptor when I can. This way I can keep my battery cycle count low.

Here's my battery info for future (self-) reference:
+-o AppleSmartBattery  
    {
      "ExternalConnected" = Yes
      "TimeRemaining" = 0
      "InstantTimeToEmpty" = 65535
      "ExternalChargeCapable" = Yes
      "CellVoltage" = (4189,4189,4190,0)
      "PermanentFailureStatus" = 0
      "BatteryInvalidWakeSeconds" = 30
      "AdapterInfo" = 0
      "MaxCapacity" = 5573
      "Voltage" = 12568
      "Quick Poll" = No
      "Manufacturer" = "DP"
      "Location" = 0
      "CurrentCapacity" = 5573
      "LegacyBatteryInfo" = {"Amperage"=226,"Flags"=5,"Capacity"=5573,"Current"=5573,"Voltage"=12568,"Cycle Count"=3}
      "BatteryInstalled" = Yes
      "FirmwareSerialNumber" = 9626
      "CycleCount" = 3
      "AvgTimeToFull" = 0
      "DesignCapacity" = 5450
      "ManufactureDate" = 15124
      "BatterySerialNumber" = "xxxxxxxxxxxx"
      "PostDischargeWaitSeconds" = 120
      "Temperature" = 3099
      "InstantAmperage" = 0
      "ManufacturerData" = <000000000000000000000000xxxxxxxxxx000000000000000>
      "MaxErr" = 1
      "FullyCharged" = Yes
      "DeviceName" = "xxxxxxxxx"
      "IOGeneralInterest" = "IOCommand is not serializable"
      "Amperage" = 226
      "IsCharging" = No
      "DesignCycleCount9C" = 1000
      "PostChargeWaitSeconds" = 120
      "AvgTimeToEmpty" = 65535
    }


As this battery is deigned to last 1000 cycles I'm hoping this battery will give me 6hrs backup when I need it for a long long time - at least 3 years.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Thunderbird 3: Better search user-experience, not there yet.

For my work e-mail, I've used Microsoft Outlook with Exchange server for 2 years and I liked it a lot. Especially the global address book integration, expanding distribution lists, calendar/meeting scheduling features are awesome. In my current workplace, we don't have Exchange. And I'm on a macbook. So I've a choice of Apple Mail or Thunderbird or Microsoft Entourage.
Tried Microsoft Entourage - didn't like it - it's nothing like outlook and it's UI is as if it's been resurrected from 1970s. And without exchange server to connect with, it doesn't have much advantage compared to others.

I tried Apple mail also for a couple of months. Didn't like it either. Although it looks great compared to entourage or thunderbird, it isn't great for handling lots of e-mails in lots of IMAP folders. It's search is also lacking in speed.

Then I tried Thunderbird. I've been a big fan of Mozilla for a long time. And being a supporter of open-source (where appropriate!), I decided that I could put up with minor quirks here and there with Thunderbird and woud use it as my primary mail client. And so I've been using it for past 2 years.

Over the years, Thunderbird has improved quite significantly. Especially it's ability to handle huge number of e-mails in huge number IMAP folders is great. It's search is also quite fast. Although there have been lots of crashes (as I'm always on beta or even alpha builds, that's expected), the latest Thunderbird 3 release has been quite stable. No crashes so far. So overall I'm happy.

But I think thunderbird can do much better with just a few minor improvements. Here's my list of low-hanging-fruit enhancements to thunderbird that can greatly improve it's UX.

  • Keyboard accelerator or special keywords (search operators) that maps to search filters in the quick search drop down. This would speed up the search experience in a big way. I've filed this as a enhancement request in the thunderbird bug tracking. https://bugzilla.mozilla.org/show_bug.cgi?id=538738. Please leave your comment there if you also think it's important.
  • Multiple addresses in a single line in the compose window. This is annoying when we are replying to a message having a lot of recipients. Here's the enhancement request for this one. https://bugzilla.mozilla.org/show_bug.cgi?id=495241
  • In thread view, when a new message arrives, if the thread is collapsed, it should be shown in bold to indicate there is a unread message hidden there. Otherwise, the user may miss reading the message.
If you are a thunderbird hacker, please consider working on this. I myself would like to spend time on this. Maybe with jetpack for thunderbird, this may be a simple jetpack to get both these things done.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati

Sunday, June 28, 2009

Git: Understanding is key to using it well

I've seen many people struggling with git manpages or getting confused with myriads of commands in git. If you are one of them, I would recommend you to check out stuff done by this cool guy, Scott Chacon.

Specifically check out this presentation here http://gitcasts.com/posts/railsconf-git-talk
The pdf of that presentation is also available there. Understanding the concepts of how git works internally is very easy and helps a lot in understanding git commands and improves your mileage with git.

For users on Windows, tortoisegit is has come up quite well. It's almost feature-equivalent to tortoisesvn and has 25 more features relevant to git.
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • Furl
  • Reddit
  • Spurl
  • StumbleUpon
  • Technorati