Cloud Data Persistence


is this a renaissance for the database?

today we have new problems and new challenges, but the computing power we now have provides alternatives.  This in turn is changing the approaches that are being taken. the following provides examples of the alternatives in the cloud database area.

Physical limitations and computation complexity are driving implementations into new area that we are not use to.  Similar to when multi-cores first came out, many thought lots of software would take account of the multi-cores, but this did not happen that quickly, it’s still happening now.

A regular solution to a social network with large database use is to Shard (divide into chunks across multiple servers).  But remember this has changes for the developer, who now needs to look-up over multiple shards – you cannot use an SQL JOIN across a shard and you cannot guarantee uniqueness and integrity. We are also seeing developers resorting to unnatural acts to deal with the issues of large data sets i.e turning MYSQL into a keyless store (Friendfeed).  good solution for them, but now the DB is no longer a DB.

The Cloud can be views:

  • Hiding complexity
  • Scalable – elastic resource availability
  • Pay as you go
  • no need to worry about tuning
  • geographical diversity

And can be broken into loose types = the *aaS Model

  • Saas – software as a service
  • PaaS – platfrom as a service
  • TaaS – tools as a service
  • IaaS – infrastructure as a service
  • ?aaS

All sounds great, so what’s the catch? Safety, geo-graphical availability and commodity hardware.  Also believe it or not the speed of light, which in data terms is still slow when transferring over geographical locations.

Two alternatives to the relational model that can cope with massive datasets in the cloud

Google BigTable

data tables are sharded into tablets and served via a single server, each tablet server can have 1000 tablets.  these table servers have a master and this can be removed and the system will still work for a limited period.

  • Distributed store
  • hundreds of terabytes
  • effectively a big sorted map
  • row keys grouped into column families
  • data is versioned
  • fast, scalable and transnational
  • meta data also stored in the same way in the tablets via a route metadata tablet.

You cannot use BigTable yourself but there are some open source alternatives Hypertable Apche HBase.  Also Big Table via Google App Engine, you need to use Python and there is something in-between (although the speaker had not worked out what it is).  But you are getting the benefits of Big Table in a round about way.

Amazon Dynamo

projects Voldermort and Cassandra use this idea.

  • Distributed key value store
  • Designed for high availability – tolerate network partitions and server failures without effect
  • decentralized – no master
  • data replicated via consistant hashing
  • multi-node reads and writes for redundancy
  • objects versioned for consistency
  • uses a Vector clock to disambiguate between server version of the same object

And for the lighter touch the smaller alternatives

Amazon Simple DB

  • tabular store
  • domains which are like tables and contain items
  • schemaless
  • auto-indexing
  • eventually consistent
  • no cross domain joins
  • query limit to 250 items
  • everything is a string

MSFT’s Azure SQL Services – in test

  • non-relational – really an XML document store
  • Containers which have entities
  • Queries through LINQ
  • REST and SOAP interfaces

Apache CouchDB – looks pretty good for JavaScript apps.

  • Document store in Json
  • REST API get,put, post

Other things to watch:


2 responses to “Cloud Data Persistence”

Leave a Reply

Your email address will not be published. Required fields are marked *