Relatively speaking, scaling web servers is the easiest part of any scalability strategy. Any number of cloud providers can give you the capability to easily add nodes to your web farm, but eventually the database is going to become a bottleneck. If you’re starting your product or service today and you expect to need significant database scalability, you can design for that from the beginning by picking the right vendor, provider, and architecture. However, if you’re an existing company with a typical RDBMS which has grown in size and complexity over the years, your options are much more limited. You might be asking yourself, what can the cloud do for me?
Under Lock and Key
The allure of the cloud, once again, is the idyllic freedom from resource constraints. In our case, we have essentially one physical SQL Server 2005 central database serving all our applications storing approximately 60GB of data. We make heavy use of stored procedures, T-SQL, as well as linked servers and replication at different integration points with other subsystems. We’re locked in.
There are numerous strategies to increase database scalability and perhaps performance as well. The most obvious is to throw more hardware at it. Right now we’re running on 4 CPUs and 18 GB of RAM, so we’ve hardly pushed the limits of this approach. The major drawbacks of this approach, if you could call them that, are the fact that we’d have to build for peak, and one giant database gets harder and harder to backup and restore in a reasonable time frame. The obvious advantage is that I don’t have to change a thing; the applications, code, and architecture are all unaffected. Although there is a theoretical limit to how far this approach will take us, it’s a question of whether capacity will grow faster than our need for it, where the bottlenecks surface, and whether this is the most economically justifiable approach.
Let’s assume that we do reach a point where the hardware approach is too expensive either because our peaks are too infrequent and orders of magnitude heavier than our typical load, or competitors, due to a better architecture are simply able to accomplish the same scale much cheaper.
I go to work, like an architect
One obvious approach would be to break up the monolith. Create more boundaries between the applications and start carving the database up along those boundaries. By doing so we could split our one monolithic database into three or four separate databases which could be moved off to different physical (or virtual) database servers. Integration between them could then be maintained via Service-oriented architecture (SOA) at the application level and/or replication at the database layer. That’s all well and good, but no small undertaking.
While we’re at it, by splintering our database and re-architecting, we could additionally look beyond the RDBMS at some of the NoSQL solutions that are known to scale and perform better for certain types of applications. Perhaps we could run our eCommerce site on something like RavenDB. I think some strong arguments can be made of the advantages therein. But of course now we are changing not only our data and application architecture but also our code, tools, skill sets, and integration points as well. This is starting to look like a mountain of work; what else can I do?
Another approach would be to use a combination of tuning, archiving, and table partitioning to squeeze more performance and scalability out of our current database. Archiving and partitioning could also provide a similar ability to spread the database across multiple servers through federation. This is no small amount of effort either, but at least it’s somewhat contained with the database layer making it somewhat more practical than a complete re-architecture. But I can’t help but wonder if the cloud can free me from this burden?
Nothing but shattered dreams
Ideally, I’d want to just plunk my existing database on some services in the cloud and have all of my scalability problems handled for me. I could look at Amazon RDS, Xeround, or SQL Azure to name a few who claim auto-scaling. I believe that auto-scaling generally means the seamless allocation of additional resources, and a large pool or resources from which to draw. But that’s not much different than just adding more resources to my existing physical database server. What I’m looking for from the cloud, is some form of auto-sharding that would handle the partitioning and federating aspects of splitting my one database across many and back again without any code changes. It looks like that might actually be a possibility if mine was a MySql database, but alas, I won’t be plunking my SQL Server database down on Amazon RDS (MySQL and Oracle) or Xeround (MySql). The only realistic chance of us doing anything close would be with SQL Azure. However, it only takes a little research to discover that sharding and federation are far from ‘auto’ in SQL Azure as they require some large architectural changes which can involve restructuring a database.
A Nuo hope
It looks like if I want anything near auto-sharding then I’ll need to consider migrating to a cloud database service and/or a NoSQL database that supports it. But if I’m going to go through all the trouble of migrating why not look at something more radically different; a distributed database like NuoDB. This promises me freedom from all of these scalability concerns as well as being a “Zero DBA” solution. Oh yeah, but it’s a beta product, so those are still just promises.
I want it all and I want it now
As I mull these options I can’t escape the fact that achieving scale is going to cost me in time, money, effort, and complexity. There are undoubtedly technologies that will help me achieve greater scalability, and many of those can take advantage of cloud resources to reach theoretically infinite limits. Nevertheless, right now I don’t see an easy path, and I definitely don’t need infinite capacity. That leaves me with two questions to ponder in later posts; how much scale do I need, and when do I need it by?