Why the cloud? Or, making the shift, a real-life story. Part 2

Migrating applications from physical servers to a new set of virtual machines on Amazon Web Services (AWS) has proved to be a rewarding experience. It was more expensive than expected but has proved to perform well, and given a means of rapid scaling up horizontally and/or vertically when required

My company has made a fundamental shift towards cloud computing and specifically towards Amazon Web Services (AWS) over the last 12 months. This is the second part of a two-part article (read Part 1 here) where I cover our experience migrating to the cloud.

In Part 1, I discussed our planning phase and test deployments, our first two steps, and in Part 2, I’m going to cover the rest: parallel running, going live, and updating our documentation and processes.

COMMISSIONING and then PARALLEL RUNNING

As described in Part 1, we decided to use AWS reserved instances for our hosting (see Part 1 for full details). Of course, we couldn’t just then flip on our application and go live and be done with it. The prototype worked, but what about the ‘proper’ application and data? The next step was to introduce the finalized architecture slowly. We had decided on ‘live’, ‘stage’, and ‘dev’ environments, as recommended by our solution architect. Each environment would have its own elastic load balancers and at least two standard instances behind each load balancer. That was six instances in total. The cost of a micro instance during our test was $0.35 per hour, compared with the $0.48 per hour price of a standard instance (confusingly classed as ‘large’ by Amazon). When we ran six of them, the hourly cost shot up by an alarming 168%!

DATA, DATA, how it vexes me, DATA!

I worked for a mapping company, so we had lots of data, and one particular challenge was to get this data into the cloud and onto the AWS instances. Two of our instances were our mapping servers, with all the data and server software installed on each.

There were a number of options available to use and we considered them all. The options included:

1. Upload everything to Amazon Simple Storage Service (S3).
2. Use the AWS import/export service.
3. Install an FTP client/server and use that to transfer the data.
4. Use the Remote Desktop Protocol (RDP) copy function.

Using S3

S3 is a fast, inexpensive cloud storage offering from Amazon, which they use themselves. You store things in a conceptual folder, called a ‘bucket’. Like the other AWS products, one only needs to pay for as much as one uses. The cost of storage is available at amazon pricing with additional costs for bundles of requests to and from the S3 bucket.

We had several hundred gigabytes to move into the cloud. We decided to use the very useful Mozilla Firefox add-on ‘S3Fox’ as the application to move our data onto our S3 bucket. We then installed S3Fox on each of the instances so that we could download the new data. We took timings and, not surprisingly, the speed to download the data from an AWS S3 bucket to an AWS instance was substantially the fastest. The bottleneck was our own internet connection speed (though not too shabby at 60mbps) and any limitations to our upload bandwidth.

There were a few gotchas, though, all to do with file sizes.

First, the S3 bucket was limited to objects no larger than 1TB in size (edit: This has now been raised to 5TB, see: http://aws.typepad.com/aws/2010/12/amazon-s3-object-size-limit.html) and this posed a number of problems for us, since our spatial data was already in the terabyte range. So, we had to chunk up the data prior to transfer. However, S3Fox itself appeared to be limited to files no bigger than 50GB. Anything bigger and it would fail. We suspected that this was more to do with the 32-bit operating systems we were using. This further compounded our problem but it forced us to reduce the filesize of each chunk to no more than 50GB. For a 1.2 TB file, it meant we had to create about 20 files, which we would upload sequentially to theS3 bucket in the cloud, then download each chunk to our selected instance before we stitched it back together again. The tools we used for this included 7zip (a fantastic tool) and some Linux tools.

The entire upload process to move our data from our local servers to the AWS instance via our S3 bucket took about a week to complete. Once the data was successfully stitched together we copied the contents to a separate Elastic Block Store (EBS) and attached it to our mapping server instance as a new drive, volume E:, since the default instance only came with a 35GB C: drive and a 100GB D: drive, both too small for our needs. We also made a copy of E:. This new volume would be attached to the second of our mapping server instances, a lot easier than downloading the data a second time.

AWS import/export service

Obviously, AWS anticipated that some of their customers would need to move large amounts of data into and out of AWS, so they offer a service where one can use a portable storage device to load directly into the AWS high-speed network. We liked this idea, as the cost of a 2TB drive is now comfortably under $200.00. However, the service isn’t consistent: the AWS import/export service only supports the import of files from the portable drive into Amazon EBS for the US Regions. Everywhere else, the import/export puts data into your Amazon S3 bucket. This obviously cuts out the slowest part of the process (upload time), but the fiddly stitching together of multiple files still remained – we were keen to cut out the need to chunk up the data and reassemble the bits into a coherent file.

Using the FTP client/server

This was tried and it worked fine; however, it did require a change in the AWS security group to allow FTP through, as well as extra software installation and configuration. Our aim was to keep the instances as lean and secure as possible and installing FTP software would increase the vulnerability footprint.

Using the RDP copy and paste function

This also worked but only for small (and I mean, small) files. It was unsuitable for the size of files we were using.

At this point it was prudent for us to make a snapshot of this instance, as losing it would set us back severely. We then discovered another gotcha. With the new 1TB EBS volume attached, taking a snapshot of the entire instance would 1) take a long time and 2) most likely fail. Rather perturbed, we tried again with the same result. After a quick chat to an AWS Solution Architect (we had the benefit of knowing someone at AWS rather than going through Customer Support), it was recommended that we first dismount the new volume under Windows Disk Management,  dismount the volume through the AWS dashboard, then back up the instance and EBS volume separately.

This worked but it added another step to an ever-growing work list.

Performance testing – does the application rock?

Despite the issues we had faced, everything was still track on and all the instances were created and ready to roll. It was at this point, with the architecture on a semi-live footing, that we ran through a very involved performance testing process. We created a test harness both on a dedicated cloud server (for cloud-to-cloud testing) and another server within our own LAN environment. The latter test harness would probably be a more accurate reflection of real life usage, as the test would need to navigate through network traffic and the firewall before hitting the internet.

Will the cloud server really be as fast as people claim?

The results were very pleasing. On average, the speed of response was less than one second and the AWS infrastructure easily handled a big jump in concurrent users, from one to 100 – we didn’t even look at the auto-scale functionality.

Security

After that test, we went through an involved round of security testing ,from brute force hacking of the website (we warned AWS first in case they responded!) to social engineering, where we had some of our developers try to hack into our email accounts in order to create sophisticated phishing attempts. While I do not want to get into too much detail, the security is only as strong as your personnel. AWS itself was perfectly fine.

BETA Program

With testing complete, it was time for a limited beta-program, as from our experience there’s no testing quite like user testing. This was opened to a select number of our clients and the feedback was invaluable – they also discovered some bugs in the application itself that we had failed to notice. The beta-program should have run for a couple of months only, but we extended it by another three months with the intention that it would overlap with the go-live date. This was because we had some features in beta that would not make it to release due to legal issues.

GOING LIVE

The go-live date arrived and it was actually a bit of an anti-climax. With a marketing campaign kicking off and users signing up, the site smoothly transitioned from beta to live. The usage patterns were very pleasing with a steady upward curve in registrations and users.

REFLECTIONS ON THE PROCESS

We replaced a bunch of physical servers with a new set of virtual machines and we’re now running the live production servers 24/7, just as if  they were physical servers, so we’re paying a constant fee each month. With a physical machine, the initial cost is usually up-front (the capital cost) and the cost and value of the machine drops over time, sometimes quite rapidly. Conversely, the requirements of ever newer applications and software places increased demands on hardware. The cloud servers, we’ve been informed, have a rolling program where the underlying hardware is constantly updated by the provider. Of course, one cannot control this upgrade process and if you really need the latest and greatest hardware, you cannot go out and get it!

There is a misconception that I want to tackle, though I am not a myth buster. The cloud is touted as being cheap and, for some users, it is indeed a cheap and powerful resource. You treat computers and computing power like a utility, turning them on and off whenever you need and only pay for what you need – an excellent idea I cannot fault. However, for a live production system such as the one I am working on, that doesn’t actually apply. Extending the utility analogy further, our usage is akin to electricity that powers our life-support system; there’s no compelling reason to switch it off to save money.

This makes calculating the annual cost quite easy. The standard Amazon Machine Images (AMI) we use are the large ones, currently priced at $0.48/hour, which works out at $11.52 a day or $4204.08 a year. The jump from a headline grabbing price of $0.48 per hour, which doesn’t sound much, to a more eye-watering $4204.08 per year can be disconcerting for those tracking the money closely. Now multiply this with the number of instances one has in one’s production environment, staging environment, and maybe a development environment, and the cost starts to hit the level that makes those in charge of budgets uncomfortable. Count the cost of the Elastic Block Store (EBS) for all your extra data, throw in an enterprise load balancer and lots of snapshots (you do want backups right?), and a few 100 gigabytes worth of traffic to and from your AWS cloud, and the cost just keeps going up.

My BugBears

1. CloudWatch has no SMS notification, except in the US East Region.

2. Ideally, the import/export service where large files can be loaded into AWS would be made available for the EBS, but this is only the case in the US East Region. Those of us in Europe have to load large files to S3 and then download. This isn’t an ideal situation, but at least this method gets the data onto S3 without it being constrained by a slow upload speed from your own network. The movement of files between S3 and an EC2 instance is actually quite fast. So please, AWS – can we in Europe have the ability to import/export direct to an EBS volume?

3. It would be great if some sort of quota/cap on usage were available from within the AWS dashboard. For example, for a test system there might be a number of instances required. If one could give the instances an expiry date or a cap on total running time, it would ensure that costs could be controlled more easily. We once had a consultant who forgot that he had a large instance running and didn’t realize until the credit card bill came in a month later!

However, the advantages of rapid scaling up horizontally and/or vertically mean that responses to architectural changes can be fast, safe, and efficient. If you have an excellent technical architecture, a clear understanding of what you are delivering, and support processes in the form of configuration management, change management, and release management, then you’re good to go.

CONCLUSION

The cloud is here to stay, at least with my company and our parent company. It is definitely more expensive than we had first expected, due to many of our system architects being fooled into the cost savings of developing an ephemeral system.