I worked for 15 years as a DBA and IT Manager in the insurance and financial markets. When I first started out, ‘chaotic’ would be the best word to describe our approach to the design, development and ongoing maintenance of our business applications and data. I remember vividly the “Wild West” days when, with only a database backup as a fallback policy, I’d make ad-hoc database changes directly to production with little to no testing. I learned the hard way that processes that can survive for short periods, with a slice of luck and when working on relatively small systems and in small teams, become untenable as the team, systems and business ambitions grow.
It would be nice to be able to say that as the companies I worked for grew and matured, so I witnessed a smooth technical and cultural change that allowed us to continually refine and optimize our IT systems in a planned and structured way. In truth, it was a lot messier than that. Some of the sensible models that exist for continual process improvement, such as that enshrined in the CMMI Maturity Model from Carnegie Mellon University, are barely on the radar of many organizations. My experience in the industry was characterized by periods of ‘heroic endeavour’, in response to very short-term priorities and targets. IT improvements were funded only when there was little alternative, rather than as part of a longer-term strategic plan.
Nevertheless, in fits and starts, we did move gradually and sometimes painfully towards more managed and optimized systems, where database or application changes could be made relatively quickly, and with less disruption, lower risk and fewer surprises. By the time I left the finance industry, our testing was automated, deployments were “1-click” and a semblance of proper IT governance was a reality, with the development team more generally accountable to the business as a whole, for the changes they wished to make.
“I’ve put all the files in this folder”
In my early days as a database developer-cum-reluctant DBA (early 2000), I was working in Financial Services, developing and managing some in-house trading software for a small hedge fund.
The trading system had an Access frontend and a SQL Server database. Change management was rudimentary. There was no version control of any kind. A “release” entailed shouting to the other developer, who was really a trader, to inform him that “I’ve put all the files in this folder.” All changes were made live to the production system, after a brief sanity check. It was quite a small operation, but the trading system was still dealing with a few $100 million of people’s money.
Later, this hedge fund management firm merged into another larger hedge fund that had far greater assets under management (about $5 billion) but if anything an even more chaotic approach to change management.
Their key business functions were defined in Excel files. The system used SendKeys macros to send instructions to a DOS-based investments management system. We produced basic reports using some Java components to run queries against this DOS database. Nothing was source controlled. All changes were direct to production. There was no formal issue tracking beyond erratic entries into a central spreadsheet.
Every attempt to introduce a change to a report would involve hacking the Java code, and then running it in production to see what the new report looked like, hopefully while nobody else was trying to look at it. The Java code was badly written and it was very easy to make a mistake and blow everything up, at which point, I simply had to revert to my original file system copy and try again.
My first task was to migrate the database to SQL Server and replace the unreliable SendKeys macros with a COM interface. Along the way, I managed to get the Java code into source control, well Visual SourceSafe (VSS) at least. The database code was less of a priority at that point, since making any schema changes wasn’t a realistic prospect.
Eventually, I built a dedicated data mart and moved some of the main business functions out of Excel. I had devised a new schema for the data mart, and began making more database changes, so I started scripting out the database schema each evening and saving it to VSS. I also managed to acquire a second server. It wasn’t called anything as grand as “staging”, but it was at least a place where I could break things with impunity before pushing changes to production. I started using a schema comparison tool (SQL Compare) to push changes between the two environments.
Rapid Business Expansion, No IT Planning
At this time, the firm had some success in winning new business and shifted from being a specialist hedge fund into supporting a broader range of product types. It was a period of rapid business expansion, achieved without giving a great deal of strategic thought to the support functions, including IT.
The business wanted a new trading platform that would drive and support their expanding business interests. They wanted new functionality, and they wanted it delivered quickly. The IT team grew, and we were expected to undertake much more ambitious projects.
Our rudimentary and manual change management processes, which showed evidence of strain even when the development team was relatively small, began to crack. There was no automated workflow around any of our processes. It all relied on people firstly remembering to check their changes into source control, and then getting my “okay” before anything went into production. Inevitably, as the team grew, along with the pressure to deliver, mistakes crept in.
We simply had to make some overdue improvements to our processes. We acquired a third environment, so we now had Development, UAT/QA (depending on what was required at that time) and production. We moved from VSS to Subversion, which was a big leap forward in terms of the ease with which we could make concurrent changes. We also started using SQL Source Control to commit database changes directly from SSMS, as well as CruiseControl as a build server and JetBrains’ YouTrack for issue tracking, in place of Excel.
Thanks to these improvements in our development and deployment processes, we managed to deliver some decent bits of functionality for the business, but progress was often frustratingly slow, because we were working against deeper problems.
Deeper data architecture problems
The right tools will help make your application and database change management process more reliable and predictable, but it’s still fundamentally hard to deliver changes quickly and efficiently, if the underlying architecture isn’t well-designed, flexible or adaptable. These inadequacies in our data architecture, and a lack of time for any forward planning caused us a lot of pain.
In terms of IT budgets, the focus always tends to be on short-term tactical objectives. The business wants a new piece of functionality to support a current objective, and needs it as quickly as possible. This meant there was never time to tackle fundamental problems with the underlying data architecture. There was resistance to the idea that the data architecture was part of the “software stack”. Combine this attitude with artificial deadlines (“I want this by Tuesday”) and it meant that tactical decision-making was the norm. Over time, this actively subverts good architecture.
Each new application tended to be a ‘stovepipe’ or ‘silo’, with its own database and its own definition of a data model. It was hard to adapt the schema for these applications to introduce new features, and even harder to integrate the data from the various silos when we needed a more unified view of the business data.
The core problem was that there was no unified understanding of an underlying data model for the business. In the data model, we had to represent important business entities such as “asset classes” and “risk categories”. If there’s no central definition of these entities, what they mean and how they are defined then you simply can’t correlate data from each silo.
At one point we embarked on a project to upgrade our compliance engine, which would test proposed trades against a relatively simple set of rules (“The exposure of this fund in US dollars cannot exceed 80% of its total value”). Each customer would have slightly different rules.
The compliance engine had to integrate data from numerous “stovepipe” databases, perform the test and feed the result into the central trading system. We were having to manipulate our schemas a fair bit in order to support these new applications, but without any lead time or forward planning, we ended up creating a lot of cross database dependencies, linked server dependencies, which would make it truly painful to move changes between environments.
We took the pain, and stayed late every night for months, and when things went wrong we fought the fires. We took that pain over and over again for a long time. We knew what the pain was, we knew what the solution was, but we weren’t empowered to step back. We needed to overhaul the underlying data model; we needed to introduce an abstraction over the database, but senior management felt that the right way to run a project was to lean on everybody aggressively to “get it over the line”, with a promise for time to look into the “other stuff” later. Of course, that time was rarely found.
Lack of testing
Over this period, a lack of automated testing during development became a major problem. When failures happened, the first sight of the problem would be through the front end, and if it didn’t report it very clearly, then you really just had to go through all the connection strings, all the permission sets, and logs, and try to track down the issue.
I recall one project where we requested a more test-driven approach to the development. The outside consultancy presented two prices to the CEO, one with testing and one without. The CEO chose the one without. It seems surprising now, but it wasn’t at the time.
We’re now well versed in the case for test driven development but 10 years ago it was not the norm. Testing was still someone firing up the application and clicking. That’s what people understood it to mean.
The management perspective was to try to hire decent developers who didn’t make that many mistakes. It’s a different mind-set. If you think application faults are blunders by people who should be better at writing software, then the answer is shout at them more, shout louder, lean on them, fire them if you have to, hire in some new ones then shout at them to make it clear that they’re not allowed to make mistakes.
Pain pushing changes between environments
Connection and permission-related failures were a constant pain when promoting changes from one environment to another. We had segregated permission sets for our three environments, but moving anything from one environment to another was still manual and painful. I lost count of the number of time a stored procedure had been introduced, but the associated role hadn’t been given the correct permissions, and so it would fail.
By necessity, I spent quite a lot of time smoothing out the process of promoting changes up through the environments. I built up a collection of scripts that I could rerun relatively quickly, but a lot of the stages were still manual and it burnt a lot of my time as the reluctant DBA.
Pushing changes down the environments was even more tedious and error prone. Developers would request a copy of production or a copy of the current UAT for their environments, and I’d stumble constantly into permissions issues, environmental inconsistencies (different numbers of servers, different linked server definitions), and other ad-hoc complications related to a specific request, such as to introduce subsets of changes from UAT that hadn’t yet been deployed to production.
Another difficulty we had was imposing the discipline of not making direct changes to the Test/QA environment. We wanted testers to feed bugs back down to development, so they could be fixed, and then the new code tested and promoted back up to UAT/QA from source control. However, it didn’t often happen this way. Our change management processes weren’t as slick or automated as we needed them to be; they took time, and the developers were always busy. As a result, testers would demand privileged access to UAT/QA in order to make changes and see whether the change that they thought would make their test pass really did make their test pass. Of course, there was always a risk that without careful oversight, which of course eats up time, direct changes to these environments would not make it back into the “canonical source” in source control.
Data architecture redesign
Eventually, we embarked on a project that more or less forced us to rethink our application and data architecture.
The organization wanted to automate the main middle office operations and functions, and for Foreign Exchange trading one of the main functions is to rebalance. By the nature of the Foreign Exchange, the value of a client’s assets in each given currency changes all the time, due to market movement. Rebalancing meant to bring down or up the exposure to market risk, in a particular area. Exposure can be very volatile, so we needed to be able to rebalance quickly, very often between very large foreign exchange trades into the market. At a month’s end there would be hundreds of rebalancing trades to be implemented at exactly 4:00 PM on the last day of the month. It was very stressful, and these rebalancing calculations were being done on spreadsheets!
We hired some consultants to help build the main components of that application and at around the same time we also had some consultants in to review our IT infrastructure and processes, and the root cause some of the recurring problems. They came back with a lot of recommendations. Thankfully, one of those recommendations was to implement a new data architecture.
An experienced data architect came in to help us design and build a data architecture that could service the business properly. It was a ground-up, analytical approach. He interviewed users and IT, enumerated systems and business functions, came up with a logical data model for the business, which we then reviewed and tested against real business functions, before coding it into a schema.
With a unified data model in place, we were able to introduce new business functionality at a much-improved rate, adopting a service-oriented approach, based on Windows Presentation Foundation (WPF) with a WCF data service, for the Rebalancing project, and all future projects.
With the help of a consultant development team, we also managed for the first time to introduce a proper package of automated tests, running on TeamCity. We deployed from TeamCity into each environment. It meant that the TeamCity projects were relatively complex, with a lot of environment variables determining what was going to happen after a build. However, with the new services architecture, it meant that application developers could build a WCF services and WPF application, hit a button, to run the test suite. If it passed all the tests and then push into the test environment, and form there into the UAT environment.
On the downside, despite sorting out the data model, the database was still out of sync with all these positive testing and deployment changes, and the new smooth application processes exposed how relatively backward the database processes were. Database deployments or upgrades were still a relatively manual process, driven by SQL Compare, and in there terms we still suffered from a lot of our old problems.
ORMs and the “stateful versus migrations” debate
We gradually built up a more complete layer of reusable services for all of our applications. In many cases, a service required a dedicated database that would store only the data required to support that particular business function.
The application team wanted to adopt a “migrations” approach to managing database changes in sync with application and service changes, automating database migrations through their ORM (Object-Relational Mapping) tool, which was NHibernate.
In this approach, the application team developed the data model in the ORM and a member of the DBA team would then vet the schema to make sure it was consistent in its basic definitions with our underlying data model for the business as a whole.
Once the data was in its final state, the application or service was required to feed completed records into our central, relational data store, where we had in place auditing and constraints to ensure all the data conformed to the rules for that particular business function.
During development, as the team committed each database modification, the ORM generated a migration script describing that change, which was stored in the VCS. The ORM would automatically run the required scripts to generate a new build, or to advance a database from one define state to another (e.g. build 11 to build 14).
This worked reasonably well in these cases where there was a clean 1:1 relationship between application and database, and they could change in lockstep. Unfortunately, the application developers were convinced this was the only way ever to do any kind of change control on databases and wanted to apply it everywhere.
However, we also had databases that were shared between multiple applications, and had different reporting responsibilities. We argued that they could not possibly be under the jurisdiction of one particular service or application, and so began an awful lot of bun fights about ORM versus non-ORM, and use of migrations versus simply storing in the VCS the current state of each database object. In the end, we settled on a two-solution approach where application specific databases were built using ORM and managed entirely under that solution, but domain databases, those that supported multiple applications, were treated differently.
A better development model
When dealing with domain databases, multiple projects would request changes to that database. Initially, we had only a single shared database development environment for each domain database and we encountered issues around not being able to freeze an environment without impacting other projects, and not being able to progress an environment, to support a request from one project, without impacting another.
To address these issues, we switched to a “swim lane” model of development with a separate database development server for each project team. We had four development environments, DevA, DevB, DevC and DevIntegration. In A, B and C, developers could work on features specific to their project. This gave us the ability to freeze and unfreeze and progress our environments separately.
We tried to avoid branching as far as possible, because of the pain of merging, especially for the databases. Occasionally, we needed to implement a feature that cut across project work, and touched multiple services and apps, forcing us to branch so as not to lose in-flight work. Other than that, we would simply analyze the current set of work and the likelihood of one project’s changes “breaking” another project, when we pushed the changes to DevIntegration. Generally, we found the answer was no, and we could take steps to mitigate any likely problems we did foresee.
Each team committed daily during development and we would build that against a copy of production for all the databases and run tests against that, and dev integration every night.
Automated database deployment
By this stage we had four separate development environments, plus test, staging, UAT and production. We also had seven or eight services, three applications and still had half a dozen databases.
I had a central spreadsheet where I tried to keep track of version of which service, application, or database was in any given environment at any time, and when a new version had been introduced, but it was becoming time-consuming and error prone. Also, we were still suffering from a lot of the pain described earlier, when trying to promote changes between environments.
This situation improved dramatically once we’d adopted a proper release management tool (Red Gate Deployment Manger, now retired and replaced by Octopus Deploy, with SQL Release) that would help us manage deployments for both the application and the database. Through a single interface, we now had visibility into exactly what had been deployed where, and when and eventually arrived at a single-click deployments between development and test environments, and eventually all the way up to production.
We resolved the age-old problem with permission failures, when moving changes from one environment to the next, by separating out the permissions for each environment into a set of script and environment variables. We used a Visual Studio (VS) solution for this such that, for example, on a deployment to Test or UAT, we’d tear down the whole permission set for a database and apply it afresh from the current VS solution, stored in source control.
After a lot of work, I was able to fit multiple development styles (migrations and state-based, for example) into that single deployment interface and we were in a much happier place. We also adopted a scheme where only privileged users could deploy to specific environments. So, for example, in order to deploy to production, you had to login as a special user, whose password was only known to two people, and all the other environments would be greyed out. We would perform these production deployments in front of the people responsible for signing off that change.
When I first started out, there was no governance, no real business oversight of what IT did or how. Over time, however, this began to change, especially in the period following the consultant review.
In practice, this meant that there was now a “gated” approval process for each development effort. A candidate piece of work would require a business case and a change control committee would need to approve the project into the current workload.
During development and test, we’d have a small “release gate”, for release of a finished feature to UAT. We went through many iterations of the release gate criteria before we arrived at something that was useful to all parties, without causing too much delay or drag. Essentially, it was a wiki page, which we created from a template and which answered questions such as: What’s changing and why? What’s the risk? Who’s affected? How are you communicating with them? How are you performing testing in UAT? What is the worst case scenario? How will you roll back, if necessary? This review process had to involve non-technical people, so for UAT that might just be business analysts and the head of operations.
For a release to production, the gate was higher and the review process involved a compliance officer, the CEO, anyone who you could get into a room, but the more senior the better. For each significant change, such as a software upgrade, or database schema changes to support new functionality, and possibly data changes as well, we’d present along with the wiki page various ‘artefacts’ to help describe what was changing and why and how we’d tested it. This might include test coverage reports from TeamCity or the SQL comparison report from SQL Compare, and so on.
As a technical manager, it’s easy for me to review the tests, the code, and the schema and data changes and say “this is all correct”, but I’m only signing off at a certain level. The compliance officer is responsible for making sure we’re observing the regulatory rules, as well as the client specific rules that software was designed to implement. They’re finally responsible, not the technical team. The technical team would have to do a lot of education of the non-technical team in order to get them to understand what they were approving.
The increased visibility we had provided into our deployment process helped with this. For example the compliance officer could see exactly what had been deployed to UAT and for how long and this gave greater confidence that the change had been tested thoroughly.
We also spent a long time developing a way to communicate simply what were often complex new functionally, such as to that to perform rebalance calculations. This was as much a problem of describing the domain language as the actual software change.
Rather than just describe a change to the “Rebalance” software, or why we needed to introduce a new abstraction to the logical model and therefore a new object to the schema, we had to explain in business terms why these existed and how we intended to drive the required behavior through the software.
Software development for a specific business domain often requires evolving and reinforcing a domain-specific language, so we had to introduce and communicate changes to the business users as we developed the systems.
If all of this hadn’t been documented and explained well ahead of time, then the final sign off would be quite difficult.
When I started out with databases, neither I nor the company I worked for had an idea of the extent of the techniques and processes that contributed to an ordered and repeatable management of database development. In any organization, it takes time and the necessary components have to be introduced step-wise. There is no single product or other silver bullet that allows databases to be developed quickly, and for features to be added predictably in line with the requirements of the business. Database Lifecycle Management (DLM) requires cultural changes, the learning of techniques, the introduction and integration of a range of tools and a different way of working with the business. For a business that depends for its existence on rapid, robust and scalable system in a fast-changing industry that is subject to regulatory supervision, the benefits are survival and a good chance of growth.