Planning for Successful Data Management

It is the data, in particular, that sets Database Lifecycle Management apart from the mainstream of application delivery. Data entities, and the way that organisations understand and deal with them, have their own lifespan. If we neglect the management of data, we risk disaster for the organisations that use it. If we take data management seriously, databases become a lot easier.

At the heart of Database Lifecycle Management (DLM) is the management of the data itself, and this is, of necessity, a continuous task in any organization. Like several DLM activities, it entails responsibilities across all the individual applications, databases and reporting systems that use the data.

Many organizations struggle to implement and maintain proper strategies for managing data, through the various stages of its lifecycle, across all projects. Failure can lead to data that is inconsistent and hard to analyze reliably, possibly due to there being several inconsistent copies of that data in various parts of an organization. It can also lead to data being unavailable when it’s required, being used illegally or falling into unauthorized hands. In all these cases, costs can spiral dramatically. If a company hasn’t got a complete grasp of the data and the way it is processed, mergers and take-overs become particularly difficult. Also, an organization will have no accurate way of knowing how much it is going to cost to implement and maintain those data systems that it will need in, order to support and expand its business.

How then, do we start to put in place some formal structures and strategies for managing business data so that an organization can ensure the availability, security and integrity of that data at all times, so that all relevant roles within an organization consistently understand their data responsibilities?

One useful practical strategy is to tie specific data planning and management requirements into the established phases of DLM. This article focusses on the planning that is required in the early phases of a new database project in order to ensure the successful management of the data, as part of the DLM activity. Although data management planning is considered to be mainly part of the governance process, it is also an essential aspect of both delivery and operations, and this article explains the main responsibilities of each of these roles.

The article will also review, briefly, the ongoing data responsibilities of each IT role for databases in active production use, as well as for ‘legacy’ database systems, when the data has reached the limits of its useful life and the team need to deal with secure archiving or disposal.

What’s the payback for data planning and management?

Unless an organization plans properly for data, including its capture, storage, analysis, transport, security, availability and so on, it has no way of knowing how much the management of the data is going to cost the organization, relative to the perceived business value of that data. By proper data planning, it will understand more fully the type and volume of data that it will need to manage, the type of storage systems that will be required and the sort of security measure that need to be in place. Also, it will develop a much fuller understanding of the required data-flow between applications, and the reporting and analytical requirements that will help it derive maximum business value from that data. If a disaster strikes, in the form of security breach or application failure, the impact is minimized if prior planning and data documentation is in place.

What is the payback for operations staff? They will know what sort of backup, recovery, availability and security plans they need to have in place, for each class of data stored. They will know what data needs offsite backup, after what period data can be archived to “tier 2” storage, and so on, and they can plan for that. Sensitive data may need to be encrypted both at rest, and when sent across the network; again this needs upfront planning. Support should be much easier if the team that gives support have a map of the flow of data through the organization. It means that they can, if necessary ascertain immediately the consequences of a failure of any point in the dataflow network.

What is the payback for developers? The biggest one is that there will be far fewer nasty surprises at deployment time, such as having to perform massive rewrites because their chosen approach to security fails to comply with legal requirements. However, it will also mean that a lot of the work of understanding the application’s business domain will be unnecessary, as much of the groundwork will already have been completed. There will be more consistent naming conventions too and it is quite usual to find that much of the data model has already been done. It will be easier to determine what applications will require data feeds, and the likely formats required. The same is true of the sources of some of the data which an application requires.

Managing data through the data lifecycle

Not only must there be a common, and managed, understanding of the business entities and the data that is associated with them, but business data itself must be managed throughout its useful life, from its initial creation to its eventual retirement. This becomes particularly important where data is used by more than one application because this is generally seen to be the responsibility of the ‘owning’ application up to that point.

‘Corporate’ data usually transcends the life of the individual applications and databases that uses it. This means that it is a business-level responsibility to make a clear statement about the requirements for the availability, integrity and confidentiality of all the accumulated data and information used, as well as how it is stored and preserved throughout its life.

The management of this so-called ‘corporate’ data is designed to encourage the using, improving, monitoring, maintaining, assessing, managing, and protecting of data throughout the organization and for the entire life of the data.

The lifecycle of the data, by which I mean the information itself, covers:

  1. Creation and receipt – the various means by which data is generated and enters the system, such as internal data entry via a CRM application, or from an external source
  2. Distribution – who, internally or externally, needs to see the data
  3. Use – the various business, analytical, and reporting uses of the data
  4. Maintenance – on-going support for the availability, security, recoverability and integrity of the data
  5. Disposal – strategies for archiving, and ultimately disposing, of data that has served its immediate business purpose

There are other categorization schemes for information but the Information Lifecycle Management (ILM) scheme is the simplest.

Managing Data within Database Lifecycle Management

The maintenance of a corporate data model and the management of the information lifecycle are best subsumed under the DLM activity, within any organization, purely because there is no logical point of separation. However, the governance activity of DLM is responsible almost entirely for ensuring that the key activities of the information lifecycle are initiated at the right time. Personal data, for example, must be disposed of within a set date that varies between legislative areas. It is illegal to omit to do so, and companies can be prosecuted as a consequence.

William Brewer described each of the six stages of the database lifecycle (New, Emerging, Mainstream, Contained, Sunset, Prohibited) in his article, What is Database Lifecycle Management (DLM)?, but here we’re not going to break it down in such fine-grained fashion. Instead, we’ll take a high-level view of the various data responsibilities of governance, delivery and operations, through the following broad phases of the ‘life’ of a database:

  • New databases – cover the phases from initial conception, through the application/database design and development, up to the point it first moves to a production system, in pilot form. During this part of the lifecycle, there will often be rapid rounds of database development and upgrades.
  • Active databases – cover the phases from the first production deployment, in pilot form, through to general production use, where data is being actively created, distributed, used and maintained. These databases are subject to frequent bug fixes, upgrades, new feature deployments, etc.
  • Legacy databases – covering databases in the final stages of the lifecycle, from databases in production only for limited or legal use, with no active development support, through to those scheduled for retirement, and on to the final nail in the database coffin, where its further use is prohibited and it needs to be retired.

A well planned, proactive approach to data management, through each phase of DLM, will ensure that an organization has the data management processes in place to guarantee that data is always valid, available, consistent and secure.

One of the essential goals of DLM is to systematize or refine team processes such as database version control, build, continuous integration, and automated deployments such that these tasks can be performed effectively without reliance on tribal memory or one individual’s knowledge. It formalizes those processes so that they become consistent, repeatable and reliable across teams.

Likewise, as an integral part of DLM, we need to use data management processes that allow any IT role to understand their responsibilities with regard to the data, across all projects, and for all applications and databases that capture, store, manage, use or analyze that data, throughout its ‘natural lifespan’.

The Data Responsibilities of Governance, Delivery and Operations

In his article, Planning for a Successful Database Lifecycle, William Brewer explained why a successful application “depends on close teamwork between the three key activities, Governance, Delivery and Operations”. Broadly, and briefly, the responsibilities of the relevant IT team for each of these activities, through the database lifecycle, is as follows:

  • Governance – covers planning and project management, in order to ensure that the application continues to provide what the business needs, conforms with the requirements of any regulatory frameworks, meets organizational standards and service level agreements for security, availability, recovery time, data retention, to name just a few.
  • Delivery – implementing all application design, development, release and deployment processes, as well as for the timely delivery of an application, that meets all agreed business requirements, to the customer.
  • Operations – implementing systems and procedures for security architecture, access control, routine maintenance (bugs, hotfixes or improvements), support, monitoring, resilience and more.

Sometimes the application or database lifecycle is a close match for the data lifecycle; in other words, the data has little ongoing value or use beyond the life of the application that produces it. In such cases, it can seem as if the delivery and operations teams bear the brunt of the responsibility for the data management.

However, in other cases, data is scarily permanent. Banks, for example, will have a complete history of its customers’ data, even if they have used several applications to host it. As noted in the introduction to this article, this is why data management is often considered more akin to a business operation than an IT function. Much of the data lifecycle of long-lasting data is maintained by governance processes; delivery teams will only have a transient effect on, or interest in, these processes.

Data responsibilities when Planning New Databases

The article, Planning for a Successful Database Lifecycle, summarizes many of the broader database planning activities that are required at the very start of a database project, so here we focus exclusively on data responsibilities of governance, delivery and operations relating directly to the security, compliance, retention, resilience and availability of that data.

Governance data responsibilities when planning new databases

There is a lot of planning work that needs be done at the very start of a new database project, primarily by governance, but in close collaboration with the delivery and operations teams. Some of the steps necessary are:

  • Identify the main business needs and therefore set business priorities for the data.
  • Provide an overall data architecture.
  • Identify all enterprise-wide data and processes shared by applications, and plan data interfaces and feeds.
  • Ensure the participation of all the necessary IT expertise within the organization for security, compliance and so on.
  • Agree a resilience plan that includes High Availability (HA) and Disaster Recovery (DR) in conjunction with the Operations activity

Data classification

An important responsibility of governance for a new database project is to establish a Data Classification System (DCS).

There are several relevant classifications of data, such as the type of structure, whether it is structured, tabular, hierarchical or unstructured. It can also be about the actual metric, such as geographical, chronological, qualitative, quantitative, discrete and continuous. However, what is most important for the Data Lifecycle, is the type of security and access control it requires, and if it is subject to the organization’s compliance framework.

This classification affects both structured and unstructured data. The DCS will, in turn, determine who can view, or modify, the various classes of data, as well as other security questions such as which data may need to be encrypted during storage or transport and which data may need to be stored offsite in accordance with the relevant level of security.

The main classes of data are as follows:

  1. Highly sensitive company and confidential data – where disclosure could have legal and financial consequences, such as staff and client’s personal details;
  2. Sensitive company specific data – the disclosure of which could have an adverse effect on operations, such as details of client and supplier contracts
  3. Data relating to the activities of a company – which are not meant for public scrutiny. These could be sales figures, organizational structures, etc.
  4. Public information – which is freely available, such as price lists, contact details or any data used for publicity and brochures.

Written guidelines and procedures for classifying data should define its classes and categories, such as financial, corporate, or personal, as well as specifying the various roles and responsibilities of employees relating to data stewardship. A classification scheme needs to be clear and easy for all employees to execute.

Having documented the DCS, the governance team need to work closely with delivery and operations, so as to formulate an appropriate plan for storing and handling each class of data. An organization will generally adopt a relevant legislative framework, describing the current standards and legal requirements in their industry, for each class of data.

Data security and access requirements

The data classification work will inform subsequent security and access policies for the different classes of data, as well as decisions regarding the physical storage architecture (see below).

Data Storage Policies (DSP) are a set of rules and procedures, designed to manage and control data in an enterprise. This set of rules also ensures that data is used for the purpose for which it was intended, e.g. to make profits in a business, and protect the interest of customers. The policies are designed to protect sensitive data and secure it, so it is not given or released, inadvertently or otherwise, to unauthorized people, companies or organizations. (See “data classification” above). In addition, the policies would cover misuse by individuals, such as malicious enhancements or physical and digital manipulation.

DSPs can vary in how data is collected and stored, to also include a set of applications that govern and control all aspects of data. Internal as well as external data, such as web information data, should be included in the storage policies. It is becoming more important to have a DSP because the growth in data is becoming increasingly difficult for administrators to deal with, using routine maintenance tasks, e.g. back-ups and disaster recovery provisions.

A few selected rules for consideration in data protection whilst in storage, are:

  • Authorized access and security controls – allows only authorized people to view, change, manipulate and update data according to their level of security clearance
  • Audit trails and audit logs – monitoring and checking audit trails or logs regularly to ensure access controls have not been breached
  • Data hashing – in order to ensure data integrity, a hash value is automatically generated, being a fixed length string of text or numbers, and represents larger volumes of data as a small number or short code. Comparing a hash value, would indicate if data has been altered in transit
  • Data encryption – the means of securing data, whether in transit or in the database. Encrypting sensitive or relevant data makes it less likely to fall into the wrong hands, be misused or misappropriated.

Storage Architecture

Most organizations recognize a 5-tier structure when classifying application data according to the required storage architecture. For example, a tier 1 application relies on transactional data that is accessed frequently, every day, often by many people. The storage architecture for tier 1 applications needs to reflect the need for that data to be constantly available and returned quickly. This structure also covers less frequently accessed or less valuable data (tiers 2 and 3), disaster recovery architecture (tier 4) and offline storage (tier 5).

During data planning, it’s the responsibly of governance, in collaboration with operations, to classify the application according to its storage architecture requirements, and also to define after what time period, data can be partitioned off to cheaper, lower-tier storage, what DR plan must be supported and so on.

Physical storage location

The initial data classification work will also determine the physical storage location of the data, such as whether it can be stored locally, in cloud-based storage, or requires secure off-site storage.

  • Local storage – The data is simply stored in a local data center or suitable holding repository. These would be local servers or other hard drive storage media, but not on individual local PCs, as these are not shared storage devices.
  • Data Warehouse (DW) – Traditionally, a data warehouse is located on the main server of the organization but recently cloud-based storage is becoming more prevalent in some organizations.
  • Cloud storage – This is online storage where the organization’s data is located off-site, potentially on several servers and even at several locations. The Cloud, unfortunately, has yet to be widely accepted as a safe storage location due to anxiety about the uncertainty regarding the legislative framework in the location where the data is physically held, having to rely on unknown security arrangements, and the ease with which huge quantities of data can be accidentally lost.

Data storage and transfer format

Besides policies relating to the physical storage location of data, governance needs to determine in which format the data will arrive, and the formats to be used for storage, transference between systems and presentation.

By developing a proper understanding of the nature of the data involved, they can plan the correct type of database in which to store the data. This makes it possible to use graph databases, document databases and other specialized database systems for unstructured or semi-structured data, relational databases for transaction processing, and OLAP/Cubes for business intelligence. In other words, they can correctly define the requirements of a heterogeneous data platform, if necessary, rather than relying on a single platform. This can make the task of planning and costing this sort of system much easier.

During data planning, governance also need to establish the basic storage format for each type of data, in terms of how they are stored (data type, unit of measure), transferred and represented. Some data, such as times, are only relevant for a particular time zone, money only has meaning when associated with a currency. If storing details for automobiles, it needs to be decided if this will be stored in KPH or MPH. The planning process will need to specify the checks and reconciliation processes that are required to make sure that any data corruption is detected quickly.

The big problems of data storage arise when there is no planning or understanding of the representational version of data, the transfer version and the machine or binary version. How should the system handle dates, for example? They should be stored in a format that a machine can easily understand and on which it can rapidly perform simple calculations. Essentially, the data is stored as a number representing a quantity relative to a known point in time, such as an associated time zone. However, the date would be transferred in an ISO standard format, and be represented in a way that anyone can understand.

Data Quality

By ensuring data quality, an organization is able to trust its data, so that it can make intelligent decisions to improve its performance, based on that data.

There must be a common and consistent data language for everyone in an organization. This will contribute greatly to overall data quality. For example, how does the organization define and identify a “customer”? The answers to such questions are not necessarily straightforward, and certainly the organization needs to avoid situations, where one part of the business considers a customer to be an “account”, and another an “individual person”.

Data planning for a new database must establish a single, unified definition of each entity about which information must be stored and the origin, or “canonical source” of the data for each attribute of that entity, but also stipulate a consistent way to link all entity data.

Data will originate from various internal sources, processes, machines, instruments and sensors. Disciplines such as Master Data Management will help join together all of the company’s information and data to a coherent whole. It helps prepare the data, thereby ensuring that the organization has a consistent, accurate overview of all information and in addition, can depend on the business decisions taken, based on the analysis of that data.

Governance, during planning, should provide a set of high-level requirements for effective control of data quality, within each IT process, agree on objectives and design procedures to measure the data quality. This may entail use of commonly referenced best practices and guidelines, such as COBIT, ISO/IEC 38500, the methodology of Six Sigma, and various other tools for data mapping, profiling, cleansing and monitoring data.

Auditing and Compliance

Every country has different rules, directives and laws governing the security, storage and handling of corporate, financial and personal data. Regulatory compliance outlines the aims organizations should attempt to meet, to ensure they are fully aware of, take the necessary action and have data management methods, to comply with relevant laws and regulations.

The International Organization for Standardization (ISO) produces guidelines designed to provide a common framework of how compliance and risk should operate together. In addition, a lot of regulations come from European Union (EU) legislation. Various areas are controlled by different bodies, such as the FCA (Financial Conduct Authority), Information Commissioner’s Office, to name just a few.

The governance process must translate the appropriate compliance framework governing storage and handling of sensitive data, into a simple set of checklists that delivery can use to ensure that all the work is compliant.

The compliance rules for handling of ‘sensitive’ data will also feed into the plans for implementing auditing requirements, needed to ensure and prove that the data has not been accessed or modified by unauthorized people or processes. An audit can range from a full scale analysis of business systems and methods to monitoring system or administration log files which record changes and amendments.

Data Retention

Complying with legal requirements of data retention is difficult and complicated. Data retention laws sometimes require data owners and service providers to keep large quantities of user data for a much longer time than their business operations require, or paradoxically sometimes demand its deletion while actively being used.

Compliance laws like the CAN-SPAM Act and the Fair Credit Reporting Act in the U.S. demand that organizations confer to an individual the “right to be forgotten.” However, the latest laws require longer data retention time, even though it is opposed to the individual’s wishes. The result raises privacy rights which are a real legal challenge to sort out. Generally, organizations will adopt a ‘framework’ that interprets the entire range of legislation and specifies in practical terms what is appropriate. This means that a compliance expert need only check actual practices against the framework.

Data Resilience

Data Resilience (DR) is the means that allows continued, uninterrupted operation, despite problems such as equipment and power failures, or any other faults or interruptions. Resiliency will ensure, that the system maintains its operational ability.

DR forms part of an enterprises’ data architecture. It is also often incorporated into disaster-planning and disaster-recovery considerations, which are closely linked to data protection. The task of governance is to make sure that actual operation practice conforms to these documented expectations.

Delivery responsibilities when planning new database projects

In planning a database application, the delivery activity will work with governance to determine what data is entirely domain-specific, and what data is already part of the organization’s existing data model. They will need to establish in broad terms what the requirements of the application will be for existing data. This will determine several requirements that will affect the development effort, especially where there are special security, logging and reporting requirements. The delivery team plans how they will best meet these requirements.

Furthermore, they plan, in conjunction with both governance and operations, a structure within the organization that supports service-orientated architecture (SOA) or requirements for specific architectural changes, which support the application.

Operations responsibilities when planning new database projects

During the planning phase, the operations team will act in an advisory capacity to both governance and delivery. They will address High Availability (HA), Disaster Recovery (DR), security and architectural planning, as well as costing.

In planning the disaster recovery strategy, governance needs to

  • Understand what is at risk;
  • Know what data has been stored;
  • Evaluate vulnerable points objectively;
  • Assess all the storage and the disaster recovery environment;
  • Perform tests to ensure that data and environments are secure;
  • Run internal security assessments to highlight environmental vulnerabilities.

As different applications and systems require differing levels of protection, protection tiers and availability, these areas need to be objective, analysed and prioritized. The economic and operational advantages should be balanced with the requirement of protecting data and applications adequately.

Ongoing Data responsibilities for Active Databases

Once a database is in production, either in pilot form, or in general use, governance, operations and delivery must cooperate to ensure that data management processes are implemented as agreed, and are then monitored, improved and adapted where necessary, so that all databases continue to meet the requirements for data availability, quality, and security.

Governance, broadly, will need to ensure that all information management processes are implemented as per the specifications, and also amended in the light of changes in the business (such as after an acquisition), or changes to the application or data (e.g. after a new deployment).

They will perform checks on data entry mechanisms, data quality, access control, auditing and logging, change control processes, and more. They will need to work with the delivery team to ensure that all data handling, security, logging and reporting mechanisms in production databases continues to comply with any updated compliance framework.

Operations will have data responsibilities relating to correcting reported data quality issues, implementing the agreed backup and data protection plan, managing data growth, managing the storage architecture, including migrating data between tiers as required and performing regular performance checks for compliance.

Ongoing Data responsibilities for Legacy Databases

Towards the end of a database’s lifecycle, it may be in ‘contained’ production use, to support specific, limited business functionality. Governance will continue to monitor for changes that are required as a result of security concerns or legislative changes, but any activity of the delivery team will be limited to changes that spring from these concerns. All ongoing operations such as maintenance and monitoring will be done as a routine, relying on documentation and a Central Management Server archive.

If there are plans to retire (‘sunset’) a legacy database, there is additional work to be done in ensuring that any replacement system can provide equivalent functionality, implementing plans for migrating any required data over to the replacement database, and planning for the enforcement of appropriate data archive and purge procedures for the legacy system.

Archiving involves copying or transferring the data to a secure holding environment, where it is stored in case it is required at a later date or time. This data storage is usually for a lengthy period of time, e.g. as in legal situations, where the law requires retention of data or documents for a stipulated time period, e.g. 5 or even 10 years.

Purging or disposal is the final exit of, or death of, the value of the data. Its use has become redundant and so, generally after the archiving stage, it is purged from the system.

Governance will need to:

  • Plan for the termination of the service, checking that all users have plans in place for the eventual termination of the service
  • Establish the data formats, retention time and archival rules.
  • Establish how any archived data will be hosted, as appropriate for the classification of that data, and which methods of access, encryption and storage are permitted.

Once data has been purged, governance has a critical security management, regulatory, governing and compliant role to play in this step. It has to prove that the purge phase has been completed to exact requirements and that there is no residual post-purge data. Appropriate system checks, controls and user protocols should be in place, proving that total data elimination has effectively and successfully taken place.

Operations will be responsible for costing and implementing the data archiving and purging plan, as established by governance. For the duration of the data retention period, operations must run periodic checks, that offsite/backup storage media can still be read.

Operations will also be responsible for arranging the data purging process, and will need to prove that the purge phase has been completed to the exact requirements set by governance and also, that there is no residual post-purge data.

Conclusion

DLM takes a view of data right across the organization that is using it. It might be expected, that data is entirely the responsibility of the application that uses it. This has never been the case in any enterprise. Most offices rely on whole chains of applications and processes that pass data between them. This means that the creation, use and disposal of data has to be supervised from a central point. If data needs to be changed, it must be easy to determine the original source of the data, and then amend it in such a way that it stays altered. To do this successfully requires certainty as to the original source, the ownership and the way that data feeds into downstream systems.

If data is lost, corrupted or stolen, the losses can be enormous. The impact can be minimized in any organization that understands the nature of its data, and the processes that act on that data, especially when it keeps to a sensible set of data strategies that are shared and understood. By adopting a sensible approach to DLM and data governance, an organization can save a great deal of time and money in recovering from adversity, and will be able to be quicker and more effective in responding to business change.