The Logical Data Warehouse – Towards a Single View of All the Data

What is wrong with the Enterprise Data Warehouse? Quite a lot, it seems. By taking the narrow view that the struggle is that of accommodating and interrogating huge quantities of data, then initiatives such as the Virtual Data Warehouse and Logical Data Warehouse could make sense. But what about data quality, security, access control, archiving, retention, privacy and regulatory compliance?

For years, the traditional enterprise data warehouse (EDW) has been the mainstay of comprehensive business intelligence (BI) solutions, providing a central repository for data that everyone could turn to for that one source of truth. But in today’s world of information overload, the EDW can barely keep pace with the explosive volumes and diversity of big data, causing IT professionals to look elsewhere for BI solutions that are highly flexible and scalable, while able to respond to the growing demand for real-time analytics.

In the traditional EDW platform, data might come from transactional databases, line-of-business (LOB) applications, customer resource management (CRM) services, enterprise resource planning (ERP) systems, or any number of other sources. Before the data was loaded into the warehouse, it was cleansed and transformed in whatever ways necessary to ensure its reliability, consistency, and accuracy across the enterprise. Extract, transform, and load (ETL) operations were precise and highly refined, providing a stable and predictable environment from which to access the data.

With an EDW infrastructure in place, data scientists and information workers had a centralized platform from which they could perform complex analytics and generate informative reports that let viewers filter and drill into the data. An effectively designed EDW would, in theory, contain an inclusive set of historic data at a granular enough level to derive information meaningful enough to justify the investment in time and resources necessary to implement and maintain such a system.

Long live the enterprise data warehouse

At one time, the EDW could host enough data to provide decision-makers with the information they needed to understand trends and drive business strategies. Data would be transformed and loaded into the warehouse, often through nightly batch feeds, and made available through an assortment of BI tools, providing a handful of analysts with a rich repository of information they could poke and prod and ponder at their leisure. But those days are long gone.

In today’s climate, data volumes are expanding at staggering rates, due in no small part to Web 2.0 and the accompanying cloud services, social networks, mobile devices, and Internet of Things (IoT), all of which fall under the big data umbrella, or more precisely, the big, varied, dispersed, mostly unstructured data free-for-all that has come to define the information age in all its massive glory.

Not only is this morass of information growing at speeds unimagined 20 years ago, but it also comes scattered across global silos in an insanely wide range of formats, and with the expectation that it should be accessible, meaningful, and ready to be consumed by the latest self-service BI product racing over the horizon with its eye on real-time analytics of multi-sourced, multi-typed, constantly multiplying data.

It’s not surprising that, in the face of this data overload, the traditional EDW should fall short-fall short, flounder, crumble to its knees. The one-source/one-truth promise that defined the EDW in its heyday is being strong-armed into giving way to the feeds and sensors and scanners and social channels and RFID tags and countless other data generators and stores that have been multiplying from sea to shiny sea.

The EDW offered a central repository of cleansed and trusted and well-structured data, and although such a repository still serves a vital purpose, the world in which we live is made up mostly of raw, messy, unstructured data, and with all that growing messiness comes an equally growing interest in making sense of it, deriving value from it, making decisions based upon it. To do so, however, requires solutions that are flexible, nimble, affordable, and relatively painless to implement, everything the EDW is not.

Implementing and maintaining a traditional EDW platform takes a significant amount of planning and investment, with careful thought to how data will be extracted and transformed and loaded, along with what resources will have to be dedicated to make it all happen and keep it all happening. Changes can be difficult to implement and disruptive to business. Modifications to source applications can wreak havoc. As for the EDW projects themselves, by the time they’re implemented, they often no longer meet the current business needs. That’s not to say the traditional EDW should be cast aside, but rather that, when it comes to big data, the EDW has met its match.

All hail the logical data warehouse

Organizations looking to take control of this onslaught of information are turning to other solutions to meet their data needs, either in addition to or instead of the traditional EDW. Quite often this means turning to a logical architecture that abstracts the inherent complexities of the big data universe. Such an approach embraces mixed environments through the use of distributed processing, data virtualization, metadata management, and other technologies that help ease the pain of accessing and federating data.

Dubbed the logical data warehouse (LDW), this virtual approach to a BI analytics infrastructure originated with Mark Beyer, when participating in Gartner’s Big Data, Extreme Information and Information Capabilities Framework research in 2011. According to his blog post ” Mark Beyer, Father of the Logical Data Warehouse, Guest Post ,” Beyer believes that the way to approach analytical data is to focus on the logic of the information, rather than the mechanics:

This architecture will include and even expand the enterprise data warehouse, but will add semantic data abstraction and distributed processing. It will be fed by content and data mining to document data assets in metadata. And, it will monitor its own performance and provide that information first to manual administration, but then grow toward dynamic provisioning and evaluation of performance against service level expectations. This is important. This is big. This is NOT buzz. This is real.

After Beyer introduced the concept of the LDW, he and his Gartner colleagues fleshed out the idea, and in the end, identified seven major components that define the LDW platform:

  • Repository management – The implementation of EDW data repositories to support specific use cases where the highest quality data standards must be maintained, such as those required for compliance and regulatory matters. The more valuable the data, the more likely it will need to reside in the EDW, assuming that data sizes are not prohibited.
  • Data virtualization – A single view of data from distributed sources, regardless of the type or location or whether structured, semi-structured, or unstructured. The data remains within the source systems and can include Hadoop clusters, relational databases, NoSQL databases, cloud services, data lakes, file servers, social networks, or any number of systems.
  • Distributed processing – An approach to data querying and analytics that pushes the processing down to the source system where the data resides. If a query spans multiple data sources, each system can process its own chunk of data, with the results from all systems aggregated into a unified set.
  • Metadata management – A system for maintaining metadata across all classes of data services in order to facilitate distributed processing and data virtualization. The metadata can also be used to support data quality, data governance, and master data management.
  • Taxonomy/ontology resolution – A system for relating data asset taxonomy with use-case ontology in order to effectively combine data from multiple sources. The metadata derived from this process can help to locate data assets across the available data stores, as well as support auditing and service-level agreement (SLA) services.
  • Auditing and performance services – A system for collecting statistics about the performance of the other LDW components as well as for maintaining preferences from connected users and applications.
  • SLA management – A system for tracking the expectations of connected applications and users. The system monitors the SLA performance in relation to the auditing statistics and from there makes recommendations or automatically optimizes an operation.

Although data virtualization and distributed processing are listed as separate components, the technologies are often merged together. For example, Microsoft’s PolyBase, which has been added to SQL Server 2016, enables access to Hadoop clusters from within the database schema structure, providing a virtualized view of the data, at the same time, pushing the processing to the cluster where the data resides.

Then there’s Denodo, a full-fledged data virtualization solution that, like PolyBase, pushes the processing down to the the system where the data resides, whether a transactional database, EDW solution, or Hadoop cluster.

As important as data virtualization and distributed processing are to the LDW, a complete implementation requires most, if not all, the components identified by Gartner to support self-service BI, predictive analytics, and real-time decision-making across a wide spectrum of data sources. That’s not to say that the LDW must be implemented in a specific way, but rather, it needs these components to form a logical whole, no matter how they’re implemented.

The LDW strives to provide a single view of all the data, without having to move it from its original silo, making it possible to query one or more sources of data with the same ease you would expect when querying a relational database. Only by approaching data at the logical level, can you achieve the type of flexibility and scalability required in this era of big data

The LDW label

The challenge with a technology trend such as LDW-and the type of wide-net definition offered by Gartner-is that the concept can cause a fair amount of confusion. Like the IoT, the LDW begs a concise definition that the uninitiated can point to for a clear picture of what an implemented LDW structure might look like. Is it a product such as SQL Server? A cloud service such as Salesforce? Or is it more like a database abstraction layer or virtualization technology? The latter would be a tempting comparison because the LDW is often referred to as a virtual data warehouse (VDW), although it’s also been labeled a data layer and data lake, along with a number of other names.

But it’s the VDW label that’s particularly problematic because of its somewhat checkered past. The VDW has actually been around for a while, and like the LDW, the VDW came with a trunkful of promises that it would at long last fulfill the EDW mission of unifying data into a common, though virtual, repository.

Unlike the LDW, however, the VDW was concerned primarily with relational databases, rather than boatloads of big data silos. By stringing together multiple databases, the VDW promised quick and easy implementation, without worrying about all those irksome integration details that the conventional EDW required. The data remained in its respective store, disparate applications could be virtually tethered, and the resource-intensive implementation associated with the typical EDW could be avoided.

Unfortunately, the VDW came with its own assortment of challenges, with performance proving a particularly vulnerable casualty. Imagine a query attempting to access multiple databases simultaneously. Response times could vary. Caching could be inconsistent. And one down system could bring the entire operation to a halt.

Perhaps the even bigger issue was that, despite the hype, the VDW approach failed to address one of the biggest challenges of the traditional EDW-the need to cleanse all that data. Syncing multiple databases, each with its own version of the truth, could turn the most rudimentary query into a quagmire of unpredictable and unreliable analytics. At some point, that data needs to be cleansed to derive meaningful information, regardless of where it happens.

There are plenty of other issues with the VDW, of course, but the point is, we should be cautions about imposing the VDW label on the LDW-and hope that the LDW can avoid all those VDW pitfalls.

But the LDW has also been compared to the data lake, a repository for storing massive amounts of unstructured data, usually within a Hadoop infrastructure. A data lake can support all types of data and has the capacity to transform it and define data structures as they’re needed. Google and Yahoo were at the forefront of the data lake movement, but since then even Microsoft has joined the fun with its Azure Data Lake service, now in public preview. According to Microsoft, you’ll be able to use the service to store and analyze data of any type or size.

Along with other technologies, Azure Data Lake is built on Hadoop YARN (Yet Another Resource Negotiator). YARN is a cluster management server that is part of the Hadoop 2 framework. YARN decouples a number of the MapReduce components to allow multiple third-party engines to use Hadoop as the common standard for accessing data, taking advantage of Hadoop’s linear-scale storage and processing.

The data lake can be particularly useful for handling the distributed processing component of the LDW. In fact, because the data lake is so efficient at processing data and can do so at such a relatively low cost, the LDW platform can push much of its data cleansing and transformation operations to the data lake, even data that resides in the EDW. Such operations would, of course, have to be weighed against the cost of moving the data, but the potential is there and could certainly be a beneficial one.

Hadoop 2 and the YARN framework make it possible to access, process, and federate data more efficiently than ever, yet the data lake is not an LDW, nor is it a replacement for the EDW. A data lake can be and often is part of a larger LDW solution, but the data lake is only one component, though a significant one at that.

That said, differentiating the LDW from the VDW or the data lake does not give us a concrete picture of the LDW. In fact, arriving at that picture is no easy task because the LDW, when taken as a whole, is as much a concept or effort as it is a physical implementation. There is good reason that the word logical plays so prominently in the LDW name.

Perhaps the best way to view the LDW is as a logical structure that’s defined by the sum of its parts, with those parts being the EDW, cloud services, Hadoop clusters, data lakes, and other elements, some of which include their own capacity to virtualize data and distribute processing. However, these elements alone don’t necessarily complete the LDW puzzle. For that, we might need to turn to additional products.

For example, ThoughtWeb now offers Enterprise Analytics Studio, a software solution for centrally managing, designing, and building the enterprise LDW. The solution can leverage both structured and unstructured data, organize and transform that data, and handle SLA management as well as taxonomy/ontology resolution.

MarkLogic also provides an LDW solution, which it bills as a searchable enterprise data layer that presents a unified view of various data silos. The MarkLogic solution includes a NoSQL database, a metadata catalog and repository, web services, and tools for connecting to remote data sources. It can also ingest high volumes of data, transform and aggregate data, and deliver it to multiple applications.

Even Cisco has gotten into the game with its Data Virtualization Platform, which reportedly enables every LDW component, including repository management, distributed processing, and of course data virtualization.

Despite how complete such solutions might seem, they are not in themselves the entire LDW platform, but rather the components that power the system in order to make all the data play nicely together. Nor should any one of these solutions suggest an absolute view of what the LDW picture should look like. There is no one architecture that defines how the LDW should be put together. It is changeable and adaptable and malleable, essential ingredients in the big data soup.

The world of big data

As technologies such as Hadoop’s YARN, Microsoft’s PolyBase, and Denodo Data Virtualization Platform continue to proliferate, along with solutions such as those coming out of Cisco, ThoughtWeb, and MarkLogic, the ability to knit disparate systems together into an LDW platform will continue to grow. Indeed, with more data than ever looming on the horizon, what choice do we have but to virtualize and distribute and orchestrate from a logical platform?

Still, the LDW brings with it a number of questions that must be answered. How do we ensure that the data is secure, with access properly controlled? How do we handle the historical data necessary for long range analysis? What about privacy, compliance, and regulatory issues? And how do we deal with the data inconsistencies that exist among the silos? Do we disregard data quality altogether?

Until these questions are answered, the LDW could risk the same fate as the VDW. However, if these issues can be satisfactorily addressed, without bringing performance to its knees, the LDW promises to be an important tool for enterprises trying to get a grip on the influx of big data. The question then is what to do with all this new-found information that’s been magically dropped at our doorsteps.