Azure Data Lakes

The Data Lake is basically a large repository of data for 'big data' analytic workloads, held in its original format. The Azure Data Lake adds Data Lake Analytics, and Azure HDInsight. Although the tools are there for Big Data Analysis, it will require new skills to use, and a heightened attention to Data Governance if it is to appeal to the average enterprise.

In April of this year, Microsoft announced the forthcoming Azure Data Lake service, an enterprise-wide repository that can store data in any format, regardless of size or structure. Since then, Microsoft has upped the ante on its data lake offering, moving beyond a basic storage service to a fully-realized platform that supports distributed analytics and HDInsight clusters, along with a secure and scalable store.

The data lake is a relatively new concept in big data circles. At its most basic, the data lake is a large storage repository – usually based on Apache Hadoop – for holding massive quantities of data in its original format. The data can come from any number of heterogeneous sources and be ingested into the repository in its raw form, without the overhead of transforming or micro-managing the data.

The data lake serves as an alternative to multiple information silos typical of enterprise environments. The data lake does not care where the data came from or how it was used. It is indifferent to data quality or integrity. It is concerned only with providing a common repository from which to perform in-depth analytics. Only then is any sort of structure imposed upon the data.

As the popularity of the data lake grows, so too does the number of vendors jumping into data lake waters, each bringing its own idea of what a data lake entails. While any data lake solution will have at its core a massive repository, some vendors also roll in an analytics component or two, which is exactly what Microsoft is planning to do. As the following figure shows, the Azure Data Lake platform comprises three primary services: Data Lake Store, Data Lake Analytics, and Azure HDInsight.


Data Lake Store provides the repository necessary to persist the influx of data, and Data Lake Analytics offers a mechanism for picking apart that data. Both components are now in public preview. Microsoft has also rolled HDInsight into the Data Lake mix, a service that offers a wide range of Hadoop-based tools for additional analytic capabilities. To facilitate access between the storage and analytic layers, the Data Lake platform leverages Apache YARN (Yet Another Resource Negotiator) and WebHDFS-compatible REST APIs.

Azure Data Lake Store

Microsoft describes Data Lake Store as a “hyper-scale repository for big data analytic workloads,” a mouthful, to be sure, but descriptive nonetheless. The service will let you store data of any size, type, or ingestion speed, whether originating from social networks, relational databases, web-based applications, line-of-business (LOB) applications, mobile and desktop devices, or a variety of other sources. The repository provides unlimited storage without restricting file sizes or data volumes. An individual file can be petabytes in size, with no limit on how long you keep it there.

Data Lake Store uses a Hadoop file system to support compatibility with the Hadoop Distributed File System (HDFS), making the data store accessible to a wide range of Hadoop-friendly tools and services. Data Lake Store is already integrated with Data Lake Analytics and HDInsight, as well as Azure Data Factory; however, Microsoft also plans eventual integration with services such as Microsoft’s Revolution-R Enterprise, distributions from Hortonworks, Cloudera, and MapR, and Hadoop projects such as Spark, Storm, and HBase.

To protect the data, Microsoft makes redundant copies for ensuring durability and promises enterprise-grade security, based on Azure Active Directory (AAD). The AAD service manages identity and access for all stored data, providing multifactor authentication, role-based access control, conditional access, and numerous other features.

To use AAD to protect data, you must first create AAD security groups to facilitate role-based access control in Azure Portal. Next, you must assign the security groups to the Data Lake Store account, which control access to the repository for portal and management operations. The next step is to assign the security group to the access control lists (ACLs) associated with the repository’s file system. Currently, you can assign access control permissions only at the repository level, but Microsoft is planning to add folder- and file-level controls in a future release.

Data Lake Store supports POSIX-style permissions exposed through the WebHDFS-compatible REST APIs. The WebHDFS protocol makes it possible to support all HDFS operations, not only read and write, but also such operations as accessing block locations and configuring replication factors. In addition, WebHDFS can use the full bandwidth of the Hadoop cluster for streaming data.

Data Lake Store also implements a new file system-AzureDataLakeFilesystem (adl://)-for directly accessing the repository. Applications and services capable of using the file system can realize additional flexibility and performance gains over WebHDFS. Systems not compatible with the new file system can continue to use the WebHDFS-compatible APIs.

Azure Data Lake Analytics

Data Lake Analytics represents one of Microsoft’s newest cloud offerings, appearing on the scene only within the last couple months. According to Microsoft, the company built Data Lake Analytics from the ground up with scalability and performance in mind. The service provides a distributed infrastructure that can dynamically allocate or de-allocate resources so customers pay for only the services they use.

As with similar cloud platforms, Data Lake Analytics users can focus on the business logic, rather than on the logistics of how to implement systems and process large data sets. The service handles all the complex management tasks so customers can develop and execute their solutions without worrying about deploying or maintaining the infrastructure to support them.

Data Lake Analytics is also integrated with AAD, making it easier to manage users and permissions. It is also integrated with Visual Studio, providing developers with a familiar environment for creating analytic solutions.

A solution built for the Data Lake Analytics service is made up of one or more jobs that define the business logic. A job can reference data within Data Lake Store or Azure Blob storage, impose a structure on that data, and process the data in various ways. When a job is submitted Data Lake Analytics, the service will access the source data, carry out the defined operations, and output the results to Data Lake Store or Blob storage.

Azure Data Lake provides several options for submitting jobs to Data Lake Analytics:

  • Use Azure Data Lake Tools in Visual Studio to submit jobs directly.
  • Use the Azure Portal to submit jobs via the Data Lake Analytics account.
  • Use the Data Lake SDK job submission API to submit jobs programmatically.
  • Use the job submission command available through the Azure PowerShell extensions to submit jobs programmatically.

The U-SQL difference

A job is actually a U-SQL script that instructs the service on how to process the data. U-SQL is a new language that Microsoft developed for writing scalable, distributed queries that analyze data. An important function of the language is its ability to process unstructured data by applying schema on read logic, which imposes a structure on the data as you retrieve it from its source. The U-SQL language also lets you insert custom logic and user-defined functions into your scripts, as well as provides fine-grained control over how to run a job at scale.

U-SQL evolved out of Microsoft’s internal big data language SCOPE (Structured Computations Optimized for Parallel Execution), a SQL-like language that supports set-oriented record and column manipulation. U-SQL is a type of hybrid language that combines the declarative capabilities of SQL with the extensibility and programmability of C#. The language also incorporates big data processing concepts such as custom reducers and processors, as well as schema on reads.

Not surprisingly, U-SQL has its own peculiarities. Keywords such as SELECT must be all uppercase, and the expression language within clauses such as SELECT and WHERE use C# syntax. For example, a WHERE clause equality operator would take two equal signs, and a string value would be enclosed in double quotes, as in WHERE Veggie == "tiger nut" .

In addition, the U-SQL type system is based on C# types, providing tight integration with the C# language. You can use any C# type in a U-SQL expression. However, you can use only a subset of C# types to define rowset columns or certain other schema objects. The usable types are referred to as built-in U-SQL types and can be classified as simple built-in types or complex built-in types . The simple built-in types include your basic numeric, string, and temporal types, along with a few others, and the complex ones include map and array types.

You can also use C# to extend your U-SQL expressions. For example, you can add inline C# expressions to your script, which can be handy if you have a small set of C# methods you want to use to process scalar values. In addition, you can write user-defined functions, aggregators, and operators in C# assemblies, load the assemblies into the U-SQL metadata catalog, and then reference the assemblies within your U-SQL scripts.

Data Lake Analytics executes a U-SQL job as a batch script, with data retrieved in a rowset format. If the source data are files, U-SQL schematizes the data on extract. However, the source data can also be U-SQL tables or tables in other data sources, such as Azure SQL Database, in which case it does not need to be schematized. In addition, you can define a U-SQL job to transform the data before storing it in a file or U-SQL table. The language also supports data definition statements such as CREATE TABLE so you can define metadata artifacts.

Any data that you process with a U-SQL script must be transformed into a rowset. From there, you can combine the data with other rowsets or use U-SQL query expressions to transform the data further.

Azure HDInsight

Although HDInsight is not a new cloud service, its integration into the Data Lake platform is a new development, one that coincided with the introduction of Data Lake Analytics. HDInsight is a fully managed Hadoop cluster service that supports a wide range of analytic engines, including Spark, Storm, and HBase. The service has been updated to take advantage of Data Lake Store in order to maximize security, scalability, and throughput. In addition, HDInsight now supports managed clusters on Linux, along with the Windows-based clusters that have been the service’s mainstay.

HDInsight provides a software framework for deploying and provisioning Hadoop clusters in the cloud. The framework uses the Hortonworks Data Platform (HDP) Hadoop distribution to manage, analyze, and report on big data, providing a highly available and reliable environment for running Hadoop components, including Pig, Hive, Sqoop, Oozie, Ambari, Mahout, Tez, and ZooKeeper. In addition, HDInsight can integrate with BI tools such as Excel, SQL Server Analysis Services, and SQL Server Reporting Services.

HDInsight also provides cluster configurations for the following services:

  • Hadoop: HDFS data storage with a simple MapReduce programming model that supports parallel processing and analytics.
  • HBase : NoSQL database built on Hadoop, providing random access and strong consistency for large sets of structured and semi-structured data.
  • Storm : Distributed, real-time computational service for quickly processing data streams.

HBase is modeled after Google Big Table and provides random access to large quantities of data while maintaining strong consistency. HBase scales linearly to handle petabytes of data across thousands of nodes and can integrate with other systems to provide batch processing, data redundancy, and other services.

Microsoft implements HBase as a managed cluster integrated into the Azure environment, making it possible for HDInsight to leverage the HBase scale-out architecture for automated table sharding and failover. HBase uses enhanced in-memory caching for read and writes.

With Apache Storm, you can create distributed analytic solutions. Storm provides a fault-tolerant, open-source computational system that supports real-time data processing as well as the ability to replay data that was not successfully processed the first time around. Storm runs as a managed service that can potentially deliver 99.9% uptime. The service supports a mix of programming languages, including Java, C#, and Python, and can be integrated with other Azure services, such as Event Hubs, SQL Database, Blob storage, and DocumentDB.

HDInsight also supports Apache Spark, an open-source framework that offers in-memory parallel processing. The Spark processing engine provides sophisticated big data analytics through its in-memory computational capabilities. However, you can cache data either in-memory or in SSDs attached to the cluster nodes.

Spark includes connectors for Event Hubs and BI tools such as Power BI and Tableau and comes with Anaconda libraries preinstalled. The service is particularly suited to interactive data analysis, iterative machine learning, and real-time analytics-whether working with structured, semi-structured, or unstructured data-able to scale from terabytes to petabytes on demand.

Apache YARN

If Azure Data Lake manages to deliver on its promises, much of the success will be attributable to Apache YARN, a relatively new application management framework for processing data in Hadoop clusters, such as those in Data Lake Store. YARN has been instrumental in transforming Hadoop from a single-use batch processing platform to a multi-use system that not only supports batch processing, but also stream, online, and interactive SQL processing, while improving scalability and cluster utilization.

By all accounts, YARN is capable of providing Azure Data Lake with the resource management capabilities necessary to deliver consistent operations and data security and governance across the Hadoop clusters in the Data Lake Store. YARN can dynamically allocate resources for clusters that expand to thousands of nodes and manage petabytes of data.

YARN includes three major components for managing Hadoop resources:

  • ResourceManager : Provides distributed management for applications running on the Hadoop clusters and serves as the ultimate resource arbitrator. ResourceManager includes a scheduler that allocates resources to the various applications.
  • NodeManager : Per-machine slave that launches application containers, monitors their resource usage, and provides usage information to ResourceManager.
  • ApplicationMaster : Framework-specific entity that negotiates resources for containers as well as tracks their status and monitors their progress. ApplicationMaster also works with NodeManager components to run and monitor tasks.

Data Lake Analytics and HDInsight both use YARN for resource management, allowing multiple analytic engines to run side-by-side. For example, U-SQL can run alongside Spark and Storm, all three processing data that resides in the store. YARN brings with it technologies such as resource reservation and work-conserving preemption, with additional tools for addressing resource allocation and utilization.

The data lake promise

With Azure Data Lake, Microsoft has thrown its weight fully behind the recent data lake frenzy. The three primary components-Data Lake Store, Data Lake Analytics, and HDInsight-represent a new wave of big data storage and analytic solutions. Of course, the only proven system in all this is HDInsight, and even then, only on Windows. Data Lake Store, Data Lake Analytics, the U-SQL language, and even YARN are all too new to say with any certainty that such a system will work as Microsoft promises.

Like many big data trends, the devil is indeed in the details. The data lake offers a one-type-fits-all solution for dumping in data from wherever it might originate and in whatever raw form it happens to be. Although the data lake can deliver an organization from its multitudes of information silos, there is nothing to prevent the data lake from becoming little more than a data dumping ground.

Under such circumstances, issues such as data quality and governance can easily take a back seat. Sure, Azure Data Lake promises enterprise-grade security and eventual file-level access control, but is that enough to protect privacy and guarantee security across such a vast store of varied data? Great care will have to be taken to ensure full and comprehensive governance. Until that can be assured, organizations will have to pick wisely what data they dump into their lakes.

Still, when it comes Azure Data Lake, Microsoft appears to be pulling out the big data guns. Together, Data Lake Analyzer and HDInsight will offer analysts and data scientists a wide range of tools for delving into an almost unlimited amount of data. Yet it’s too soon to tell how well a single storage structure will work for multiple giant analytic projects, when compared to working on systems built for specific purposes. There is also the question of how many individuals within an organization will have the skills necessary to utilize such vast stores of unstructured and semi-structured data. And how will multiple projects be managed to avoid costly duplicate analytic efforts?

Despite these concerns, Microsoft is demonstrating a serious commitment to solving the big data dilemma, and Azure Data Lake could prove a valuable resource for making sense of this information storm. In a year from now, we’ll have a much better sense of whether Microsoft’s data lake effort can deliver on its promises. Of course, in the year from now, the idea of a data lake might be as dried up as Microsoft Zune.