Microsoft Azure Stream Analytics

Azure Stream Analytics aims to extract knowledge structures from continuous ordered streams of data by real-time analysis. These streams might include computer network traffic, social network data, phone conversations, sensor readings, ATM transactions or web searches. It provides a ready-made solution to the business requirement to react very quickly to changes in data and handle large volumes of information. Robert Sheldon explains its significance.

For many organizations, capturing and storing event data for later analysis is no longer enough. They’re looking for ways to better utilize the vast amounts of data pouring in from devices and systems across the cyber landscape. Simply beefing up their current systems is not the answer. They want stream-processing solutions that can support real-time analytics.

An organization willing to invest the resources necessary to build such a solution can potentially save over the long term by reducing repetitive tasks, supporting operational systems more effectively, and having access to critical information for dynamic strategizing. With the Internet of Things (IoT) looming large, the desire for real-time analytics grows in direct proportion to the expected influx of streaming data.

But implementing a real-time stream analytics solution is no small task. It must be scalable and fault tolerant, while ensuring low latency and high availability. Above all, the system must make the data readily accessible so it can be correlated, filtered, and aggregated in order to provide meaningful insights into the abundance of information.

To ease the complexities of implementing such a system, Microsoft now offers Azure Stream Analytics, a fully managed cloud service that provides complex event processing over streaming data. The service can handle millions of events per second while correlating them across multiple streams. Because Stream Analytics is a cloud service, an organization can implement a real-time stream processing solution with relative ease and little in the way of upfront costs. But it must be done in conjunction with other Azure services. Only by implementing them together, can an organization realize a complete stream analytics solution in the Azure cloud.

What is Stream Analytics?

Stream Analytics is an event processing engine that can ingest events in real-time, whether from one data stream or multiple streams. Events can come from sensors, applications, devices, operational systems, websites, and a variety of other sources. Just about anything that can generate event data is fair game.

Stream Analytics provides high-throughput, low-latency processing, while supporting real-time stream computation operations. With a Stream Analytics solution, organizations can gain immediate insights into real-time data as well as detect anomalies in the data, set up alerts to be triggered under specific conditions, and make the data available to other applications and services for presentation or further analysis. Stream Analytics can also incorporate historical or reference data into the real-time streams to further enrich the information and derive better analytics.

Stream Analytics is built on a pull-based communication model that utilizes adaptive caching with configured size limits and timeouts. The service also adheres to a client-anchor model that provides built-in recovery and check-pointing capabilities. In addition, the service can persist data to protect against node or downstream failure.

To implement a streaming pipeline, developers create one or more jobs that define a stream’s inputs and outputs. The jobs also incorporate SQL-like queries that determine how the data should be transformed. In addition, developers can adjust a number of a job’s settings. For example, they can control when the job should start producing result output, how to handle events that do not arrive sequentially, and what to do when a partition lags behind others or does not contain data. Once a job is implemented, administrators can view the job’s status via the Azure portal.

Stream Analytics supports two input types, stream data and reference data, and two source types, Azure Event Hubs and Azure Blob storage. Event Hubs is a publish-subscribe data integration service that can consume large volumes of events from a wide range of sources. Blob storage is a data service for storing and retrieving binary large object (BLOB) files. The following table shows the types of data that Stream Analytics can handle and the supported sources and input formats for each.

Input type

Supported Sources

Supported Formats

Size Limits

Stream

Event Hubs

Blob storage

JSON

CSV

Avro

N/A

Reference

Blob storage

JSON

CSV

50 MB

A Stream Analytics job must include at least one stream input type. If Blob storage is used, the file must contain all events before they can be streamed to Stream Analytics. The file is also limited to a maximum size of 50 MB. In this sense, the stream is historical in nature, no matter how recently the file was created. Only Event Hubs can deliver real-time event streams.

Reference data is optional in a Stream Analytics job and can come only from Blob storage. Reference data can be useful for performing lookups or correlating data in multiple streams.

Once a job has the input it needs, the data can be transformed. To facilitate these transformations, Stream Analytics supports a declarative SQL-like language. The language includes a range of specialized functions and operators that let developers implement everything from simple filters to complex aggregations across correlated streams. The language’s SQL-like nature makes it relatively easy for developers to transform data without having to dig into the technical complexities of the underlying system.

The last piece of the job puzzle is the stream output. Stream Analytics can write the query results (the transformed data) to Azure SQL Database or Blob storage. SQL Database can be useful if the data is relational in nature or supports applications that require database hosting. Blob storage is a good choice for long-term archiving or later processing. A Stream Analytics job can also send data back to Event Hubs to support other streaming pipelines and applications.

According to Microsoft, Stream Analytics can scale to any volume of data, while still achieving high throughput and low latency. An organization can start with a system that supports only a few kilobytes of data per second and scale up to gigabytes per second as needed. Stream Analytics can also leverage the partitioning capabilities of Event Hubs. In addition, administrators can specify how much compute power to dedicate to each step of the pipeline in order to achieve the most efficient and cost-effective throughput.

The Azure real-time analytics stack

Stream Analytics was designed to work in conjunction with other Azure services. Data inputted into and outputted from a Stream Analytics job must come and go through those Azure services. The following diagram provides a conceptual overview of how the Azure layers fit together and data flows through those layers in order to provide a complete stream analytics solution.

2202-4e806a62-c9ca-49b0-9dc7-8496fb22fd2

The top layer shown in the figure represents the starting point. These are the data sources that generate the event data. The data can come from just about anywhere, whether a piece of equipment, mobile device, cloud service, ATM, aircraft, oil platform-any device, sensor, or operation that can transmit event data. The data source might connect directly to Event Hubs or Blob storage or go through a gateway that connects to Azure.

Event Hubs can ingest and integrate millions of events per second. The events can be in various formats and stream in at different velocities. Event Hubs persists the events for a configurable period of time, allowing the events to support multiple Stream Analytics jobs or other operations. Blob storage can also store event data and make it available to Stream Analytics for operations that rely on historical data. In addition, Blob storage can provide reference data to support operations such as correlating multiple event streams.

The next layer in the Azure stack is where the actual analytics occur. Stream Analytics provides built-in integration with Event Hubs to support seamless, real-time analytics and with Blob storage to facilitate access to historical event data and reference data.

In addition to Stream Analytics, Azure provides Machine Learning, a predictive analytics service for mining data and identifying trends and patterns across large data sets. After analyzing the data, Machine Learning can publish a model that can then be used to generate real-time predictions based on incoming event data in Stream Analytics.

Also at the analytics layer is HDInsight Storm, an engine similar to Stream Analytics. However, unlike Stream Analytics, Storm runs on dedicated HDInsight clusters and supports a more diversified set of languages. Stream Analytics provides a built-in, multi-tenant environment and supports only the SQL language. In general, Stream Analytics is more limited in scope but makes it easier for an organization to get started. Storm can ingest data from more services and is more expansive in scope, but requires more effort. This, of course, is just a basic overview of the differences between the two services, so be sure to check Microsoft resources for more information.

From the analytics layer, we move to what is primarily the storage layer, where data can be persisted for presentation or made available for further consumption. As noted earlier, Stream Analytics can send data to SQL Database or Blob storage. SQL Database is a managed database as a service (DBaaS) and can be a good choice when you require an interactive query response from the transformed data sets.

Stream Analytics can also persist data to Blob storage. From there, the data can again be processed as a series of events, used as part of an HDInsight solution, or made available to large-scale analytic operations such as machine learning. In addition, Stream Analytics can write data back to Event Hubs for consumption by other applications or services or to support additional Stream Analytics jobs.

The final layer in the analytics stack is presentation and consumption, which can include any number of tools and services. For example, Power Query in Excel includes a built-in native connection for accessing data directly from Blob storage, providing a self-service, in-memory environment for working with the processed event data. Another option is Power BI, which offers a rich set of self-service visualizations. With Power BI, users can interact directly with the processed event data in SQL Database. In addition, a wide range of applications and services can consume data from Blob storage, SQL Database, or Event Hubs, providing almost unlimited options for presenting the processed event data or consuming it for further analysis.

Putting Stream Analytics to Work

Stream Analytics, in conjunction with Event Hubs, provides the structure necessary to perform real-time stream analytics on large sets of data. It is not meant to replace batch-oriented services, but rather to offer a way to handle the growing influx of event data resulting from the expected IoT onslaught. Most organizations will still have a need for traditional transactional databases and data warehouses for some time to come.

Within the world of real-time analytics, the potential uses for services such as Stream Analytics are plenty. The following table provides an overview of only some of the possibilities.

Usage

Description

Examples

Connected devices

Monitor and diagnose real-time data from connected devices such as vehicles, buildings, or machinery in order to generate alerts, respond to events, or optimize operations.

Plan and schedule maintenance; coordinate vehicle usage and respond to changing traffic conditions; scale or repair systems.

Business operations

Analyze real-time data to respond to dynamic environments in order to take immediate action.

Provide stock trade analytics and alerts; recalculate pricing based on changing trends; adjust inventory levels.

Fraud detection

Monitor financial transactions in real-time to detect fraudulent activity.

Correlate a credit card’s use across geographic locations; monitor the number of transactions on a single credit card.

Website analytics

Collect real-time metrics to gain immediate insight into a website’s usage patters or application performance.

Perform clickstream analytics; test site layout and application features; determine an ad campaign’s impact; respond to degraded customer experience.

Customer dashboards

Provide real-time dashboards to customers so they can discover trends as they occur and be notified of events relevant to their operations.

Respond to a service, website, or application going down; view current user activity; view analytics based on data collected from devices or operations.

These scenarios are, of course, only a sampling of the ways an organization can reap the benefits of stream-processing analytics. Services such as Stream Analytics can also translate into significant savings, depending on the organization and type of implementation.

Currently, Microsoft prices Stream Analytics by the volume of processed data and the number of stream units used to process the data, at a per-hour rate. A stream unit is a compute capacity (CPU, memory, throughput), with a maximum throughput of 1 MB/s. Stream Analytics imposes a default quota of 12 streaming units per region, but requires no start-up or termination fees. Customers pay only for what they use, based on the following pricing structure:

  • Data volume: $0.001/GB
  • Streaming unit: $0.031/hour

The Stream Analytics pricing structure makes it possible for organizations large and small to spin up analytics operations in no time at all. Beware, however, that there are more to costs than just the Stream Analytics subscription rate. Stream Analytics cannot operate in a vacuum and requires other Azure services. In addition, an organization will still need to invest in the resources necessary to set up the services, write the SQL code, implement the mechanisms to get the data from the devices to Azure, and take a number of other steps to implement the solution. That said, when compared to setting up such a solution in-house, the Azure approach could still end up a lot cheaper. At least in the short term. Be sure you project expenses over the long haul when doing your cost analysis-not an easy task in the age of IoT.

Despite the issue of costs, Stream Analytics represents a larger trend that’s occurring across the industry. The plethora of data that IoT-driven economics promises to deliver is massive and important enough for companies such as Microsoft to start offering ways to handle all that data, a part of the IoT equation that until recently has remained missing. Whether services such as Stream Analytics will be able to meet the perceived demands is yet to be seen. Also yet to be seen is whether all the hype surrounding real-time streaming analytics will pay off. In certain sectors, it could prove a game changer. For others, it may turn out to be simply another headache for IT.