Real-time Data Streaming in Snowflake

Real-time data ingestion has become essential for modern analytics and operational intelligence. Organizations across industries need to process data streams from IoT sensors, financial transactions, and application events with minimal latency. Snowflake offers two robust approaches to meet these real-time data needs: Snowpipe for near-real-time file-based streaming and Direct Streaming via Snowpark API for true real-time data integration.

This guide explores both options in depth, providing detailed implementations with explanation of code parameters, performance comparisons, and practical recommendations to help you choose the right approach for your specific use case.

Understanding Snowflake’s Real-time Options

Snowflake provides two distinct technologies for real-time data processing, each with its own strengths and optimal use cases. This section will introduce SnowPipe and then Snowpark solutions and how you will use them in loading real-time data into Snowflake.

Snowpipe: Near-Real-Time File-Based Streaming

Snowpipe is Snowflake’s serverless solution for automating data loading from cloud storage. It continuously monitors designated external stages (cloud storage locations) and loads new files as they arrive without manual intervention.

Key characteristics of Snowpipe include:

Serverless architecture: Snowpipe operates without requiring warehouse resources, making it highly scalable and cost-effective.
Low Latency: Typically processes files within 1-2 minutes after they arrive in the external staging location. This latency primarily depends on file detection time and processing overhead rather than file size.
Cost model: You pay only for the data you load on a per-file basis, eliminating idle compute costs.
Best suited for: Micro-batch processing scenarios where files of moderate size (typically under 100MB) arrive continuously, and near-real-time ingestion (within minutes) is acceptable.

Snowpark API Direct Streaming: True Real-Time API Integration

For use cases requiring true real-time data processing, Snowflake offers Direct Streaming through the Snowpark API. This approach bypasses file staging altogether and writes data directly into Snowflake tables.

Key characteristics of Direct Streaming include:

API-based integration: Leverages the Snowpark API to stream data directly from applications without intermediate file storage.
Very low latency: Achieves sub-second to seconds latency, making it suitable for time-sensitive applications.
Cost model: Based on compute resources used during ingestion, requiring an active warehouse.
Best suited for: Direct application integration where immediate data visibility is critical.

Common Direct Streaming Use Cases and Recommended Tool

Different business scenarios have varying latency and throughput requirements. Understanding these needs helps determine which Snowflake streaming option is most appropriate.

IoT Sensor Data

IoT deployments can generate massive volumes of time-series data from distributed sensors. For example, manufacturing equipment might report temperature, vibration, and pressure readings every few seconds.

Data characteristics: High-frequency sensor readings, often in JSON format with nested structures containing sensor IDs, timestamps, and multiple reading values.
Latency requirements: Often toleran1–2-minutee-level latency for general monitoring, but may need second-level responsiveness for critical alerts.
Volume considerations: Can range from kilobytes to gigabytes per minute depending on sensor count and sampling frequency.
Recommendation: Snowpipe works well for most IoT scenarios where data is collected in small batch files. Direct Streaming is preferable when immediate alerting on sensor conditions is required.

Financial Transactions

Financial systems require rapid processing of transactions for fraud detection, real-time balance updates, and compliance monitoring.

Data characteristics: Structured transaction records containing account information, amounts, timestamps, and metadata.
Latency requirements: Often needs second-level or better latency for fraud detection and customer-facing applications.
Volume considerations: Can fluctuate dramatically during peak hours versus off-hours.
Recommendation: Direct Streaming via Snowpark is often preferred due to stricter latency requirements and the need for immediate visibility into transaction patterns.

Application Tracking Events

Modern applications generate streams of events tracking user interactions, system performance, and business processes.

Data characteristics: Semi-structured event logs with event types, timestamps, user identifiers, and contextual information.
Latency requirements: Varies widely—some operational dashboards can tolerate minute-level delays, while user experience optimization might need faster feedback.
Volume considerations: Typically follows user activity patterns with predictable peaks.
Recommendation: Either approach can work well depending on specific requirements. Snowpipe is simpler to implement for basic analytics, while Snowpark Direct Streaming enables real-time user experience optimization.

Snowpipe Implementation

In this section we will look at the steps to implement a Snowpipe solution for ingesting IoT sensor data arriving as JSON files in a cloud storage bucket. We’ll walk through each step of the process with detailed explanations of the code, parameters, and configurations.

Setting Up Your Target Table

First, we need to create a table to store our sensor data. The sensor data will come in .JSON files with 1 or more readings in them.

The table structure needs to accommodate both the core sensor readings and metadata about the ingestion process. This will facilitate any debugging issues you may have if there are more than a few invalid readings.:

CREATE OR REPLACE TABLE sensor_readings (

sensor_id VARCHAR(50),

reading_timestamp TIMESTAMP_NTZ,

temperature FLOAT,

humidity FLOAT,

pressure FLOAT,

metadata VARIANT,

file_name VARCHAR(200),

ingestion_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()

);

The table design accommodates the core sensor readings using different datatypes based on the nature of each measurement: temperature as FLOAT to handle decimal precision for potentially wide temperature ranges, humidity as FLOAT for percentage values that require decimal precision, and pressure as FLOAT for precise atmospheric or system pressure measurements that need decimal points.

The design also includes a VARIANT column for flexible metadata storage. It also contains file tracking information to assist with troubleshooting, and lastly, it also captures the time of the file ingestion, which is captured in the ingestion_timestamp column, defined as DEFAULT CURRENT_TIMESTAMP().

The VARIANT data type is a flexible type that is the recommended way to bring in data from outside sources like this. It is similar to other database management systems in that it can hold any datatype, but it is particularly useful in business critical files in different formats. For more information, check out Snowflake Documentation on Semi-structured data types and this piece from Nimbus Intelligence that explores the VARIANT data type.

Defining Your File Format

Now, we need to tell Snowflake how to interpret the incoming files. For our IoT scenario, we’ll say that the devices use JSON format:

CREATE OR REPLACE FILE FORMAT sensor_json_format

TYPE = JSON

STRIP_OUTER_ARRAY = TRUE

ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE;

Note: Snowflake supports other file formats such as CSV, Avro and more and you can find details in snowflake’s official documentation.

The parameters here are important:

TYPE = JSON: Specifies that our incoming files are in JSON format.
STRIP_OUTER_ARRAY = TRUE: Many IoT systems output JSON as arrays of objects. This parameter tells Snowflake to treat each object in the array as a separate row.
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE: This makes the loading process more resilient to JSON schema variations, allowing some fields to be missing without failing the entire load.

If a failure does occur at the record level, the system will skip the problematic record and continue loading valid ones. These errors can be viewed later using the VALIDATE function, which we’ll cover in the error handling section.

Creating External Staging

Next, we need to define a location where our files will be stored before Snowflake ingests them. This is called an external stage:

CREATE OR REPLACE STAGE sensor_stage

URL = 's3://your-bucket/sensor-data/'

CREDENTIALS = (AWS_KEY_ID = 'your_key' AWS_SECRET_KEY = 'your_secret')

FILE_FORMAT = sensor_json_format;

The command specifies an S3 bucket location where your sensor data files will be uploaded, provides the necessary credentials for Snowflake to access the S3 bucket, and associates our previously defined file format with this stage.

Note that you can also use Azure Blob Storage or Google Cloud Storage by adjusting the URL and credential parameters accordingly.

Setting Up Cloud Event Notifications

To enable automatic ingestion, we need to configure Snowflake to receive notifications when new files arrive in the external stage. This step is optional but recommended for true automation.

Without cloud event notifications, you would need to either manually execute the COPY command each time new files arrive or set up a scheduled task to run the pipe at regular intervals, which is less efficient and could lead to increased latency:

CREATE OR REPLACE NOTIFICATION INTEGRATION sensor_events

ENABLED = TRUE

TYPE = QUEUE

NOTIFICATION_PROVIDER = AWS_SQS

AWS_SNS_TOPIC_ARN = 'arn:aws:sns:region:account:topic'

AWS_SQS_ARN = 'arn:aws:sns:region:account:queue'

DIRECTION = INBOUND;

This integration connects Snowflake to an AWS SQS queue that will receive notifications about new files. It also configures the integration to listen for inbound messages. Note that you need to have already set up the appropriate SNS topic and SQS queue in AWS for this to work.

For Azure or Google Cloud, you would use different provider-specific parameters.

Creating and Configuring the Snowpipe

Now we can create the actual Snowpipe that will automatically load our data:

CREATE OR REPLACE PIPE sensor_data_pipe

AUTO_INGEST = TRUE

ERROR_INTEGRATION = 'PIPE_ERRORS'

COPY INTO sensor_readings (

sensor_id,

reading_timestamp,

temperature,

humidity,

pressure,

metadata,

file_name

)

FROM (

SELECT

$1:sensor_id::VARCHAR,

TRY_TO_TIMESTAMP($1:timestamp::STRING),

$1:readings:temperature::FLOAT,

$1:readings:humidity::FLOAT,

$1:readings:pressure::FLOAT,

$1:metadata::VARIANT,

metadata$filename

FROM @sensor_stage --The STAGE object set up earlier

)

ON_ERROR = 'CONTINUE'

PATTERN = '.*\.json'

VALIDATION_MODE = 'RETURN_ERRORS';

Let’s break down this more complex statement:

AUTO_INGEST = TRUE: This enables automatic ingestion triggered by cloud storage notifications. When a new file arrives and a notification is sent, Snowpipe will automatically start loading the file.
ERROR_INTEGRATION = 'PIPE_ERRORS': This specifies where error information should be stored for troubleshooting.
COPY INTO: This is the core loading command, specifying:
- The target table and columns
- A nested SELECT that extracts and converts values from the JSON structure
- The JSON path notation ($1
  ) to access nested fields
- Type casting to ensure data type consistency (::VARCHAR, ::FLOAT)
- TRY_TO_TIMESTAMP() to handle potential timestamp parsing errors gracefully
- metadata$filename to capture the source file name automatically
ON_ERROR = 'CONTINUE': This tells Snowpipe to continue processing even if some records have errors, rather than failing the entire file.
PATTERN = '.*.json': This regular expression limits processing to only files with a .json extension.
VALIDATION_MODE = 'RETURN_ERRORS': This enables detailed error reporting for failed records.

Implementing Error Handling

Proper error handling is crucial for production pipelines. Let’s create a table to store error information:

CREATE OR REPLACE TABLE pipe_error_log (

error_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(),

pipe_name STRING,

error_message STRING,

file_name STRING,

raw_data VARIANT

);

Next, we’ll create a stored procedure to capture and log errors from our Snowpipe. You can manually view errors at any time by querying the VALIDATE function directly, but this stored procedure automates the process and provides a historical record of all errors:

CREATE OR REPLACE PROCEDURE handle_pipe_errors()

RETURNS STRING

LANGUAGE JAVASCRIPT

try {

// Query pipe status

var errorsQuery = `

SELECT

FILE_NAME,

ERROR_MESSAGE,

REJECTED_RECORD

FROM TABLE(VALIDATE(sensor_readings, JOB_ID => SYSTEM$PIPE_STATUS('sensor_data_pipe')));

var stmt = snowflake.createStatement({sqlText: errorsQuery});

var result = stmt.execute();

// Log errors

while (result.next()) {

var insertQuery = `

INSERT INTO pipe_error_log (

pipe_name,

error_message,

file_name,

raw_data

)

SELECT

'sensor_data_pipe',

'${result.getColumnValue(2)}',

'${result.getColumnValue(1)}',

PARSE_JSON('${result.getColumnValue(3)}');

snowflake.execute({sqlText: insertQuery});

}

return "Error handling completed";

} catch (err) {

return "Error in error handling: " + err.message;

}

This JavaScript stored procedure queries the pipe status using the SYSTEM$PIPE_STATUS function. It then uses the VALIDATE table function to get detailed error information. The procedure iterates through any errors and inserts them into the error log table. Both the error messages and the raw records that failed are captured by the procedure.

You would typically schedule this procedure to run periodically using Snowflake Tasks to ensure errors are captured consistently.

Monitoring Pipe Status

Regular monitoring is essential for operational reliability. Here’s a query to check recent pipe activity:

SELECT

file_name,

status,

row_count,

error_count,

first_error_message,

last_load_time

FROM TABLE(INFORMATION_SCHEMA.PIPE_USAGE_HISTORY(

DATE_RANGE_START=>DATEADD('HOUR',-1,CURRENT_TIMESTAMP())

));

This query returns operational metrics that are valuable, including the files processed within the last hour, the success or failure status for each individual file, the row and error counts, and timing information for performance monitoring.

Data in the output might look similar to these values:

file_name	status	row_count	error_count	first_error_message	last_load_time
sensor_20250409_12.json	LOADED	1428	0	NULL	2025-04-09 12:15:23
sensor_20250409_13.json	LOADED	1542	12	Invalid timestamp format	2025-04-09 13:03:45
sensor_20250409_14.json	LOAD_IN_PROGRESS	NULL	NULL	NULL	NULL

Implementation: Direct Streaming with Snowpark API

For use cases requiring true real-time data processing with sub-second latency, Snowflake’s Direct Streaming via the Snowpark API is the preferred approach. Let’s implement a complete Python solution for streaming sensor data directly into Snowflake.

Setting Up the Python Environment

First, ensure you have the necessary libraries installed. This is a one-time setup that persists across application restarts – you won’t need to reinstall these packages each time you run your application:

from snowflake.Snowpark.session import Session

from snowflake.Snowpark.functions import col

import pandas as pd

import json

import time

Creating a Streaming Function

Here’s a Python function that implements direct streaming to Snowflake. This is a lot of code to follow, but it is described after the listing. However, unlike the Spowpipe implementation, this is all the configuration you will need to get started with Snowpark’ Direct Streaming functionality:

def stream_to_snowflake(connection_params, data_generator):

"""

Stream data directly to Snowflake using Snowpark API.

Parameters:

- connection_params: Dictionary containing Snowflake connection parameters

(account, user, password, role, warehouse, database, schema)

- data_generator: An object with a get_next_batch() method that returns

batches of data as pandas DataFrames

"""

# Create Snowpark session

session = Session.builder.configs(connection_params).create()

try:

# Create target table if not exists

session.sql("""

CREATE TABLE IF NOT EXISTS realtime_events (

event_id VARCHAR,

event_timestamp TIMESTAMP_NTZ,

event_type VARCHAR,

payload VARIANT,

processing_timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP

)

""").collect()

# Set up writer

writer = session.write_pandas

# Main streaming loop

while True:

# Get batch of events from the data source

events = data_generator.get_next_batch()

# If no events, wait briefly and continue

if not events or events.empty:

time.sleep(0.1)

continue

# Write batch to Snowflake

try:

writer.save_as_table(

events,

"realtime_events",

mode="append",

chunk_size=10000 # Adjust based on your data volume

)

except Exception as e:

print(f"Error writing batch: {str(e)}")

# Implement retry logic here

# For example:

# time.sleep(1)

# retry_counter += 1

# if retry_counter > max_retries:

# # Handle terminal failure (e.g., write to dead letter queue)

finally:

# Ensure session is closed properly

session.close()

Understanding the Key Components

Let’s break down the important elements of this implementation:

Snowpark Session: The Session object establishes the connection to Snowflake and provides the API for data operations. The connection parameters should include your account identifier, credentials, warehouse, database, and schema.
Table Creation: The code creates the target table named realtime_events if it doesn’t already exist. The schema includes:
- event_id: A unique identifier for each event
- event_timestamp: When the event occurred at the source
- event_type: Classification of the event
- payload: A VARIANT column to store the full event data.
  
  The payload column needs to contain a valid JSON object with all the event details. For IoT data, this would typically include the complete sensor reading with any additional metadata. The exact structure is flexible since VARIANT columns can store any valid JSON, but your application must ensure the data is properly formatted before sending it. Unlike the Snowpipe approach, where the JSON structure is mapped to table columns during loading, here you’ll need to structure your data appropriately in your application code.
- processing_timestamp: Automatically records when the data was received by Snowflake
Pandas Integration: The Snowpark API’s write_pandas method provides efficient batch loading of pandas DataFrames into Snowflake tables.
Batching Strategy: The chunk_size parameter (set to 10000 in this example) controls the balance between throughput and latency:
- Smaller chunks reduce latency but increase overhead
- Larger chunks improve throughput but increase latency
- The optimal value depends on your data characteristics and latency requirements
Error Handling: The code includes basic error catching with placeholders for implementing retry logic, which is essential for production systems.

Performance Comparison

When choosing between Snowpipe and Direct Streaming, it’s important to understand their performance characteristics across different dimensions. Here’s a detailed comparison based on real-world implementations:

Latency

Snowpipe typically processes data with a latency of 1-2 minutes from the time a file arrives in the external stage. This latency comes from several components. Cloud storage events usually take 10-30 seconds to propagate through the notification system.

The Snowpipe queue processing adds another 30-60 seconds before the file begins loading. The actual data loading time varies with file size but is generally efficient. This latency remains relatively consistent across file sizes up to about 100MB, making it predictable for capacity planning.

Direct Streaming via Snowpark can achieve latencies as low as a few seconds or even sub-second in optimal conditions. Network latency between your application and Snowflake forms the baseline. Batch size configuration significantly impacts performance; smaller batches reduce latency but increase overhead.

Warehouse size and current load directly affect processing speed, as does data volume and complexity. For time-sensitive applications like fraud detection or real-time alerting, Direct Streaming’s lower latency is often critical for business operations.

Scalability

Snowpipe offers tremendous scalability without manual intervention, though this comes at a price based on the volume of data processed. Its serverless architecture automatically scales to handle incoming file volume without requiring you to provision additional resources, but costs will scale proportionally with data volume. There are no practical limits on concurrent files being processed, making it ideal for variable or unpredictable workloads. You don’t need to provision or manage compute resources, which simplifies operations.

Snowpipe works equally well with both consistent and bursty workloads, automatically adjusting to incoming data volume. Note that while this process itself scales automatically, your source systems will still need appropriate capacity to generate and transfer the files to your cloud storage.

There are no practical limits on concurrent files being processed, making it ideal for variable or unpredictable workloads. You don’t need to provision or manage compute resources, which simplifies operations. Snowpipe works equally well with both consistent and bursty workloads, automatically adjusting to incoming data volume.

Snowpark Direct Streaming’s scalability is limited by the warehouse resources allocated. It requires an active warehouse with sufficient resources to handle the incoming data stream. During high-volume situations, you may need to manually scale up the warehouse to prevent bottlenecks. Direct Streaming is more prone to backpressure during volume spikes, potentially causing data buffering in your application. It needs careful capacity planning for peak loads to ensure sufficient resources are available when needed. For unpredictable or highly variable data volumes, Snowpipe’s serverless model often provides better scalability without manual intervention.

Cost Efficiency

Snowpipe uses a pay-per-file pricing model where charges are based on the number of files processed and rows loaded. When no files are being ingested, you incur no costs, making it efficient for intermittent data flows.

Costs scale linearly with data volume, making budgeting more predictable. The serverless nature eliminates idle resource costs that might otherwise accumulate. This pricing model is particularly advantageous for variable workloads or development environments.

Snowpark Direct Streaming costs are based on warehouse compute time. Charges accrue continuously as long as the warehouse is running, regardless of whether data is actively flowing. This makes it less efficient for sporadic or low-volume data streams where the warehouse might sit idle much of the time.

Costs can be optimized by adjusting warehouse size to match throughput needs and implementing suspension settings for periods of inactivity. For 24/7 streaming applications, this approach can lead to potentially higher costs unless carefully managed. For continuous, high-volume streams, Direct Streaming can be more cost-effective if the warehouse is properly sized and utilized, while for intermittent or low-volume streams, Snowpipe usually offers better cost efficiency.

Implementation Complexity

Snowpipe requires more initial setup effort. You’ll need to configure an external stage to hold your files before ingestion. Setting up cloud notification integration adds another layer of complexity to enable automatic triggering. However, once set up, it is as simple as dropping properly formatted files in a cloud storage service.

File management considerations, such as naming conventions and archiving strategies, must be addressed. Comprehensive error handling and monitoring infrastructure is necessary to ensure reliable operations. Despite this initial complexity, Snowpipe offers excellent operational simplicity once configured.

Snowpark Direct Streaming has a simpler implementation from an infrastructure perspective. It uses a standard API integration pattern familiar to most developers without requiring intermediate file storage. There’s no file management overhead since data flows directly from your application to Snowflake. Of course, less to do on your server means more to do on your clients.

The streaming paradigm is often more familiar to application developers who may already be using similar patterns elsewhere in their systems. It integrates more naturally with existing application code, particularly for systems already generating events or messages. For teams with strong application development skills but less cloud infrastructure experience, Direct Streaming often provides an easier path to implementation.

Error Recovery

Snowpipe offers robust error handling at the file level. When issues occur, failed files can be reprocessed entirely without having to track individual records. Detailed error information is preserved for troubleshooting through Snowflake’s validation functions. It’s straightforward to implement dead-letter mechanisms for failed files by moving them to a separate location for later analysis or reprocessing.

Snowpipe includes built-in retry mechanisms for transient issues, improving reliability without custom code. The file-based approach creates natural transaction boundaries that simplify recovery.

Snowpark Direct Streaming requires custom error handling at the record level. Your application must implement its own retry logic for failed operations, including appropriate backoff strategies. It requires careful transaction management to ensure consistency during error conditions. You’ll need to develop dead-letter queues for failed records that cannot be processed after multiple attempts.

There is generally more responsibility on the application side for ensuring data isn’t lost during failures. Snowpipe generally provides more robust out-of-the-box error recovery, while Direct Streaming requires more custom implementation for equivalent reliability.

Best Practices and Recommendations

Based on extensive implementation experience, here are detailed recommendations for implementing real-time data streaming in Snowflake effectively:

When to Choose Snowpipe

Snowpipe is typically the better choice when:

You’re working with file-based data sources. If your data already arrives as files (logs, exports, etc.), Snowpipe provides the simplest integration path without adding an unnecessary transformation layer.

In fact, if you have the flexibility to architect your data flow, designing a file-based transfer approach can provide advantages in terms of simplicity, reliability, and the ability to reprocess historical data when needed. This is especially true for workloads that don’t require true real-time processing
Minute-level latency is acceptable. For many analytical use cases, the 1-2 minute latency of Snowpipe is perfectly adequate and doesn’t justify the additional complexity of Direct Streaming.
You need serverless scalability. If your data volume is highly variable or unpredictable, Snowpipe’s serverless architecture eliminates capacity planning challenges.
Cost efficiency is a priority. For intermittent data streams or cost-sensitive implementations, Snowpipe’s pay-per-file model often results in lower overall costs.

Snowpipe Best Practices

To get the most from Snowpipe, follow these implementation best practices:

Optimize file sizes. Aim for file sizes between 10MB and 100MB for the best balance of throughput and latency. Extremely small files (under 1MB) create unnecessary overhead, while very large files (over 500MB) can lead to longer processing times. For IoT applications that generate many small readings, consider implementing a buffer or aggregator service that combines readings into optimally sized batches before writing to your cloud storage. For example, you might buffer sensor readings for a short time period (e.g., 1-5 minutes) or until a target file size is reached before creating a new file in your stage.
Implement comprehensive error handling. Use the VALIDATE table function and periodic error checking procedures to ensure no data is silently lost. Store the raw content of failed records to facilitate troubleshooting and reprocessing.
Set up monitoring and alerting. Create dashboards that track key metrics like file counts, latency, error rates, and data volumes. Configure alerts for anomalies such as rising error rates or processing delays.
Plan for reprocessing. Maintain the ability to reload historical files if needed. This might involve preserving files in the stage or maintaining a separate archive of raw files.
Consider partitioning strategies. Organize your external stage with a logical folder structure (e.g., by date or source) to facilitate management and potential selective reprocessing.

When to Choose Direct Streaming

Direct Streaming is typically the better choice when:

You need minimal latency. For operational use cases like fraud detection or real-time decision support, Direct Streaming’s lower latency can be business-critical.
Your data comes directly from applications. When your data originates from your own applications rather than files, Direct Streaming eliminates the unnecessary step of file creation and staging.
You need immediate data consistency. If your use case requires immediate visibility of new data for decision-making, Direct Streaming provides the most current view.
You have continuous, high-volume streams. For steady, high-throughput scenarios where a warehouse would be running continuously anyway, Direct Streaming can offer better performance.

Direct Streaming Best Practices

To implement Direct Streaming effectively:

Size your warehouse appropriately. Monitor CPU utilization and queue times to determine the optimal warehouse size. Consider using a dedicated warehouse for streaming to isolate it from other workloads.
Implement backpressure handling. Your application should detect when it’s sending data faster than Snowflake can process it and adapt accordingly, either by throttling the source or by buffering data temporarily.
Build robust error handling. Implement comprehensive retry logic with exponential backoff for transient failures. Create a dead-letter mechanism for records that consistently fail to load.
Use efficient batching. Experiment with different batch sizes to find the optimal balance between throughput and latency for your specific use case. Monitor latency at different percentiles (p50, p95, p99) to understand the full performance profile.
Implement circuit breakers. Design your application to detect sustained failures and temporarily pause streaming attempts to prevent resource exhaustion during outages or maintenance windows.

General Streaming Best Practices

Regardless of which method you choose:

Validate data quality. Implement validation checks both before sending data to Snowflake and after loading. Consider creating data quality metrics that track error rates, schema compliance, and business rule violations.
Plan for disaster recovery. Document procedures for recovering from failures, including how to backfill data if there’s an extended outage.
Implement comprehensive logging. Maintain detailed logs of all streaming operations, including timestamps, volumes, and error conditions, to facilitate troubleshooting and performance optimization.
Design for exactly-once semantics. If your use case requires it, implement mechanisms to prevent duplicate data or to make processing idempotent.
Consider hybrid approaches. Some organizations implement both methods: Direct Streaming for critical, low-latency data and Snowpipe for higher-volume background data, leveraging the strengths of each approach.

Wrapping Up

Snowflake offers powerful options for real-time data streaming that can accommodate a wide range of latency requirements, data volumes, and architectural patterns. By understanding the strengths and limitations of both Snowpipe and Direct Streaming, you can make informed decisions that balance performance, cost, and implementation complexity.

For most file-based use cases where minute-level latency is acceptable, Snowpipe provides a robust, serverless solution with minimal operational overhead. For applications requiring true real-time processing with second or sub-second latency, Direct Streaming via the Snowpark API offers the necessary performance at the cost of some additional implementation complexity.

Whichever approach you choose, following the best practices outlined in this guide will help ensure a reliable, scalable, and cost-effective real-time data pipeline in Snowflake.

Understanding Snowflake’s Real-time Options

Snowpipe: Near-Real-Time File-Based Streaming

Snowpark API Direct Streaming: True Real-Time API Integration

Common Direct Streaming Use Cases and Recommended Tool

IoT Sensor Data

Financial Transactions

Application Tracking Events

Snowpipe Implementation

Setting Up Your Target Table

Defining Your File Format

Creating External Staging

Setting Up Cloud Event Notifications

Creating and Configuring the Snowpipe

Implementing Error Handling

Monitoring Pipe Status

Implementation: Direct Streaming with Snowpark API

Setting Up the Python Environment

Creating a Streaming Function

Understanding the Key Components

Performance Comparison

Latency

Scalability

Cost Efficiency

Implementation Complexity

Error Recovery

Best Practices and Recommendations

When to Choose Snowpipe

Snowpipe Best Practices

When to Choose Direct Streaming

Direct Streaming Best Practices

General Streaming Best Practices

Wrapping Up

Article tags

Recommended

About the author

Anil Kumar Moka

Anil's contributions

Articles

Books

Top topics

Anil's latest contributions:

Load Data into Snowflake Using Native Snowflake Connector for Python: From Small to Large Datasets

Load Data into Snowflake Using Python with Pandas

Implementing Enterprise Data Security in Snowflake: Practical Concepts