{"id":107644,"date":"2025-09-23T12:00:00","date_gmt":"2025-09-23T12:00:00","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=107644"},"modified":"2025-09-19T07:46:13","modified_gmt":"2025-09-19T07:46:13","slug":"iot-meets-oltp-how-to-handle-backfilling-challenges-in-real-time-systems","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/iot-meets-oltp-how-to-handle-backfilling-challenges-in-real-time-systems\/","title":{"rendered":"IoT Meets OLTP: How to Handle Backfilling Challenges in Real-Time Systems"},"content":{"rendered":"\n<p>This is the true story of a 64-core <a href=\"https:\/\/www.microsoft.com\/en-gb\/sql-server\" target=\"_blank\" rel=\"noreferrer noopener\">SQL Server<\/a> brought down by poor assumptions about its data. A <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/learn\/effective-clustered-indexes\/\" target=\"_blank\" rel=\"noreferrer noopener\">clustered index<\/a> designed for neat, sequential IoT inserts was overwhelmed when the real readings arrived late, out of order, and in bulk. The same risk lurks in any high-write system with unpredictable insert patterns. This article shows what can go wrong and how to avoid it. In essence, it answers the question: how do you manage IoT data in <a href=\"https:\/\/en.wikipedia.org\/wiki\/Online_transaction_processing\" target=\"_blank\" rel=\"noreferrer noopener\">OLTP (online transaction processing)<\/a> systems?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-background\">The Background<\/h2>\n\n\n\n<p>This isn\u2019t a hypothetical lab setup: this describes what can go wrong at a manufacturing site running a high-end <a href=\"https:\/\/en.wikipedia.org\/wiki\/Relational_database\" target=\"_blank\" rel=\"noreferrer noopener\">relational database system (RDBMS)<\/a> to record IoT data.<\/p>\n\n\n\n<p>A manufacturing company rolled out over 10,000 IoT sensors across its factories to track temperature, pressure, humidity, acoustic and vibration data. The data was streamed into a RDBMS, <a href=\"https:\/\/www.microsoft.com\/en-gb\/sql-server\" target=\"_blank\" rel=\"noreferrer noopener\">SQL Server<\/a> in this case, not particularly because it was cutting-edge, but because it was already part of their stack. Their MES system ran on it, they had additional licenses, elite-class support from <a href=\"https:\/\/www.microsoft.com\/en-gb\/\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft<\/a>, and the cost of switching to open-source systems was too high.<\/p>\n\n\n\n<p>The table design seemed deceptively straightforward: a <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/learn\/primary-key-primer-for-sql-server\/\" target=\"_blank\" rel=\"noreferrer noopener\">clustered primary key<\/a> on (SensorId, ReadingTime) as is commonly used with <a href=\"https:\/\/www.clarify.io\/learn\/time-series-data\" target=\"_blank\" rel=\"noreferrer noopener\">time series<\/a> based workloads. But there was a deviation for this in OLTP RDBMS systems, and they didn\u2019t have a DBA involved during\u00a0design.\u00a0Everything looked fine on frequency samples of 1000 sensors or less, but problems arose when the number of sensors increased.<\/p>\n\n\n\n<p>However, it wasn\u2019t just the increase in sensors that caused problems, as many of these sensors suffered from power issues, poor GSM connectivity, and delayed data transmission in production usage. The result was the arrival of backfilled inserts with old timestamps arriving hours or days late, and often in considerable bulk. This had an immediate impact on the business in that temperature gauges often showed\u00a0&#8216;stale&#8217; values despite sensor updates, vibration alerts sometimes failed\u00a0to trigger abnormal machinery, and pressure graphs occasionally displayed gaps even though the sensors were online.<\/p>\n\n\n\n<p>These out-of-order inserts hit earlier parts of the clustered index and triggered massive <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/how-to-identify-the-source-of-page-splits-in-a-database\/\" target=\"_blank\" rel=\"noreferrer noopener\">page splits<\/a>. With no fill factor adjustments or fragmentation checks in place, the system began to struggle. <a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/relational-databases\/databases\/tempdb-database?view=sql-server-ver17\" target=\"_blank\" rel=\"noreferrer noopener\">TempDB<\/a> usage climbed, CPU hit 90%, and insert latency shot up. Even with a 64-core server, the database couldn\u2019t keep up.<\/p>\n\n\n\n<p>Eventually, we looked deeper, monitored closely, and realized what was happening. The clustered index wasn\u2019t the primary problem \u2014 it was the assumption that inserts would always be sequential. With no room left on pages, every backfill caused internal reshuffling, page splits, and performance degradation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-architecture\">The Architecture<\/h2>\n\n\n\n<p>The design of the end-to-end data pipeline was fairly straightforward on paper, as shown below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"716\" height=\"1024\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-716x1024.png\" alt=\"The design of the end-to-end data pipeline.\" class=\"wp-image-107645\" style=\"width:706px;height:auto\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-716x1024.png 716w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-210x300.png 210w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-768x1099.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-1073x1536.png 1073w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image.png 1125w\" sizes=\"auto, (max-width: 716px) 100vw, 716px\" \/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>It began with individual IoT sensors that were deployed across industrial equipment, continuously measuring parameters such as temperature, vibration, pressure and humidity.<\/p>\n\n\n\n<p>These sensors didn\u2019t talk to SQL Server directly. Instead, their readings flowed through an Edge Device Gateway \u2014 a middleware layer responsible for temporarily buffering and pre-processing incoming data. This layer played a crucial role: it had the ability to store up to one month of telemetry data locally, which became invaluable when connectivity was lost or sensors went offline. It\u2019s this buffer that allowed delayed data to be backfilled later into the system, but the immediate value of the data in triggering alerts on abnormal readings was lost.<\/p>\n\n\n\n<p>From the middleware, data was pushed into staging tables within the database \u2014 dynamically created tables named like <code>sensor_n_UUID<\/code>. These were lightweight, per-sensor structures used to isolate incoming data before merging it into the main <code>Telemetry<\/code> table.<\/p>\n\n\n\n<p>These staging tables served two purposes:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>In real-time, they handled small, continuous batches of data every few seconds or minutes.<br><\/li>\n\n\n\n<li>When receiving backfill data from the Edge Device Gateway, they accumulated much larger datasets \u2014 sometimes several days or even months worth of readings.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>While the real-time flow remained efficient due to the small payload size and regular purging of staging tables, the backfill operations became a ticking time bomb. When massive, unordered inserts from these tables hit the clustered index on (<code>SensorId<\/code>, <code>ReadingTime<\/code>), they caused page splits, TempDB spills, and CPU spikes. Later, we\u2019ll discuss the severe impacts these had.<\/p>\n\n\n\n<p>Once the telemetry was ingested into the main table, continuous aggregates and enrichment jobs ran periodically \u2014 computing moving averages, trend deltas, and preparing enriched datasets for downstream consumers. Finally, this processed data powered analytics dashboards and machine learning pipelines for equipment health prediction, anomaly detection, and more.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-first-signs-of-disaster-what-we-saw-in-real-time\">First Signs of Disaster: What We Saw in Real-Time<\/h2>\n\n\n\n<p>There were several obvious signs that something was wrong, seen in alerts and notifications:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>The CPU was pinned at 90%+ on a 64-core machine<\/li>\n\n\n\n<li>TempDB log file growth alerts (50GB+ in an hour)<\/li>\n\n\n\n<li>Page splits\/sec graph looked like a hospital alert\u00a0(1000\u2192 5000+)<\/li>\n\n\n\n<li>Dashboard queries failing with <code>Timeout expired<\/code> errors<\/li>\n\n\n\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Extract,_transform,_load\" target=\"_blank\" rel=\"noreferrer noopener\">ETL<\/a> job time increased greatly, with many stuck while loading from Staging Tables<\/li>\n<\/ul>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-where-things-started-breaking\">Where Things Started Breaking:<\/h2>\n\n\n<div class=\"block-core-list\">\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Clustered index on (SensorId, ReadingTime) meant that mid-range inserts, while back-filling, caused non-sequential writes.<\/li>\n\n\n\n<li>Payload bloated rows, reducing stored rows\/pages drastically, in turn causing massive page splits of up to 1000 per second<\/li>\n\n\n\n<li>Massive TempDB usage as a result of trying to insert from the Sensor_n_UUID table to the telemetry table, in an effort to preserve data<\/li>\n\n\n\n<li>CPU shot up to 90%+ \u2014 even on a 64-core machine \u2014 because of so many insertions to the telemetry table running continuously in\u00a0parallel<\/li>\n<\/ol>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-simulating-the-complete-workload\">Simulating The Complete Workload<\/h2>\n\n\n\n<p>I recreated the full environment on an 8-vCPU, 64 GB RAM <a href=\"https:\/\/azure.microsoft.com\/en-us\/products\/virtual-machines\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Virtual Machine (VM)<\/a> with SQL Server Evaluation Edition 2019 (feature-wise, the same as the Enterprise edition). While the original client system ran on a monstrous 64-core SQL Server Enterprise instance 2019, replicating that exact scale wasn&#8217;t feasible \u2014 Azure\u2019s 64-vCPU machines are significantly more expensive and generally out of reach for individual users like me.<\/p>\n\n\n\n<p>But even with modest specs, the setup was more than sufficient to faithfully simulate the bottlenecks \u2014 offering a realistic glimpse into how page splits, TempDB overload, and CPU spikes emerge when backfilled telemetry data clashes with suboptimal indexing, just as had happened with the manufacturing company in this example. It also allowed me to run tests to assess the best solution to the problem.<\/p>\n\n\n\n<p>For the purpose of simulation, I modeled 100 sensors instead of the full 10,000+ used in production. This wasn\u2019t just to save on data generation time, but also to keep the workload manageable while still triggering the same structural failures: backfilled inserts, bloated payloads, and fragmentation chaos. The goal wasn\u2019t to match the client&#8217;s scale byte-for-byte \u2014 it was to mirror the architectural flaw and its side effects.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-create-main-table-telemetry\">Create Main Table Telemetry<\/h3>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >CREATE TABLE dbo.Telemetry (\n    SensorId INT NOT NULL,\n    ReadingTime DATETIME NOT NULL,\n    Value FLOAT NULL,\n    Payload VARCHAR(MAX) NULL,\n    CONSTRAINT PK_Telemetry PRIMARY KEY CLUSTERED (SensorId, ReadingTime)\n);\n<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-simulating-reality-data-every-10-seconds-with-gsm-glitches\">Simulating Reality: Data Every 10 Seconds\u2026 with GSM Glitches<\/h3>\n\n\n\n<p>In the live systems, data was expected every 10 seconds; however, for various reasons\u2014including SIM expiry, GSM (Global System for Mobile Communications) outages, or tower handshakes\u2014data sometimes arrived late. To address this, the application stored incoming data in temporary staging tables, with one table per sensor (e.g., sensor_1_UUID, sensor_2_UUID), which were then backfilled into the <code>Telemetry<\/code> table in parallel.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >-- Table per sensor (simplified)\nCREATE TABLE dbo.sensor_1_UUID123 (... same schema ...);\nCREATE TABLE dbo.sensor_2_UUID234 (...);\n<\/pre><\/div>\n\n\n\n<p>New tables were created during each backfill operation as data was pushed from edge devices, using a new hash suffix to reduce direct load on the main telemetry table. Since simulating edge device behavior was not possible in this context, individual hash tables were created in the next section while maintaining similar logic. Whenever new sensor data entered the system, a corresponding sensor_n_UUID table was generated.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-simulating-the-work-load\">Simulating the Work Load<\/h2>\n\n\n\n<p>To get to the bottom of the issue, I needed to recreate the entire environment in a controlled test setup and then simulate their page splits, TempDB usage and CPU hike issues on the basis of known patterns. The real system had over 10,000 industrial IoT sensors deployed across manufacturing units, each transmitting a steady stream of telemetry: temperature, acoustic signatures, pressure, humidity, and device health stats. These payloads were not lean \u2014 they were firmware-packed strings including battery level, device version, GPS location, CRC checksums, and metadata used for downstream analytics and alerting.<\/p>\n\n\n\n<p>However, this data didn\u2019t always arrive in real time due to GSM network blackouts, expired SIMs, and device reboots. Sometimes it came as backfilled records hours or even days later. Now, with a clustered index on (SensorId, ReadingTime) and inserts arriving out-of-order, SQL Server wasn\u2019t happy.<\/p>\n\n\n\n<p>To mirror this behavior, I set up a table named <code>Telemetry<\/code> with the same clustered index structure, added a Payload column of type <code>VARCHAR(MAX)<\/code>, and generated synthetic data for 100 sensors covering an entire month. Each sensor had its own &#8220;missing&#8221; time windows to simulate real-life network disruptions. These windows were later used to insert backfilled data \u2014 just as happened in production.<\/p>\n\n\n\n<p>In real deployments, the client routed late-arriving data via dynamically created staging tables (named like sensor_28_3ABF1298&#8230;) before merging it into the main table. My simulation does the same.<\/p>\n\n\n\n<p>The goal is to reproduce the exact insertion pattern that triggered thousands of subtle but expensive page-splits:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >DECLARE @StartDate DATETIME = '2020-01-01'; \nDECLARE @EndDate DATETIME = '2020-01-31';\nDECLARE @SensorId INT = 1;\nWHILE @SensorId &lt;= 10 BEGIN DECLARE @MissingStart DATE = NULL, @MissingEnd DATE = NULL;\nIF @SensorId BETWEEN 10 AND 13\nBEGIN\n    SET @MissingStart = '2020-01-04';\n    SET @MissingEnd = '2020-01-07';\nEND\nELSE IF @SensorId BETWEEN 18 AND 25\nBEGIN\n    SET @MissingStart = '2020-01-09';\n    SET @MissingEnd = '2020-01-15';\nEND\nELSE IF @SensorId BETWEEN 20 AND 60\nBEGIN\n    SET @MissingStart = '2020-01-05';\n    SET @MissingEnd = '2020-01-11';\nEND\nELSE IF @SensorId BETWEEN 28 AND 42\nBEGIN\n    SET @MissingStart = '2020-01-18';\n    SET @MissingEnd = '2020-01-25';\nEND\nELSE IF @SensorId BETWEEN 42 AND 57\nBEGIN\n    SET @MissingStart = '2020-01-11';\n    SET @MissingEnd = '2020-01-15';\nEND\nELSE IF @SensorId BETWEEN 10 AND 30\nBEGIN\n    SET @MissingStart = '2020-01-25';\n    SET @MissingEnd = '2020-01-29';\nEND\nELSE IF @SensorId BETWEEN 28 AND 55\nBEGIN\n    SET @MissingStart = '2020-01-02';\n    SET @MissingEnd = '2020-01-07';\nEND\nELSE IF @SensorId BETWEEN 33 AND 50\nBEGIN\n    SET @MissingStart = '2020-01-10';\n    SET @MissingEnd = '2020-01-25';\nEND\nELSE IF @SensorId BETWEEN 45 AND 70\nBEGIN\n    SET @MissingStart = '2020-01-17';\n    SET @MissingEnd = '2020-01-25';\nEND\nELSE IF @SensorId BETWEEN 80 AND 97\nBEGIN\n    SET @MissingStart = '2020-01-10';\n    SET @MissingEnd = '2020-01-27';\nEND\n\nDECLARE @UUIDSuffix NVARCHAR(36) = CONVERT(NVARCHAR(36), NEWID());\nDECLARE @SensorTableName NVARCHAR(128) = N'sensor_' + CAST(@SensorId AS NVARCHAR) + '_' + @UUIDSuffix;\n\nDECLARE @CreateSQL NVARCHAR(MAX) = '\n    IF OBJECT_ID(''' + @SensorTableName + ''') IS NULL\n    CREATE TABLE ' + QUOTENAME(@SensorTableName) + ' (\n        SensorId INT NOT NULL,\n        ReadingTime DATETIME NOT NULL,\n        Value FLOAT NULL,\n        Payload VARCHAR(MAX) NULL\n    )';\nEXEC sp_executesql @CreateSQL;\n\nDECLARE @Current DATETIME = @StartDate;\nWHILE @Current &lt; @EndDate\nBEGIN\n    IF (@MissingStart IS NULL OR @Current &lt; @MissingStart OR @Current &gt;= @MissingEnd)\n    BEGIN\n        DECLARE @Payload NVARCHAR(MAX) = 'Battery=92%;GPS=27.1,78.3;CRC=0xAB12F3';\n\n        INSERT INTO dbo.Telemetry (SensorId, ReadingTime, Value, Payload)\n        VALUES (@SensorId, @Current, RAND() * 100, @Payload);\n\n        DECLARE @InsertSensorSQL NVARCHAR(MAX) = '\n            INSERT INTO ' + QUOTENAME(@SensorTableName) + '\n            (SensorId, ReadingTime, Value, Payload)\n            VALUES (' + CAST(@SensorId AS NVARCHAR) + ', ''' +\n            CONVERT(NVARCHAR, @Current, 120) + ''', ' +\n            CAST(RAND() * 100 AS NVARCHAR) + ', ''' + @Payload + ''')';\n        EXEC sp_executesql @InsertSensorSQL;\n    END\n\n    SET @Current = DATEADD(SECOND, 10, @Current);\nEND\n\nIF @MissingStart IS NOT NULL\nBEGIN\n    INSERT INTO dbo.MissingDataLog (SensorId, MissingMonthStart, MissingMonthEnd)\n    VALUES (@SensorId, @MissingStart, @MissingEnd);\nEND\n\nSET @SensorId += 1;\n \nEND<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-backfilling-the-data-behavior-and-witnessing-page-splits\">Backfilling the Data Behavior and Witnessing Page Splits<\/h2>\n\n\n\n<p>The following script simulates how real-world IoT sensor backfilling triggers page-splits in SQL Server.<\/p>\n\n\n\n<p>In production, telemetry data from sensors often doesn&#8217;t arrive on time due to GSM issues, power loss, or firmware-level buffering. When the network reconnects, the device pushes old data into the database \u2014 sometimes minutes, hours, or even days late. This is known as backfilling.<\/p>\n\n\n\n<p>Now here\u2019s the catch: if your clustered index is on (SensorId, ReadingTime), and the late data falls into an older time range, SQL Server must insert it into already-filled pages. If those pages don\u2019t have enough space, a page split is triggered, thereby causing extra I\/O, CPU spikes, TempDB usage, and sometimes query stalls. <\/p>\n\n\n\n<p>The query below simulates exactly this kind of backfill for one sensor:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >DECLARE @SensorId INT = 32; \nDECLARE @StartDate DATETIME = '2020-01-10';\nDECLARE @EndDate DATETIME = '2020-01-27';\n-- Step 1: Find the sensor-specific staging table \n\nDECLARE @SensorTable NVARCHAR(128); \n-- latest table in case of multiple\nSELECT TOP 1 @SensorTable = name FROM sys.tables WHERE name LIKE 'sensor_' + CAST(@SensorId AS NVARCHAR) + '_%' ORDER BY create_date DESC; \n\n -- Sanity check\n\n IF @SensorTable IS NULL BEGIN RAISERROR('Sensor staging table not found for SensorId = %d', 16, 1, @SensorId); RETURN; END\n\n-- Step 2: Delete overlapping data from main table (optional, for clean re-run) \nDELETE FROM dbo.Telemetry WHERE SensorId = @SensorId AND ReadingTime BETWEEN @StartDate AND @EndDate;\n-- Step 3: Build and execute dynamic insert \n\nDECLARE @SQL NVARCHAR(MAX) = ' INSERT INTO dbo.Telemetry (SensorId, ReadingTime, Value, Payload) \nSELECT SensorId, ReadingTime, Value, Payload FROM ' + QUOTENAME(@SensorTable) + ' WHERE ReadingTime BETWEEN @Start AND @End ORDER BY SensorId, ReadingTime;';\n-- Step 4: Execute with parameters \nEXEC sp_executesql @SQL, N'@Start DATETIME, @End DATETIME', @Start = @StartDate, @End = @EndDate;<\/pre><\/div>\n\n\n\n<p>The maximum page split was 995 for the three-sensor backfilling:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"604\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-1-1024x604.png\" alt=\"The maximum page split was 995 for the three-sensor backfilling.\" class=\"wp-image-107646\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-1-1024x604.png 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-1-300x177.png 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-1-768x453.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-1.png 1125w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><br>The maximum page split then skyrocketed to 1900-2000 for six sensors backfilling:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"614\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-2-1024x614.png\" alt=\"The maximum page split then skyrocketed to 1900-2000 for six sensors backfilling.\" class=\"wp-image-107647\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-2-1024x614.png 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-2-300x180.png 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-2-768x461.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-2.png 1125w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><br>And finally, the maximum page split further increased to 2933 for nine sensors backfilling:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"592\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-3-1024x592.png\" alt=\"And finally, the maximum page split further increased to 2933 for nine sensors backfilling.\" class=\"wp-image-107649\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-3-1024x592.png 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-3-300x173.png 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-3-768x444.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-3.png 1125w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-alternate-way-to-simulate-the-same-page-split-effect\">Alternate Way to Simulate the Same Page Split Effect<\/h2>\n\n\n\n<p>If you don\u2019t want to pull from the stage table (i.e sensor_n_uuid), but still want to trigger the same pattern of page splits, you can directly insert bloated rows into the <code>Telemetry<\/code> table using a loop. Here&#8217;s a quick way to simulate the same behavior and page-splits:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >DELETE FROM dbo.Telemetry\nWHERE SensorId = 32  AND ReadingTime BETWEEN '2020-01-10' AND '2020-01-27';\nDECLARE @Current DATETIME = '2020-01-18';\nDECLARE @EndDate DATETIME = '2020-01-25';\nDECLARE @SensorId INT = 32;\nDECLARE @Payload NVARCHAR(MAX); \nWHILE @Current &lt; @EndDate\nBEGIN\n    -- Generate ~8000 character payload (simulate full page fill pressure)\n    SET @Payload = REPLICATE('X', 8000); \n    INSERT INTO dbo.Telemetry (SensorId, ReadingTime, Value, Payload)\n    VALUES (@SensorId, @Current, RAND() * 100, @Payload);\nSET @Current = DATEADD(SECOND, 10, @Current);\nEND<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-high-tempdb-usage\">High TempDB Usage<\/h2>\n\n\n\n<p>To load sensor data back into the <code>Telemetry<\/code> table, we followed a dynamic insert pattern like this:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >DECLARE @SQL NVARCHAR(MAX) = '\nINSERT INTO dbo.Telemetry (SensorId, ReadingTime, Value, Payload)\nSELECT SensorId, ReadingTime, Value, Payload \nFROM ' + QUOTENAME(@SensorTable) + ' \nWHERE ReadingTime BETWEEN @Start AND @End \nORDER BY SensorId, ReadingTime;';<\/pre><\/div>\n\n\n\n<p>This query seems harmless \u2014 it&#8217;s a simple <code>INSERT<\/code> &#8230; <code>SELECT <\/code>with an <code>ORDER BY<\/code> to maintain clustering order. But the catch lies in the source table: sensor_n_UUID.<\/p>\n\n\n\n<p>In most production patterns, staging tables don\u2019t carry indexes. They\u2019re short-lived, used for quick loads, and often dropped or truncated after merge \u2014 so adding indexes is skipped for simplicity and performance reasons. While that works for unordered appends, it hurts badly when you&#8217;re inserting into a clustered index that depends on order.<\/p>\n\n\n\n<p>Without an index on (SensorId, ReadingTime) in the staging table, SQL Server must sort the entire result set before inserting while doing <code>ORDER BY<\/code><strong><em> <\/em><\/strong>\u2014 and that sort spills into TempDB.<\/p>\n\n\n\n<p>We tracked backfilling TempDB usage via the following queries:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >-- Shows TempDB usage by backfilling write session(54)\nSELECT session_id, request_id, user_objects_alloc_page_count, user_objects_dealloc_page_count, internal_objects_alloc_page_count, internal_objects_dealloc_page_count, total_allocated_kb = (user_objects_alloc_page_count + internal_objects_alloc_page_count) * 8, total_deallocated_kb = (user_objects_dealloc_page_count + internal_objects_dealloc_page_count) * 8 FROM sys.dm_db_task_space_usage WHERE session_id = 54 -- replace with actual SPID like 54\n\n---TOTAL TEMPDB USAGE BY backfilling session(54)\nSELECT session_id, total_allocated_kb = (user_objects_alloc_page_count + internal_objects_alloc_page_count) * 8, total_deallocated_kb = (user_objects_dealloc_page_count + internal_objects_dealloc_page_count) * 8 FROM sys.dm_db_session_space_usage WHERE session_id = 54<\/pre><\/div>\n\n\n\n<p>As seen below, the query showed that a single session (session_id = 54) consumed over 1.7 GB of TempDB, entirely from sorting unindexed data during a backfill insert:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"387\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-4-1024x387.png\" alt=\"The query showed that a single session (session_id = 54) consumed over 1.7 GB of TempDB, entirely from sorting unindexed data during a backfill insert.\" class=\"wp-image-107650\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-4-1024x387.png 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-4-300x113.png 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-4-768x290.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-4.png 1125w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Now imagine if this wasn\u2019t just the one sensor \u2014 the system was backfilling data from dozens of sensors in parallel, each running its own session. If each session consumes a similar amount of TempDB, the load multiplies rapidly, overwhelming TempDB with internal sort spills, leading to disk I\/O pressure, latch contention, and severe slowdowns. This is how one innocent-looking insert pattern can quietly choke a high-performance SQL Server.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-observing-the-impact-on-cpu\">Observing the Impact on CPU<\/h2>\n\n\n\n<p>Many analytic queries were running against this table during normal operations \u2014 including complex windowed aggregates like moving averages, deltas, and payload parsing. The query below is one such example of an analytic query in IOT-based environments:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >SELECT     t.SensorId,\nt.ReadingTime,\nt.Value,\nLEN(t.Payload) AS PayloadLength,\nAVG(t.Value) OVER (PARTITION BY t.SensorId ORDER BY t.ReadingTime ROWS BETWEEN 50 PRECEDING AND CURRENT ROW) AS MovingAvg,\n\nLEAD(t.Value, 1) OVER (PARTITION BY t.SensorId ORDER BY t.ReadingTime) - LAG(t.Value, 1) OVER (PARTITION BY t.SensorId ORDER BY t.ReadingTime) AS ValueDelta,\n\nCHARINDEX('GPS=', t.Payload) AS GpsStartPos,DENSE_RANK() OVER (PARTITION BY t.SensorId ORDER BY t.ReadingTime DESC) AS RankByTime,COUNT(*) OVER (PARTITION BY t.SensorId) AS TotalPointsForSensor \n\nFROM dbo.Telemetry t \nWHERE \nt.ReadingTime BETWEEN '2020-01-10' AND '2020-01-27' ORDER BY t.SensorId, t.ReadingTimeOPTION (MAXDOP 8);<\/pre><\/div>\n\n\n\n<p>On a healthy table, it runs fine. But due to massive page splits and bloated row sizes from backfills, logical reads ballooned, and each query had to scan more pages than necessary. CPU utilization spiked, and even basic reporting saw latency. This wasn\u2019t because the analytics were expensive \u2014 it was because the physical structure of the table collapsed underneath massive page splits.<\/p>\n\n\n\n<p>In our simulated environment, this query gave 72% usage without any other load, though we were using parallelism:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"887\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-5-1024x887.png\" alt=\"In our simulated environment, this query gave 72% usage without any other load, though we were using parallelism.\" class=\"wp-image-107651\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-5-1024x887.png 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-5-300x260.png 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-5-768x666.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2025\/09\/image-5.png 1125w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-fixes-that-brought-it-all-back-under-control\">The Fixes That Brought It All Back Under Control<\/h2>\n\n\n\n<p>We didn\u2019t rely on a single fix. Instead, we addressed the immediate symptoms and redesigned parts of the system for long-term stability.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-changed-primary-key-to-a-surrogate-identity-key\">Changed Primary Key to a Surrogate Identity Key<\/h3>\n\n\n\n<p>The original clustered primary key on (SensorId, ReadingTime) was well-intentioned but unsuitable for a backfilled, out-of-order insert workload. Every late insert had to go back into already full index pages, triggering page splits. We replaced it with a sequential surrogate key using an identity column:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >CREATE TABLE dbo.Telemetry (\nTelemetryId BIGINT IDENTITY(1,1) PRIMARY KEY CLUSTERED,\n    SensorId INT NOT NULL,\n     ReadingTime DATETIME NOT NULL,\nValue FLOAT NULL,\n    Payload VARCHAR(MAX) NULL -- retained during initial fixes\n);<\/pre><\/div>\n\n\n\n<p>This change ensured all inserts remained sequential regardless of the timestamp, immediately improving the write throughput and drastically reducing page splitting to almost zero because of the main <code>Telemetry<\/code> table.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-created-indexes-on-staging-tables\">Created Indexes on Staging Tables<\/h3>\n\n\n\n<p>When loading backfilled data from per-sensor staging tables, we were doing <code>ORDER BY<\/code> SensorId, ReadingTime on tables that had no indexes. That caused massive sort operations and heavy TempDB spills. To fix this, we added clustered indexes on staging tables from the application backend:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >CREATE CLUSTERED INDEX IX_sensor_1_UUID123 \nON sensor_1_UUID123 (SensorId, ReadingTime);<\/pre><\/div>\n\n\n\n<p>This reduced TempDB I\/O significantly and helped us to avoid 1\u20132 GB of allocation spikes per session. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-long-term-architectural-fixes\">Long-Term Architectural Fixes<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-removed-payload-from-main-table\">Removed Payload from Main Table<\/h3>\n\n\n\n<p>One of the heaviest design issues was the use of <code>VARCHAR(MAX)<\/code> payloads embedded directly in the main table. These long strings significantly reduced the number of rows per page and made every insert more expensive. In the long-term fix, we removed this field entirely from the primary OLTP table and moved the data to file-based object storage. We then replaced the column with a pointer:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >-- Updated schema \nCREATE TABLE dbo.Telemetry ( \nTelemetryId BIGINT IDENTITY(1,1) PRIMARY KEY CLUSTERED, \nSensorId INT NOT NULL, \nReadingTime DATETIME NOT NULL, \nValue FLOAT NULL, \nPayloadPath VARCHAR(500) NULL );<\/pre><\/div>\n\n\n\n<p>This required minor changes in the application layer to retrieve payloads only when necessary \u2014 usually in reporting or investigation workflows.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Established Index and Statistics Maintenance<\/h3>\n\n\n\n<p>We added weekly index rebuilds and stat updates across heavy-write tables to keep performance consistent. As we had SQL Server&#8217;s Enterprise version, we were able to perform online rebuilds on regular intervals:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >ALTER INDEX ALL ON dbo.Telemetry REBUILD WITH (ONLINE=ON, FILLFACTOR = 80);\nUPDATE STATISTICS dbo.Telemetry WITH FULLSCAN;<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-implemented-purging-policies\">Implemented Purging Policies<\/h3>\n\n\n\n<p>To control table size and long-range scan costs, we defined a purging policy that deleted telemetry older than 90 days. After such massive deletes we made sure to rebuild the index and update the statistics:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >DELETE FROM dbo.Telemetry\nWHERE ReadingTime &lt; DATEADD(MONTH, -3, GETDATE());<\/pre><\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-planned-for-partitioning\">Planned for Partitioning<\/h3>\n\n\n\n<p>To further isolate hot and cold data zones, we began designing monthly range-based table partitions.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Stress Testing via PowerShell<\/h2>\n\n\n\n<p>We simulated parallel inserts using PowerShell jobs:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \" >1..50 | ForEach-Object {\n    Start-Job -ScriptBlock {\n        Invoke-Sqlcmd -ServerInstance 'MyServer' -Database 'simple_talk' \\\n                      -Query \"INSERT INTO ... SELECT FROM sensor_x ORDER BY SensorId, ReadingTime\"\n    }\n}\n<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-outcome\"><br>The Outcome<\/h2>\n\n\n\n<p>Page splits\/sec dropped from 3000\/sec to under 5\/sec, CPU stabilized around 30% peak, TempDB contention almost disappeared, and the Insert latency reduced by 80%.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-have-we-learned\">What Have We Learned?<\/h2>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Payloads don\u2019t belong in clustered indexes<\/li>\n\n\n\n<li>Avoid mid-key inserts on composite Primary Keys<\/li>\n\n\n\n<li>Use <code>FILLFACTOR<\/code> wisely but don\u2019t depend on it alone, as page splits could be due to other reasons as well &#8211; like in our simulation<\/li>\n\n\n\n<li>Monitor page splits\/sec, runnable_tasks_count, and TempDB read\/write latency<\/li>\n<\/ul>\n<\/div>\n\n\n<p>In hindsight, the system didn\u2019t fail because of SQL Server, IoT, or page splits. It failed because of a gap between how the data was expected to behave and how it actually behaved. The database followed its instructions perfectly, but those instructions were based on assumptions that didn\u2019t survive contact with reality.<\/p>\n\n\n\n<p>This isn\u2019t just an IoT cautionary tale. Any system that experiences high write volumes, irregular insert patterns or underestimated payloads is vulnerable to the same silent sabotage. What saved the day wasn\u2019t a single clever fix. It was a willingness to revisit core design choices, monitor the right metrics, simulate the system in a test cell, and respond to the story the data was telling.<\/p>\n\n\n\n<p>Databases are remarkably obedient \u2014 and that&#8217;s both their strength and their danger.<br><br><br><br><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Learn how IoT backfilling impacts OLTP systems and discover strategies to optimize performance for real-time data processing.&hellip;<\/p>\n","protected":false},"author":346249,"featured_media":105778,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143523,143524],"tags":[4168,4170,4150,4151],"coauthors":[159333],"class_list":["post-107644","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-databases","category-sql-server","tag-database","tag-database-administration","tag-sql","tag-sql-server"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/107644","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/346249"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=107644"}],"version-history":[{"count":2,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/107644\/revisions"}],"predecessor-version":[{"id":107654,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/107644\/revisions\/107654"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/105778"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=107644"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=107644"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=107644"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=107644"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}