26 July 2007

83,341 views

109 0

RBAR: ‘Row By Agonizing Row’

Remi Gregoire describes the vice of RBAR Database Programming, 'Row By Agonising Row', and illustrates how the effect of RBAR can sometimes be felt only years after an application is released, when the database supporting the application grows.

RBAR, is an acronym for ‘Row By Agonising Row’. It was first coined by Jeff Moden, a very regular poster on SqlserverCentral.com.

If, like me, you have not had the valuable experience of working in a Very Large Database Environment (VLDB), you may have done some ‘Row by agonising row’ (RBAR) programming without seeing too many bad results. But as time goes by, and as the Databases grow larger, the consequences of this style of coding will start to become increasingly felt: You will start getting help calls about the applications because they are getting slower by the minute, or even, apparently, not working at all.RBAR is a consequence of coding in a strictly procedural way, rather than in a set-based way. It is different from poor coding; it is the result of adopting a mindset that one always has to tell the computer, step by step, how to do something rather than, in a set-based approach, merely specifying the result one is aiming for. With relational databases, one tells the database what one wants, not how to do it, row by agonising row.

RBAR in Front-end code

When I first joined a company, my very first job was to do programming work on projects that used Access as front-end, and SQL Server as the back-end. I found several examples of RBAR processing on the front-end of the code There was almost no back end code on the server except for views and stored procedures that only did

1	SELECT * FROM TABLE WHERE COLUMN = @Param.

The team of programmers were all beginners, and they believed that they had programmed a system that was working correctly. It did the job that they wanted it to do correctly in a timely manner. It gave the output they needed. This, however, was a few years before my arrival 500 000 inserts later. The basis logic in every function was something like this:

Open all the tables where we may or may not need to do anything to or from, download all columns and all rows, then loop through all that to do the job.

The code was certainly working when there were 1000 rows in the database tables. Now, with nearly 1million rows, the loops inside the loops inside the loops… nested 19 times, started to take more and more time, both on the server and application, and on the nerves of the people who used it.The application was being forced to process the data in a particular way that was fine for a small data set but ridiculous for a large one. My solution back then (4-5 years ago), was to reduce the number of rows that were downloaded from tables on the server to a minimum. Since I did not want to mess too much with this critical part of the application, I stopped my optimisation there. This minor alteration made the application run almost 10 times faster.I thought that this was not bad for a junior and, since everyone else was happy, I didn’t push the envelope further.

Now fast forward three to four more years and 400 000 more inserts in the various tables. The performance problem is as bad, if not worse, than it was when I first joined.

A simple example may give you an idea. We have a ‘Bill of Work’ system where four buttons are present on a form:: ‘Bill hours’, ‘Bill the rest’, ‘Print’, and ‘Cancel’. The ‘Cancel’ button just changes a few columns in the base table and deletes a few rows in a handful of others tables. Simple. However, because of the way the system was designed, it took eleven seconds to run this process on one of the larger Bills of Work, and around six seconds for the rest. This was a part of the application I had never touched at all, but I clearly remember that, when I first joined, this process took no more than a second. I would now consider even this to be extremely slow, but nobody at the time saw it as a problem

I got back into this code last week to see what I could do with this button. The first thing that I noticed was that six recordsets were being opened at the very start of the function. Those recordsets then downloaded four of the biggest tables in the database. This part, by itself, took around five seconds.

There was a two-way logic in this function depending on which flags are set in the Bill. So depending on the flags, 3 and only 3 tables can be used when running this function. You would never require all 6 tables. I therefore moved the table downloads to their respective IF statement. This shaved a few seconds off the time. I then started to analyse the code. Four of the six tables were opened merely to delete rows . The code looped through all the rows, one row at the time, in order to check to see if the ID matched the ID of the current Bill of Work. When it did, it deleted the row. Seeing this, I deleted about 100 lines of code and inserted 4 lines of my own code which reads something like this:

1	Connection.EXECUTE "Delete FROM dbo.TableName WHERE idCol = "& Me.IdCol

(the logic was more complex than that but you get the idea).

After I deleted the code that downloaded those 4 tables, I made another test run and was pleased to see that the code now ran in less than 2 seconds.
“That’s great, it’s a 5 fold increase in speed and time. Let’s go 10 fold now”.
I changed the rest of the code so that the updates were now done in a single batch rather than row by row. This shaved about 1.5 seconds of the remaining time.

Having done that, I moved all that logic into a stored procedure on the server, and changed the front-end code to call that stored procedure. After a quick check of the execution plans and, as a result, adding two more indexes, this procedure now runs in less than 0.10 second (this also includes auditing and a re-query of the bill of work to display the changes). My final version of the code takes about 4-5 times less written lines of code to do the exact same thing. And also now that the code is server side, a new change of this procedure will not required a redeployment of the application

The conclusion is that SQL Server was designed to work with sets of data. This means that it is more efficient in deleting all rows where FK = 10, than deleting them one at the time (firstly delete the row that has a PK = 1, then the one where PK = 2). The set-based approach will be faster than the RBAR over 99.99% of the time. I know of only a couple of exceptions where RBAR is at worst as fast as set-based, and at best, outruns the hell out of it.

RBAR in a trigger

So what is RBAR in a trigger? Let’s say that you want to set a flag to true with all new inserted rows. The RBAR code could look something like this :

--Set up the tables

USE SSC --Test Database

IF EXISTS (

SELECT *

FROM dbo.SysObjects

WHERE Name = 'Employees'

AND XType = 'U'

AND USER_NAME(uid) = 'dbo')

DROP TABLE dbo.Employees

CREATE TABLE dbo.Employees

(

EmpID INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED

, FName VARCHAR(50) NOT NULL

, LName VARCHAR(50) NOT NULL

, HourlyRate DECIMAL(5,2) NOT NULL

, SomeFlag BIT NOT NULL CONSTRAINT DF_Employees DEFAULT (0)

)

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

VALUES ('Ninja', 'Rus', 50)

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

VALUES ('Some', 'Consultant', 100)

SELECT *

FROM dbo.Employees

--All rows show a flag at off (0).

--Now let's say that a new business requirement needs you to flag

all new employees because they need to do something with those

at a later time.

--You could create a trigger that would look something like this :

IF EXISTS (

SELECT *

FROM dbo.SysObjects

WHERE Name = 'TR_Employees_A_I_FlagNewEmployees'

AND XType = 'TR'

AND USER_NAME(uid) = 'dbo')

DROP TRIGGER dbo.TR_Employees_A_I_FlagNewEmployees

CREATE TRIGGER dbo.TR_Employees_A_I_FlagNewEmployees ON dbo.Employees

AFTER

INSERT

DECLARE @EmpID AS INT

--Get employee id

SELECT @EmpID = EmpID

FROM Inserted

--Update the base table

UPDATE dbo.Employees SET SomeFlag = 1

WHERE Empid = @Empid

--Now let's see if this work :

DECLARE @NewEmpID AS INT

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

VALUES ('Test', 'SomeFlagUpdate', 100)

SET @NewEmpID = SCOPE_IDENTITY()

SELECT *

FROM dbo.Employees

WHERE EmpID = @NewEmpID

EmpID FName LName HourlyRate SomeFlag

----------- -------------------------- -------------------------- ---------- --------

4 Test SomeFlagUpdate 100.00 1

(1 row(s) affected)

–So, happy in the knowledge that seems to have worked, you go home and have a good night. The next morning you come in with the boss waiting in your car-parking space. He’s fuming because the code you put into production is not working. So after arguing that you tested the code and it was working yesterday, you ask him to show you what query he ran. He shows you this query:

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

SELECT 'Multiple', 'rows', 50

UNION ALL

SELECT 'Trigger update', 'Fails', 100

SELECT TOP 2 * FROM dbo.Employees ORDER BY EmpID DESC

EmpID FName LNam HourlyRate SomeFlag

----------- -------------------------- -------------------------- ---------- --------

5 Trigger update Fails 100.00 1

4 Multiple rows 50.00 0

(2 row(s) affected)

*/ */

–To your dismay, you see that only 1 row, the last, is updated in the trigger. No matter how many rows will be put in that insert statement, only 1, the last one, will be updated by the trigger. You go back and check your work and realize that, since you only set the variable once and do one update, you can’t possibly update more than one row. So you now come up with this solution :

--I will disable the first trigger here so that it doesn't fire with the new one :

ALTER TABLE dbo.Employees DISABLE TRIGGER TR_Employees_A_I_FlagNewEmployees

IF EXISTS (SELECT * FROM dbo.SysObjects

WHERE Name = 'TR_Employees_A_I_FlagNewEmployees_MultiRows'

AND XType = 'TR' AND USER_NAME(uid) = 'dbo')

DROP TRIGGER dbo.TR_Employees_A_I_FlagNewEmployees_MultiRows

CREATE TRIGGER dbo.TR_Employees_A_I_FlagNewEmployees_MultiRows ON dbo.Employees

AFTER INSERT

DECLARE @EmpID AS INT

--Get employee id

SELECT TOP 1 @EmpID = EmpID FROM Inserted ORDER BY EmpID

WHILE @EmpID IS NOT NULL

BEGIN

--Update the base table

UPDATE dbo.Employees SET SomeFlag = 1 WHERE Empid = @Empid

--Reset the Empid

SET @EmpID = (SELECT TOP 1 EmpID

FROM Inserted WHERE EmpID > @EmpID ORDER BY EmpID)

END

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

SELECT 'Multiple 2', 'Rows 2', 50

UNION ALL

SELECT 'RBAR', 'Version', 100

SELECT TOP 2 * FROM dbo.Employees ORDER BY EmpID DESC

EmpID FName LName HourlyRate SomeFlag

----------- -------------------------- -------------------------- ---------- --------

7 RBAR Version 100.00 1

6 Multiple 2 Rows 2 50.00 1

(2 row(s) affected)

This works perfectly, you think, and so you push this into production. Now, six months later, you again find your boss waiting in your parking spot. Your company has recently merged with a huge corporation with over 100 000 employees. The new company decides to merge their data into your system and use your systems from now on. However, when running a statement to insert all of their employees’ information, they find that the server hangs for several minutes without any response. They kill the query and try again and again with the same results.You retest your trigger with a few rows and it seems to work fine. However you realize that, in order to update a single row, the server first has to do a SELECT FROM INSERTED (which is actually read from the logs, and is a costly operation), then an update. This works fine for a few rows, but with an insert of 100 000 rows, the server actually has to run over 200 000 queries to produce the correct results and finish the insert. I will not talk about all other pitfalls of administration in this article (like file growth or tempdb), but there are quite a few to avoid.

This insertion operation is not small by any standards, but it should not take several minutes, or even seconds, to run. You do some research and find a way to do a single UPDATE statement that should speed this thing up a lot :

--I will disable the seconds trigger here so that it doesn't fire with the new one :

ALTER TABLE dbo.Employees DISABLE TRIGGER TR_Employees_A_I_FlagNewEmployees_MultiRows

IF EXISTS (SELECT * FROM dbo.SysObjects

WHERE Name = 'TR_Employees_A_I_FlagNewEmployees_MultiRows_SetBased'

AND XType = 'TR' AND USER_NAME(uid) = 'dbo')

DROP TRIGGER dbo.TR_Employees_A_I_FlagNewEmployees_MultiRows_SetBased

CREATE TRIGGER dbo.TR_Employees_A_I_FlagNewEmployees_MultiRows_SetBased ON dbo.Employees

AFTER INSERT

UPDATE EMP

SET EMP.SomeFlag = 1

FROM Inserted I

INNER JOIN dbo.Employees EMP

ON I.EmpID = EMP.EmpID

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

SELECT 'Multiple 3', 'Rows 3', 50

UNION ALL

SELECT 'SET BASED', 'Version', 100

SELECT TOP 2 * FROM dbo.Employees ORDER BY EmpID DESC

EmpID FName LName HourlyRate SomeFlag

----------- -------------------------- -------------------------- ---------- --------

9 SET BASED Version 100.00 1

8 Multiple 3 Rows 3 50.00 1

(2 row(s) affected)

Again, you see that the results are correct with this version of the trigger. But what about the speed? I will enable the second trigger and run a 10 rows inserts and I’ll let you judge the difference in the execution plans of each trigger :

ALTER TABLE dbo.Employees ENABLE TRIGGER TR_Employees_A_I_FlagNewEmployees_MultiRows

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

SELECT 'Speed ', 'Compare 1', 50

UNION ALL

SELECT 'Speed ', 'Compare 2', 50

UNION ALL

SELECT 'Speed ', 'Compare 3', 50

UNION ALL

SELECT 'Speed ', 'Compare 4', 50

UNION ALL

SELECT 'Speed ', 'Compare 5', 50

UNION ALL

SELECT 'Speed ', 'Compare 6', 50

UNION ALL

SELECT 'Speed ', 'Compare 7', 50

UNION ALL

SELECT 'Speed ', 'Compare 8', 50

UNION ALL

SELECT 'Speed ', 'Compare 9', 50

UNION ALL

SELECT 'Speed ', 'Compare 10', 50

--The results :

--The insert takes 1.37% of the total time of all the execution plans

--The final set based trigger takes 8.50% of the time of all the execution plans

--So that leaves over 90% for the 2nd version of the trigger. That's over 10 times

-- slower over 10 rows. I will now show you what happens with 10 000 rows :

--Disable RBAR trigger

ALTER TABLE dbo.Employees DISABLE TRIGGER TR_Employees_A_I_FlagNewEmployees_MultiRows

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

SELECT TOP 10000

'SPEED TEST' AS FName,

'10K rows' AS LName,

55 AS HourlyRate

FROM Master.dbo.SysColumns C1

CROSS JOIN Master.dbo.SysColumns C2

--This runs in about 1 second with the last trigger

--Now let's see with the RBAR version

--ENABLE RBAR trigger

ALTER TABLE dbo.Employees

ENABLE TRIGGER TR_Employees_A_I_FlagNewEmployees_MultiRows

--DISABLE Set based trigger

ALTER TABLE dbo.Employees

DISABLE TRIGGER TR_Employees_A_I_FlagNewEmployees_MultiRows_SetBased

INSERT INTO dbo.Employees (FName, LName, HourlyRate)

SELECT TOP 10000

'SPEED TEST' AS FName,

'10K rows' AS LName,

55 AS HourlyRate FROM Master.dbo.SysColumns C1

CROSS JOIN Master.dbo.SysColumns C2

--Final Run time : 44 Min 24 Sec

--CLEAN UP

IF EXISTS (SELECT * FROM dbo.SysObjects

WHERE Name = 'Employees' AND XType = 'U' AND USER_NAME(uid) = 'dbo')

DROP TABLE dbo.Employees

Unfortunately, even with a minor insert of 10 000 rows, the final run time is just unacceptable. In a VLDB environment, this trigger would have been a bottleneck for the application right from the very start because, even at one row, the trigger still has to do 2 operations instead of one (if you don’t count declaring and destroying the variable as an operation). Now, if you have the courage or a very powerful machine, I will let you try the RBAR trigger at 1 M rows. I won’t attempt this on my machine because it would take a few days to run (assuming I don’t run out of RAM first).

RBAR Bad, Set-based good!

To conclude, I gave you a little taste here of what RBAR processing can do to a server. I’m using an archaic P III 1.3 GHz with 512 MB of RAM to do those tests. Even though it’s a 10 years old machine, it’s still powerful enough to run fairly complex statements at a very good speed. I’m sure most of you will get much better results that I did even on your personal notebooks. However I feel I clearly made my point in this article. I often see questions on the forums about this subject on the forums of sqlservercentral.com, and I just can’t seem to get my point across that any type of RBAR is often a very bad idea if you want any kind of performance out of your application. The RBAR concept applies to any looping technique and cursors, to code wherever it stands, whether it’s sitting on the server or on a client machine like it was the case for me at my first job. Here I’m only referring to client code that interacts with the server from within the loop. I’m not saying, by any means, that looping should be banned from programming, but rather that programming a database generally requires you to specify the result you want accurately, rather than telling it how to go about getting it, row by agonising row.

RBAR: ‘Row By Agonizing Row’

RBAR in Front-end code

RBAR in a trigger

RBAR Bad, Set-based good!

Subscribe for more articles

Rate this article

Remi Gregoire

Related articles

Backing Up MySQL Part 7: MySQL Enterprise Backup

What’s new in T-SQL in SQL Server 2022

Temporary Tables in SQL Server

Tags