A Knights Tale

There are so many lessons to be learned from the story of Knight Capital losing nearly half a billion dollars as a result of a deployment gone wrong.

The Knight Capital Group (KCG N) was an American global financial services firm engaging in market making, electronic execution, and institutional sales and trading. According to the recent order (File No.3.15570) against Knight Capital by U.S. Securities and Exchange Commissionâ, Knight had, for many years used some software which broke up incoming “parent” orders into smaller “child” orders that were then transmitted to various exchanges or trading venues for execution. A tracking ‘cumulative quantity’ function counted the number of ‘child’ orders and stopped the process once the total of child orders matched the ‘parent’ and so the parent order had been completed. Back in the mists of time, some code had been added to it which was excuted if a particular flag was set. It was called ‘power peg’ and seems to have had a similar design and purpose, but, one guesses, would have shared the same tracking function. This code had been abandoned in 2003, but never deleted. In 2005, The tracking function was moved to an earlier point in the main process. It would seem from the account that, from that point, had that flag ever been set, the old ‘Power Peg’ would have been executed like Godzilla bursting from the ice, making child orders without limit without any tracking function. It wasn’t, presumably because the software that set the flag was removed.

In 2012, nearly a decade after ‘Power Peg’ was abandoned, Knight prepared a new module to their software to cope with the imminent Retail Liquidity Program (RLP) for the New York Stock Exchange. By this time, the flag had remained unused and someone made the fateful decision to reuse it, and replace the old ‘power peg’ code with this new RLP code. Had the two actions been done together in a single automated deployment, and the new deployment tested, all would have been well. It wasn’t.

To quote…

“Beginning on July 27, 2012, Knight deployed the new RLP code in SMARS in stages by placing it on a limited number of servers in SMARS on successive days. During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.” (para 15)

“On August 1, Knight received orders from broker-dealers whose customers were eligible to participate in the RLP. The seven servers that received the new code processed these orders correctly. However, orders sent with the repurposed flag to the eighth server triggered the defective Power Peg code still present on that server. As a result, this server began sending child orders to certain trading centers for execution. Because the cumulative quantity function had been moved, this server continuously sent child orders, in rapid sequence, for each incoming parent order without regard to the number of share executions Knight had already received from trading centers. Although one part of Knight’s order handling system recognized that the parent orders had been filled, this information was not communicated to SMARS.” (para 16)

SMARS routed millions of orders into the market over a 45-minute period, and obtained over 4 million executions in 154 stocks for more than 397 million shares. By the time that Knight stopped sending the orders, Knight had assumed a net long position in 80 stocks of approximately $3.5 billion and a net short position in 74 stocks of approximately $3.15 billion. Knight’s shares dropped more than 20% after traders saw extreme volume spikes in a number of stocks, including preferred shares of Wells Fargo (JWF) and semiconductor company Spansion (CODE). Both stocks, which see roughly 100,000 trade per day, had changed hands more than 4 million times by late morning. Ultimately, Knight lost over $460 million from this wild 45 minutes of trading.

Obviously, I’m interested in all this because, at one time, I used to write trading systems for the City of London. Obviously, the US SEC is in a far better position than any of us to work out the failings of Knight’s IT department, and the report makes for painful reading. I can’t help observing, though, that even with the breathtaking mistakes all along the way, that a robust automated deployment process that was ‘all-or-nothing’, and tested from soup to nuts would have prevented the disaster. The report reads like a Greek Tragedy. All the way along one wants to shout ‘No! not that way!’ and ‘Aargh! Don’t do it!’. As the tragedy unfolds, the audience weeps for the players, trapped by a cruel fate.

All application development and deployment requires defense in depth. All IT goes wrong occasionally, but if there is a culture of defensive programming throughout, the consequences are usually containable. For financial systems, these defenses are required by statute, and ignored only by the foolish. Knight’s mistakes weren’t made by just one hapless sysadmin, but were progressive errors by an IT culture spanning at least ten years. One can spell these out, but I think they’re obvious. One can only hope that the industry studies what happened in detail, learns from the mistakes, and draws the right conclusions.