A while ago, James King of Redgate spoke to Tony Madonna, the Microsoft Platform Lead and SQL Enterprise Architect at BMW about the growing SQL Server estate at the car manufacturer, and how and why monitoring such a large estate is so important.
The conversation was recorded and the video above is on the Redgate YouTube channel. This is an edited transcript to remove pauses and repetition, and make the conversation accessible to anyone.
After the introductions, James asked Tony about the challenge of keeping on top of an expanding server estate.
Tony: It’s very difficult, especially since like any other company, we do more with less, so it gets to a point where you tend to go into reactive mode more than proactive mode. That was one of the reasons we engaged with Redgate.
We started to ask how we could be a little bit more proactive and get some better insights into systems that are having a problem and react in real time, to reduce the amount of business impact and downtime. That was one of the initial streams in getting us together and ever since then it’s been a great partnership.
James: You mentioned the trigger there, trying to be more proactive rather than reactive. What were you guys doing before?
Tony: Before, it was pretty much a typical database shop. So we had certain monitors and we obviously used Microsoft tools like SCOM and other things to do monitoring. We basically waited for a problem and then reacted to the problem. We always tried to be on the front end of patching and planning the infrastructure to the point where machines and systems were well allocated. We had a good distribution of databases across the infrastructure and machines weren’t overwhelmed or oversubscribed.
Doing that manually for maybe 30, 40, 50 machines is okay, but when you start getting into hundreds of servers that are operated globally in different time zones it becomes very difficult. So obviously you have to engage some third-party product or write your own and we didn’t really want to go down that road.
James: I guess not. It’s probably more difficult maintaining something like that than anything. When you talk about hundreds of servers, how many servers do you have and geographically how are you set up? Because obviously it’s a big operation.
Tony: I don’t have a firm number off the top of my head. We made a shift a few years back from being more virtual to being more physical because of licensing costs. So with the increase in footprint of newer hardware and better processing power, we were able to consolidate things down into standard Windows clusters and SQL clusters. I would say, ballpark, we have probably 400 to 500 servers running globally, but the server class is quite large, we’re talking multi-terabytes of RAM, so it’s a pretty big footprint.
James: That’s not a small estate to look after at the end of the day. How many people are looking after that?
Tony: We actually only have five internal people and we have a partner provider too. I would say cumulatively, globally, we are probably looking at about a dozen, maybe 15.
James: Wow, that’s pretty impressive with the size of estate. But I guess it comes back to this doing more with less, right?
Tony: Right, exactly. And trying to be as efficient as you can with what you have. We not only develop the solutions and try to stay in check with Microsoft’s release, we’re also part of the Microsoft Alpha program. We work with Microsoft on a lot of things that we do and we also automate our installation routines so that everything is standardized. So there’s a significant amount of scripting that we do on the front end to package the solution, to make sure all the scenarios look the same, and all the servers are set up the same way.
Lately there’s also been an increase with security issues and other concerns that have a cost overhead to refocus us. So we don’t have the luxury of spending as much time on the operational side as we do on the dev side. We try to balance that as well as we can to be able to deliver a consistent, uniform environment with the highest amount of uptime. Thankfully, we’ve been very blessed as far as not having any type of catastrophic events for a while, knock on wood.
James: You mentioned you have to keep on top of patching. Does the speed of releases with Microsoft mean having a holistic view across your estate is much more important to you?
Tony: It does. One of the things that we’ve battled for a long time is just pure inventory, keeping tracks of things as they come and go and making sure everything stays up-to-date. We have our own internal security folks that validate Microsoft patches and other patches and mandate the application on certain timeframes and what not. It works well in conjunction with our third-party provider to be able to go in and schedule these types of things and try to keep Patch Tuesday, as Microsoft so lovingly calls it, a sort of monthly ritual.
James: Well at least you’re keeping on top of this stuff. Coming back to the monitoring side of things, you went from nothing to monitoring your estate of 500 servers. That’s a big estate. Was implementing a new tool quite a tricky thing to do?
Tony: Actually in all honesty, with Redgate, it was pretty easy. Obviously being global, we split regionally just to be able to coordinate across time zones a little bit easier, as well as contain WAN traffic. So we placed Redgate SQL Monitor servers per region and then essentially encapsulated those regional servers directly into those boxes.
I have to say it was a very easy configuration and setup. We did oversubscribe the servers to make sure we had enough capacity to store data for about three months minimum, and also collectively make sure there was no bottleneck with the Redgate server to be able to aggregate it. That way, if there was a problem and we needed real-time data, we could jump on and get good data splices as fast as we could.
It was quite honestly one of the easier things that we had to do. It was a little bit time-consuming obviously because we had to get all the servers configured, but thankfully the product is great and we don’t have to touch the servers from the product’s perspective. One of my number one rules of thumb is that we don’t run any agents on our SQL boxes, so the fact that Redgate is agentless is a great thing. We don’t want to have anything go rogue and cause any type of issues with our database servers.
So as a whole, again, it’s been great. It’s one of the few things that we don’t have to babysit a lot, I would say.
James: That’s good to hear. With the rolling out, you mentioned it was quite time-consuming. Did you have to do it bit by bit and then build it up over time, or did you try and load as much of it on as quickly as possible?
Tony: We basically separated each region, gave it to a point person and had them load the baseload per region all at once. Once we had those baselines, we went in and did some configuration. As things changed in inventory, we added servers, we subtracted servers, we just maintained them on an ad hoc basis.
We tend to run in spurts where, especially if we’re running a new release, we’ll get larger amounts of servers and new implementations, or we’ll deprecate something old like SQL Server 2008 R2. So we get ebbs and flows of hardware in and, as long as you keep up, it tends to be pretty easy to manage and maintain. Like I said, we’ve been pretty good so far.
James: How about from the cultural perspective? Obviously one of the hard things when you get a new tool is getting people to use the tools that are available to them. Is that ever a challenge for you?
Tony: It was a little bit of a challenge. The initial deployment of the tool was for us to use internally in operations and then we did allow restricted use for our heavy users. They’re users with very high capacity, high transactional databases who want to get their own datasets from them. It took about a year for them to remember ‘Oh hey, I have a tool here that I can use.’
We have other tools that will do system class information, but we didn’t have anything that would actually be SQL-based information. So when we rolled it out, it was always ‘Don’t forget you have Redgate SQL Monitor’, and now it’s sort of become second nature. If there is a problem, usually the client is on SQL Monitor before they even call it into us and escalate it as being some sort of an issue. So it’s actually an invaluable tool for those who take the time to learn it.
James: I guess it’s slightly more proactive than you were initially planning, if they’re already looking at it before you get the ticket.
Tony: It is, but that’s actually a good thing, because they know their application more than we do. We provide them the database functions on the servers but we don’t know what their database application is doing, because we don’t have a view into their data. That said, when they call us and say ‘Hey, we’re having a blocking issue’ or ‘We’re having a lag issue and we think it’s this’, it really comes down to us acting as an escalation group and giving our professional opinion on it. That’s opposed to spending an hour trying to figure out that someone forgot to put an index on the table.
James: That’s good. It’s nice to be able to give people the power to solve their own problems as well sometimes, right?
Tony: Right, and again it’s not that they use it – they generally don’t use it without relaying it to us. So you don’t have guys sitting there watching their applications because they like to see the pretty charts and graphs. But normally they will go in there and say ‘Hey, we’re experiencing something different, something is just not quite right’. It’s like when you wake up in the morning and you don’t feel right. You take an aspirin or ibuprofen and then move on your merry way. Sort of the same kind of concept – we provide them the ibuprofen, they take it, and if it doesn’t fix it they call us.
James: Are there any other unexpected benefits you had from this? Was there anything else that you thought ‘Oh, actually I didn’t expect that’?
Tony: Not so much unexpected but we do use it to be able to scale servers appropriately. So as applications increase in both capacity as well as transactional amounts or user accounts, we can do a little bit of projective analysis with the heuristics that we store. Then we’re able to say ‘Okay, this server is probably not going to cut it for longer than another six months, so let’s plan on moving this to a new cluster, new hardware or something that’s a little bit more robust’. That way we don’t experience any type of either downtime or degraded performance for the client base.
So that’s been a really good feature of it too. It’s also helped us to evaluate some of the third-party patches that come out that are known for posing possible disruptions to systems, like the Spectre patches and stuff like that. We were able to do pre- and post-patch analysis so that if, say, Microsoft or Intel state this patch is going to degrade performance by X, we can see if it’s a true statement. You can sort of get your own litmus based on your environment, as opposed to accepting the generic span of 8% to 25% degradation. So that’s helped us too and been an unexpected plus.
James: That’s good to hear. Going into the future, are there things you think will change the way you use the tool, or is there an environment you expect to change that the tool will help you with?
Tony: I think the one thing that we don’t use the tool for that has a lot of power is the custom monitoring option. I think once we get into a more proactive mode where we have some time, or customers request certain things, we’d like to use the tool more to script custom monitoring components and be able to run some triggers specifically for customer requests.
We’ve done a few here and there but it hasn’t been something that we’ve really been able to dedicate time to, so it’s been ad hoc. Again, a customer requests it, we go in and we will do it. But I know based on the Redgate forums and the community that there are a lot of people doing a lot of things that are very effective in improving how you get data through the system, and that’s something that we’d like to probably get a little bit more in-depth with over time.
James: I guess the sky’s the limit with that sort of stuff. It’s just how creative you can be and, like you say, getting the time to do it. With regards to things like the cloud, is the cloud a big thing on your radar in the future?
Tony: The cloud beats me over the head every day. Yeah, it’s a very big movement to get more things into the cloud, to essentially offload building of new data centers or increasing our current capacity. One of our initiatives over the next 12 months is to start putting some things into the cloud and doing some testing. A lot of little hanging fruit initially just to get the people and the users in there that are just simple single user databases or small tests for products, so we can leverage the cloud for what it should be used for.
For me, it’s more of an up/down kind of scenario, so I could spin something up pretty quickly, use it for a period of time, break it down and only have to pay for what I’ve used it for. That’s opposed to a lot of times we’ll spin up some machines for testing or PSE and three months later the customer has done using it 18 weeks ago but they finally tell us ‘Oh, we don’t need that anymore.’ Then in the meantime we’ve utilized resources and space for a period of time that is in excess of what we needed to. So that’s a plus hopefully that will come shortly.
James: I guess it’s about keeping a closer eye on these things and making sure you’re not wasting money, time or resources. This more for less theme is pretty prevalent with you guys, so utilizing your resources if you just don’t need to is probably a big thing. From the future perspective as well, where do you see monitoring going?
Tony: I think eventually it’s going to get a little bit more in line with AI practices. So monitoring will get to the point where someone will write some sort of open source kernel that will actually seed itself. As it sees things change, it will be able to differentiate if it’s a problem, an anomaly, or a trend, so that you won’t have to waste time troubleshooting things that really are just computer idiosyncrasies.
I think anybody that is a consumer of compute always expects this thing to be running 24/7. Those of us that have been in the industry for a while realize it’s still a machine, right? So it still has errors, it still overheats, it has component failure and a whole bunch of other things. To maintain that level of reliability, you can be proactive from the human perspective, but proactive from the monitoring perspective where it is sort of self-seeding on its own would be a nice thing. And I think eventually that will come. I’m not a big proponent like Mark Cuban is of AI taking over robotics and we become petabytes in the world. The reality is, the more we can apply data science and the AI component to analyzing what we have, the more efficiently it will allow us to run and more effectively deliver the solutions that we have.
James: It’s something I firmly believe in. Automation isn’t going to take our jobs, it’s just going to allow us to elevate to the next level of what we need to do.
Tony: I think it’s going to give us more headaches, because at the end of the day somebody has to do the automation, right? And again it’s a computer, it’s a piece of code, it will break. I’m really not convinced my life will get any easier.
James: Maybe not for a while at least. This has been great, Tony. I’ve had a good insight of what you’ve been up to for a while, but I think this is really interesting stuff. People don’t understand what level you’re working at these days at BMW. From my perspective, it’s almost become software in a car, so I think it gives a really nice insight into the world of what you have to deal with and how you got from A to B. So I really appreciate your time.
Also in Database development
The terms Continuous Integration and Continuous Deployment tend to be combined into the acronym CI/CD, often without any distinction between the two. CI and CD are distinct processes, even if ...