A MySQL story: Can system engineers solve database issues using system tools?

A story about how a system engineer solved a MySQL performance issue written by Rajeshkumar Sasidharan.

I attempted to avoid a dry presentation because this is a technical subject. As a result, I reasoned, why not depict it as a story? After all, everyone enjoys stories, and we’ve all heard plenty of fancy fiction tales from our grandmothers and elders. Those who don’t want to read stories can just read the bolded words, which are entirely technical.

6:00 p.m. 8:00 p.m.

Outside, a brisk breeze blew, and the entire city had forgotten about summer in favor of the sea breeze… especially those who lived near the beach. This narrative begins near the ocean… On the top floor of one of the city’s tallest buildings, there is a spacious boardroom that seats 25 people and is completely air-conditioned. Around ten people sat on each side, while a senior person who was leading the meeting sat in the center and observed both sides. On the one side, sat members of the service provider(SP) that hosts and administers the customer’s infrastructure. On the other side sat his company’s (social networking platform) information technology and application team handles the applications and services.

A cartoon image showing several people seated at a conference table. The manager is seated at the head of the table

(All images sourced from https://www.pngegg.com/)

The plot is of this story not centered on the team senior member. Rather than that, it is built around the system engineer, who is based on the service provider side. The protagonist of the story is a service engineer we’ll call Z.

Z recalled how chilly it was outside when he initially entered the building, but how tense it was within the boardroom due to the discussion. Z began to focus on the customers who were arguing. “We did the MySQL query on the replica server’s, and it produced results promptly, therefore MySQL is operating great, and reading is also extremely fast…so there are no issues with MySQL,” a customer representative stated. He began by displaying the commands and outputs to the rest of the board room.

******************************************************************************

Select * from user LIMIT 1000;( for eg; Returning the first 1000 rows from a table user)

Show processlist; ( no issues)

******************************************************************************

Following that, one of the SP engineers presented data gathered during his testing of the customer’s infrastructure. He began by displaying the IOPS graph from the SQL lun/storage partition’s side and declaring that everything was OK, and that consumption was far below the allotted capacity. As a result, he believed that a large number of underutilized IOPS were still accessible for utilization in applications. He also performed read/write tests on the Replica SQL server’s OS level and discovered that it was operating fine. He recorded and provided documentation for the SAR data (used to collect performance data from Linux) throughout the experiment.

******************************************************************************

  • sar -u Shows CPU use
  • sar -P ALL Shows CPU usage
  • sar -r Memory usage
  • sar – S Display of memory
  • sar -b monitors overall IO activity;
  • sar -d monitors individual disk activity.

******************************************************************************

Furthermore, the lun use graph plainly showed that it was not completely utilized. Even when doing read/write tests at the operating system level, the graph shows only small spikes, suggesting that the storage bandwidth is not being used to its full potential. As a result, the storage provider claimed that this was not a storage issue, and that consumers should stay away from it completely.

*************** Utilization Reports from storage based on ****************************

  • Disk level
  • Raid level ( Redundant array of independent disks)
  • Lun level (Logical unit number)
  • Volume level
  • Processor Level
  • Cache utilization

******************************************************************************

A cartoon image showing several people watching a presentation by an engineer

A member of client-side asked, “Why don’t you configure RW (Read/Write) allocation as an 80/20 ratio, rather than the present 50/50 in the storage?”

“If even 50% of storage capacity is not fully utilized, why should we consider increasing it to 80/20 here?” asked the SP engineer. The conversation deteriorated from here into an escalating argument.

Z considered the environment in the boardroom to be quite hot. He focused on why he was currently sitting in this room and when/how/why this situation arose. To grasp this, we must slightly rewind this scenario.

Two months ago…

A cartoon image showing a man happy about some news

“It’s really a huge milestone for our business. We will be migrating and hosting a Client from one of the Big Cloud Providers to our managed hosting services. If we successfully complete it then, we can get huge orders in the future!” exclaimed the head of the company where Z worked. The entire company was happy about this new deal. The agreement mentioned managing only customer infrastructure and OS. Applications would be handled by the customer’s IT and application teams. Thus, all departments within Z’s firm were involved, including sales, marketing, accounting, solution architecture, engineering, provisioning, and project management. The company began acquiring and deploying hardware, internet circuits, and operating systems. The SP procured high-end servers, storage, and networking equipment from prominent industry manufacturers for this customer.

Once all necessary hardware and operating systems were obtained and installed, the servers were handed over to the customer’s application development team for application deployment.

AN- Before I begin, I’d like to clarify the types of services offered by providers.

Datacenter Hosting provider: Customers can purchase all or a portion of the rack space on which their servers, appliances, or network devices are placed. The total cost of ownership (TCO) and all capital expenditures (CAPEX) are included (one time).

Managed hosting service: the provider will acquire the appropriate hardware and host the clients’ operating systems and applications in its data center. Customers are required to sign a contract committing them to the service for a period of at least three or four years. The penalty is based on a violation of the service’s terms of service. Although there is no TCO, the model is based on OPEX. Monthly subscription payments that cover the cost of the service and associated equipment

Cloud hosting Service: Pay-per-use enables you to pay only for the resource/service based on usage. There is no TCO or Capex, Only OPEX – Monthly. Nowadays, cloud service providers provide attractive pricing structures in exchange for a long-term commitment.

Present-day evening

Z had to go for the evening, so he finished all his work and packed his laptop. A company Top level manager immediately phoned him and demanded to meet. Z got up from his seat, and, on the way, he thought that the Top level manager would ask some technical doubt or compatibility details about new requirements.

Alas, when he entered his Office, he was met with a serious-faced Top level manager. The Top level manager said, “Z, I’m glad you are here; we need your help, and we hope you can solve this problem.”

An image showing a seated man with his head in his hands

Z was nervous at this point, unsure how he could resolve the issue without first determining what the issue was.

“What is going on?” Z questioned.

“Are you aware that we received a significant order a few days ago for social networking company hosting and management?” the Top level manager inquired.

“Yes, we implemented the hardware and operating system and handed it over to the customer for application deployment,” Z stated with a nod.

The customer is having replication issues with MySQL; the database contains terabytes of data that was downloaded from their cloud provider and imported into the Master MySQL node, but replication to other replica nodes is not occurring at the desired rate. They initially suspected the network, but our network experts conducted an analysis and detected no faults, proving that the network is running as expected. Now the customer is claiming that the replication nodes lun/mount point is causing the issues. No one has addressed this issue, and it has been dragging on for two days. If this issue is not resolved, then the customer will continue to use their existing cloud provider for some time longer; thus, we may incur a loss not only in terms of not utilizing the new hardware and services, but it also affects our credibility.

Z shook his head in recognition of the gravity of the issue and remarked, “I’ll inform my manager and then go.”

“It was already agreed upon with your boss, so please inform him and look after the customer. Also notify me of any changes” the business leader grinned.

Z notified his boss and traveled to the Customer’s site to meet with their IT and application development teams. Z had to visit the customer with all customer-facing staff, including business managers, account managers, TAMs, and sales reps, as well as technical heads and engineers due to the scale of the order. As a result, a party of about ten individuals set out to visit a customer location. True, the bulk of corporate customer meetings will have only one or two engineers; the balance will be comprised of management.

Later that evening, 8:30 p.m.

Z found himself in the meeting room where the never-ending conflict occured. Z understood the Customer’s point of view and knew they were speaking from the facts they have acquired, but the SP engineers have also presented adequate data to establish that their infrastructure was in good condition. This does not, however, answer the customer’s issue because it demonstrates that replication occurs very slowly in all Secondary Replicas.

Z made an important point here: the customer are shifting from cloud to managed hosting services, which implied that the customer’s infrastructure components such as hardware, network, security, and storage operating system are all controlled by a service provider. Obviously, customer-side system administrators/information technology workers lacked full privileged access to the operating system, leading them to assume they are insecure, and the SP cannot expect full support from them until they understand the managed hosing service benefits. As a result, Z began searching the group for members of the DB team and identified one man. He smiled at him and was greeted with a pleasant smile in return. Z decided that he would be the one to support him moving ahead with this quest.

Z began conversing with everyone and expressed his appreciation for their assistance in supplying vital information that would enable him to work on that issue. Z was aware that there is a MySQL problem, but he continued to declare during the meeting that he will work on the storage and operating system sides. Nonetheless, he sought assistance from one MySQL member on the client team in the event of a necessity.

As a result, the client’s senior person assigned one of his MySQL engineers to assist Z, the same person who was smiling at him by happenstance. The database engineer, V, took Z to the cafeteria and gave him coffee. “Managing a MySQL database is a challenging task, and you are absolutely remarkable,” Z commented. V grinned in response to the compliment, because he was responsible for the management of multiple terabytes of MySQL data.

Z began collaborating with V on the operating system for the Replication servers. He explained the method to V while he worked on it. “There is no harm in being acquainted with system tools,” Z asserted.

An image showing two men at a desk. One is using a computer

He began by checking the system’s essential statistics using the SAR (System Activity Report).

**** Yum install sysstat ***This will aid with the SAR installation, in case it doesn’t exist.

He instantly began writing enormous files to the SQL Lun/Data partition, despite the fact that all system parameters appeared to be OK.

dd is a powerful and useful utility to convert and copy files, that comes pre-installed on Unix and Unix-like operating systems.

******************************************************************************

dd <if=source> of=destination>

dd if = /dev/sda of = /dev/sdb

******************************************************************************

Even later that evening: 11pm

Everything was operating normally, and they had not noticed any issues. As the night wore on, the strain on Z’s shoulder became more intense. V was really cooperative, yet he refused to accept that MySQL could be the culprit. He showed Z the identical query output, but the issue remained unresolved.

Z stepped out of the room and sat on a sofa, while analyzing the problem’s nature.

Storage is a monster that is still in development and underutilized, therefore, it will not be a problem. The MySQL servers are running normally, and no issues have been detected. He realized that if they did not address this issue soon, the firm may incur a loss as a result of the client failing to utilize the service on time, and the customer will most likely have to download a huge quantity of SQL database from the cloud provider again.

An image of a man thinking about how to solve a problem

He thought “Oh my god, why am I debugging like other engineers?”

Z wondered why his company sent him here. Think, think, think, think outside the box. Suddenly, something clicked in Z’s head: he knew what to test, but he took a step back and decided not to inform the database engineer about the problem now.

Z logged in and began working on the Database SQL servers at the operating system level; despite his competence with MySQL administration, he lacks access to client MySQL servers.

Z’s face lit up, and his eyes glowed now that he had a thought. “How did you spend your time outside?” V questioned, a smile on his face. “You seem to be positively bursting with vigor!” Z logged into Mysql Server and immediately began working with the Linux command-line tool strace.

“strace -p 3456 “ ( process id of mysql)

It began by displaying Fcntl, write, lseek, write, pread, and read… everything is rather slow due to the data being read line by line…

*******************example output –not real capture*******************************

9260 lseek(22, 0, SEEK_SET) = 0

9260 write(22, “15\xxxxxxxx.002936\n621015866\n173″…, 84) = 84

9260 read(9, 0x1d1a34b0, 16384) = -1 EAGAIN (Resource temporarily unavailable)

9260 fcntl(9, F_SETFL, O_RDWR) = 0

9260 read(9, <unfinished …>

9261 <… pread resumed> “\2\226\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\25\221<\0\0\0\0\0\0\0\0″…, 1024, 156095488) = 1024

9261 pread(181,”\2\251\0\0\0\3\25\224\320\0\0\0\3\25\225\f\0\0\0\3\25\225H\0\0\0\3\25\225t\0 0″…, 1024, 156102656) = 1024

9261 pread(181, “\2\251\0\0\0\3\25\214d\0\0\0\3\25\214\220\0\0\0\3\25\214\314\0\0\0\3\25\214\370\0\0″…, 1024, 156103680) = 1024

9261 pread(181, <unfinished …>

******************************************************************************

Z had now identified the root cause of the problem, which he had seen before with MySQL infrastructure. Even though he was confident about the MYSQL issue, Z simply phoned his mentor, a very senior socket level C programmer, and informed him of all he discovered. Z’s mentor confirmed that Z was right.

An image of two men at a desk. One is writing while the other explains something

With that information, Z informed V about the problem in MySQL. V didn’t admit it and said the same thing about how the problem was in MySQL. But now Z said the issue is not in your replica Server .. issue is with your Primary SQL server. V appeared taken aback and inquired, “How could you know, considering that you did not have access to SQL?”

Z began by displaying the output of the strace and describing each step in detail, including the delay associated with data reading.

“What does it mean when it indicates that the reading is slow?” V was perplexed. “When I query master, it appears as though everything is in order.”

Z explained to him that while there is no delay when querying the master with a limited number of records, there is a delay when attempting to read a record from a corrupted area.

Finally, at Z’s request, V agreed to repair the Master SQL table.

MySQL Repair can be done in multiple ways.

****************************************************************************

  1. Using mysqlscheck :

mysqlcheck -u<USERNAME> -p<PASSWORD> –databases <DB-NAME>

  1. Manual Repair inside database

Login to mysql and then execute “ REPAIR TABLE TABALENAME [OPTIONS]

  1. Using myisamchk for myisam Tables

Run myisamchk ‘–recover’option to recover MyISAM table – default option.

  • myisamchk –recover tablename

Run myisamchk –safe-recover option is slower than the default recovery option

  • myisamchk –safe-recover tablename

**************************************************************************

After the MySQL master table repair, database replication from Primary to Secondaries occurred at a rapid pace, showing that they were now fully utilizing their managed storage provider’s beast storage and network performance.

Once V had validated, the replication process took between 3 and 5 hours if the sync occured in this manner.

Even though it was after 12:30 a.m., Z called his Top level manager and manager to update them on the issue.

V also contacted his boss to inform him of the wonderful news.

Everyone was satisfied and returned home with a joyful disposition.

An image showing a happy man in a suit

For addressing this challenge, Z received an award from his firm.

Z’s learnings from the issue :

*****************************************************************************

1. When importing/exporting Mysql databases from a Cloud provider, always follow their best practices to avoid issues.

2. As service providers, we must continually remind ourselves that the consumer is always correct; even if something is erroneous, we must provide sufficient proof and justification to persuade and educate them.

3. Rather than attempting to remedy the problem’s impact, try to find the problem’s root cause.

******************************************************************************