{"id":95127,"date":"2022-11-26T18:19:30","date_gmt":"2022-11-26T18:19:30","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=95127"},"modified":"2022-11-26T18:20:55","modified_gmt":"2022-11-26T18:20:55","slug":"backing-up-mysql-part-5-big-data","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/backing-up-mysql-part-5-big-data\/","title":{"rendered":"Backing Up MySQL Part 5: Backing Up Big Data"},"content":{"rendered":"<p><a href=\"https:\/\/www.statista.com\/statistics\/871513\/worldwide-data-created\/\">According to research made by Statista in 2022<\/a>, the amount of data that is being worked with globally is expected to reach more than 180 zettabytes. Part of that is because more and more data is being introduced to the online world \u2013 but another part has its roots in software development. There are approximately 24.3 million software developers in the world (source: <a href=\"https:\/\/www.inapps.net\/how-many-software-developers-are-in-the-world\/\">InApps<\/a>), and with the number increasing every year, there\u2018s bound to be a lot of work with data.<\/p>\n<p>The one thing that has always helped software developers work with data are databases \u2013 databases are not all made equal though. Some are said to be a fit for working with bigger data sets (MongoDB comes to mind), some are fit for timescale data (think TimescaleDB), some are a good fit for other use cases.<\/p>\n<p>However, MySQL is MySQL \u2013 the RDBMS has been here for ages, and undeniably, it\u2018s here to stay. As with every database management system, MySQL has its upsides and downsides \u2013 one of the biggest notable downsides of MySQL is that <em>many people think that it\u2018s not a fit for working with big data<\/em>.<\/p>\n<p>However, that&#8217;s not exactly true &#8211; MySQL is powering some of the biggest websites in the world from Facebook (for Facebook, MySQL is one of the main database management systems of choice as the social network stores almost all of the necessary data in it), to <a href=\"https:\/\/breachdirectory.com\/blog\">some of the biggest &amp; fastest data breach search engines in the world letting people secure themselves from identity theft.<\/a><\/p>\n<p>MySQL is a powerful, but a complex beast and in this blog, we will tell you how to back up big data sets within that beast.<\/p>\n<h2><span style=\"font-weight: 400\">The Infrastructure of MySQL<\/span><\/h2>\n<p>Before working with bigger data sets inside MySQL, you should familiarize yourself with its infrastructure. The most frequently used storage engine specific to MySQL is InnoDB and its counterpart XtraDB developed by Percona. Both InnoDB and XtraDB have a couple of properties unique to themselves:<\/p>\n<ol>\n<li>The functionality of InnoDB is based on the InnoDB buffer pool \u2013 the buffer pool is an area used to cache the data and indexes when they are accessed. The bigger it is, the faster certain types of queries that are using the buffer pool or its data (<code>SELECT<\/code>, <code>LOAD DATA INFILE<\/code>) will complete.<\/li>\n<li>InnoDB supports the ACID model guaranteeing data atomicity, consistency, transaction isolation, and durability. Note that even though InnoDB is known to offer support for ACID properties by default, it can also be made to exchange ACID for speed (and vice versa.)<\/li>\n<li>InnoDB has a couple of acquaintances \u2013 while InnoDB data is stored in the ibdata1 file, InnoDB also has a couple of log files \u2013 <code>ib_logfile0<\/code> and <code>ib_logfile1<\/code> \u2013 that are used to store undo and redo logs. Undo logs contain information about undoing changes made by a specific transaction, while redo logs store information about how to re-do them (hence the names.) Both files can be found either in the <code>\/var\/lib\/mysql\/mysql*.*.**\/data<\/code> directory if Linux is in use, or in the <code>\/bin\/mysql\/mysql*.*.**\/data<\/code> directory if you find yourself using Windows (*.*.** represents the MySQL version): <br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/11\/d-redgate-blogs-may-2022-backing-up-big-data-sets.png\" alt=\"\" width=\"433\" height=\"280\" \/><\/li>\n<li>All of the settings relevant to InnoDB can be configured via the settings present in my.cnf.<\/li>\n<\/ol>\n<p>Now as far as backups are concerned, we don\u2018t need to know much about the internals of InnoDB (or any other MySQL storage engine for that matter) <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/mysql\/backing-up-mysql-part-1-mysqldump\/\">because we can simply take them using mysqldump<\/a>, <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/backing-up-mysql-part-4-mydumper\/\">mydumper<\/a>, or through phpMyAdmin by heading over to the Export tab:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/11\/d-redgate-blogs-may-2022-backing-up-big-data-sets-1.png\" alt=\"\" width=\"423\" height=\"203\" \/><\/p>\n<p>All of those tools will provide us with logical backups that can be used to restore data whenever we please, however, for those of us that work with bigger data sets, neither of those tools are an option \u2013 as <code>INSERT<\/code> statements contained in the backups derived from those tools usually come with a lot of overhead, developers frequently find themselves needing to dip their toes in other waters instead.<\/p>\n<h2><span style=\"font-weight: 400\">Backing Up Big Data<\/span><\/h2>\n<h3><span style=\"font-weight: 400\">Logical Backups<\/span><\/h3>\n<p>In order to back up data sets that are bigger in size, developers need to keep a couple of things in mind:<\/p>\n<ol>\n<li>First off, if a logical backup is being taken, we should avoid creating a backup filled with <code>INSERT INTO<\/code> statements \u2013 <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/mysql\/introducing-the-mysql-insert-statement\/\">this article should provide you with a little more information what these are<\/a>, but the main reason such statements are to be avoided is because such statements come with a lot of overhead that MySQL has to consider. In order to execute a single <code>INSERT<\/code> statement, MySQL has to check for permissions, open the necessary tables, initialize the query, acquire all locks that are needed for the query to execute, run the query, finish the work (let the query run until it\u2019s finished), close tables and perform additional clean up operations. Of course, these might be relatively quick to complete in the grand scheme of things, but we need to consider that we would be working with hundreds of millions of rows at a time.<\/li>\n<li>Working with big data sets would require a dedicated server, or at the very least, a VPS. This one is probably already out of the question (<a href=\"https:\/\/www.plesk.com\/blog\/business-industry\/big-data-hosting-provider\/\">you\u2019ve done your investigation into the hosting provider before choosing one, right<\/a>?), but nonetheless, choosing the right server for our big data project is always a necessity.<\/li>\n<li>How many databases are we backing up? How many tables are inside of those databases? Do all of those need to be backed up? You would be surprised how many times we can limit the amount of data that is being backed up by simply thinking ahead.<\/li>\n<li>Did we work with bigger data sets before? Our experience in the past teaches us more lessons than we could imagine \u2013 however, if we didn\u2019t have much experience beforehand, we can also just ask around \u2013 if our colleagues are well-versed in the field, the answers to most pressing questions shouldn\u2019t be too far away.<\/li>\n<\/ol>\n<p>Once the basics are covered and we\u2019ve decided what we\u2019re backing up, it\u2019s time to move on. The best way to take a logical backup of a bigger data set is to use the <code>LOAD DATA INFILE<\/code> query \u2013 since the query comes with many of its own bells and whistles, it\u2019s significantly faster than ordinary <code>INSERT<\/code> statements:<\/p>\n<ul>\n<li>MySQL needs little overhead to parse the query.<\/li>\n<li>Users can easily skip lines or columns.<\/li>\n<li>Users can easily specify the columns to load the data into at the end of the query.<\/li>\n<\/ul>\n<p>It\u2019s important to note that MySQL offers two ways to use <code>LOAD DATA<\/code> to load data into our infrastructure \u2013 <code>LOAD DATA LOCAL<\/code> is another query that users can use. The local option only works if we didn\u2019t define the <code>local-infile<\/code> option to be 0 or if the local-infile option is defined within my.cnf underneath the <code>[mysql]<\/code> and <code>[mysqld]<\/code> headings. The <code>LOAD DATA LOCAL<\/code> query is widely considered to be insecure because the parsing of data happens on the server side \u2013 the local option:<\/p>\n<ul>\n<li>Invokes certain security functions of MySQL that must be followed.<\/li>\n<li>Changes the behavior of MySQL when searching for the file to import.<\/li>\n<li>In some cases (when dealing with errors, etc.), it has the same effect as <code>IGNORE<\/code> does.<\/li>\n<\/ul>\n<p>The most important security considerations are explained <a href=\"https:\/\/dev.mysql.com\/doc\/mysql-security-excerpt\/5.7\/en\/load-data-local-security.html\">here<\/a>.<\/p>\n<p>The main reason <code>LOAD DATA INFILE<\/code> works so well with bigger data sets though is because the files created with it only contain the data that is being backed up separated by a specific character (\u201c:\u201d, \u201c|\u201d, \u201c,\u201d, Tab, etc.) \u2013 MySQL can then load the data straight into a specific column without performing any additional operations and know where the set of data \u201cends\u201d as well. A basic query looks like so:<\/p>\n<p><code>LOAD DATA INFILE \u2018file.ext\u2019 INTO TABLE demo_table [options];<\/code> where:<\/p>\n<ul>\n<li><code>file.ext<\/code> is a file bearing any extension other than SQL (.txt, .csv, etc.);<\/li>\n<li><code>demo_table<\/code> defines the table that the data will be loaded into.<\/li>\n<li>After the file and the table are defined, we can choose to define certain options:\n<ul>\n<li><code>TERMINATED BY<\/code> will make our columns end with a certain character (it can be useful when our file has a lot of records that are terminated by a character): <br \/>\n<code>LOAD DATA INFILE \u2018file.ext\u2019 INTO TABLE demo_table TERMINATED BY \u2018|\u2019;<\/code><\/li>\n<li><code>CHARACTER SET<\/code> will make MySQL use a specific character set when importing the data. This feature can be very important when loading in data that comes from a different geographical place than the existing data (e.g. if our data is from England, and we\u2019re importing data from Korea, China, Russia, etc.) \u2013 it will help us avoid errors when importing such data sets \u2013 learn more about charsets <a href=\"https:\/\/dev.mysql.com\/doc\/refman\/8.0\/en\/charset.html\">here.<\/a><br \/>\n<code>LOAD DATA INFILE \u2018file.ext\u2019 INTO TABLE demo_table CHARACTER SET latin1;<\/code><\/li>\n<li><code>PARTITION<\/code> will make MySQL able to insert data into a specific partition \u2013 that\u2019s very useful when working with bigger partitioned data sets \u2013 the partition can be specified like so (<code>p1<\/code>, <code>p2<\/code>, and <code>p3<\/code> defines the partitions): <br \/>\n<code>LOAD DATA INFILE \u2018file.ext\u2019 INTO TABLE demo_table PARTITION p1, p2, p3;<\/code><\/li>\n<li><code>IGNORE LINES|ROWS<\/code> can be used to ignore certain lines or rows \u2013 this feature can be very useful when we want to ignore a certain amount of rows or lines when importing big data sets \u2013 perhaps we only want to import rows starting from the row #25527? (we can either ignore lines or rows): <br \/>\n<code>LOAD DATA INFILE \u2018file.ext\u2019 INTO TABLE demo_table IGNORE 25526 LINES|ROWS;<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>All of the options can be found at <a href=\"https:\/\/dev.mysql.com\/doc\/refman\/8.0\/en\/load-data.html\">the MySQL documentation<\/a>.<\/p>\n<p>As far as backups of big data are concerned, the most popular options that are used are as follows:<\/p>\n<ol>\n<li><code>TERMINATED BY<\/code><\/li>\n<li><code>PARTITION<\/code><\/li>\n<li><code>CHARACTER SET<\/code><\/li>\n<li><code>IGNORE LINES|ROWS<\/code><\/li>\n<\/ol>\n<p>Each of those options have their own upsides (see above) and when combined with an exceptionally large data set (think 100M rows and above), they make the <code>LOAD DATA INFILE<\/code> operation even more powerful than before.<\/p>\n<p>To take a backup that can be restored by using the <code>LOAD DATA INFILE<\/code> statement on the other hand, we\u2019d need to use the <code>SELECT [columns] INTO OUTFILE<\/code> statement:<\/p>\n<p><code>SELECT * FROM demo_table INTO OUTFILE \u2018backup.txt\u2019 [options];<\/code><\/p>\n<p><a href=\"https:\/\/dev.mysql.com\/doc\/refman\/8.0\/en\/load-data.html\">The options remain the same<\/a>. That\u2019s it!<\/p>\n<p>Another consideration that is worth mentioning is the <code>IGNORE<\/code> keyword: the keyword will ignore all of the errors derived from the query: think not all entries having enough columns, some rows being wrongfully terminated (terminated with the wrong character), etc. \u2013 to make use of the keyword in <code>LOAD DATA INFILE<\/code> operations, specify it like so:<\/p>\n<p><code>LOAD DATA INFILE \u2018path\/file.txt\u2019 IGNORE INTO TABLE demo_table [options];<\/code><\/p>\n<p>To make use of the keyword in <code>SELECT ... INTO OUTFILE<\/code> operations, make use of the keyword like so:<\/p>\n<p><code>SELECT * INTO OUTFILE \u2018demo.txt\u2019 IGNORE FROM demo_table [options];<\/code><\/p>\n<p>Contrary to popular belief, the <code>IGNORE<\/code> keyword has more use cases than we could think of \u2013 can you imagine how much time is saved by ignoring thousands or even millions of errors?<\/p>\n<h2><span style=\"font-weight: 400\">Physical Backups<\/span><\/h2>\n<p>As far as physical backups are concerned, backing up big data sets is not that different from backing fewer amount of rows \u2013 here\u2019s what we have to consider:<\/p>\n<ol>\n<li>We can either take a hot or cold backup (a hot backup is taken when a database is still running, while a cold backup is taken while the database is down.)<\/li>\n<li>We need to consider whether we will use tools or take the backup manually (the tools that come to mind in this space include MySQL Enterprise Backup and others.)<\/li>\n<\/ol>\n<p>If you decide to take a hot backup, most of the time you will be advised to use <a href=\"https:\/\/dev.mysql.com\/doc\/mysql-enterprise-backup\/8.0\/en\/\">mysqlbackup (MySQL Enterprise Backup)<\/a> as it provides good results with minimal downtime.<\/p>\n<p>If we decide on a cold backup, the following steps should help:<\/p>\n<ol>\n<li>Shut down MySQL.<\/li>\n<li>Copy all of the files relevant to your storage engine of choice \u2013 if you\u2019re using InnoDB, take a backup of ibdata1, ib_logfile0, and ib_logfile1 and store them somewhere safe (If you\u2019re using MyISAM, simply take a backup of .MYD and .MYI files in your MySQL data directory and stop here.)<\/li>\n<li>If you\u2019re using InnoDB, take a copy of all of the tables in the data directory of MySQL (start from the <code>\/var\/lib\/mysql\/bin\/<\/code> directory if you find yourself using Linux \u2013 if you\u2019re using Windows, follow the path given below): <br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/11\/d-redgate-blogs-may-2022-backing-up-big-data-sets-2.png\" alt=\"\" width=\"594\" height=\"178\" \/><\/li>\n<li>Finally, take a copy of my.cnf \u2013 the file can be found in <code>\/var\/lib\/bin\/mysql\/mysqlversion<\/code> when using Linux \u2013 if you find yourself using Windows, the file is called my.ini and it can be found here (keep in mind that taking a copy of the file can be useful to review old settings as well):<\/li>\n<\/ol>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/11\/d-redgate-blogs-may-2022-backing-up-big-data-sets-3.png\" alt=\"\" width=\"503\" height=\"222\" \/><\/p>\n<p>For those who are interested, the settings within the file will most likely look like so &#8211; if you want to, feel free to explore the file, then come back to this blog:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2022\/11\/my.cnf-parameters.png\" alt=\"\" width=\"384\" height=\"155\" \/><\/p>\n<p>Anyhow, here\u2019s why performing all of these things is a necessity:<\/p>\n<ol>\n<li>Shutting down MySQL will ensure that nothing interferes with our backup operation.<\/li>\n<li>The backup of ibdata1 will take a backup of all of the data existing in InnoDB, while backing up the ib_logfile0 and ib_logfile1 will back up the redo logs.<\/li>\n<li>Taking a copy of the .frm and .ibd files is necessary because as of MySQL 5.7.8, only table metadata is stored in ibdata1 to prevent congestion \u2013 the actual table data is stored in the .frm and .ibd files (an .ibd file contains the data of the table, while an .frm file contains table metadata.)<\/li>\n<li>Taking a copy of my.cnf (or my.ini if we find ourselves using Windows) is an absolute necessity as well because the file contains all of the settings relevant to all storage engines in MySQL. That means that even if we are using an obsolete storage engine like MyISAM and want to continue running the engine on the server that we\u2019re loading the backup onto for some reason (MyISAM provides faster <code>COUNT(*)<\/code> operations because it stores the number of rows in the table metadata \u2013 InnoDB does not), we can.<\/li>\n<li>Finally, make sure to store everything in a safe and accessible place \u2013 doing so allows us to recover data whenever we need to.<\/li>\n<\/ol>\n<p>To restore a hot backup taken by using MySQL Enterprise Backup, once again refer to our blog, and to restore a cold physical backup taken by following the steps above, follow the steps below (the steps are specific to InnoDB):<\/p>\n<ol>\n<li>While MySQL is still down, copy over your my.cnf file to your server and adjust the settings within the file to your liking.<\/li>\n<li>Place the folders contained within the data folder when backing up back into the data directory.<\/li>\n<li>Place the ibdata1 file and the log files back into the data directory of MySQL (see above for an example.)<\/li>\n<li>Start MySQL.<\/li>\n<\/ol>\n<p>To recover your MyISAM data, simply copy over the data folder to your server. No further action is necessary due to the fact that MyISAM doesn\u2019t store additional data in files as InnoDB does.<\/p>\n<p>To take and restore physical backups of InnoDB, we can also use PITR (Point-In-Time Recovery) \u2013 some say that the PITR method is significantly easier, but that depends on the person. Anyway, <a href=\"https:\/\/dev.mysql.com\/doc\/refman\/8.0\/en\/point-in-time-recovery.html\">the MySQL manual should walk you through all of the necessary steps<\/a>.<\/p>\n<h2><span style=\"font-weight: 400\">Summary<\/span><\/h2>\n<p>Backing up big data sets in MySQL is not the easiest thing to do \u2013 however, with knowledge centered around how the main storage engine (InnoDB) works internally and what queries we can run to make our backup process faster, we can overcome every obstacle in our way.<\/p>\n<p>Of course, as with everything, there are certain nuances, so <a href=\"https:\/\/dev.mysql.com\/doc\/refman\/en\/\">make sure to read all about them in the MySQL documentation chapters<\/a> surrounding the necessary functions you need to employ \u2013 came back to this blog for more information about database functionality and backups within them, and until next time!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog will guide you through backing up big data sets in MySQL with ease. Have a read!&hellip;<\/p>\n","protected":false},"author":339547,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[5564,4168,5854],"coauthors":[146040],"class_list":["post-95127","post","type-post","status-publish","format-standard","hentry","category-blogs","tag-backups","tag-database","tag-mysql"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/95127","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/339547"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=95127"}],"version-history":[{"count":8,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/95127\/revisions"}],"predecessor-version":[{"id":95140,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/95127\/revisions\/95140"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=95127"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=95127"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=95127"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=95127"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}