{"id":97571,"date":"2023-08-21T12:01:11","date_gmt":"2023-08-21T12:01:11","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=97571"},"modified":"2024-04-16T19:21:45","modified_gmt":"2024-04-16T19:21:45","slug":"learning-postgresql-with-grant-introducing-vacuum","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/learning-postgresql-with-grant-introducing-vacuum\/","title":{"rendered":"Introducing VACUUM: Learning PostgreSQL With Grant"},"content":{"rendered":"<p><em><strong>This is part of an ongoing series of post by Grant as he takes his vast knowledge of SQL Server and applies it to adding PostgreSQL and shares it with you so you can skip learn from his triumphs and mistakes. For more you can go to the <a href=\"https:\/\/www.red-gate.com\/simple-talk\/collections\/series-learning-postgresql-with-grant\/\">Learning PostgreSQL with Grant<\/a> series home page<\/strong><\/em><\/p>\n\n<p>While there are many features within PostgreSQL that are really similar to those within SQL Server, there are some that are unique. One of these unique features is called <code>VACUUM<\/code>. In my head, I compare this with the <code>tempdb<\/code> in SQL Server. Not because they act in any way the same or serve similar purposes. They absolutely do not. Instead, it\u2019s because they are both fundamental to behaviors within their respective systems, and both are quite complex in how they work, what they do, and the ways in which we can mess them up.<\/p>\n<p><code>VACUUM<\/code> is a complex, deep, topic, so this article will only act as an introduction. I\u2019ll have to follow up with more articles, digging into the various behaviors of this process. However, let\u2019s get started. <code>VACUUM<\/code>, and the very directly related, <code>ANALYZE<\/code>, are vital processes in a healthy PostgreSQL environment. Most of the time these will be running in an automated fashion, and you\u2019ll never deal with them directly. However, since these processes are so important, I am going to introduce them now.<\/p>\n<h2>The PostgreSQL VACUUM Process<\/h2>\n<p>At its core, <code>VACUUM<\/code> is pretty simple. PostgreSQL doesn\u2019t actually, physically, remove the data when you issue a <code>DELETE<\/code> statement. Instead, that data is logically marked as deleted internally and then won\u2019t show up in queries against the table. For an <code>UPDATE<\/code> statement, a new row is added and the old row is logically marked as deleted. As you can imagine, if nothing is done, your database will eventually fill your disk (unless you define a <code>TABLESPACE<\/code> for the tables and limit its size, and that\u2019s another article). The first function then of <code>VACUUM<\/code> is to remove those rows from the table. That\u2019s it. Nice and simple.<\/p>\n<p>Well, of course it\u2019s not that simple.<\/p>\n<p><code>VACUUM<\/code> has a second behavior called <code>ANALYZE<\/code>. The <code>ANALYZE<\/code> process examines the tables and indexes, generating statistics, and then stores that information in a system catalog (system table) called <code>pg_statistic<\/code>. In short, <code>VACUUM<\/code> <code>ANALYZE<\/code> is the PostgreSQL equivalent of <code>UPDATE STATISTICS<\/code> in SQL Server.<\/p>\n<p>I told you that <code>VACUUM<\/code> was both complex and integral to the behavior of PostgreSQL. Without it you not only will fill your drives, but you won\u2019t have up to date statistics. There\u2019re even more behaviors wrapped up within the <code>VACUUM<\/code> process, but we\u2019re not going to cover them all here. In fact, we\u2019re not even going to go very deep into the two standard behaviors, cleaning up your tables and maintaining your statistics, because each of these are very deep topics all on their own. We are going to go over the basics of how these processes work and why you need to pay attention to them.<\/p>\n<h2>VACUUM<\/h2>\n<p>Making VACUUM work is very simple. This command will ensure that the space is retrieved from all tables:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">VACUUM;<\/pre>\n<p>While the space from the removed rows is reclaimed for reuse, the actual size of your database won\u2019t shrink. The exception to this is when there are completely empty pages at the tail end of the table. In that case, you can see the space being completely reclaimed.<\/p>\n<p>The PostgreSQL equivalent to <code>SHRINK<\/code> would be to run <code>VACUUM<\/code> like this:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">VACUUM (FULL);<\/pre>\n<p>This command will rebuild all the tables in the database into new tables. That comes with significant overhead and will most certainly cause blocking while the data is being moved. This will also cause significant IO on the system. However, it\u2019ll remove every empty page, reclaiming space for the operating system. Again, similar to <code>SHRINK<\/code>, running this regularly is not considered a good practice. In fact, <a href=\"https:\/\/www.red-gate.com\/simple-talk\/author\/ryan-booz\/\">Ryan Booz<\/a>, who kindly did some technical edits on this article says, \u201crunning this <strong>at all<\/strong> is not considered a good practice.\u201d The core issue is that while running this command, the standard automated <code>VACUUM<\/code> processes are blocked, possibly setting you up to need to run this process again, then again, then\u2026 Well, you get the point.<\/p>\n<p>You can also target specific tables when running <code>VACUUM<\/code> manually:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">VACUUM radio.antenna;<\/pre>\n<p>You can even specify a list of tables:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">VACUUM radio.antenna, radio.bands, radio.digitalmodes;<\/pre>\n<p>In either of these cases, instead of accessing every table in the database to which I have permissions, only the table or tables listed will go through the <code>VACUUM<\/code> cleanup process.<\/p>\n<p>To really see what\u2019s happening, we can take advantage of an additional parameter, <code>VERBOSE<\/code>. I\u2019m going to load up a table with some data and then remove that data. Then, we\u2019ll run <code>VACUUM<\/code>:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">INSERT INTO radio.countries \r\n(country_name)\r\nSELECT generate_series(1,15000,1);\r\n\r\nDELETE FROM radio.countries \r\nWHERE country_id BETWEEN 3 AND 12000;\r\n\r\nVACUUM (VERBOSE) radio.countries;<\/pre>\n<p>The results are as follows (yours may vary some, but should be similar):<\/p>\n<p><code>vacuuming \"hamshackradio.radio.countries\"<\/code><\/p>\n<p><code>finished vacuuming \"hamshackradio.radio.countries\": index scans: 1<\/code><\/p>\n<p><code>pages: 0 removed, 81 remain, 81 scanned (100.00% of total)<\/code><\/p>\n<p><code>tuples: 11998 removed, 3004 remain, 0 are dead but not yet removable<\/code><\/p>\n<p><code>removable cutoff: 1305, which was 0 XIDs old when operation ended<\/code><\/p>\n<p><code>new relfrozenxid: 1304, which is 3 XIDs ahead of previous value<\/code><\/p>\n<p><code>index scan needed: 64 pages from table (79.01% of total) had 11998 dead item identifiers removed<\/code><\/p>\n<p><code>index \"pkcountry\": pages: 77 in total, 58 newly deleted, 65 currently deleted, 7 reusable<\/code><\/p>\n<p><code>avg read rate: 12.169 MB\/s, avg write rate: 12.169 MB\/s<\/code><\/p>\n<p><code>buffer usage: 729 hits, 3 misses, 3 dirtied<\/code><\/p>\n<p><code>WAL usage: 388 records, 0 full page images, 96719 bytes<\/code><\/p>\n<p><code>system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s<\/code><\/p>\n<p>I\u2019m not 100% on everything going on here, as this series says, learning with Grant. However, there is easily spotted information. 11,998 tuples removed with 3004 remaining. You can also see the pages for the <code>pkcountry<\/code> index where there was 77 pages, but 58 were deleted and 7 are reusable. On top of all that, you get the performance metrics at the end for just how long everything took and the I\/O involved. This is all useful stuff.<\/p>\n<p>For anyone following along in the series, if you want to clean up your table after this little test, here are the scripts I used:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">DELETE FROM radio.countries \r\nWHERE country_id &gt; 2;\r\n\r\nALTER TABLE radio.countries \r\n    ALTER COLUMN country_id RESTART WITH 3;<\/pre>\n<p>I could probably run <code>VACUUM<\/code> on the table again to see more results.<\/p>\n<p>Now, there are simply metric tonnes more details on everything <code>VACUUM<\/code> does and how it does it. However, these are the basics. Let\u2019s shift over and look at <code>ANALYZE<\/code> for a minute.<\/p>\n<h2>ANALYZE<\/h2>\n<p>One thing that PostgreSQL shares with SQL Server is the use of statistics on tables as a means of row estimation within the query optimization process. And, just like SQL Server, these statistics can get out of date as the data changes over time. While there is an automated process to handle this (more on that later), you may find, just like SQL Server, that you need to intervene directly. So, built in to the <code>VACUUM<\/code> process is the ability to update statistics through the <code>ANALYZE<\/code> parameter:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">VACUUM (ANALYZE);<\/pre>\n<p>Just as with the <code>VACUUM<\/code> command at the start, this will examine all the tables that I have access to within the database in question and run <code>ANALYZE<\/code> against them.<\/p>\n<p>Interestingly, you can run <code>ANALYZE<\/code> as a separate process. This will do the same thing as the preceding statement:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">ANALYZE;<\/pre>\n<p>You can run the commands separately primarily as a mechanism of control and customization. The actions performed are the same. To see this in action, I want to look at the radio.countries table and the statistics there, after running ANALYZE to be sure that it reflects the two rows in the table:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">VACUUM (ANALYZE) radio.countries;\r\n\r\nSELECT\r\n\tps.histogram_bounds\r\nFROM\r\n\tpg_stats AS ps\r\nWHERE\r\n\tps.tablename = 'countries';<\/pre>\n<p>Now, as with SQL Server, there\u2019s a whole lot to statistics. I\u2019m just displaying the histogram here so we can see why kind of data might be in it. The results are here:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"432\" height=\"154\" class=\"wp-image-97572\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/07\/word-image-97571-1.png\" \/><\/p>\n<p>I\u2019m going to rerun the data load script from above, and then immediately look at the statistics in <code>pg_stats<\/code> again. Since there is an automatic <code>VACUUM<\/code> process (more on that later) that runs about once a minute by default, I want to see the stats before they get fixed by an automated <code>ANALYZE<\/code> process:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">INSERT INTO radio.countries \r\n(country_name)\r\nSELECT generate_series(1,15000,1);\r\n\r\nSELECT\r\n\tps.histogram_bounds\r\nFROM\r\n\tpg_stats AS ps\r\nWHERE\r\n\tps.tablename = 'countries';\r\n\r\nVACUUM (ANALYZE) radio.countries;\r\n\r\nSELECT\r\n\tps.histogram_bounds\r\nFROM\r\n\tpg_stats AS ps\r\nWHERE\r\n\tps.tablename = 'countries';<\/pre>\n<p>The first result set (not pictured) from <code>pg_stats<\/code> is exactly the same as the figure above. This is because they automated <code>VACUUM<\/code> process hasn\u2019t run <code>ANALYZE<\/code> yet and I didn\u2019t do a manual <code>ANALYZE<\/code>. Then, of course, I do the <code>ANALYZE<\/code> and the results of the histogram change to this:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-97573\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/07\/a-screenshot-of-a-computer-description-automatica-5.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated with low confidence\" width=\"744\" height=\"118\" \/><\/p>\n<p>It just keeps going from there, out to the width of the values in the histogram for the table (again, another article we\u2019ll be covering in the future).<\/p>\n<p>I can also take advantage of the <code>VERBOSE<\/code> parameter to see what\u2019s happening when <code>ANALYZE<\/code> runs. This time I\u2019ll just run the <code>ANALYZE<\/code> command though:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">DELETE FROM radio.countries \r\nWHERE country_id BETWEEN 3 AND 12000;\r\nANALYZE (VERBOSE) radio.countries;<\/pre>\n<p>And the output is here:<\/p>\n<p><code>analyzing \"radio.countries\"<\/code><\/p>\n<p><code>\"countries\": scanned 81 of 81 pages, containing 3004 live rows and 11998 dead rows; 3004 rows in sample, 3004 estimated total rows<\/code><\/p>\n<p>You can see that it\u2019s now scanned a smaller set of rows to arrive at a new set of statistics and a new histogram. You can also see the deleted rows in the output. I ran this separately so that it didn\u2019t do both a <code>VACUUM<\/code> and <code>ANALYZE<\/code>. This is how you can break these things down and take more direct control.<\/p>\n<p>I\u2019ve hinted at it several times throughout the article. There is an automatic <code>VACUUM<\/code> process that we need to discuss, the autovacuum deamon.<\/p>\n<h2>Autovacuum<\/h2>\n<p>Enabled by default, there is a deamon process within PostgreSQL that will automatically run both <code>VACUUM<\/code> and <code>ANALYZE<\/code> on the databases on your system completely automatically. The process is pretty sophisticated and highly customizable, so you can make a lot of changes to the base behavior.<\/p>\n<p>Basically, autovacuum runs against every database on the server. The default number of threads that can be operating at the same time is 3, set through <code>autovacuum_max_workers<\/code>, which you can configure. It launches every 60 second by default through the <code>autovacuum_naptime <\/code>value, also configurable. You\u2019re going to see a pattern, most of the settings can be configured.<\/p>\n<p>Then, there\u2019s a threshold to determine if a given table will go through the <code>VACUUM<\/code> and <code>ANALYZE<\/code> processes. <code>VACUUM<\/code> has to exceed the <code>autovacuum_vacuum_threshold<\/code> value, the default of which is 50 tuples, or rows. It\u2019s a little more complicated than that because there\u2019s a calculation involving the <code>autovacuum_vacuum_insert_threshold<\/code>, which has a default of 1,000 tuples, which is then added to the <code>autovacuum_vacuum_insert_scale_factor<\/code>, by default, 20% of the rows of a given table. This value is then multiplied by the number of tuples in the table. All of that lets us know which tables will get the <code>VACUUM<\/code> process run against them. ou can see the formula laid out <a href=\"https:\/\/www.postgresql.org\/docs\/current\/routine-vacuuming.html&quot; \\l &quot;AUTOVACUUM\">here in the documentation<\/a>.<\/p>\n<p><code>ANALYZE<\/code> is similar. The <code>autovacuum_analyze_threshold<\/code>, 50 tuples by default, is calculated against the <code>autovacuum_analyze_scale_factor<\/code>, 10% of the table, and the number of tuples to arrive at the analyze threshold value.<\/p>\n<p>All of these settings can be controlled at the server level, or, at the table level, allowing a high degree of control over exactly how both your automatic <code>VACUUM<\/code> and your automatic <code>ANALYZE<\/code> operate. You may find, similar to statistics updates in SQL Server, that the automated processes need to be either adjusted, or augmented with the occasional manual update. As stated earlier, statistics in PostgreSQL provide the same kind of information to the optimizer as they do in SQL Server, so getting them as right as possible is quite important.<\/p>\n<h2>Conclusion<\/h2>\n<p>As I said at the beginning, the <code>VACUUM<\/code> process is a very large, involved, topic. I\u2019ve only scratched the surface with this introduction. However, the basics are there. We have an automatic, or manual, process that cleans out deleted tuples. Then, we have an automatic, or manual, process ensuring that our statistics are up to date. While taking control of these processes and adjusting the automated behaviors, or running them manually is relatively straightforward, knowing when and where to make those adjustments is a whole different level of knowledge.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While there are many features within PostgreSQL that are really similar to those within SQL Server, there are some that are unique. One of these unique features is called VACUUM. In my head, I compare this with the tempdb in SQL Server. Not because they act in any way the same or serve similar purposes&#8230;.&hellip;<\/p>\n","protected":false},"author":221792,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143534],"tags":[158977,158976,159066],"coauthors":[6785],"class_list":["post-97571","post","type-post","status-publish","format-standard","hentry","category-featured","category-postgresql","tag-learningpostgresqlwithgrant","tag-planetpostgresqlgrantfritchey","tag-postgresql-101-webinar-sidebar"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97571","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/221792"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=97571"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97571\/revisions"}],"predecessor-version":[{"id":102168,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97571\/revisions\/102168"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=97571"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=97571"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=97571"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=97571"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}