{"id":98424,"date":"2023-10-09T22:35:24","date_gmt":"2023-10-09T22:35:24","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=98424"},"modified":"2024-03-07T11:43:13","modified_gmt":"2024-03-07T11:43:13","slug":"postgresql-indexes-what-they-are-and-how-they-help","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/featured\/postgresql-indexes-what-they-are-and-how-they-help\/","title":{"rendered":"PostgreSQL Indexes: What They Are and How They Help"},"content":{"rendered":"<p>In the <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/what-is-an-execution-plan-and-how-to-find-it\/\">previous blog<\/a> in this series, we learned how to produce, read and interpret execution plans. We learned that an execution plan provides information about access methods, which PostgreSQL use to select records from a database. Specifically, we observed that in some cases PostgreSQL used sequential scan, and in some cases index-based access.<\/p>\n<p>It might seem to be a better idea to talk about indexes before we discussed execution plans, but query plans are a good place to start the performance journey and work our way in! This blog is going to be about indexes, why we need them, how they can help, and how they can make things worse.<\/p>\n<h2>What is an index?<\/h2>\n<p>What is an index? One might assume that any person who works with databases knows what an index is. However, a surprising number of people, including database developers and report writers and, in some cases, even DBAs, use indexes, even <strong>create<\/strong> indexes, with only a vague understanding of what indexes are and how they are structured. Having this in mind, let\u2019s start with defining what is an index.<\/p>\n<p>Since there are many different index types in PostgreSQL (and new index types are constantly created) we won\u2019t focus on structural properties to produce an index definition. Instead, we define an index based on its usage. A data structure is called an index if it is:<\/p>\n<ul>\n<li>A redundant data structure<\/li>\n<li>Invisible to an application<\/li>\n<li>Designed to speed up data selection based on certain criteria<\/li>\n<\/ul>\n<p>Let\u2019s discuss each of the bullet points in more detail.<\/p>\n<p>Redundancy means that an index can be dropped without any data loss and can be reconstructed from data stored in tables (the <code>postgres_air<\/code> database dump we will use with the examples comes without indexes, but you can build them after you restore the database. Full instructions are in the appendix of this article to create the database you will need if you wish to follow along with the examples.)<\/p>\n<p>Invisibility means that a user cannot detect if an index is present or absent when writing a query other than the time the query takes to process. That is, any query produces the same results with or without an index.<\/p>\n<p>Finally, an index is created with the hope (or confidence) that it improves performance of a specific query or (even better!) several queries. Although index structures can differ significantly among index types, the speed-up is achieved due to a fast check of some filtering conditions specified in a query. Such filtering conditions specify certain restrictions on table attributes.<\/p>\n<p>Figure 1 shows how an index can speed up the access to the specific table rows.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1346\" height=\"1078\" class=\"wp-image-98425\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-table-with-blue-lines-and-black-text-descriptio.png\" alt=\"A table with blue lines and black text\n\nDescription automatically generated with medium confidence\" \/><\/p>\n<p><strong>Figure 1. Index access<\/strong><\/p>\n<p>The right part of Figure 1 shows a table, and the left represents an index that can be viewed as a special kind of a table. Each row of the index consists of an index key and a pointer to a table row. The value of an index key usually is equal to the value of a table attribute. The example in Figure 1 has airport code as its value; hence, this index supports search by airport code.<\/p>\n<p>For this particular index, all values in the <code>airport_code<\/code> column are unique, so each index entry point to exactly one row in the table. However, some columns, like the column <code>city<\/code> in the <code>airport<\/code> table, can have the same value in multiple rows. If this column is indexed, the index must contain pointers to all rows containing this value of an index key. That is, a key may be logically associated with a list of pointers to rows rather than a single pointer.<\/p>\n<p>Figure 1 illustrates how to reach the corresponding table row when an index record is located; however, it does not explain why an index row can be found much faster than a table row. Let\u2019s take a closer look and find out.<\/p>\n<h2>Index Types<\/h2>\n<p>PostgreSQL supports a lot of types of indexes.<\/p>\n<p>For this article, I am going to only discuss B-Tree indexes because they are the most common index structure you will typically use. There are other types of indexes you can add to tables, but the B-Tree is the most common and what I will focus on for this article.\u00a0<\/p>\n<p>The basic structure of a B-tree is shown in Figure 2, using airport codes as the index keys. The tree consists of hierarchically organized nodes that are associated with blocks stored on a disk.<\/p>\n<p>Records in all blocks are ordered, and at least half of the block capacity is used in every block.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1501\" height=\"1018\" class=\"wp-image-98426\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-diagram-of-a-diagram-description-automatically.png\" alt=\"A diagram of a diagram\n\nDescription automatically generated\" \/><\/p>\n<p><strong>Figure 2. An Example of a B-Tree<\/strong><\/p>\n<p>The leaf nodes (shown in the bottom row in Figure\u00a02) contain index records exactly like those in Figure\u00a01; these records contain an index key and a pointer to a table row. Non-leaf nodes (located at all levels except the bottom) contain records that consist of the smallest key (in Figure\u00a02, the lowest alphanumeric value) in a block located at the next level and a pointer to this block.<\/p>\n<p>Any search for a key <code>K<\/code> starts from the root node of the B-tree. During the block lookup, the largest key <code>P<\/code> not exceeding <code>K<\/code> is found, and then the search continues in the block pointed to by the pointer associated with <code>P<\/code> until the leaf node is reached, where a pointer refers to table rows. The number of accessed nodes is equal to the depth of the tree. Of course, the key <code>K<\/code> is not necessarily stored in the index, but the search finds either the key or the position where it could be located.<\/p>\n<p>B-Tree indexes provide the widest variety of uses (for example, single row and ordered range lookups), and can be built for any data type for which we can define ordering (otherwise called <a href=\"https:\/\/www.cs.purdue.edu\/homes\/hosking\/m3\/reference\/ordinal.html\">ordinal types<\/a>), this is where I will start in this series.<\/p>\n<h2>When Indexes are Useful<\/h2>\n<p>Based on what we learned about indexes so far, would we say that indexes can help to speed up <strong>any<\/strong> query? Absolutely not! Whether indexes will be useful or not for a specific query, mostly depends on what type of query we are talking about. It should be no surprise that indexes are most useful for <strong>short queries<\/strong>.<\/p>\n<p>But wait \u2013 we need to produce one more definition here! Which queries are considered short? Does it depend on the size of a query? The code of the below query is very short; does it mean it\u2019s a short query?<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT \r\n     d.airport_code AS departure_airport,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0a.airport_code AS arrival_airport\r\nFROM\u00a0\u00a0airport a,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0airport d;<\/pre>\n<p>Spoiler alert \u2013 it is not! This query produces a Cartesian product of two copies of the airport table, generating a domain of all possible flights.<\/p>\n<p>The next query is \u201clonger\u201d:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT \r\n       f.flight_no,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0f.scheduled_departure,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0boarding_time,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0p.last_name,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0p.first_name,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0bp.update_ts as pass_issued,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0ff.level\r\nFROM flight f\r\n\u00a0\u00a0\u00a0JOIN booking_leg bl ON bl.flight_id = f.flight_id\r\n\u00a0\u00a0\u00a0JOIN passenger p ON p.booking_id=bl.booking_id\r\n\u00a0\u00a0\u00a0JOIN account a ON a.account_id =p.account_id\r\n\u00a0\u00a0\u00a0JOIN boarding_pass bp \r\n       ON bp.passenger_id=p.passenger_id\r\n\u00a0\u00a0\u00a0LEFT OUTER JOIN frequent_flyer ff \r\n       ON ff.frequent_flyer_id=a.frequent_flyer_id\r\nWHERE f.departure_airport = 'JFK'\r\n   \u00a0\u00a0\u00a0AND f.arrival_airport = 'ORD'\r\n    \u00a0\u00a0AND f.scheduled_departure BETWEEN\r\n                \u00a0\u00a0'2023-08-05' AND '2023-08-07'<\/pre>\n<p>However, it is a short query because the result contains just a handful of rows.<\/p>\n<p>Maybe, it\u2019s the number of rows in the output that matters? Wrong again! Look at the next query:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT \r\n   AVG(flight_length),\r\n   AVG(passengers)\r\nFROM (SELECT \r\n         flight_no,\r\n         scheduled_arrival -scheduled_departure \r\n                                 AS flight_length,\r\n         COUNT(passenger_id) passengers\r\n\u00a0     FROM flight f\r\n\u00a0\u00a0\u00a0   JOIN booking_leg bl ON bl.flight_id = f.flight_id\r\n\u00a0\u00a0    JOIN passenger p ON p.booking_id=bl.booking_id\r\nGROUP BY 1,2) a<\/pre>\n<p>This query produces the output of exactly one row, however, it is a long query because we need to go through all records in flight, booking_leg and passenger tables to obtain the result.<\/p>\n<p>Now, we are ready for a definition:<\/p>\n<p>A query is short when the number of rows needed to compute its output is small, no matter how large the involved tables are. Short queries may read every row from small tables but read only a small percentage of rows from large tables.<\/p>\n<p>How small is a \u201csmall percentage\u201d? It depends on system parameters, application specifics, actual table sizes, and possibly other factors. Most of the time, however, it means less than 10%. ratio of rows that are retained to the total rows in the table. This ratio is called <strong>selectivity<\/strong>.<\/p>\n<p>Figure 3 presents a graph which shows a dependency between selectivity of the query and a cost of data access with and without indexes.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"340\" height=\"290\" class=\"wp-image-98427\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-graph-with-colored-lines-description-automatica.png\" alt=\"A graph with colored lines\n\nDescription automatically generated\" \/><\/p>\n<p>Figure 3. Cost vs Selectivity graph<\/p>\n<p>Note, that PostgreSQL provides two different index-based access methods both of which will be covered in this series of blog posts.<\/p>\n<p>The graph in Figure 3 makes it evident why indexes are most useful for the short queries. For low \u2013 selectivity queries, we need to read a small number of rows, thus, the extra cost of additional access to the index pays off. The lower is a query selectivity, the more efficient indexes are. For queries with high selectivity, the cost of additional access to index quickly makes it more expensive than sequential scan.<\/p>\n<h2>Execution plans for short queries<\/h2>\n<p>When we optimize a short query, we know that in the end, we select a relatively small number of records. This means that the optimization goal is to reduce the size of the result set as early as possible. If the most restrictive selection criterion is applied in the first steps of query execution, further sorting, grouping, and even joins will be less expensive. Looking at the execution plan, there should be no table scans of large tables. For small tables, a full scan may still work.<\/p>\n<p>You can&#8217;t select a subset of records quickly from a table if there is no index supporting the corresponding search. That&#8217;s why short queries require indexes on larger tables for faster execution. If there is no index to support a highly restrictive query, one needs to be created. It might seem easy to make sure that the most restrictive selection criteria are applied first; however, this isn\u2019t always straightforward.<\/p>\n<p>Let\u2019s look at the following query:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT * \r\nFROM flight\r\nWHERE departure_airport='LAX'\r\n  AND update_ts BETWEEN '2023-08-13' AND '2023-08-15'\r\n  AND status='Delayed'\r\n  AND scheduled_departure \r\n          BETWEEN '2023-08-13' AND '2023-08-15'<\/pre>\n<p>Can you tell which selection criteria will be the most restrictive? Not without<\/p>\n<p>Delayed status might be the most restrictive, because ideally, on any given day, there are many more on-time flights than delayed flights.<\/p>\n<p>In our training database, we have a flight schedule for six months, so limiting it by two days might not be very restrictive. On the other hand, usually the flight schedule is posted well in advance, and if we are looking for flights where the timestamp of the last update is relatively close to the scheduled departure, it most likely indicates that these flights were delayed or canceled.<\/p>\n<p>Another factor that may be taken into consideration is the popularity of the airport in question. <code>LAX<\/code> is a popular airport, and for Listing 5-1, a restriction on <code>update_ts<\/code> will be more restrictive than on <code>departure_airport<\/code>.<\/p>\n<p>Let\u2019s look at the execution plan!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1432\" height=\"379\" class=\"wp-image-98428\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-screenshot-of-a-computer-description-automatica-4.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>We can see that PostgreSQL query planner took advantage of three indexes on the flight table and performed bitmap index scan &#8211; a special kind of index-based data access which I will cover in detail in the next post. In short, the query planner uses three indexes: f<code>light_departure_airport, flight_update_ts<\/code> and<code> flight_scheduled_departure<\/code>. Scanning these three indexes helps to identify the blocks which can potentially contain the records which satisfy the search criteria, and then examines these blocks to recheck whether the conditions are indeed met.<\/p>\n<p>What is important for us now is that none of the existing indexes had a clear advantage over others. Does it mean we didn\u2019t build the most restrictive index? Possibly that\u2019s true, we will keep experimenting with additional indexes.<\/p>\n<p>What if we will restrict the time interval even further?<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT * \r\nFROM flight\r\nWHERE departure_airport='LAX'\r\n  AND update_ts \r\n       BETWEEN '2023-08-13' AND '2023-08-14'\r\n  AND status='Delayed'\r\n  AND scheduled_departure \r\n         BETWEEN '2023-08-13' AND '2023-08-14'<\/pre>\n<p>Do you think the execution plan will stay the same or will it change?<\/p>\n<p>Actually, it will!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1432\" height=\"327\" class=\"wp-image-98429\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/a-screenshot-of-a-computer-description-automatica-5.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>Now, the conditions on both <code>update_ts<\/code> and <code>scheduled_departure<\/code> became significantly more restrictive than on departure airport, so only these two indexes will be used.<\/p>\n<p>How about changing the filtering on <code>departure_airport<\/code>? If we change it to <code>FUK<\/code>, the airport criterion will be more restrictive than selection based on update_ts and execution plan will change dramatically:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1429\" height=\"160\" class=\"wp-image-98430\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/word-image-98424-6.png\" \/><\/p>\n<p>As you can see, if all the search criteria are indexed, there is no cause for concern, PostgreSQL will choose the optimal execution plan. But if the most restrictive criterion is not indexed, the execution plan may be suboptimal, and likely, an additional index is needed.<\/p>\n<h2>Performance overhead<\/h2>\n<p>The performance improvement does not come for free. As an index is redundant, it must be updated when table data are updated. That produces some overhead for update operations that is sometimes not negligible. However, many database textbooks overestimate this overhead. Modern high-performance DBMSs use algorithms that reduce the cost of index updates, so usually, it is beneficial to create several indexes on a table.<\/p>\n<p>Did you have fun exploring indexes in PostgreSQL? Would you like to learn more about indexing? Look for the next up in this series.<\/p>\n<h2>Appendix \u2013 Setting up the training database<\/h2>\n<p>The instructions for installing the base database can be found in the Appendix of the <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/what-is-an-execution-plan-and-how-to-find-it\/\">first article in this series<\/a>. If you are starting here, use those instructions to download the database.<\/p>\n<p>The experiments in this article require you to create the following additional indexes on your copy of the postges_air database:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SET search_path TO postgres_air;\r\n\r\nCREATE INDEX flight_arrival_airport \r\n     ON flight\u00a0\u00a0(arrival_airport);\r\n\r\nCREATE INDEX booking_leg_flight_id \r\n     ON booking_leg\u00a0\u00a0(flight_id);\r\n\r\nCREATE INDEX flight_actual_departure \r\n     ON flight\u00a0\u00a0(actual_departure);\r\n\r\nCREATE INDEX boarding_pass_booking_leg_id \r\n     ON boarding_pass\u00a0\u00a0(booking_leg_id);\r\n\r\nCREATE INDEX booking_update_ts \r\n     ON booking  (update_ts);<\/pre>\n<p>Don\u2019t forget to run <code>ANALYZE<\/code> on all the tables for which you build new indexes:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk \">ANALYZE  flight;\r\n\r\nANALYZE  booking_leg;\r\n\r\nANALYZE  booking;\r\n\r\nANALYZE boarding_pass;<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous blog in this series, we learned how to produce, read and interpret execution plans. We learned that an execution plan provides information about access methods, which PostgreSQL use to select records from a database. Specifically, we observed that in some cases PostgreSQL used sequential scan, and in some cases index-based access. It&#8230;&hellip;<\/p>\n","protected":false},"author":341426,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143523,53,143534],"tags":[159028,159066],"coauthors":[158986],"class_list":["post-98424","post","type-post","status-publish","format-standard","hentry","category-databases","category-featured","category-postgresql","tag-planetpostgresqlhenriettadombrovskaya","tag-postgresql-101-webinar-sidebar"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98424","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/341426"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=98424"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98424\/revisions"}],"predecessor-version":[{"id":100385,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/98424\/revisions\/100385"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=98424"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=98424"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=98424"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=98424"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}