{"id":110087,"date":"2026-05-29T12:00:00","date_gmt":"2026-05-29T12:00:00","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=110087"},"modified":"2026-05-13T16:03:17","modified_gmt":"2026-05-13T16:03:17","slug":"how-data-scientists-work-with-postgresql-from-python-and-how-to-write-queries-that-dont-break-at-scale","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/how-data-scientists-work-with-postgresql-from-python-and-how-to-write-queries-that-dont-break-at-scale\/","title":{"rendered":"How data scientists work with PostgreSQL from Python (and how to write queries that don\u2019t break at scale)"},"content":{"rendered":"\n<p><strong>In this article,<\/strong> <strong>learn how PostgreSQL powers data science workflows, including query execution, performance optimization, indexing, data retrieval, and more.<\/strong><\/p>\n\n\n\n<p>There\u2019s a part of the data science stack that rarely gets discussed. Not because it\u2019s unimportant, but because it\u2019s already been decided long before you arrive. Somewhere upstream, engineers chose a <a href=\"https:\/\/db-engines.com\/en\/ranking\" target=\"_blank\" rel=\"noreferrer noopener\">relational database<\/a>. In many cases, they chose <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/postgresql-basics-getting-started-with-psql\/\" target=\"_blank\" rel=\"noreferrer noopener\">PostgreSQL<\/a>. And, since it often becomes a &#8216;background&#8217; system, it&#8217;s very easy to underestimate its impact.<\/p>\n\n\n\n<p>It&#8217;s seen as just a place to pull data from before the \u201creal work\u201d begins in <a href=\"https:\/\/www.red-gate.com\/simple-talk\/development\/dotnet-development\/10-reasons-python-better-than-c-sharp\/\" target=\"_blank\" rel=\"noreferrer noopener\">Python<\/a>. It sits beneath notebooks, pipelines, and dashboards, rarely drawing attention unless anything breaks. That perception, however, is misleading. Yes, PostgreSQL stores data. Yet it also decides <em>how<\/em> that data is accessed, <em>how<\/em> it\u2019s shaped, and in many cases, <em>how<\/em> it\u2019s computed.<\/p>\n\n\n\n<p>PostgreSQL is not designed to optimize for shortcuts. It prioritizes correctness: validating data strictly, enforcing <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/working-within-constraints\/\" target=\"_blank\" rel=\"noreferrer noopener\">constraints<\/a>, and adhering closely to <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/what-is-sql\/\" target=\"_blank\" rel=\"noreferrer noopener\">SQL<\/a> standards &#8211; even when doing so introduces additional overhead. That can make it appear slower in trivial cases, but it&#8217;s precisely this design that allows it to behave predictably under load.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-dual-purpose-backbone-of-postgresql\">The dual-purpose backbone of PostgreSQL<\/h2>\n\n\n\n<p>PostgreSQL occupies a unique position in the data ecosystem because it supports both <a href=\"https:\/\/www.red-gate.com\/de\/blog\/what-is-data-warehousing\/\" target=\"_blank\" rel=\"noreferrer noopener\">transactional (OLTP) and analytical (OLAP) workloads<\/a> within the same system. This hybrid capability (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Hybrid_transactional\/analytical_processing\" target=\"_blank\" rel=\"noreferrer noopener\">HTAP<\/a>) means that the same database handling user transactions is also responsible for reporting, feature generation, and <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/oracle-databases\/introduction-to-analytic-functions-part-1-2\/\" target=\"_blank\" rel=\"noreferrer noopener\">analytical queries<\/a>.<\/p>\n\n\n\n<p>That dual role introduces constraints, since analytical queries compete with production workloads instead of running in isolation. A poorly written <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/eager-aggregation-in-sql-queries\/\" target=\"_blank\" rel=\"noreferrer noopener\">aggregation<\/a> or <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/understanding-sql-join-types\/\" target=\"_blank\" rel=\"noreferrer noopener\">join<\/a> can impact application latency just as much as inefficient transactional logic. It&#8217;s exactly why indexing strategy and <a href=\"https:\/\/www.red-gate.com\/simple-talk\/resources\/books\/simple-talk-published-books\/sql-server-execution-plans\/\" target=\"_blank\" rel=\"noreferrer noopener\">execution planning<\/a> are crucial to responsible system design.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1007\" height=\"413\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image.gif\" alt=\"An image showing a graph of PostgreSQL's support of both OLTP and OLAP workloads.\" class=\"wp-image-110088\"\/><figcaption class=\"wp-element-caption\"><em>PostgreSQL supports both OLTP and OLAP workloads<\/em><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-data-scientist-s-entry-point\">The data scientist\u2019s entry point<\/h2>\n\n\n\n<p>Many workflows still begin with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Comma-separated_values\" target=\"_blank\" rel=\"noreferrer noopener\">comma-separated values (CSV)<\/a> exports. While convenient during early experimentation, this approach introduces long-term problems. Data becomes stale, transformation logic is lost outside <a href=\"https:\/\/www.red-gate.com\/simple-talk\/devops\/database-devops\/database-version-control\/\" target=\"_blank\" rel=\"noreferrer noopener\">version control<\/a>, and reproducibility depends on manual steps that are difficult to track.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-explaining-the-postgresql-direct-query-mandate\">Explaining the PostgreSQL direct query mandate<\/h2>\n\n\n\n<p>Direct querying of PostgreSQL changes this dynamic substantially: the SQL query becomes a reproducible, executable definition of how data is constructed rather than just a retrieval mechanism. Data scientists operate on the source of truth, retrieving <em>exactly<\/em> the required subset of data, <em>exactly<\/em> when needed.<\/p>\n\n\n\n<p>In production environments, this extends to <a href=\"https:\/\/www.qwak.com\/post\/feature-engineering-pipeline\" target=\"_blank\" rel=\"noreferrer noopener\">feature generation pipelines<\/a>, dashboards, and inference systems. They all depend on live queries so, at that point, the database is no longer upstream of the system &#8211; it&#8217;s actively participating in computation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-risk-of-breakage-at-scale\">The risk of breakage at scale<\/h2>\n\n\n\n<p>A script that works well with ten thousand rows can completely fail with one hundred million, simply as queries that used to return results quickly start to slow down as the planner selects inefficient paths. Python processes can run out of memory after retrieving more data than they can handle. Connections build up until the database simply can&#8217;t take on any more. <\/p>\n\n\n\n<p>Large results accumulate the network, making data transfer the main expense. None of these problems show up early &#8211; they are, in fact, often found (too) late. Hence, recognizing <em>how<\/em> these failures occur is the first step to preventing them.<\/p>\n\n\n\n<section id=\"my-first-block-block_83c5b8111be328021f9a2f11bdde51b8\" class=\"my-first-block alignwide\">\n    <div class=\"bg-brand-600 text-base-white py-5xl px-4xl rounded-sm bg-gradient-to-r from-brand-600 to-brand-500 red\">\n        <div class=\"gap-4xl items-start md:items-center flex flex-col md:flex-row justify-between\">\n            <div class=\"flex-1 col-span-10 lg:col-span-7\">\n                <h3 class=\"mt-0 font-display mb-2 text-display-sm\">Simple Talk is brought to you by Redgate Software<\/h3>\n                <div class=\"child:last-of-type:mb-0\">\n                                            Take control of your databases with the trusted Database DevOps solutions provider. Automate with confidence, scale securely, and unlock growth through AI.                                    <\/div>\n            <\/div>\n                                            <a href=\"https:\/\/www.red-gate.com\/solutions\/overview\/\" class=\"btn btn--secondary btn--lg\" aria-label=\"Discover how Redgate can help you: Simple Talk is brought to you by Redgate Software\">Discover how Redgate can help you<\/a>\n                    <\/div>\n    <\/div>\n<\/section>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-python-drivers-amp-execution-explained\">Python drivers &amp; execution, explained<\/h2>\n\n\n\n<p>Python interacts with PostgreSQL through a small set of well-established drivers, each operating at a different level of abstraction. <a href=\"https:\/\/pypi.org\/project\/psycopg2\/\" target=\"_blank\" rel=\"noreferrer noopener\">Psycopg2<\/a> offers stability and direct control, <a href=\"https:\/\/www.psycopg.org\/psycopg3\/\" target=\"_blank\" rel=\"noreferrer noopener\">Psycopg3<\/a> introduces improved type handling and async capabilities, and <a href=\"https:\/\/www.sqlalchemy.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">SQLAlchemy<\/a> provides abstraction and <a href=\"https:\/\/learn.microsoft.com\/en-us\/dotnet\/framework\/data\/adonet\/sql-server-connection-pooling#:~:text=In%20practice%2C%20most%20applications%20use,optimization%20technique%20called%20connection%20pooling.\" target=\"_blank\" rel=\"noreferrer noopener\">connection pooling<\/a>.<\/p>\n\n\n\n<p>The choice reflects workload characteristics. Batch pipelines prioritize simplicity and determinism, while high-concurrency systems benefit from asynchronous execution and pooling.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-the-execution-lifecycle\">The execution lifecycle<\/h3>\n\n\n\n<p>Regardless of the driver, execution follows the same path once a query is issued. Calling <code>cursor.execute()<\/code> triggers a multi-stage process inside PostgreSQL.<\/p>\n\n\n\n<p>The database parses the SQL, the planner selects an execution strategy, and the executor runs that plan while streaming results back to the client. The planning phase is where PostgreSQL decides <em>how<\/em> to run the query, whether to use an index, and in what order operations should occur.<\/p>\n\n\n\n<p>The planner is the critical piece, as its decisions depend on how accurately it can estimate data distribution. If statistics are outdated, the planner may choose inefficient strategies such as scanning entire tables instead of using indexes, or selecting the wrong join algorithm. <\/p>\n\n\n\n<p>Hence, the same query can behave differently across environments, depending on how accurate the planner\u2019s assumptions are.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"986\" height=\"221\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-3.gif\" alt=\"An image showing a graph of how the same query can behave differently across environments.\" class=\"wp-image-110091\"\/><figcaption class=\"wp-element-caption\"><em>Result returned to client<\/em><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-parameterized-queries-in-postgresql-what-are-they\">Parameterized queries in PostgreSQL &#8211; what are they?<\/h3>\n\n\n\n<p>Constructing SQL queries through <a href=\"https:\/\/en.wikipedia.org\/wiki\/String_interpolation\" target=\"_blank\" rel=\"noreferrer noopener\">string interpolation<\/a> introduces both security risks <em>and<\/em> performance inefficiencies. Parameterized queries prevent <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/sql-injection-how-it-works-and-how-to-thwart-it\/\" target=\"_blank\" rel=\"noreferrer noopener\">SQL injection<\/a> and allow PostgreSQL to reuse execution plans. For high-frequency queries, this saves real latency.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-efficient-retrieval-patterns-streaming-vs-loading-which-is-best\">Efficient retrieval patterns (streaming vs. loading) &#8211; which is best?<\/h3>\n\n\n\n<p>As datasets grow, retrieval strategy becomes critical. Client-side cursors that load entire result sets into memory quickly become infeasible. Server-side cursors enable streaming, allowing rows to be processed incrementally.<\/p>\n\n\n\n<p>This shift changes how pipelines are designed. Instead of loading data and then processing it, systems operate on streams of results. Memory usage becomes predictable and pipelines can handle datasets that exceed available resources. This pattern is particularly useful in <a href=\"https:\/\/www.sqlshack.com\/an-overview-of-etl-and-elt-architecture\/\" target=\"_blank\" rel=\"noreferrer noopener\">ETL (extract, transform, load)<\/a> pipelines and long-running analytical jobs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-to-avoid-the-notebook-trap\">How to avoid the &#8216;notebook trap&#8217;<\/h2>\n\n\n\n<p>Many pipelines start life inside a notebook. Data is loaded into memory, transformed using Python, and iterated on quickly. This works well for small datasets but introduces inefficiencies at scale.<\/p>\n\n\n\n<p>The most common issue is <a href=\"https:\/\/www.geeksforgeeks.org\/graphql\/what-are-over-fetching-and-under-fetching\/\" target=\"_blank\" rel=\"noreferrer noopener\">over-fetching<\/a>. Queries such as <code>SELECT<\/code> * retrieve entire tables, shifting filtering and aggregation into Python. This not only increases memory usage but also bypasses the database\u2019s optimized execution engine.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-production-grade-query-design\">Production-grade query design<\/h3>\n\n\n\n<p>Scaling requires shifting relevant computation back into SQL. The database has access to indexes, optimized join strategies, and buffered <a href=\"https:\/\/www.lenovo.com\/gb\/en\/glossary\/what-is-io\/\" target=\"_blank\" rel=\"noreferrer noopener\">I\/O (input\/output)<\/a>. Queries should be written to return <em>only<\/em> the necessary data, already transformed and aggregated.<\/p>\n\n\n\n<p>More importantly, performance must be approached systematically. Bottlenecks can&#8217;t be reliably predicted &#8211; they must be measured &#8211; and fixing one bottleneck often reveals another. Plus, assumptions about performance frequently prove incorrect under load.<\/p>\n\n\n\n<p>Effective systems follow a disciplined approach:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Measure current performance<\/li>\n\n\n\n<li>Identify the bottleneck<\/li>\n\n\n\n<li>Apply a targeted change<\/li>\n\n\n\n<li>Measure again<\/li>\n<\/ul>\n<\/div>\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"769\" height=\"480\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-1.gif\" alt=\"An image showing a graph detailing a disciplined approach to approaching performance systematically. \" class=\"wp-image-110089\"\/><figcaption class=\"wp-element-caption\"><em>The disciplined approach<\/em><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-query-performance-optimization-in-postgresql\">Query performance optimization in PostgreSQL<\/h2>\n\n\n\n<p>When queries slow down, PostgreSQL\u2019s <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/what-is-an-execution-plan-and-how-to-find-it\/\" target=\"_blank\" rel=\"noreferrer noopener\">execution plan<\/a> is the primary diagnostic tool. <code>EXPLAIN<\/code> reveals the planner\u2019s expectations, while <code>EXPLAIN ANALYZE<\/code> shows what actually happens during execution.<\/p>\n\n\n\n<p>Query performance is heavily influenced by how PostgreSQL executes joins and scans. Sequential scans read entire tables and are efficient only for small datasets. <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/performance-sql-server\/identifying-and-solving-index-scan-problems\/\" target=\"_blank\" rel=\"noreferrer noopener\">Index scans<\/a> reduce the search space but depend on appropriate indexing.<\/p>\n\n\n\n<p>Join strategies introduce another layer of complexity. <a href=\"https:\/\/www.geeksforgeeks.org\/dbms\/nested-loop-join-in-dbms\/\" target=\"_blank\" rel=\"noreferrer noopener\">Nested loop joins<\/a> can perform well for small datasets but degrade rapidly with large inputs. <a href=\"https:\/\/www.educative.io\/answers\/what-is-a-hash-join-in-sql\" target=\"_blank\" rel=\"noreferrer noopener\">Hash joins<\/a> are efficient but require sufficient memory. <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/learn\/showplan-operator-of-the-week-merge-join\/\" target=\"_blank\" rel=\"noreferrer noopener\">Merge joins<\/a> depend on sorted data and can be optimal in specific scenarios. When the planner selects the wrong strategy (often due to inaccurate statistics), performance can degrade dramatically.<\/p>\n\n\n\n<section id=\"my-first-block-block_b8febf9880a40bf5736ae6075fcf140d\" class=\"my-first-block alignwide\">\n    <div class=\"bg-brand-600 text-base-white py-5xl px-4xl rounded-sm bg-gradient-to-r from-brand-600 to-brand-500 red\">\n        <div class=\"gap-4xl items-start md:items-center flex flex-col md:flex-row justify-between\">\n            <div class=\"flex-1 col-span-10 lg:col-span-7\">\n                <h3 class=\"mt-0 font-display mb-2 text-display-sm\">Get started with PostgreSQL &#8211; free book download<\/h3>\n                <div class=\"child:last-of-type:mb-0\">\n                                            &#8216;Introduction to PostgreSQL for the data professional&#8217;, written by Grant Fritchey and Ryan Booz, covers all the basics of how to get started with PostgreSQL.                                    <\/div>\n            <\/div>\n                                            <a href=\"https:\/\/www.red-gate.com\/hub\/books\/introduction-to-postgresql-for-the-data-professional\/\" class=\"btn btn--secondary btn--lg\" aria-label=\"Download your free copy: Get started with PostgreSQL - free book download\">Download your free copy<\/a>\n                    <\/div>\n    <\/div>\n<\/section>\n\n\n<h3 class=\"wp-block-heading\" id=\"h-optimization-strategies-indexing-views-partitioning\">Optimization strategies (indexing, views, partitioning)<\/h3>\n\n\n\n<p>Indexing remains a primary optimization tool, but it is not one-size-fits-all. By reducing the amount of data that needs to be scanned, indexes significantly speed up retrieval operations.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/mysql\/mysql-index-overviews-b-tree-indexes\/\" target=\"_blank\" rel=\"noreferrer noopener\">B-Tree indexes<\/a> handle most equality and range queries, while <a href=\"https:\/\/www.sqlpassion.at\/archive\/2026\/01\/12\/gin-indexes-in-postgresql\/\" target=\"_blank\" rel=\"noreferrer noopener\">GIN indexes<\/a> are suited to multi-valued data such as <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/mysql-vs-postgresql-json-data-type\/#:~:text=JSONB%20stands%20for%20JSON%20Binary\" target=\"_blank\" rel=\"noreferrer noopener\">JSONB<\/a> or arrays. <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/index-types-in-postgresql-learning-postgresql-with-grant\/#brin:~:text=this%20introduction.-,BRIN,-The%20Block%20Range\" target=\"_blank\" rel=\"noreferrer noopener\">BRIN indexes<\/a> trade precision for size and are particularly effective on large, naturally ordered datasets like time-series logs. The effectiveness of an index depends on how closely it matches the access pattern of the query.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"485\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-2-1024x485.gif\" alt=\"\" class=\"wp-image-110090\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-2-1024x485.gif 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-2-300x142.gif 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-2-768x364.gif 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><a href=\"https:\/\/aws.amazon.com\/what-is\/materialized-view\/\" target=\"_blank\" rel=\"noreferrer noopener\">Materialized views<\/a> provide a way to precompute expensive queries, while <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/postgresql-partitioning-the-most-useful-feature-you-may-never-have-used\/\" target=\"_blank\" rel=\"noreferrer noopener\">partitioning<\/a> limits the amount of data scanned during execution by splitting a table. Aggregations over long histories, <a href=\"https:\/\/www.appcues.com\/blog\/cohort-analysis\" target=\"_blank\" rel=\"noreferrer noopener\">cohort analysis<\/a>, and derived metrics often fall into this category. Together, these strategies shift workloads from runtime computation to precomputed or <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/sql-and-scope\/\" target=\"_blank\" rel=\"noreferrer noopener\">scoped<\/a> operations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-mvcc-and-vacuum-in-postgresql-what-are-they\">MVCC and VACUUM in PostgreSQL &#8211; what are they?<\/h3>\n\n\n\n<p>PostgreSQL uses <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/multi-version-concurrency-control-mvcc-in-postgresql-learning-postgresql-with-grant\/\" target=\"_blank\" rel=\"noreferrer noopener\">Multi-Version Concurrency Control (MVCC)<\/a>, so updates create new row versions instead of overwriting existing ones. This allows for concurrent reads and writes without locking.<\/p>\n\n\n\n<p>Over time, these old versions accumulate as dead tuples and must be cleaned up by <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/learning-postgresql-with-grant-introducing-vacuum\/\" target=\"_blank\" rel=\"noreferrer noopener\">VACUUM<\/a>. If not, tables and indexes bloat, leading to slower queries. Autovacuum handles this automatically, but long-running transactions can block cleanup, causing gradual performance degradation.<\/p>\n\n\n\n<p>At scale, performance isn\u2019t just about efficient queries, it&#8217;s about maintaining the system so it stays efficient over time.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"277\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-4-1024x277.gif\" alt=\"A graph showing VACUUM in PostgreSQL.\" class=\"wp-image-110092\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-4-1024x277.gif 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-4-300x81.gif 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/04\/image-4-768x208.gif 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>VACUUM in PostgreSQL<\/em><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-advanced-feature-engineering-in-sql\">Advanced feature engineering in SQL<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-window-functions-what-are-they\">Window functions &#8211; what are they?<\/h3>\n\n\n\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/introduction-to-t-sql-window-functions\/\" target=\"_blank\" rel=\"noreferrer noopener\">Window functions<\/a> provide context-aware computations such as rolling averages and lag-based features. Although highly effective, they come at a cost. These operations often require sorting or partitioning data, which can consume significant memory and may spill to disk if not properly configured.<\/p>\n\n\n\n<p>By performing these computations directly in SQL, data scientists can avoid unnecessary data movement and leverage PostgreSQL\u2019s execution engine.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-explaining-common-table-expressions-ctes\">Explaining Common Table Expressions (CTEs)<\/h3>\n\n\n\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/recursive-ctes-transforming-and-analyzing-data-in-postgresql-part-3\" target=\"_blank\" rel=\"noreferrer noopener\">CTEs<\/a> improve readability by breaking queries into logical steps. This allows complex transformations to be articulated and maintained effortlessly. By breaking down the process, debugging becomes simpler, and the query becomes readable and easily extendable.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-analytical-aggregations-explained\">Analytical aggregations, explained<\/h3>\n\n\n\n<p>PostgreSQL also supports advanced aggregation techniques that go beyond simple grouping. Features like conditional aggregation and multi-level grouping enable complex summaries to be computed in a single query. This reduces redundant scans and improves efficiency, particularly for reporting and feature pipelines.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-infrastructure-and-connection-management-in-postgresql\">Infrastructure and connection management in PostgreSQL <\/h2>\n\n\n\n<p>PostgreSQL uses a process-based architecture where each connection consumes system resources. At scale, excessive connections can lead to memory pressure and degraded performance. The failure mode is subtle. Queries begin to queue, new connections are refused, and the system appears to hang &#8211; or fail &#8211; intermittently.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-how-connection-pooling-works-in-postgresql\">How connection pooling works in PostgreSQL<\/h3>\n\n\n\n<p>Connection pooling mitigates this by reusing existing connections. Instead of creating a new connection for each request, a pool of connections is maintained and shared across the application. Tools like <a href=\"https:\/\/www.pgbouncer.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">PgBouncer<\/a> handle this at the infrastructure level, while libraries like SQLAlchemy manage it within the application. In both cases, the goal is to stabilize resource usage and avoid spikes that the database cannot absorb.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-extensibility-and-data-types-in-postgresql-explained\">Extensibility and data types in PostgreSQL, explained<\/h3>\n\n\n\n<p>PostgreSQL\u2019s extensibility allows it to handle diverse data types and workloads. JSONB supports semi-structured data, while extensions like <a href=\"https:\/\/postgis.net\/\" target=\"_blank\" rel=\"noreferrer noopener\">PostGIS<\/a> enable spatial analysis. These features allow the database to adapt to evolving requirements without introducing additional systems.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-in-summary-it-s-not-about-what-tools-are-used-it-s-how-they-re-used\">In summary: it&#8217;s not about <em>what<\/em> tools are used &#8211; it&#8217;s <em>how<\/em> they&#8217;re used<\/h2>\n\n\n\n<p>At small scale, working with PostgreSQL from Python feels straightforward. Most things just <em>work<\/em>, and performance isn\u2019t something you think about much. As things grow, however, that changes. Decisions that didn\u2019t matter before start having a visible impact on how queries are written, how data is fetched, and how the database is maintained in the background.<\/p>\n\n\n\n<p>The difference between fragile pipelines and reliable systems is sometimes not the tools used, but how they&#8217;re used. PostgreSQL is an execution engine with its own constraints and capabilities, so it&#8217;s important to understand its internals and design around them. <\/p>\n\n\n\n<p>Put simply, that&#8217;s what enables data science workflows to move from experimentation to production without breaking under load.<\/p>\n\n\n\n<section id=\"faq\" class=\"faq-block my-5xl\">\n    <h2>FAQs: How data scientists work with PostgreSQL from Python (and how to write queries that don\u2019t break at scale)<\/h2>\n\n                        <h3 class=\"mt-4xl\">1. What role does PostgreSQL play in data science?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"145\" data-end=\"230\">It stores, processes, and shapes data &#8211; acting as both a database <em>and<\/em> execution engine.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">2. Why is PostgreSQL often overlooked?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"272\" data-end=\"339\">It runs in the background, so it\u2019s mistaken for <em>just<\/em> a data source.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">3. Can PostgreSQL handle both OLTP and OLAP?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"387\" data-end=\"450\">Yes, it supports both transactional and analytical workloads (HTAP).<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">4. Why avoid CSV-based workflows?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"487\" data-end=\"555\">They create stale, non-reproducible data &#8211; unlike direct SQL queries.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">5. How does PostgreSQL execute queries?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"598\" data-end=\"675\">It parses, plans, and executes queries, with the planner driving performance.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">6. Why do queries fail at scale?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"711\" data-end=\"785\">Poor query design, outdated stats, and large data volumes cause slowdowns.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">7. What are parameterized queries?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"823\" data-end=\"899\">Secure, reusable queries that improve performance and prevent SQL injection.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">8. Streaming vs loading data &#8211; what\u2019s better?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"946\" data-end=\"1018\">Streaming is more scalable since it avoids loading all data into memory.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">9. What is the &#039;notebook trap&#039;?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"1053\" data-end=\"1124\">Over-fetching data and doing heavy processing in Python instead of SQL.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">10. How do you optimize PostgreSQL queries?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"1170\" data-end=\"1233\">Use <code>EXPLAIN ANALYZE<\/code>, indexes, and iterative performance tuning.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">11. Why is indexing important?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"1266\" data-end=\"1319\">Indexes reduce scan time and speed up data retrieval.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">12. What is MVCC and VACUUM?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"1350\" data-end=\"1426\">MVCC enables concurrency. <code>VACUUM<\/code> cleans up old data to maintain performance.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">13. Why use connection pooling?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"1460\" data-end=\"1525\">It prevents resource overload from too many database connections.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">14. What enables PostgreSQL scalability?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"1568\" data-end=\"1642\">Efficient queries, proper indexing, and continuous performance monitoring.<\/p>\n            <\/div>\n            <\/section>\n\n\n\n<section id=\"my-first-block-block_0f5d9b14435381dcf43df451aee53e7e\" class=\"my-first-block alignwide\">\n    <div class=\"bg-brand-600 text-base-white py-5xl px-4xl rounded-sm bg-gradient-to-r from-brand-600 to-brand-500 red\">\n        <div class=\"gap-4xl items-start md:items-center flex flex-col md:flex-row justify-between\">\n            <div class=\"flex-1 col-span-10 lg:col-span-7\">\n                <h3 class=\"mt-0 font-display mb-2 text-display-sm\">Enjoying this article? Subscribe to the Simple Talk newsletter<\/h3>\n                <div class=\"child:last-of-type:mb-0\">\n                                            Get selected articles, event information, podcasts and other industry content delivered straight to your inbox.                                    <\/div>\n            <\/div>\n                                            <a href=\"https:\/\/www.red-gate.com\/simple-talk\/subscribe\/\" class=\"btn btn--secondary btn--lg\" aria-label=\"Subscribe now: Enjoying this article? Subscribe to the Simple Talk newsletter\">Subscribe now<\/a>\n                    <\/div>\n    <\/div>\n<\/section>","protected":false},"excerpt":{"rendered":"<p>In this article, learn how PostgreSQL powers data science workflows, including query execution, performance optimization, indexing, data retrieval, and more.&hellip;<\/p>\n","protected":false},"author":346892,"featured_media":105920,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143523,143526,53,143534,146042],"tags":[4483,4168,4170,158978,5021,4150],"coauthors":[159382],"class_list":["post-110087","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-databases","category-development","category-featured","category-postgresql","category-python","tag-data","tag-database","tag-database-administration","tag-postgresql","tag-python","tag-sql"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/110087","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/346892"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=110087"}],"version-history":[{"count":6,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/110087\/revisions"}],"predecessor-version":[{"id":110105,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/110087\/revisions\/110105"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/105920"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=110087"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=110087"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=110087"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=110087"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}