postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
*	Add cross-type comparisons to contrib/btree_gin.	Tom Lane	24 hours
\| \| \| \| \| \| \| \| \| \| \| \| \|	Extend the infrastructure in btree_gin.c to permit cross-type operators, and add the code to support them for the int2, int4, and int8 opclasses. (To keep this patch digestible, I left the other datatypes for a separate patch.) This improves the usability of btree_gin indexes by allowing them to support the same set of queries that a regular btree index does. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> Discussion: https://postgr.es/m/262624.1738460652@sss.pgh.pa.us
*	Refactor subtype field of AlterDomainStmt	Michael Paquier	37 hours
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AlterDomainStmt.subtype used characters for its subtypes of commands, SET\|DROP DEFAULT\|NOT NULL and ADD\|DROP\|VALIDATE CONSTRAINT, which were hardcoded in a couple of places of the code. The code is improved by using an enum instead, with the same character values as the original code. Note that the field was documented in parsenodes.h and that it forgot to mention 'V' (VALIDATE CONSTRAINT). Author: Quan Zongliang <quanzongliang@yeah.net> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/41ff310b-16bd-44b9-a3ef-97e20f14b709@yeah.net
*	Support multi-line headers in COPY FROM command.	Fujii Masao	38 hours
\| \| \| \| \| \| \| \| \| \| \|	The COPY FROM command now accepts a non-negative integer for the HEADER option, allowing multiple header lines to be skipped. This is useful when the input contains multi-line headers that should be ignored during data import. Author: Shinya Kato <shinya11.kato@gmail.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp> Discussion: https://postgr.es/m/CAOzEurRPxfzbxqeOPF_AGnAUOYf=Wk0we+1LQomPNUNtyZGBZw@mail.gmail.com
*	Add InjectionPointList() to retrieve list of injection points	Michael Paquier	45 hours
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This routine has come as a useful piece to be able to know the list of injection points currently attached in a system. One area would be to use it in a set-returning function, or just let out-of-core code play with it. This hides the internals of the shared memory array lookup holding the information about the injection points (point name, library and function name), allocating the result in a palloc'd List consumable by the caller. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Discussion: https://postgr.es/m/Z_xYkA21KyLEHvWR@paquier.xyz Discussion: https://postgr.es/m/aBG2rPwl3GE7m1-Q@paquier.xyz
*	Add GetNamedDSA() and GetNamedDSHash().	Nathan Bossart	2 days
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Presently, the dynamic shared memory (DSM) registry only provides GetNamedDSMSegment(), which allocates a fixed-size segment. To use the DSM registry for more sophisticated things like dynamic shared memory areas (DSAs) or a hash table backed by a DSA (dshash), users need to create a DSM segment that stores various handles and LWLock tranche IDs and to write fairly complicated initialization code. Furthermore, there is likely little variation in this initialization code between libraries. This commit introduces functions that simplify allocating a DSA or dshash within the DSM registry. These functions are very similar to GetNamedDSMSegment(). Notable differences include the lack of an initialization callback parameter and the prohibition of calling the functions more than once for a given entry in each backend (which should be trivially avoidable in most circumstances). While at it, this commit bumps the maximum DSM registry entry name length from 63 bytes to 127 bytes. Also note that even though one could presumably detach/destroy the DSAs and dshashes created in the registry, such use-cases are not yet well-supported, if for no other reason than the associated DSM registry entries cannot be removed. Adding such support is left as a future exercise. The test_dsm_registry test module contains tests for the new functions and also serves as a complete usage example. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Florents Tselai <florents.tselai@gmail.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Discussion: https://postgr.es/m/aEC8HGy2tRQjZg_8%40nathan
*	Show sizes of FETCH queries as constants in pg_stat_statements	Michael Paquier	3 days
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Prior to this patch, every FETCH call would generate a unique queryId with a different size specified. Depending on the workloads, this could lead to a significant bloat in pg_stat_statements, as repeatedly calling a specific cursor would result in a new queryId each time. For example, FETCH 1 c1; and FETCH 2 c1; would produce different queryIds. This patch improves the situation by normalizing the fetch size, so as semantically similar statements generate the same queryId. As a result, statements like the below, which differ syntactically but have the same effect, will now share a single queryId: FETCH FROM c1 FETCH NEXT c1 FETCH 1 c1 In order to do a normalization based on the keyword used in FETCH, FetchStmt is tweaked with a new FetchDirectionKeywords. This matters for "howMany", which could be set to a negative value depending on the direction, and we want to normalize the queries with enough information about the direction keywords provided, including RELATIVE, ABSOLUTE or all the ALL variants. Author: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0tA6LbHCg2qSS+KuM850BZC_+ZgHV7Ug6BXw22TNyF+MA@mail.gmail.com
*	Control ctype behavior internally with a method table.	Jeff Davis	3 days
\| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, pattern matching and case mapping behavior branched based on the provider. Refactor to use a method table, which is less error-prone. This is also a step toward multiple provider versions, which we may want to support in the future. Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/2830211e1b6e6a2e26d845780b03e125281ea17b.camel%40j-davis.com
*	Silence valgrind about pg_numa_touch_mem_if_required	Tomas Vondra	3 days
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When querying NUMA status of pages in shared memory, we need to touch the memory first to get valid results. This may trigger valgrind reports, because some of the memory (e.g. unpinned buffers) may be marked as noaccess. Solved by adding a valgrind suppresion. An alternative would be to adjust the access/noaccess status before touching the memory, but that seems far too invasive. It would require all those places to have detailed knowledge of what the shared memory stores. The pg_numa_touch_mem_if_required() macro is replaced with a function. Macros are invisible to suppressions, so it'd have to suppress reports for the caller - e.g. pg_get_shmem_allocations_numa(). So we'd suppress reports for the whole function, and that seems to heavy-handed. It might easily hide other valid issues. Reviewed-by: Christoph Berg <myon@debian.org> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aEtDozLmtZddARdB@msg.df7cb.de Backpatch-through: 18
*	pgflex: propagate environment to flex subprocess	Peter Eisentraut	4 days
\| \| \| \| \| \| \| \| \| \|	Python's subprocess.run docs say that if the env argument is not None, it will be used "instead of the default behavior of inheriting the current process’ environment". However, the environment should be preserved, only adding FLEX_TMP_DIR to it. Author: Javier Maestro <jjmaestro@ieee.org> Discussion: https://www.postgresql.org/message-id/flat/CABvji06GUpmrTqqiCr6_F9vRL2-JUSVAh8ChgWa6k47FUCvYmA%40mail.gmail.com
*	Fix whitespace	Peter Eisentraut	4 days
\|
*	Align log_line_prefix in CI and TAP tests with pg_regress.c	Michael Paquier	5 days
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	log_line_prefix is changed to include "%b", the backend type in the TAP test configuration. %v and %x are removed from the CI configuration, with the format around %b changed. The lack of backend type in postgresql.conf set by Cluster.pm for the TAP test configuration was something that has been bugging me, beginning the discussion that has led to this change. The change in the CI has come up during the discussion, to become consistent with pg_regress.c, %v and %x not being that useful to have. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/aC0VaIWAXLgXcHVP@paquier.xyz
*	Stamp HEAD as 19devel.	Joe Conway	5 days
\| \| \| \|	Let the hacking begin ...
*	Sync typedefs.list with the buildfarm.	Tom Lane	2025-06-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Our maintenance of typedefs.list has been a little haphazard (and apparently we can't alphabetize worth a darn). Replace the file with the authoritative list from our buildfarm, and run pgindent using that. I also updated the additions/exclusions lists in pgindent where necessary to keep pgindent from messing things up significantly. Notably, now that regex_t and some related names are macros not real typedefs, we have to whitelist them explicitly. The exclusions list has also drifted noticeably, presumably due to changes of system headers on the buildfarm animals that contribute to the list. Unlike in prior years, I've not manually added typedef names that are missing from the buildfarm's list because they are not used to declare any variables or fields. So there are a few places where the typedef declaration itself is formatted worse than before, e.g. typedef enum IoMethod. I could preserve the names that were manually added to the list previously, but I'd really prefer to find a less manual way of dealing with these cases. A quick grep finds about 75 such symbols, most of which have never gotten any special treatment. Per discussion among pgsql-release, doing this now seems appropriate even though we're still a week or two away from making the v18 branch.
*	Revert function to get memory context stats for processes	Daniel Gustafsson	2025-05-23
\| \| \| \| \| \| \| \| \|	Due to concerns raised about the approach, and memory leaks found in sensitive contexts the functionality is reverted. This reverts commits 45e7e8ca9, f8c115a6c, d2a1ed172, 55ef7abf8 and 042a66291 for v18 with an intent to revisit this patch for v19. Discussion: https://postgr.es/m/594293.1747708165@sss.pgh.pa.us
*	Replace deprecated log_connections values in docs and tests	Melanie Plageman	2025-05-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	9219093cab2607f modularized log_connections output to allow more granular control over which aspects of connection establishment are logged. It converted the boolean log_connections GUC into a list of strings and deprecated previously supported boolean-like values on, off, true, false, 1, 0, yes, and no. Those values still work, but they are supported mainly for backwards compatability. As such, documented examples of log_connections should not use these deprecated values. Update references in the docs to deprecated log_connections values. Many of the tests use log_connections. This commit also updates the tests to use the new values of log_connections. In some of the tests, the updated log_connections value covers a narrower set of aspects (e.g. the 'authentication' aspect in the tests in src/test/authentication and the 'receipt' aspect in src/test/postmaster). In other cases, the new value for log_connections is a superset of the previous included aspects (e.g. 'all' in src/test/kerberos/t/001_auth.pl). Reported-by: Peter Eisentraut <peter@eisentraut.org> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Discussion: https://postgr.es/m/e1586594-3b69-4aea-87ce-73a7488cdc97%40eisentraut.org
*	Doc: add pre-branch task to run src/tools/copyright.pl.	Tom Lane	2025-05-18
\| \| \| \| \| \| \| \| \| \|	It's common for some files with last year's copyright date to sneak into the tree between early January (when we normally run copyright.pl) and feature freeze. Immediately before branching the new release is an ideal time to fix the stragglers, so add a note about it to the RELEASE_CHANGES checklist. Discussion: https://postgr.es/m/CALa6HA4_Wu7-2PV0xv-Q84cT8eG7rTx6bdjUV0Pc=McAwkNMfQ@mail.gmail.com
*	Refactor ChangeVarNodesExtended() using the custom callback	Alexander Korotkov	2025-05-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	fc069a3a6319 implemented Self-Join Elimination (SJE) and put related logic to ChangeVarNodes_walker(). This commit provides refactoring to remove the SJE-related logic from ChangeVarNodes_walker() but adds a custom callback to ChangeVarNodesExtended(), which has a chance to process a node before ChangeVarNodes_walker(). Passing this callback to ChangeVarNodesExtended() allows SJE-related node handling to be kept within the analyzejoins.c. Reported-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com Author: Andrei Lepikhov <lepihov@gmail.com> Author: Alexander Korotkov <aekorotkov@gmail.com>
*	Revert "Refactor ChangeVarNodesExtended() using the custom callback"	Alexander Korotkov	2025-05-03
\| \| \| \| \| \| \| \|	This reverts commit 250a718aadad68793e82103282247556a46a3cfc. It shouldn't be pushed during the release freeze. Reported-by: Tom Lane Discussion: https://postgr.es/m/E1uBIbY-000owH-0O%40gemulon.postgresql.org
*	Refactor ChangeVarNodesExtended() using the custom callback	Alexander Korotkov	2025-05-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	fc069a3a6319 implemented Self-Join Elimination (SJE) and put related logic to ChangeVarNodes_walker(). This commit provides refactoring to remove the SJE-related logic from ChangeVarNodes_walker() but adds a custom callback to ChangeVarNodesExtended(), which has a chance to process a node before ChangeVarNodes_walker(). Passing this callback to ChangeVarNodesExtended() allows SJE-related node handling to be kept within the analyzejoins.c. Reported-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs49PE3CvnV8vrQ0Dr%3DHqgZZmX0tdNbzVNJxqc8yg-8kDQQ%40mail.gmail.com Author: Andrei Lepikhov <lepihov@gmail.com> Author: Alexander Korotkov <aekorotkov@gmail.com>
*	pg_restore cleanups	Andrew Dunstan	2025-04-16
\| \| \| \| \| \| \| \| \| \| \| \| \|	. remove unnecessary oid_string list stuff . use pg_get_line_buf() instead of open-coding it . cleaner parsing of map.dat lines Reverts 2b69afbe50d add new list type simple_oid_string_list to fe-utils/simple_list Author: Álvaro Herrera <alvherre@kurilemu.de> Author: Andrew Dunstan <andrew@dunslane.net> Discussion: https://postgr.es/m/202504141220.343fmoxfsbj4@alvherre.pgsql
*	Fix recently introduced typos	Daniel Gustafsson	2025-04-11
\| \| \| \| \| \| \| \| \|	This fixes typos in docs and comments introduced during the v18 development cycle, to keep them from ending up in backbranches. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CA+COZaCgGua25f2hSrjrDLJcJJAHkwoKgTTqUy-wyL1=64JNjw@mail.gmail.com
*	Introduce file_copy_method setting.	Thomas Munro	2025-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It can be set to either COPY (the default) or CLONE if the system supports it. CLONE causes callers of copydir(), currently CREATE DATABASE ... STRATEGY=FILE_COPY and ALTER DATABASE ... SET TABLESPACE = ..., to use copy_file_range (Linux, FreeBSD) or copyfile (macOS) to copy files instead of a read-write loop over the contents. CLONE gives the kernel the opportunity to share block ranges on copy-on-write file systems and push copying down to storage on others, depending on configuration. On some systems CLONE can be used to clone large databases quickly with CREATE DATABASE ... TEMPLATE=source STRATEGY=FILE_COPY. Other operating systems could be supported; patches welcome. Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Ranier Vilela <ranier.vf@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
*	Add function to get memory context stats for processes	Daniel Gustafsson	2025-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a function for retrieving memory context statistics and information from backends as well as auxiliary processes. The intended usecase is cluster debugging when under memory pressure or unanticipated memory usage characteristics. When calling the function it sends a signal to the specified process to submit statistics regarding its memory contexts into dynamic shared memory. Each memory context is returned in detail, followed by a cumulative total in case the number of contexts exceed the max allocated amount of shared memory. Each process is limited to use at most 1Mb memory for this. A summary can also be explicitly requested by the user, this will return the TopMemoryContext and a cumulative total of all lower contexts. In order to not block on busy processes the caller specifies the number of seconds during which to retry before timing out. In the case where no statistics are published within the set timeout, the last known statistics are returned, or NULL if no previously published statistics exist. This allows dash- board type queries to continually publish even if the target process is temporarily congested. Context records contain a timestamp to indicate when they were submitted. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com
*	Speedup child EquivalenceMember lookup in planner	David Rowley	2025-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When planning queries to partitioned tables, we clone all EquivalenceMembers belonging to the partitioned table into em_is_child EquivalenceMembers for each non-pruned partition. For partitioned tables with large numbers of partitions, this meant the ec_members list could become large and code searching that list would become slow. Effectively, the more partitions which were present, the more searches needed to be performed for operations such as find_ec_member_matching_expr() during create_plan() and the more partitions present, the longer these searches would take, i.e., a quadratic slowdown. To fix this, here we adjust how we store EquivalenceMembers for em_is_child members. Instead of storing these directly in ec_members, these are now stored in a new array of Lists in the EquivalenceClass, which is indexed by the relid. When we want to find EquivalenceMembers belonging to a certain child relation, we can narrow the search to the array element for that relation. To make EquivalenceMember lookup easier and to reduce the amount of code change, this commit provides a pair of functions to allow iteration over the EquivalenceMembers of an EC which also handles finding the child members, if required. Callers that never need to look at child members can remain using the foreach loop over ec_members, which will now often be faster due to only parent-level members being stored there. The actual performance increases here are highly dependent on the number of partitions and the query being planned. Performance increases can be visible with as few as 8 partitions, but the speedup is marginal for such low numbers of partitions. The speedups become much more visible with a few dozen to hundreds of partitions. With some tested queries using 56 partitions, the planner was around 3x faster than before. For use cases with thousands of partitions, these are likely to become significantly faster. Some testing has shown planner speedups of 60x or more with 8192 partitions. Author: Yuya Watari <watari.yuya@gmail.com> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru> Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Tested-by: Thom Brown <thom@linux.com> Tested-by: newtglobal postgresql_contributors <postgresql_contributors@newtglobalcorp.com> Discussion: https://postgr.es/m/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com
*	Add pg_buffercache_numa view with NUMA node info	Tomas Vondra	2025-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduces a new view pg_buffercache_numa, showing NUMA memory nodes for individual buffers. For each buffer the view returns an entry for each memory page, with the associated NUMA node. The database blocks and OS memory pages may have different size - the default block size is 8KB, while the memory page is 4K (on x86). But other combinations are possible, depending on configure parameters, platform, etc. This means buffers may overlap with multiple memory pages, each associated with a different NUMA node. To determine the NUMA node for a buffer, we first need to touch the memory pages using pg_numa_touch_mem_if_required, otherwise we might get status -2 (ENOENT = The page is not present), indicating the page is either unmapped or unallocated. The view may be relatively expensive, especially when accessed for the first time in a backend, as it touches all memory pages to get reliable information about the NUMA node. This may also force allocation of the shared memory. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
*	Fix memory leaks in px_crypt_shacrypt().	Tom Lane	2025-04-06
\| \| \| \| \| \| \| \|	Per Coverity. I don't think these are of any actual significance since the function ought to be invoked in a short-lived context. Still, if it's trying to be neat it should get it right. Also const-ify a constant and fix up typedef formatting.
*	Relax ordering-related hardcoded btree requirements in planning	Peter Eisentraut	2025-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There were several places in ordering-related planning where a requirement for btree was hardcoded but an amcanorder index could suffice. This fixes that. We just need to do the necessary mapping between strategy numbers and compare types and adjust some related APIs so that this works independent of btree strategy numbers. For instance, non-btree amcanorder indexes can now be used to support sorting and merge joins. Also, predtest.c works independent of btree strategy numbers now. To avoid performance regressions, some details on btree and other built-in index types are still hardcoded as shortcuts, but other index types now have access to the same features by providing the required flags and callbacks. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
*	add new list type simple_oid_string_list to fe-utils/simple_list	Andrew Dunstan	2025-04-04
\| \| \| \| \| \| \| \|	This type contains both an oid and a string. This will be used in forthcoming changes to pg_restore. Author: Andrew Dunstan <andrew@dunslane.net>
*	Use streaming read I/O in autoprewarm	Melanie Plageman	2025-04-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make a read stream for each valid fork of each valid relation represented in the autoprewarm dump file and prewarm those blocks through the read stream API instead of by directly invoking ReadBuffer(). Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru> (earlier versions) Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> (earlier versions) Reviewed-by: Matheus Alcantara <mths.dev@pm.me> (earlier versions) Discussion: https://postgr.es/m/flat/CAN55FZ3n8Gd%2BhajbL%3D5UkGzu_aHGRqnn%2BxktXq2fuds%3D1AOR6Q%40mail.gmail.com
*	Add nbtree skip scan optimization.	Peter Geoghegan	2025-04-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit 5bf748b8. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
*	Make derived clause lookup in EquivalenceClass more efficient	Amit Langote	2025-04-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Derived clauses are stored in ec_derives, a List of RestrictInfos. These clauses are later looked up by matching the left and right EquivalenceMembers along with the clause's parent EC. This linear search becomes expensive in queries with many joins or partitions, where ec_derives may contain thousands of entries. In particular, create_join_clause() can spend significant time scanning this list. To improve performance, introduce a hash table (ec_derives_hash) that is built when the list reaches 32 entries -- the same threshold used for join_rel_hash. The original list is retained alongside the hash table to support EC merging and serialization (_outEquivalenceClass()). Each clause is stored in the hash table using a canonicalized key: the EquivalenceMember with the lower memory address is placed in the key before the one with the higher memory address. This avoids storing or searching for both permutations of the same clause. For clauses involving a constant EM, the key places NULL in the first slot and the non-constant EM in the second. The hash table is initialized using list_length(ec_derives_list) as the size hint. simplehash internally adjusts this to the next power of two after dividing by the fillfactor, so this typically results in at least 64 buckets near the threshold -- avoiding immediate resizing while adapting to the actual number of entries. The lookup logic for derived clauses is now centralized in ec_search_derived_clause_for_ems(), which consults the hash table when available and falls back to the list otherwise. The new ec_clear_derived_clauses() always frees ec_derives_list, even though some of the original code paths that cleared the old ec_derives field did not. This ensures consistent cleanup and avoids leaking memory when large lists are discarded. An assertion originally placed in find_derived_clause_for_ec_member() is moved into ec_search_derived_clause_for_ems() so that it is enforced consistently, regardless of whether the hash table or list is used for lookup. This design incorporates suggestions by David Rowley, who proposed both the key canonicalization and the initial sizing approach to balance memory usage and CPU efficiency. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> Tested-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Tested-by: Amit Langote <amitlangote09@gmail.com> Tested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAExHW5vZiQtWU6moszLP5iZ8gLX_ZAUbgEX0DxGLx9PGWCtqUg@mail.gmail.com
*	Change SQL-language functions to use the plan cache.	Tom Lane	2025-04-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the historical implementation of SQL functions (if they don't get inlined), we built plans for all the contained queries at first call within an outer query, and then re-used those plans for the duration of the outer query, and then forgot everything. This was not ideal, not least because the plans could not be customized to specific values of the function's parameters. Our plancache infrastructure seems mature enough to be used here. That will solve both the problem with not being able to build custom plans and the problem with not being able to share work across successive outer queries. Aside from those performance concerns, this change fixes a longstanding bugaboo with SQL functions: you could not write DDL that would affect later statements in the same function. That's mostly still true with new-style SQL functions, since the results of parse analysis are baked into the stored query trees (and protected by dependency records). But for old-style SQL functions, it will now work much as it does with PL/pgSQL functions, because we delay parse analysis and planning of each query until we're ready to run it. Some edge cases that require replanning are now handled better too; see for example the new rowsecurity test, where we now detect an RLS context change that was previously missed. One other edge-case change that might be worthy of a release note is that we now insist that a SQL function's result be generated by the physically-last query within it. Previously, if the last original query was deleted by a DO INSTEAD NOTHING rule, we'd be willing to take the result from the preceding query instead. This behavior was undocumented except in source-code comments, and it seems hard to believe that anyone's relying on it. Along the way to this feature, we needed a few infrastructure changes: * The plancache can now take either a raw parse tree or an analyzed-but-not-rewritten Query as the starting point for a CachedPlanSource. If given a Query, it is caller's responsibility that nothing will happen to invalidate that form of the query. We use this for new-style SQL functions, where what's in pg_proc is serialized Query(s) and we trust the dependency mechanism to disallow DDL that would break those. * The plancache now offers a way to invoke a post-rewrite callback to examine/modify the rewritten parse tree when it is rebuilding the parse trees after a cache invalidation. We need this because SQL functions sometimes adjust the parse tree to make its output exactly match the declared result type; if the plan gets rebuilt, that has to be re-done. * There is a new backend module utils/cache/funccache.c that abstracts the idea of caching data about a specific function usage (a particular function and set of input data types). The code in it is moved almost verbatim from PL/pgSQL, which has done that for a long time. We use that logic now for SQL-language functions too, and maybe other PLs will have use for it in the future. Author: Alexander Pyhalov <a.pyhalov@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com> Discussion: https://postgr.es/m/8216639.NyiUUSuA9g@aivenlaptop
*	Introduce a SQL-callable function array_sort(anyarray).	Tom Lane	2025-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Create a function that will sort the elements of an array according to the element type's sort order. If the array has more than one dimension, the sub-arrays of the first dimension are sorted per normal array-comparison rules, leaving their contents alone. In support of this, add pg_type.typarray to the set of fields cached by the typcache. Author: Junwang Zhao <zhjwpku@gmail.com> Co-authored-by: Jian He <jian.universality@gmail.com> Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/CAEG8a3J41a4dpw_-F94fF-JPRXYxw-GfsgoGotKcjs9LVfEEvw@mail.gmail.com
*	amcheck: Add gin_index_check() to verify GIN index	Tomas Vondra	2025-03-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds a new function, validating two kinds of invariants on a GIN index: - parent-child consistency: Paths in a GIN graph have to contain consistent keys. Tuples on parent pages consistently include tuples from child pages; parent tuples do not require any adjustments. - balanced-tree / graph: Each internal page has at least one downlink, and can reference either only leaf pages or only internal pages. The GIN verification is based on work by Grigory Kryachko, reworked by Heikki Linnakangas and with various improvements by Andrey Borodin. Investigation and fixes for multiple bugs by Kirill Reshke. Author: Grigory Kryachko <GSKryachko@gmail.com> Author: Heikki Linnakangas <hlinnaka@iki.fi> Author: Andrey Borodin <amborodin@acm.org> Reviewed-By: José Villanova <jose.arthur@gmail.com> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Nikolay Samokhvalov <samokhvalov@gmail.com> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-By: Kirill Reshke <reshkekirill@gmail.com> Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/45AC9B0A-2B45-40EE-B08F-BDCF5739D1E1%40yandex-team.ru
*	amcheck: Move common routines into a separate module	Tomas Vondra	2025-03-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before performing checks on an index, we need to take some safety measures that apply to all index AMs. This includes: * verifying that the index can be checked - Only selected AMs are supported by amcheck (right now only B-Tree). The index has to be valid and not a temporary index from another session. * changing (and then restoring) user's security context * obtaining proper locks on the index (and table, if needed) * discarding GUC changes from the index functions Until now this was implemented in the B-Tree amcheck module, but it's something every AM will have to do. So relocate the code into a new module verify_common for reuse. The shared steps are implemented by amcheck_lock_relation_and_check(), receiving the AM-specific verification as a callback. Custom parameters may be supplied using a pointer. Author: Andrey Borodin <amborodin@acm.org> Reviewed-By: José Villanova <jose.arthur@gmail.com> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Nikolay Samokhvalov <samokhvalov@gmail.com> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/45AC9B0A-2B45-40EE-B08F-BDCF5739D1E1%40yandex-team.ru
*	Use streaming read I/O in heap amcheck	Melanie Plageman	2025-03-27
\| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of directly invoking ReadBuffer() for each unskippable block in the heap relation, verify_heapam() now uses the read stream API to acquire the next buffer to check for corruption. Author: Matheus Alcantara <matheusssilv97@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/flat/CAFY6G8eLyz7%2BsccegZYFj%3D5tAUR-GZ9uEq4Ch5gvwKqUwb_hCA%40mail.gmail.com
*	aio: Add io_method=io_uring	Andres Freund	2025-03-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Performing AIO using io_uring can be considerably faster than io_method=worker, particularly when lots of small IOs are issued, as a) the context-switch overhead for worker based AIO becomes more significant b) the number of IO workers can become limiting io_uring, however, is linux specific and requires an additional compile-time dependency (liburing). This implementation is fairly simple and there are substantial optimization opportunities. The description of the existing AIO_IO_COMPLETION wait event is updated to make the difference between it and the new AIO_IO_URING_EXECUTION clearer. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
*	pg_overexplain: Additional EXPLAIN options for debugging.	Robert Haas	2025-03-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There's a fair amount of information in the Plan and PlanState trees that isn't printed by any existing EXPLAIN option. This means that, when working on the planner, it's often necessary to rely on facilities such as debug_print_plan, which produce excessively voluminous output. Hence, use the new EXPLAIN extension facilities to implement EXPLAIN (DEBUG) and EXPLAIN (RANGE_TABLE) as extensions to the core EXPLAIN facility. A great deal more could be done here, and the specific choices about what to print and how are definitely arguable, but this is at least a starting point for discussion and a jumping-off point for possible future improvements. Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviweed-by: Andrei Lepikhov <lepihov@gmail.com> (who didn't like it) Discussion: http://postgr.es/m/CA+TgmoZfvQUBWQ2P8iO30jywhfEAKyNzMZSR+uc2xr9PZBw6eQ@mail.gmail.com
*	Introduce PG_MODULE_MAGIC_EXT macro.	Tom Lane	2025-03-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This macro allows dynamically loaded shared libraries (modules) to provide a wired-in module name and version, and possibly other compile-time-constant fields in future. This information can be retrieved with the new pg_get_loaded_modules() function. This feature is expected to be particularly useful for modules that do not have any exposed SQL functionality and thus are not associated with a SQL-level extension object. But even for modules that do belong to extensions, being able to verify the actual code version can be useful. Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Yurii Rashkovskii <yrashk@omnigres.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/dd4d1b59-d0fe-49d5-b28f-1e463b68fa32@gmail.com
*	Detect and Log multiple_unique_conflicts type conflict.	Amit Kapila	2025-03-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce a new conflict type, multiple_unique_conflicts, to handle cases where an incoming row during logical replication violates multiple UNIQUE constraints. Previously, the apply worker detected and reported only the first encountered key conflict (insert_exists/update_exists), causing repeated failures as each constraint violation needs to be handled one by one making the process slow and error-prone. With this patch, the apply worker checks all unique constraints upfront once the first key conflict is detected and reports multiple_unique_conflicts if multiple violations exist. This allows users to resolve all conflicts at once by deleting all conflicting tuples rather than dealing with them individually or skipping the transaction. In the future, this will also allow us to specify different resolution handlers for such a conflict type. Add the stats for this conflict type in pg_stat_subscription_stats. Author: Nisha Moond <nisha.moond412@gmail.com> Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/CABdArM7FW-_dnthGkg2s0fy1HhUB8C3ELA0gZX1kkbs1ZZoV3Q@mail.gmail.com
*	meson: Flush stdout in testwrap	Andres Freund	2025-03-19
\| \| \| \| \| \| \| \|	Otherwise the progress won't reliably be displayed during a test. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/kx6xu7suexal5vwsxpy7ybgkcznx6hgywbuhkr6qabcwxjqax2@i4pcpk75jvaa Backpatch-through: 16
*	aio: Add io_method=worker	Andres Freund	2025-03-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous commit introduced the infrastructure to start io_workers. This commit actually makes the workers execute IOs. IO workers consume IOs from a shared memory submission queue, run traditional synchronous system calls, and perform the shared completion handling immediately. Client code submits most requests by pushing IOs into the submission queue, and waits (if necessary) using condition variables. Some IOs cannot be performed in another process due to lack of infrastructure for reopening the file, and must processed synchronously by the client code when submitted. For now the default io_method is changed to "worker". We should re-evaluate that around beta1, we might want to be careful and set the default to "sync" for 18. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
*	Make it possible for loadable modules to add EXPLAIN options.	Robert Haas	2025-03-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Modules can use RegisterExtensionExplainOption to register new EXPLAIN options, and GetExplainExtensionId, GetExplainExtensionState, and SetExplainExtensionState to store related state inside the ExplainState object. Since this substantially increases the amount of code that needs to handle ExplainState-related tasks, move a few bits of existing code to a new file explain_state.c and add the rest of this infrastructure there. See the comments at the top of explain_state.c for further explanation of how this mechanism works. This does not yet provide a way for such such options to do anything useful. The intention is that we'll add hooks for that purpose in a separate commit. Discussion: http://postgr.es/m/CA+TgmoYSzg58hPuBmei46o8D3SKX+SZoO4K_aGQGwiRzvRApLg@mail.gmail.com Reviewed-by: Srinath Reddy <srinath2133@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Sami Imseih <samimseih@gmail.com>
*	aio: Add core asynchronous I/O infrastructure	Andres Freund	2025-03-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The main motivations to use AIO in PostgreSQL are: a) Reduce the time spent waiting for IO by issuing IO sufficiently early. In a few places we have approximated this using posix_fadvise() based prefetching, but that is fairly limited (no completion feedback, double the syscalls, only works with buffered IO, only works on some OSs). b) Allow to use Direct-I/O (DIO). DIO can offload most of the work for IO to hardware and thus increase throughput / decrease CPU utilization, as well as reduce latency. While we have gained the ability to configure DIO in d4e71df6, it is not yet usable for real world workloads, as every IO is executed synchronously. For portability, the new AIO infrastructure allows to implement AIO using different methods. The choice of the AIO method is controlled by the new io_method GUC. As of this commit, the only implemented method is "sync", i.e. AIO is not actually executed asynchronously. The "sync" method exists to allow to bypass most of the new code initially. Subsequent commits will introduce additional IO methods, including a cross-platform method implemented using worker processes and a linux specific method using io_uring. To allow different parts of postgres to use AIO, the core AIO infrastructure does not need to know what kind of files it is operating on. The necessary behavioral differences for different files are abstracted as "AIO Targets". One example target would be smgr. For boring portability reasons, all targets currently need to be added to an array in aio_target.c. This commit does not implement any AIO targets, just the infrastructure for them. The smgr target will be added in a later commit. Completion (and other events) of IOs for one type of file (i.e. one AIO target) need to be reacted to differently, based on the IO operation and the callsite. This is made possible by callbacks that can be registered on IOs. E.g. an smgr read into a local buffer does not need to update the corresponding BufferDesc (as there is none), but a read into shared buffers does. This commit does not contain any callbacks, they will be added in subsequent commits. For now the AIO infrastructure only understands READV and WRITEV operations, but it is expected that more operations will be added. E.g. fsync/fdatasync, flush_range and network operations like send/recv. As of this commit, nothing uses the AIO infrastructure. Later commits will add an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for read_stream.c IO, which, in one fell swoop, will convert all read stream users to AIO. The goal is to use AIO in many more places. There are patches to use AIO for checkpointer and bgwriter that are reasonably close to being ready. There also are prototypes to use it for WAL, relation extension, backend writes and many more. Those prototypes were important to ensure the design of the AIO subsystem is not too limiting (e.g. WAL writes need to happen in critical sections, which influenced a lot of the design). A future commit will add an AIO README explaining the AIO architecture and how to use the AIO subsystem. The README is added later, as it references details only added in later commits. Many many more people than the folks named below have contributed with feedback, work on semi-independent patches etc. E.g. various folks have contributed patches to use the read stream infrastructure (added by Thomas in b5a9b18cd0b) in more places. Similarly, a lot of folks have contributed to the CI infrastructure, which I had started to work on to make adding AIO feasible. Some of the work by contributors has gone into the "v1" prototype of AIO, which heavily influenced the current design of the AIO subsystem. None of the code from that directly survives, but without the prototype, the current version of the AIO infrastructure would not exist. Similarly, the reviewers below have not necessarily looked at the current design or the whole infrastructure, but have provided very valuable input. I am to blame for problems, not they. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
*	aio: Basic subsystem initialization	Andres Freund	2025-03-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit just does the minimal wiring up of the AIO subsystem, added in the next commit, to the rest of the system. The next commit contains more details about motivation and architecture. This commit is kept separate to make it easier to review, separating the changes across the tree, from the implementation of the new subsystem. We discussed squashing this commit with the main commit before merging AIO, but there has been a mild preference for keeping it separate. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
*	Apply more consistent style for command options in TAP tests	Michael Paquier	2025-03-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit reshapes the grammar of some commands to apply a more consistent style across the board, following rules similar to ce1b0f9da03e: - Elimination of some pointless used-once variables. - Use of long options, to self-document better the options used. - Use of fat commas to link option names and their assigned values, including redirections, so as perltidy can be tricked to put them together. Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://postgr.es/m/87jz8rzf3h.fsf@wibble.ilmari.org
*	pg_noreturn to replace pg_attribute_noreturn()	Peter Eisentraut	2025-03-13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We want to support a "noreturn" decoration on more compilers besides just GCC-compatible ones, but for that we need to move the decoration in front of the function declaration instead of either behind it or wherever, which is the current style afforded by GCC-style attributes. Also rename the macro to "pg_noreturn" to be similar to the C11 standard "noreturn". pg_noreturn is now supported on all compilers that support C11 (using _Noreturn), as well as GCC-compatible ones (using __attribute__, as before), as well as MSVC (using __declspec). (When PostgreSQL requires C11, the latter two variants can be dropped.) Now, all supported compilers effectively support pg_noreturn, so the extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped. This also fixes a possible problem if third-party code includes stdnoreturn.h, because then the current definition of #define pg_attribute_noreturn() __attribute__((noreturn)) would cause an error. Note that the C standard does not support a noreturn attribute on function pointer types. So we have to drop these here. There are only two instances at this time, so it's not a big loss. In one case, we can make up for it by adding the pg_noreturn to a wrapper function and adding a pg_unreachable(), in the other case, the latter was already done before. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw
*	Fix incorrect handling of subquery pullup	Richard Guo	2025-03-13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When pulling up a subquery, if the subquery's target list items are used in grouping set columns, we need to wrap them in PlaceHolderVars. This ensures that expressions retain their separate identity so that they will match grouping set columns when appropriate. In 90947674f, we decided to wrap subquery outputs that are non-var expressions in PlaceHolderVars. This prevents const-simplification from merging them into the surrounding expressions after subquery pullup, which could otherwise lead to failing to match those subexpressions to grouping set columns, with the effect that they'd not go to null when expected. However, that left some loose ends. If the subquery's target list contains two or more identical Var expressions, we can still fail to match the Var expression to the expected grouping set expression. This is not related to const-simplification, but rather to how we match expressions to lower target items in setrefs.c. For sort/group expressions, we use ressortgroupref matching, which works well. For other expressions, we primarily rely on comparing the expressions to determine if they are the same. Therefore, we need a way to prevent setrefs.c from matching the expression to some other identical ones. To fix, wrap all subquery outputs in PlaceHolderVars if the parent query uses grouping sets, ensuring that they preserve their separate identity throughout the whole planning process. Reported-by: Dean Rasheed <dean.a.rasheed@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-meSahaanKskpBn0KKxdHAXC1_EJCVWHxEodqirrGJnw@mail.gmail.com
*	Add connection establishment duration logging	Melanie Plageman	2025-03-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add log_connections option 'setup_durations' which logs durations of several key parts of connection establishment and backend setup. For an incoming connection, starting from when the postmaster gets a socket from accept() and ending when the forked child backend is first ready for query, there are multiple steps that could each take longer than expected due to external factors. This logging provides visibility into authentication and fork duration as well as the end-to-end connection establishment and backend initialization time. To make this portable, the timings captured in the postmaster (socket creation time, fork initiation time) are passed through the BackendStartupData. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Guillaume Lelarge <guillaume.lelarge@dalibo.com> Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com
*	Modularize log_connections output	Melanie Plageman	2025-03-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert the boolean log_connections GUC into a list GUC comprised of the connection aspects to log. This gives users more control over the volume and kind of connection logging. The current log_connections options are 'receipt', 'authentication', and 'authorization'. The empty string disables all connection logging. 'all' enables all available connection logging. For backwards compatibility, the most common values for the log_connections boolean are still supported (on, off, 1, 0, true, false, yes, no). Note that previously supported substrings of on, off, true, false, yes, and no are no longer supported. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com