aboutsummaryrefslogtreecommitdiff
path: root/src/backend
Commit message (Collapse)AuthorAge
* Fix INITCAP() word boundaries for PG_UNICODE_FAST.Jeff Davis2025-04-21
| | | | | | | | | | | | Word boundaries are based on whether a character is alphanumeric or not. For the PG_UNICODE_FAST collation, alphanumeric includes non-ASCII digits; whereas for the PG_C_UTF8 collation, it only includes digits 0-9. Pass down the right information from the pg_locale_t into initcap_wbnext to differentiate the behavior. Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250417135841.33.nmisch@google.com
* Use the same cmd_context throughout a walsender's lifetime.Tom Lane2025-04-21
| | | | | | | | | | | | | | | | | | | | | | | | | exec_replication_command created a cmd_context to work in and then deleted it on exit. This is pretty dangerous because some replication commands start/finish transactions. In the wake of commit 1afe31f03, that could lead to re-selecting a CurrentMemoryContext that's already been deleted, leading to hilarity such as a memory context that is its own parent. To fix, let's make the cmd_context persist across exec_replication_command calls; instead of deleting it, we'll just reset it each time. In this way it retains the same identity and there's no problem if transaction abort restores it as the working context. It probably even saves a few microseconds to do this. This fix also ensures that exec_replication_command returns to the caller (PostgresMain) with the same context active that had been when it was called (probably MessageContext). The previous coding could get that wrong too. Reported-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAO6_XqoJA7-_G6t7Uqe5nWF3nj+QBGn4F6Ptp=rUGDr0zo+KvA@mail.gmail.com
* MemoryContextCreate: assert parent is valid and different from node.Tom Lane2025-04-21
| | | | | | | | | | | | | The case of "node == parent" might seem impossible, since we just allocated the new node. But it's possible if parent is a dangling reference to a recently-deleted context. In fact, given aset.c's habit of recycling contexts, it's actually rather likely if that's so. If we'd had this assertion before, it would have simplified debugging a recently-identified walsender issue. Reported-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAO6_XqoJA7-_G6t7Uqe5nWF3nj+QBGn4F6Ptp=rUGDr0zo+KvA@mail.gmail.com
* Fix a few more duplicate words in commentsDavid Rowley2025-04-21
| | | | | | | | Similar to 84fd3bc14 but these ones were found using a regex that can span multiple lines. Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrMcr8XD107H3NV=WHgyBcu=sx5+7=WArr-n_cWUqdFXQ@mail.gmail.com
* Fix a few duplicate words in commentsDavid Rowley2025-04-21
| | | | | | | These are all new to v18 Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrMcr8XD107H3NV=WHgyBcu=sx5+7=WArr-n_cWUqdFXQ@mail.gmail.com
* Comment on need to MarkBufferDirty() if omitting DELAY_CHKPT_START.Noah Misch2025-04-20
| | | | | | | | | | | | | | | | Blocking checkpoint phase 2 requires MarkBufferDirty() and BUFFER_LOCK_EXCLUSIVE; neither suffices by itself. transam/README documents this, citing SyncOneBuffer(). Update the DELAY_CHKPT_START documentation to say this. Expand the heap_inplace_update_and_unlock() comment that cites XLogSaveBufferForHint() as precedent, since heap_inplace_update_and_unlock() could have opted not to use DELAY_CHKPT_START. Commit 8e7e672cdaa6bfec85d4d5dd9be84159df23bb41 added DELAY_CHKPT_START to heap_inplace_update_and_unlock(). Since commit bc6bad88572501aecaa2ac5d4bc900ac0fd457d5 reverted it in non-master branches, no back-patch. Discussion: https://postgr.es/m/20250406180054.26.nmisch@google.com
* Avoid ERROR at ON COMMIT DELETE ROWS after relhassubclass=f.Noah Misch2025-04-20
| | | | | | | | | | | | | | | | Commit 7102070329d8147246d2791321f9915c3b5abf31 fixed a similar bug, but it missed the case of database-wide ANALYZE ("use_own_xacts" mode). Commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257 changed consequences from silent discard of a pg_class stats (relpages et al.) update to ERROR "tuple to be updated was already modified". Losing a relpages update of an ON COMMIT DELETE ROWS table was negligible, but a COMMIT-time error isn't negligible. Back-patch to v13 (all supported versions). Reported-by: Richard Guo <guofenglinux@gmail.com Reported-by: Robins Tharakan <tharakan@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-XwMKMKJ_GT=p3_-_=j9rQSEs1FbDFUnW9zHuKPsPNEQ@mail.gmail.com Backpatch-through: 13
* Fix issue with ORDER BY / DISTINCT aggregates and FILTERDavid Rowley2025-04-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1349d2790 added support so that aggregate functions with an ORDER BY or DISTINCT clause could make use of presorted inputs to avoid an implicit sort within nodeAgg.c. That commit failed to consider that a FILTER clause may exist that filters rows before the aggregate function arguments are evaluated. That can be problematic if an aggregate argument contains an expression which could error out during evaluation. It's perfectly valid to want to have a FILTER clause which eliminates such values, and with the pre-sorted path added in 1349d2790, it was possible that the planner would produce a plan with a Sort node above the Aggregate to perform the sort on the aggregate's arguments long before the Aggregate node would filter out the non-matching values. Here we fix this by inspecting ORDER BY / DISTINCT aggregate functions which have a FILTER clause to see if the aggregate's arguments are anything more complex than a Var or a Const. Evaluating these isn't going to cause an error. If we find any non-Var, non-Const parameters then the planner will now opt to perform the sort in the Aggregate node for these aggregates, i.e. disable the presorted aggregate optimization. An alternative fix would have been to completely disallow the presorted optimization for Aggrefs with any FILTER clause, but that wasn't done as that could cause large performance regressions for queries that see significant gains from 1349d2790 due to presorted results coming in from an Index Scan. Backpatch to 16, where 1349d2790 was introduced Author: David Rowley <dgrowleyml@gmail.com> Reported-by: Kaimeh <kkaimeh@gmail.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAK-%2BJz9J%3DQ06-M7cDJoPNeYbz5EZDqkjQbJnmRyQyzkbRGsYkA%40mail.gmail.com Backpatch-through: 16
* Fix typos and grammar in the codeMichael Paquier2025-04-19
| | | | | | | | The large majority of these have been introduced by recent commits done in the v18 development cycle. Author: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/9a7763ab-5252-429d-a943-b28941e0e28b@gmail.com
* Rename injection points used in AIO testsMichael Paquier2025-04-19
| | | | | | | | | The format of the injection point names used by the AIO code does not match the existing naming convention used everywhere else in the code, so let's be consistent. These points are used in test_aio. Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/Z_yTB80bdu1sYDqJ@paquier.xyz
* Make levels 1-based in pg_log_backend_memory_contexts()David Rowley2025-04-18
| | | | | | | | | | | | | | | | | Both pg_get_process_memory_contexts() and pg_backend_memory_contexts have 1-based levels, whereas pg_log_backend_memory_contexts() was using 0-based levels. Align these. This results in slightly saner behavior from MemoryContextStatsDetail() in regards to the max_level. Previously it would stop at 1 level before the maximum requested level rather than at that level. Reported-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Author: David Rowley <drowleyml@gmail.com Reviewed-by: Melih Mutlu <m.melihmutlu@gmail.com> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Discussion: https://postgr.es/m/395ea5d4fe190480efa95bf533485c70@oss.nttdata.com
* Suppress "may be used uninitialized" warnings from older compilers.Tom Lane2025-04-17
| | | | | | | | | | The "children" list won't be used until "got_children" has been set true, but older compilers don't get that; about half a dozen buildfarm animals are warning about this. Issue added by 11ff192b5. While here, improve slightly-shaky grammar in comment. Discussion: https://postgr.es/m/2057835.1744833309@sss.pgh.pa.us
* Cache typlens of a SQL function's input arguments.Tom Lane2025-04-17
| | | | | | | | | | | This gets rid of repetitive get_typlen calls in postquel_sub_params, which show up as costing a few percent of the runtime in simple test cases (more with more parameters). In combination with the preceding patches, this gets us most of the way back down to the amount of per-call overhead that functions.c had before commit 0dca5d68d. There are some more things that could be done, but this seems like an okay place to stop for v18.
* Make SQLFunctionCache long-lived again.Tom Lane2025-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | At this point, the only data structures we allocate directly in fcontext are the SQLFunctionCache struct itself, the ParamListInfo struct, and the execution_state array, all of which are small and perfectly capable of being re-used across executions of the same FmgrInfo. Hence, let's give them the same lifespan as the FmgrInfo. This step gets rid of the separate SQLFunctionLink struct and makes fn_extra point to SQLFunctionCache again. We also get rid of the separate fcontext memory context and allocate these items directly in fn_mcxt. For notational simplicity, SQLFunctionCache still has an fcontext field, but it's just a copy of fn_mcxt. The motivation for this is to allow these structures to live as long as the FmgrInfo and be re-used across calls, restoring the original design without its propensity for memory leaks. This gets rid of some per-call overhead that we added in 0dca5d68d. We also make an effort to re-use the JunkFilter and result slot. Those might need to change if the function definition changes, so we compromise by rebuilding them if the cached plan changes. This also moves the tuplestore into fn_mcxt so that it can be re-used across calls, again undoing a change made in 0dca5d68d.
* Split some storage out to separate subcontexts of fcontext.Tom Lane2025-04-17
| | | | | | | | | | | | | | | | Put the JunkFilter and its result slot (and thence also some subsidiary data such as the result tupledesc) into a separate subcontext "jfcontext". This doesn't accomplish a lot at this point, because we make a new JunkFilter each time through the SQL function. However, the plan is to make the fcontext long-lived, and that raises the possibility that we'll need a new JunkFilter because the plan for the result-generating query changes. A separate context makes it easy to free the obsoleted data when that happens. Also, instead of always running the sub-executor in fcontext, make a separate context for it if we're doing lazy eval of a SRF, and otherwise just run it inside CurrentMemoryContext.
* Make functions.c mostly run in a short-lived memory context.Tom Lane2025-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, much of this code ran with CurrentMemoryContext set to be the function's fcontext, so that we tended to leak a lot of stuff there. Commit 0dca5d68d dealt with that by releasing the fcontext at the completion of each SQL function call, but we'd like to go back to the previous approach of allowing the fcontext to be query-lifespan. To control the leakage problem, rearrange the code so that we mostly run in the memory context that fmgr_sql is called in (which we expect to be short-lived). Notably, this means that parsing/planning is all done in the short-lived context and doesn't leak cruft into fcontext. This patch also fixes the allocation of execution_state records so that we don't leak them across executions. I set that up with a re-usable array that contains at least as many execution_state structs as we need for the current querytree. The chain structure is still there, but it's not really doing much for us, and maybe somebody will be motivated to get rid of it. I'm not though. This incidentally also moves the call of BlessTupleDesc to be with the code that creates the JunkFilter. That doesn't make much difference now, but a later patch will reduce the number of times the JunkFilter gets made, and we needn't bless the results any more often than that. We still leak a fair amount in fcontext, particularly when executing utility statements, but that's material for a separate patch step; the point here is only to get rid of unintentional allocations in fcontext.
* Minor performance improvement for SQL-language functions.Tom Lane2025-04-17
| | | | | | | | | | | | | | | | Late in the development of commit 0dca5d68d, I added a step to copy the result tlist we extract from the cached final query, because I was afraid that that might not last as long as the JunkFilter that we're passing it off to. However, that turns out to cost a noticeable number of cycles, and it's really quite unnecessary because the JunkFilter will not examine that tlist after it's been created. (ExecFindJunkAttribute would use it, but we don't use that function on this JunkFilter.) Hence, remove the copy step. For safety, reset the might-become-dangling jf_targetList pointer to NIL. In passing, remove DR_sqlfunction.cxt, which we don't use anymore; it's confusing because it's not entirely clear which context it ought to point at.
* Assert lack of hazardous buffer locks before possible catalog read.Noah Misch2025-04-17
| | | | | | | | | | | | | | | | | | | | | Commit 0bada39c83a150079567a6e97b1a25a198f30ea3 fixed a bug of this kind, which existed in all branches for six days before detection. While the probability of reaching the trouble was low, the disruption was extreme. No new backends could start, and service restoration needed an immediate shutdown. Hence, add this to catch the next bug like it. The new check in RelationIdGetRelation() suffices to make autovacuum detect the bug in commit 243e9b40f1b2dd09d6e5bf91ebf6e822a2cd3704 that led to commit 0bada39. This also checks in a number of similar places. It replaces each Assert(IsTransactionState()) that pertained to a conditional catalog read. No back-patch for now, but a back-patch of commit 243e9b4 should back-patch this, too. A back-patch could omit the src/test/regress changes, since back branches won't gain new index columns. Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/20250410191830.0e.nmisch@google.com Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
* Another unintentional behavior change in commit e9931bfb75.Jeff Davis2025-04-16
| | | | | | Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250412123430.8c.nmisch@google.com
* Improve comment in regc_pg_locale.c.Jeff Davis2025-04-16
| | | | | | Reported-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20250412123430.8c.nmisch@google.com
* Improve comments for estimate_multivariate_ndistinct()David Rowley2025-04-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | estimate_multivariate_ndistinct() is coded to assume the caller handles passing it a list of GroupVarInfos with unique 'var' fields over the entire list. 6bb6a62f3 added code which didn't ensure this and that could result in estimate_multivariate_ndistinct() erroring out with: ERROR: corrupt MVNDistinct entry This occurred because estimate_multivariate_ndistinct() first searches for a set of stats that match to at least two of the given GroupVarInfos and then later assumes that the MVNDistinctItem.items array of the best matching stats will have an entry for those two columns. If the GroupVarInfos List contained a duplicate entry then the same column could be matched to twice and that could trick the code into thinking we have >= 2 columns matched in cases where only a single distinct column has been matched. This could result in a failure to find the correct MVNDistinctItem in the stats as the array containing those never contains an item for single columns. Here we make it more clear that the function needs a distinct set of GroupVarInfos and also tidy up a few other comments to make things a bit easier to follow. Author: David Rowley <drowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvocZCUhM9W9mJ39d6oQz7ePKoqFnao_347mvC-A7QatcQ@mail.gmail.com
* Sync declarations and definitions of two new tablecmds.c functions.Tom Lane2025-04-16
| | | | | | | | | | | | Buildfarm member drongo complained because the definitions of these functions used "const Oid foo" where the forward declarations just had "Oid foo". (I'm a bit surprised that drongo seems to be the only complainant.) I chose to fix this by removing the "consts" because (a) I'm generally not a fan of using const that way, and (b) it was a minority usage even within these two functions, let alone compared to the rest of our code base. Oversight in commit eec0040c4, so no need for back-patch.
* Elide not-null constraint checks on child tables during PK creationÁlvaro Herrera2025-04-16
| | | | | | | | | | | | | | | | | | | | | | | | | We were unnecessarily acquiring AccessExclusiveLock on all child tables when "ALTER TABLE ONLY sometab ADD PRIMARY KEY" was run on their parent table, an oversight in commit 14e87ffa5c54. This caused deadlocks during pg_restore of partitioned tables. The reason to acquire the AEL was that we need to verify that child tables have the involved columns already marked as not-null; but if the parent table has an inheritable not-null constraint, then all children must necessarily be in the correct state already, so we can skip the check, which avoids acquiring the lock. Reorder the code so that it works that way. This doesn't change things in the case where the constraint doesn't exist, but that case is of lesser importance because it doesn't occur during parallel pg_restore. While at it, reword some errmsg() and add errhint() to similar cases in related but not adjacent code. Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/67469c1c-38bc-7d94-918a-67033f5dd731@gmx.net Discussion: https://postgr.es/m/2045026.1743801143@sss.pgh.pa.us Discussion: https://postgr.es/m/1280408.1744650810@sss.pgh.pa.us
* Fix an incorrect check in get_memoize_pathRichard Guo2025-04-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memoize typically marks cache entries as complete after fully scanning the inner side of a join. However, in the case of unique joins, we skip to the next outer tuple as soon as the first matching inner tuple is found, leaving no opportunity to scan the inner side to completion. To work around that, we mark cache entries as complete after fetching the first matching inner tuple in unique joins. This approach is only safe when all of the join's restriction clauses are parameterized; otherwise, there is no guarantee that reading just one tuple from the inner side is sufficient. Currently, we check for this by verifying that the number of clauses in ppi_clauses is no less than the number of the join's restriction clauses. However, this check isn't entirely reliable, as ppi_clauses includes join clauses available from all outer rels, not just the current outer rel. This means the check could pass even if a restriction clause isn't parameterized, as long as another join clause, which doesn't belong to the current join, is included in ppi_clauses. To fix this, we explicitly check whether each restriction clause of the current join is present in ppi_clauses. While we're here, remove the XXX comment from the modified code, as it's not justified; in certain cases, it's not possible to move a join clause to the inner side. This is arguably a bugfix, but no backpatch given the lack of field reports. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-8JPouj=wBDj4DhK-WO4+Xdx=A2jbjvvyyTBQneJ1=BQ@mail.gmail.com
* Fix failure for generated column with a not-null domain constraint.Tom Lane2025-04-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a GENERATED column is declared to have a domain data type where the domain's constraints disallow null values, INSERT commands failed because we built a targetlist that included coercing a null constant to the domain's type. The failure occurred even when the generated value would have been perfectly OK. This is adjacent to the issues fixed in 0da39aa76, but we didn't notice for lack of testing a domain with such a constraint. We aren't going to use the result of the targetlist entry for the generated column --- ExecComputeStoredGenerated will overwrite it. So it's not really necessary that it have the exact datatype of the generated column. This patch fixes the problem by changing the targetlist entry to be a null Const of the domain's base type, which should be sufficiently legal. (We do have to tweak ExecCheckPlanOutput to accept the situation, though.) This has been broken since we implemented generated columns. However, this patch only applies easily as far back as v14, partly because I (tgl) only carried 0da39aa76 back that far, but mostly because v14 significantly refactored the handling of INSERT/UPDATE targetlists. Given the lack of field complaints and the short remaining support lifetime of v13, I judge the cost-benefit ratio not good for devising a version that would work in v13. Reported-by: jian he <jian.universality@gmail.com> Author: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxG59tip2+9h=rEv-ykOFjt0cbsPVchhi0RTij8bABBA0Q@mail.gmail.com Backpatch-through: 14
* Fix GIN's shimTriConsistentFn to not corrupt its input.Tom Lane2025-04-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0f21db36d made an assumption that GIN triConsistentFns would not modify their input entryRes[] arrays. But in fact, the "shim" triConsistentFn that we use for opclasses that don't supply their own did exactly that, potentially leading to wrong answers from a GIN index search. Through bad luck, none of the test cases that we have for such opclasses exposed the bug. One response to this could be that the assumption of consistency check functions not modifying entryRes[] arrays is a bad one, but it still seems reasonable to me. Notably, shimTriConsistentFn is itself assuming that with respect to the underlying boolean consistentFn, so it's sure being self-centered in supposing that it gets to do so. Fortunately, it's quite simple to fix shimTriConsistentFn to restore the entry-time state of entryRes[], so let's do that instead. This issue doesn't affect any core GIN opclasses, since they all supply their own triConsistentFns. It does affect contrib modules btree_gin, hstore, and intarray. Along the way, I (tgl) noticed that shimTriConsistentFn failed to pick up on a "recheck" flag returned by its first call to the boolean consistentFn. This may be only a latent problem, since it would be unlikely for a consistentFn to set recheck for the all-false case and not any other cases. (Indeed, none of our contrib modules do that.) Nonetheless, it's formally wrong. Reported-by: Vinod Sridharan <vsridh90@gmail.com> Author: Vinod Sridharan <vsridh90@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAFMdLD7XzsXfi1+DpTqTgrD8XU0i2C99KuF=5VHLWjx4C1pkcg@mail.gmail.com Backpatch-through: 13
* Harmonize function parameter names for Postgres 18.Peter Geoghegan2025-04-12
| | | | | | | | | | Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. These inconsistencies were all introduced during Postgres 18 development. This commit was written with help from clang-tidy, by mechanically applying the same rules as similar clean-up commits (the earliest such commit was commit 035ce1fe).
* Fix recently introduced typosDaniel Gustafsson2025-04-11
| | | | | | | | | This fixes typos in docs and comments introduced during the v18 development cycle, to keep them from ending up in backbranches. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CA+COZaCgGua25f2hSrjrDLJcJJAHkwoKgTTqUy-wyL1=64JNjw@mail.gmail.com
* Fix race with synchronous_standby_names at startupMichael Paquier2025-04-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | synchronous_standby_names cannot be reloaded safely by backends, and the checkpointer is in charge of updating a state in shared memory if the GUC is enabled in WalSndCtl, to let the backends know if they should wait or not for a given LSN. This provides a strict control on the timing of the waiting queues if the GUC is enabled or disabled, then reloaded. The checkpointer is also in charge of waking up the backends that could be waiting for a LSN when the GUC is disabled. This logic had a race condition at startup, where it would be possible for backends to not wait for a LSN even if synchronous_standby_names is enabled. This would cause visibility issues with transactions that we should be waiting for but they were not. The problem lasts until the checkpointer does its initial update of the shared memory state when it loads synchronous_standby_names. In order to take care of this problem, the shared memory state in WalSndCtl is extended to detect if it has been initialized by the checkpointer, and not only check if synchronous_standby_names is defined. In WalSndCtlData, sync_standbys_defined is renamed to sync_standbys_status, a bits8 able to know about two states: - If the shared memory state has been initialized. This flag is set by the checkpointer at startup once, and never removed. - If synchronous_standby_names is known as defined in the shared memory state. This is the same as the previous sync_standbys_defined in WalSndCtl. This method gives a way for backends to decide what they should do until the shared memory area is initialized, and they now ultimately fall back to a check on the GUC value in this case, which is the best thing that can be done. Fortunately, SyncRepUpdateSyncStandbysDefined() is called immediately by the checkpointer when this process starts, so the window is very narrow. It is possible to enlarge the problematic window by making the checkpointer wait at the beginning of SyncRepUpdateSyncStandbysDefined() with a hardcoded sleep for example, and doing so has showed that a 2PC visibility test is indeed failing. On machines slow enough, this bug would cause spurious failures. In 17~, we have looked at the possibility of adding an injection point to have a reproducible test, but as the problematic window happens at early startup, we would need to invent a way to make an injection point optionally persistent across restarts when attached, something that would be fine for this case as it would involve the checkpointer. This issue is quite old, and can be reproduced on all the stable branches. Author: Melnikov Maksim <m.melnikov@postgrespro.ru> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/163fcbec-900b-4b07-beaa-d2ead8634bec@postgrespro.ru Backpatch-through: 13
* Add code comment explaining ins_since_vacuum and aborted insertsDavid Rowley2025-04-11
| | | | | | | | | | | | | | | | | | | | | Sami complained that there's a discrepancy between n_mod_since_analyze and n_ins_since_vacuum, as the former only accounts for committed changes and the latter tracks committed and aborted inserts. Nobody seemed overly concerned that this would cause any concerning issues. The repercussions, from what I can tell, are limited to causing an autovacuum to trigger for inserts sooner than it otherwise might. For typical ratios of commits to aborts, it's unlikely to ever be noticed. Fixing things to make it so n_ins_since_vacuum only displays committed inserts would require an additional field in PgStat_TableCounts, which does not quite seem worthwhile at this stage. This commit just adds a comment with some details to mention that we know about it, which will hopefully prevent repeat discussions. Reported-by: Sami Imseih <samimseih@gmail.com> Author: David Rowley <drowleyml@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAApHDvpgV3a-R2EGmPOh0L-x3pHbZpM3y4dySWfy+UqUazwDQA@mail.gmail.com
* Improve various new-to-v18 appendStringInfo callsDavid Rowley2025-04-11
| | | | | | | | | Similar to 8461424fd, here we adjust a few new locations which were not using the most suitable appendStringInfo* function for the intended purpose. Author: David Rowley <drowleyml@gmail.com Discussion: https://postgr.es/m/CAApHDvqJnNjueb=Eoj8K+8n0g7nj_AcPWSiCj5RNV4fDejAfqA@mail.gmail.com
* Rename global variable backing DSA areaDaniel Gustafsson2025-04-10
| | | | | | | | | | | | The global variable backing the DSA area for Memory Context stats reporting had a too generic name, rename to be more descriptive. Independently reported by Peter and Laurenz. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Peter Eisentraut <peter@eisentraut.org> Reported-by: Laurenz Albe <laurenz.albe@cybertec.at> Discussion: https://postgr.es/m/d51172bd4e7f4b07a18a0288ca1b1c28a71a5f6a.camel@cybertec.at Discussion: https://postgr.es/m/25095db5-b595-4b85-9100-d358907c25b5@eisentraut.org
* Remove useless check for negative result of ip_addrsize().Tom Lane2025-04-10
| | | | | | | | | | | | By inspection, ip_addrsize() can't return a negative result. (If it could, we'd have way bigger problems elsewhere.) So delete useless check in network_send(). Most C compilers are probably perfectly capable of removing this code by themselves, but it's confusing/misleading. Bug: #18889 Reported-by: Daniel Elishakov <dan-eli@mail.ru> Discussion: https://postgr.es/m/18889-73d4f19e953a629e@postgresql.org
* Fix data loss in logical replication.Amit Kapila2025-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Data loss can happen when the DDLs like ALTER PUBLICATION ... ADD TABLE ... or ALTER TYPE ... that don't take a strong lock on table happens concurrently to DMLs on the tables involved in the DDL. This happens because logical decoding doesn't distribute invalidations to concurrent transactions and those transactions use stale cache data to decode the changes. The problem becomes bigger because we keep using the stale cache even after those in-progress transactions are finished and skip the changes required to be sent to the client. This commit fixes the issue by distributing invalidation messages from catalog-modifying transactions to all concurrent in-progress transactions. This allows the necessary rebuild of the catalog cache when decoding new changes after concurrent DDL. We observed performance regression primarily during frequent execution of *publication DDL* statements that modify the published tables. The regression is minor or nearly nonexistent for DDLs that do not affect the published tables or occur infrequently, making this a worthwhile cost to resolve a longstanding data loss issue. An alternative approach considered was to take a strong lock on each affected table during publication modification. However, this would only address issues related to publication DDLs (but not the ALTER TYPE ...) and require locking every relation in the database for publications created as FOR ALL TABLES, which is impractical. The bug exists in all supported branches, but we are backpatching till 14. The fix for 13 requires somewhat bigger changes than this fix, so the fix for that branch is still under discussion. Reported-by: hubert depesz lubaczewski <depesz@depesz.com> Reported-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Tested-by: Benoit Lobréau <benoit.lobreau@dalibo.com> Backpatch-through: 14 Discussion: https://postgr.es/m/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com Discussion: https://postgr.es/m/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com
* Update wording in optimizer/README for EquivalenceClassesDavid Rowley2025-04-10
| | | | | | | | | | | | | | | | | | d69d45a5a changed how em_is_child members are stored in EquivalenceClasses. Children are no longer stored in the ec_members list. optimizer/README mentioned that most operations "should ignore child members", but that felt a little untrue now since child members are now stored in a separate place, they simply won't be found by the normal means of looking (a foreach loop over ec_members), and if you don't find them, there's technically no need to "ignore" them. Here we tweak the wording slightly to reflect the new storage location for child members. Reported-by: Amit Langote <amitlangote09@gmail.com> Author: Amit Langote <amitlangote09@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CA+HiwqE8v=EuAP_3F_A2xn8zWx+nG_etW_Fe_DvKO-Fkx=+DdQ@mail.gmail.com
* Cleanup of pg_numa.cTomas Vondra2025-04-09
| | | | | | | | | | | | | | | | | | | | | | | | | This moves/renames some of the functions defined in pg_numa.c: * pg_numa_get_pagesize() is renamed to pg_get_shmem_pagesize(), and moved to src/backend/storage/ipc/shmem.c. The new name better reflects that the page size is not related to NUMA, and it's specifically about the page size used for the main shared memory segment. * move pg_numa_available() to src/backend/storage/ipc/shmem.c, i.e. into the backend (which more appropriate for functions callable from SQL). While at it, improve the comment to explain what page size it returns. * remove unnecessary includes from src/port/pg_numa.c, adding unnecessary dependencies (src/port should be suitable for frontent). These were either leftovers or unnecessary thanks to the other changes in this commit. This eliminates unnecessary dependencies on backend symbols, which we don't want in src/port. Reported-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> https://postgr.es/m/CALdSSPi5fj0a7UG7Fmw2cUD1uWuckU_e8dJ+6x-bJEokcSXzqA@mail.gmail.com
* Fix a few oversights in the longer cancel keys patchHeikki Linnakangas2025-04-09
| | | | | | | | | | | | Change MyCancelKeyLength's type from uint8 to int. While it always fits in a uint8, plain int is less surprising, as there's no particular reason for it to be uint8. Fix one ProcSignalInit caller that passed 'false' instead of NULL for the pointer argument. Author: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/61be9e31-7b7d-49d5-bc11-721800d89d64@eisentraut.org
* Doc: note that two examples in optimizer/README are oversimplified.Tom Lane2025-04-08
| | | | | | | | | | | | These examples fail to account for join clauses generated by EquivalenceClasses, but since we haven't mentioned EquivalenceClasses yet it seems like it'd just add confusion to make them fully accurate. Instead, parenthetically note that they're oversimplified. Reported-by: Zeyuan Hu <ferrishu3886@gmail.com> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACvHWmYFo+60yMqKJajDDvKN5EM41YHrCT3oxukwXmGAqpWvyw@mail.gmail.com
* Fix uninitialized index information access during apply.Amit Kapila2025-04-08
| | | | | | | | | | | | | | | | | | | | | The issue happens when building conflict information during apply of INSERT or UPDATE operations that violate unique constraints on leaf partitions. The problem was introduced in commit 9ff68679b5, which removed the redundant calls to ExecOpenIndices/ExecCloseIndices. The previous code was relying on the redundant ExecOpenIndices call in apply_handle_tuple_routing() to build the index information required for unique key conflict detection. The fix is to delay building the index information until a conflict is detected instead of relying on ExecOpenIndices to do the same. The additional benefit of this approach is that it avoids building index information when there is no conflict. Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by:Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/TYAPR01MB57244ADA33DDA57119B9D26494A62@TYAPR01MB5724.jpnprd01.prod.outlook.com
* Introduce file_copy_method setting.Thomas Munro2025-04-08
| | | | | | | | | | | | | | | | | | | | | It can be set to either COPY (the default) or CLONE if the system supports it. CLONE causes callers of copydir(), currently CREATE DATABASE ... STRATEGY=FILE_COPY and ALTER DATABASE ... SET TABLESPACE = ..., to use copy_file_range (Linux, FreeBSD) or copyfile (macOS) to copy files instead of a read-write loop over the contents. CLONE gives the kernel the opportunity to share block ranges on copy-on-write file systems and push copying down to storage on others, depending on configuration. On some systems CLONE can be used to clone large databases quickly with CREATE DATABASE ... TEMPLATE=source STRATEGY=FILE_COPY. Other operating systems could be supported; patches welcome. Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Ranier Vilela <ranier.vf@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLM%2Bt%2BSwBU-cHeMUXJCOgBxSHLGZutV5zCwY4qrCcE02w%40mail.gmail.com
* Add function to get memory context stats for processesDaniel Gustafsson2025-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a function for retrieving memory context statistics and information from backends as well as auxiliary processes. The intended usecase is cluster debugging when under memory pressure or unanticipated memory usage characteristics. When calling the function it sends a signal to the specified process to submit statistics regarding its memory contexts into dynamic shared memory. Each memory context is returned in detail, followed by a cumulative total in case the number of contexts exceed the max allocated amount of shared memory. Each process is limited to use at most 1Mb memory for this. A summary can also be explicitly requested by the user, this will return the TopMemoryContext and a cumulative total of all lower contexts. In order to not block on busy processes the caller specifies the number of seconds during which to retry before timing out. In the case where no statistics are published within the set timeout, the last known statistics are returned, or NULL if no previously published statistics exist. This allows dash- board type queries to continually publish even if the target process is temporarily congested. Context records contain a timestamp to indicate when they were submitted. Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Discussion: https://postgr.es/m/CAH2L28v8mc9HDt8QoSJ8TRmKau_8FM_HKS41NeO9-6ZAkuZKXw@mail.gmail.com
* Increase BAS_BULKREAD based on effective_io_concurrencyAndres Freund2025-04-08
| | | | | | | | | | | | | | | | | | | | | | | | Before, BAS_BULKREAD was always of size 256kB. With the default io_combine_limit of 16, that only allowed 1-2 IOs to be in flight - insufficient even on very low latency storage. We don't just want to increase the size to a much larger hardcoded value, as very large rings (10s of MBs of of buffers), appear to have negative performance effects when reading in data that the OS has cached (but not when actually needing to do IO). To address this, increase the size of BAS_BULKREAD to allow for io_combine_limit * effective_io_concurrency buffers getting read in. To prevent the ring being much larger than useful, limit the increased size with GetPinLimit(). The formula outlined above keeps the ring size to sizes for which we have not observed performance regressions, unless very large effective_io_concurrency values are used together with large shared_buffers setting. Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6 Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
* Add pg_buffercache_evict_{relation,all} functionsAndres Freund2025-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In addition to the added functions, the pg_buffercache_evict() function now shows whether the buffer was flushed. pg_buffercache_evict_relation(): Evicts all shared buffers in a relation at once. pg_buffercache_evict_all(): Evicts all shared buffers at once. Both functions provide mechanism to evict multiple shared buffers at once. They are designed to address the inefficiency of repeatedly calling pg_buffercache_evict() for each individual buffer, which can be time-consuming when dealing with large shared buffer pools. (e.g., ~477ms vs. ~2576ms for 16GB of fully populated shared buffers). These functions are intended for developer testing and debugging purposes and are available to superusers only. Minimal tests for the new functions are included. Also, there was no test for pg_buffercache_evict(), test for this added too. No new extension version is needed, as it was already increased this release by ba2a3c2302f. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru> Reviewed-by: Joseph Koshakow <koshy44@gmail.com> Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw%40mail.gmail.com
* Speedup child EquivalenceMember lookup in plannerDavid Rowley2025-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When planning queries to partitioned tables, we clone all EquivalenceMembers belonging to the partitioned table into em_is_child EquivalenceMembers for each non-pruned partition. For partitioned tables with large numbers of partitions, this meant the ec_members list could become large and code searching that list would become slow. Effectively, the more partitions which were present, the more searches needed to be performed for operations such as find_ec_member_matching_expr() during create_plan() and the more partitions present, the longer these searches would take, i.e., a quadratic slowdown. To fix this, here we adjust how we store EquivalenceMembers for em_is_child members. Instead of storing these directly in ec_members, these are now stored in a new array of Lists in the EquivalenceClass, which is indexed by the relid. When we want to find EquivalenceMembers belonging to a certain child relation, we can narrow the search to the array element for that relation. To make EquivalenceMember lookup easier and to reduce the amount of code change, this commit provides a pair of functions to allow iteration over the EquivalenceMembers of an EC which also handles finding the child members, if required. Callers that never need to look at child members can remain using the foreach loop over ec_members, which will now often be faster due to only parent-level members being stored there. The actual performance increases here are highly dependent on the number of partitions and the query being planned. Performance increases can be visible with as few as 8 partitions, but the speedup is marginal for such low numbers of partitions. The speedups become much more visible with a few dozen to hundreds of partitions. With some tested queries using 56 partitions, the planner was around 3x faster than before. For use cases with thousands of partitions, these are likely to become significantly faster. Some testing has shown planner speedups of 60x or more with 8192 partitions. Author: Yuya Watari <watari.yuya@gmail.com> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru> Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Tested-by: Thom Brown <thom@linux.com> Tested-by: newtglobal postgresql_contributors <postgresql_contributors@newtglobalcorp.com> Discussion: https://postgr.es/m/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com
* Stabilize 035_standby_logical_decoding.pl.Amit Kapila2025-04-08
| | | | | | | | | | | | | | | | | | | | Some tests try to invalidate logical slots on the standby server by running VACUUM on the primary. The problem is that xl_running_xacts was getting generated and replayed before the VACUUM command, leading to the advancement of the active slot's catalog_xmin. Due to this, active slots were not getting invalidated, leading to test failures. We fix it by skipping the generation of xl_running_xacts for the required tests with the help of injection points. As the required interface for injection points was not present in back branches, we fixed the failing tests in them by disallowing the slot to become active for the required cases (where rows_removed conflict could be generated). Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16, where it was introduced Discussion: https://postgr.es/m/Z6oQXc8LmiTLfwLA@ip-10-97-1-34.eu-west-3.compute.internal
* Fix PG 17 [NOT] NULL optimization bug for domainsBruce Momjian2025-04-07
| | | | | | | | | | | | | | | | A PG 17 optimization allowed columns with NOT NULL constraints to skip table scans for IS NULL queries, and to skip IS NOT NULL checks for IS NOT NULL queries. This didn't work for domain types, since domain types don't follow the IS NULL/IS NOT NULL constraint logic. To fix, disable this optimization for domains for PG 17+. Reported-by: Jan Behrens Diagnosed-by: Tom Lane Discussion: https://postgr.es/m/Z37p0paENWWUarj-@momjian.us Backpatch-through: 17
* Flush the IO statistics of active WAL senders more frequentlyMichael Paquier2025-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | WAL senders do not flush their statistics until they exit, limiting the monitoring possible for live processes. This is penalizing when WAL senders are running for a long time, like in streaming or logical replication setups, because it is not possible to know the amount of IO they generate while running. This commit makes WAL senders more aggressive with their statistics flush, using an internal of 1 second, with the flush timing calculated based on the existing GetCurrentTimestamp() done before the sleeps done to wait for some activity. Note that the sleep done for logical and physical WAL senders happens in two different code paths, so the stats flushes need to happen in these two places. One test is added for the physical WAL sender case, and one for the logical WAL sender case. This can be done in a stable fashion by relying on the WAL generated by the TAP tests in combination with a stats reset while a server is running, but only on HEAD as WAL data has been added to pg_stat_io in a051e71e28a1. This issue exists since a9c70b46dbe and the introduction of pg_stat_io, so backpatch down to v16. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/Z73IsKBceoVd4t55@ip-10-97-1-34.eu-west-3.compute.internal Backpatch-through: 16
* Introduce pg_shmem_allocations_numa viewTomas Vondra2025-04-07
| | | | | | | | | | | | | | | | | | | | | | Introduce new pg_shmem_alloctions_numa view with information about how shared memory is distributed across NUMA nodes. For each shared memory segment, the view returns one row for each NUMA node backing it, with the total amount of memory allocated from that node. The view may be relatively expensive, especially when executed for the first time in a backend, as it has to touch all memory pages to get reliable information about the NUMA node. This may also force allocation of the shared memory. Unlike pg_shmem_allocations, the view does not show anonymous shared memory allocations. It also does not show memory allocated using the dynamic shared memory infrastructure. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
* Add support for basic NUMA awarenessTomas Vondra2025-04-07
| | | | | | | | | | | | | | | | | | | | | | Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c portability wrapper and an optional build dependency, enabled by --with-libnuma configure option. For now this is Linux-only, other platforms may be supported later. A built-in SQL function pg_numa_available() allows checking NUMA support, i.e. that the server was built/linked with the NUMA library. The main function introduced is pg_numa_query_pages(), which allows determining the NUMA node for individual memory pages. Internally the function uses move_pages(2) syscall, as it allows batching, and is more efficient than get_mempolicy(2). Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
* aio: Make AIO more compatible with valgrindAndres Freund2025-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In some edge cases valgrind flags issues with the memory referenced by IOs. All of the cases addressed in this change are false positives. Most of the false positives are caused by UnpinBuffer[NoOwner] marking buffer data as inaccessible. This happens even though the AIO subsystem still holds a pin. That's good, there shouldn't be accesses to the buffer outside of AIO related code until it is pinned by "user" code again. But it requires some explicit work - if the buffer is not pinned by the current backend, we need to explicitly mark the buffer data accessible/inaccessible while executing completion callbacks. That however causes a cascading issue in IO workers: After the completion callbacks for a buffer is executed, the page is marked as inaccessible. If subsequently the same worker is executing IO targeting the same buffer, we would get an error, as the memory is still marked inaccessible. To avoid that, we need to explicitly mark the memory as accessible in IO workers. Another issue is that IO executed in workers or via io_uring will not mark memory as DEFINED. In the case of workers that is because valgrind does not track memory definedness across processes. For io_uring that is because valgrind does not understand io_uring, and therefore its IOs never mark memory as defined, whether the completions are processed in the defining process or in another context. It's not entirely clear how to best solve that. The current user of AIO is not affected, as it explicitly marks buffers as DEFINED & NOACCESS anyway. Defer solving this issue until we have a user with different needs. Per buildfarm animal skink. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/3pd4322mogfmdd5nln3zphdwhtmq3rzdldqjwb2sfqzcgs22lf@ok2gletdaoe6