aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Fix typoPeter Eisentraut2025-03-08
| | | | | | | | Duplicate assignment in commit af4002b381d should have been a different field. (But it didn't affect the outcome.) Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org
* nbtree: refine _bt_readnextpage contract comments.Peter Geoghegan2025-03-07
| | | | | Another minor follow-up commit for commit 1bd4bc85, which changed the _bt_readnextpage contract.
* Include column name in build_attrmap_by_position's error reports.Tom Lane2025-03-07
| | | | | | | | | | | | | | Formerly we only provided the column number, but it's frequently more useful to mention the column name. The input tupdesc often doesn't have useful column names, but the output tupdesc usually contains user-supplied names, so report that one. Author: Marcos Pegoraro <marcos@f10.com.br> Co-authored-by: jian he <jian.universality@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Co-authored-by: Erik Wienhold <ewie@ewie.name> Reviewed-by: Vladlen Popolitov <v.popolitov@postgrespro.ru> Discussion: https://postgr.es/m/CAB-JLwanky28gjAMdnMh1CjyO1b2zLdr6UOA1-oY9G7PVL9KKQ@mail.gmail.com
* Rename amcancrosscomparePeter Eisentraut2025-03-07
| | | | | | | | | | After more discussion about commit ce62f2f2a0a, rename the index AM property amcancrosscompare to two separate properties amconsistentequality and amconsistentordering. Also improve the documentation and update some comments that were previously missed. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org
* Fix some performance issues in GIN query startup.Tom Lane2025-03-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a GIN index search had a lot of search keys (for example, "jsonbcol ?| array[]" with tens of thousands of array elements), both ginFillScanKey() and startScanKey() took O(N^2) time. Worse, those loops were uncancelable for lack of CHECK_FOR_INTERRUPTS. The problem in ginFillScanKey() is the brute-force search key de-duplication done in ginFillScanEntry(). The most expedient solution seems to be to just stop trying to de-duplicate once there are "too many" search keys. We could imagine working harder, say by using a sort-and-unique algorithm instead of brute force compare-all-the-keys. But it seems unlikely to be worth the trouble. There is no correctness issue here, since the code already allowed duplicate keys if any extra_data is present. The problem in startScanKey() is the loop that attempts to identify the first non-required search key. In the submitted test case, that vainly tests all the key positions, and each iteration takes O(N) time. One part of that is that it's reinitializing the entryRes[] array from scratch each time, which is entirely unnecessary given that the triConsistentFn isn't supposed to scribble on its input. We can easily adjust the array contents incrementally instead. The other part of it is that the triConsistentFn may itself take O(N) time (and does in this test case). This is all extremely brute force: in simple cases with AND or OR semantics, we could know without any looping whatever that all or none of the keys are required. But GIN opclasses don't have any API for exposing that knowledge, so at least in the short run there is little to be done about that. Put in a CHECK_FOR_INTERRUPTS so that at least the loop is cancelable. These two changes together resolve the primary complaint that the test query doesn't respond promptly to cancel interrupts. Also, while they don't completely eliminate the O(N^2) behavior, they do provide quite a nice speedup for mid-sized examples. Bug: #18831 Reported-by: Niek <niek.brasa@hitachienergy.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18831-e845ac44ebc5dd36@postgresql.org Backpatch-through: 13
* Revert "Show index search count in EXPLAIN ANALYZE."Peter Geoghegan2025-03-05
| | | | | | | This reverts commit 5ead85fbc81162ab1594f656b036a22e814f96b3. This commit shows test failures with debug_parallel_query=regress. The underlying issue needs to be debugged, so revert for now.
* Show index search count in EXPLAIN ANALYZE.Peter Geoghegan2025-03-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index scans isn't predictable in advance (at least not with optimizations like the one added to nbtree by Postgres 17 commit 5bf748b8). It will also be useful when EXPLAIN ANALYZE shows details of an nbtree index scan that uses skip scan optimizations set to be introduced by an upcoming patch. The instrumentation works by teaching index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs must already increment the index's pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). The new counter is stored in the scan descriptor (IndexScanDescData), which explain.c reaches by going through the scan node's PlanState. This approach doesn't match the approach used when tracking other index scan specific costs (e.g., "Rows Removed by Filter:"). It is similar to the approach used in other cases where we must track costs that are only readily accessible inside an access method, and not from the executor (e.g., "Heap Blocks:" output for a Bitmap Heap Scan). It is inherently necessary to maintain a counter that can be incremented multiple times during a single amgettuple call (or amgetbitmap call), and directly exposing PlanState.instrument to index access methods seems unappealing. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com
* Rename some signal and interrupt handling functions for consistencyHeikki Linnakangas2025-03-05
| | | | | | | | | | | | | | | | | | | | | | | | The usual pattern for handling a signal is that the signal handler sets a flag and calls SetLatch(MyLatch), and CHECK_FOR_INTERRUPTS() or other code that is part of a wait loop calls another function to deal with it. The naming of the functions involved was a bit inconsistent, however. CHECK_FOR_INTERRUPTS() calls ProcessInterrupts() to do the heavy-lifting, but the analogous functions in aux processes were called HandleMainLoopInterrupts(), HandleStartupProcInterrupts(), etc. Similarly, most subroutines of ProcessInterrupts() were called Process*(), but some were called Handle*(). To make things less confusing, rename all the functions that are part of the overall signal/interrupt handling system but are not executed in a signal handler to e.g. ProcessSomething(), rather than HandleSomething(). The "Process" prefix is now consistently used in the non-signal-handler functions, and the "Handle" prefix in functions that are part of signal handlers, except for some completely unrelated functions that clearly have nothing to do with signal or interrupt handling. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://www.postgresql.org/message-id/8a384b26-1499-41f6-be33-64b801fb98b8@iki.fi
* Enforce memory limit during parallel GIN buildsTomas Vondra2025-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Index builds are expected to respect maintenance_work_mem, just like other maintenance operations. For serial builds this is done simply by flushing the buffer in ginBuildCallback() into the index. But with parallel builds it's more complicated, because there are multiple places that can allocate memory. ginBuildCallbackParallel() does the same thing as ginBuildCallback(), except that the accumulated items are written into tuplesort. Then the entries with the same key get merged - first in the worker, then in the leader - and the TID lists may get (arbitrarily) long. It's unlikely it would exceed the memory limit, but it's possible. We address this by evicting some of the data if the list gets too long. We can't simply dump the whole in-memory TID list. The GIN index bulk insert code expects to see TIDs in monotonic order; it may fail if the TIDs go backwards. If the TID lists overlap, evicting the whole current TID list would break this (a later entry might add "old" TID values into the already-written part). In the workers this is not an issue, because the lists never overlap. But the leader may see overlapping lists produced by the workers. We can however derive a safe "horizon" TID - the entries (for a given key) are sorted by (key, first TID), which means no future list can add values before the last "first TID" we've seen. This patch tracks the "frozen" part of the TID list, which we know can't change by merging additional TID lists. If needed, we can evict this part of the list. We don't want to do this too often - the smaller lists we evict, the more expensive it'll be to merge them in the next step (especially in the leader). Therefore we only trim the list if we have at least 1024 frozen items, and if the whole list is at least 64kB large. These thresholds are somewhat arbitrary and conservative. We might calculate the values from maintenance_work_mem, but tests show that does not really improve anything (time, compression ratio, ...). So we stick to these conservative values to release memory faster. Author: Tomas Vondra Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
* Compress TID lists when writing GIN tuples to diskTomas Vondra2025-03-04
| | | | | | | | | | | | | | | | | | | | | | | When serializing GIN tuples to tuplesorts during parallel index builds, we can significantly reduce the amount of data by compressing the TID lists. The GIN opclasses may produce a lot of data (depending on how many keys are extracted from each row), and the TID compression is very efficient and effective. If the number of distinct keys is high, the first worker pass (reading data from the table and writing them into a private tuplesort) may not benefit from the compression very much. It is likely to spill data to disk before the TID lists get long enough for the compression to help. The second pass (writing the merged data into the shared tuplesort) is more likely to benefit from compression. The compression can be seen as a way to reduce the amount of disk space needed by the parallel builds, because the data is written twice. First into the per-worker tuplesorts, then into the shared tuplesort. Author: Tomas Vondra Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
* Fix outdated commentHeikki Linnakangas2025-03-04
| | | | | | | Commit bc971f4025 replaced the latch-setting mechanism that the comment talked about with a condition variable. And before that, commit 2258e76f90 moved the code so that the comment got detached from the loop that it talked about, so move the comment closer to the loop.
* Add relallfrozen to pg_classMelanie Plageman2025-03-03
| | | | | | | | | | | | | | | | | | | | | | | | | | Add relallfrozen, an estimate of the number of pages marked all-frozen in the visibility map. pg_class already has relallvisible, an estimate of the number of pages in the relation marked all-visible in the visibility map. This is used primarily for planning. relallfrozen, together with relallvisible, is useful for estimating the outstanding number of all-visible but not all-frozen pages in the relation for the purposes of scheduling manual VACUUMs and tuning vacuum freeze parameters. A future commit will use relallfrozen to trigger more frequent vacuums on insert-focused workloads with significant volume of frozen data. Bump catalog version Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_aj-P7YyBz_cPNwztz6ohP%2BvWis%3Diz3YcomkB3NpYA--w%40mail.gmail.com
* Allow parallel CREATE INDEX for GIN indexesTomas Vondra2025-03-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow using parallel workers to build a GIN index, similarly to BTREE and BRIN. For large tables this may result in significant speedup when the build is CPU-bound. The work is divided so that each worker builds index entries on a subset of the table, determined by the regular parallel scan used to read the data. Each worker uses a local tuplesort to sort and merge the entries for the same key. The TID lists do not overlap (for a given key), which means the merge sort simply concatenates the two lists. The merged entries are written into a shared tuplesort for the leader. The leader needs to merge the sorted entries again, before writing them into the index. But this way a significant part of the work happens in the workers, and the leader is left with merging fewer large entries, which is more efficient. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for GIN indexes (e.g. uniqueness checks). Original patch by me, with reviews and substantial improvements by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
* Set amcancrosscompare to true for hashPeter Eisentraut2025-03-01
| | | | | | | | | This was missed in the refactoring in patch ce62f2f2a0a, which thus created a regression. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org
* Generalize hash and ordering support in amapiPeter Eisentraut2025-02-27
| | | | | | | | | | | | Stop comparing access method OID values against HASH_AM_OID and BTREE_AM_OID, and instead check the IndexAmRoutine for an index to see if it advertises its ability to perform the necessary ordering, hashing, or cross-type comparing functionality. A field amcanorder already existed, this uses it more widely. Fields amcanhash and amcancrosscompare are added for the other purposes. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Improve FATAL message for invalid TLI history at recoveryMichael Paquier2025-02-26
| | | | | | | | | | | | | | The original message did not mention where the checkpoint record LSN was found, a control file or a backup_label file. A couple of LOG messages are generated before this FATAL check is reached, providing more details about the way recovery is set up. However, knowing this information in this specific message is useful for debugging. This is also useful for instances where log_min_messages is set to FATAL or more, where LOG messages do not show up. Author: Benoit Lobréau <benoit.lobreau@dalibo.com> Reviewed-by: David Steele <david@pgbackrest.org> Discussion: https://postgr.es/m/4ed10bc8-5513-4d8e-8643-8abcaa08336d@dalibo.com
* Re-add GUC track_wal_io_timingMichael Paquier2025-02-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit is a rework of 2421e9a51d20, about which Andres Freund has raised some concerns as it is valuable to have both track_io_timing and track_wal_io_timing in some cases, as the WAL write and fsync paths can be a major bottleneck for some workloads. Hence, it can be relevant to not calculate the WAL timings in environments where pg_test_timing performs poorly while capturing some IO data under track_io_timing for the non-WAL IO paths. The opposite can be also true: it should be possible to disable the non-WAL timings and enable the WAL timings (the previous GUC setups allowed this possibility). track_wal_io_timing is added back in this commit, controlling if WAL timings should be calculated in pg_stat_io for the read, fsync and write paths, as done previously with pg_stat_wal. pg_stat_wal previously tracked only the sync and write parts (now removed), read stats is new data tracked in pg_stat_io, all three are aggregated if track_wal_io_timing is enabled. The read part matters during recovery or if a XLogReader is used. Extra note: more control over if the types of timings calculated in pg_stat_io could be done with a GUC that lists pairs of (IOObject,IOOp). Reported-by: Andres Freund <andres@anarazel.de> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/3opf2wh2oljco6ldyqf7ukabw3jijnnhno6fjb4mlu6civ5h24@fcwmhsgmlmzu
* Change relpath() et al to return path by valueAndres Freund2025-02-25
| | | | | | | | | | | | | | | | | | | | | | | | | For AIO, and also some other recent patches, we need the ability to call relpath() in a critical section. Until now that was not feasible, as it allocated memory. The fact that relpath() allocated memory also made it awkward to use in log messages because we had to take care to free the memory afterwards. Which we e.g. didn't do for when zeroing out an invalid buffer. We discussed other solutions, e.g. filling a pre-allocated buffer that's passed to relpath(), but they all came with plenty downsides or were larger projects. The easiest fix seems to be to make relpath() return the path by value. To be able to return the path by value we need to determine the maximum length of a relation path. This patch adds a long #define that computes the exact maximum, which is verified to be correct in a regression test. As this change the signature of relpath(), extensions using it will need to adapt their code. We discussed leaving a backward-compat shim in place, but decided it's not worth it given the use of relpath() doesn't seem widespread. Discussion: https://postgr.es/m/xeri5mla4b5syjd5a25nok5iez2kr3bm26j2qn4u7okzof2bmf@kwdh2vf7npra
* Delay extraction of TIDBitmap per page offsetsMelanie Plageman2025-02-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pages from the bitmap created by the TIDBitmap API can be exact or lossy. The TIDBitmap API extracts the tuple offsets from exact pages into an array for the convenience of the caller. This was done in tbm_private|shared_iterate() right after advancing the iterator. However, as long as tbm_private|shared_iterate() set a reference to the PagetableEntry in the TBMIterateResult, the offset extraction can be done later. Waiting to extract the tuple offsets has a few benefits. For the shared iterator case, it allows us to extract the offsets after dropping the shared iterator state lock, reducing time spent holding a contended lock. Separating the iteration step and extracting the offsets later also allows us to avoid extracting the offsets for prefetched blocks. Those offsets were never used, so the overhead of extracting and storing them was wasted. The real motivation for this change, however, is that future commits will make bitmap heap scan use the read stream API. This requires a TBMIterateResult per issued block. By removing the array of tuple offsets from the TBMIterateResult and only extracting the offsets when they are used, we reduce the memory required for per buffer data substantially. Suggested-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLHbKP3jwJ6_%2BhnGi37Pw3BD5j2amjV3oSk7j-KyCnY7Q%40mail.gmail.com
* Add lossy indicator to TBMIterateResultMelanie Plageman2025-02-24
| | | | | | | | TBMIterateResult->ntuples is -1 when the page in the bitmap is lossy. Add an explicit lossy indicator so that we can move ntuples out of the TBMIterateResult in a future commit. Discussion: https://postgr.es/m/CA%2BhUKGLHbKP3jwJ6_%2BhnGi37Pw3BD5j2amjV3oSk7j-KyCnY7Q%40mail.gmail.com
* Remove read/sync fields from pg_stat_wal and GUC track_wal_io_timingMichael Paquier2025-02-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The four following attributes are removed from pg_stat_wal: * wal_write * wal_sync * wal_write_time * wal_sync_time a051e71e28a1 has added an equivalent of this information in pg_stat_io with more granularity as this now spreads across the backend types, IO context and IO objects. So, keeping the same information in pg_stat_wal has little benefits. Another benefit of this commit is the removal of PendingWalStats, simplifying an upcoming patch to add per-backend WAL statistics, which already support IO statistics and which have access to the write/sync stats data of WAL. The GUC track_wal_io_timing, that was used to enable or disable the aggregation of the write and sync timings for WAL, is also removed. pgstat_prepare_io_time() is simplified. Bump catalog version. Bump PGSTAT_FILE_FORMAT_ID, due to the update of PgStat_WalStats. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z7RkQ0EfYaqqjgz/@ip-10-97-1-34.eu-west-3.compute.internal
* Add default_char_signedness field to ControlFileData.Masahiko Sawada2025-02-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The signedness of the 'char' type in C is implementation-dependent. For instance, 'signed char' is used by default on x86 CPUs, while 'unsigned char' is used on aarch CPUs. Previously, we accidentally let C implementation signedness affect persistent data. This led to inconsistent results when comparing char data across different platforms. This commit introduces a new 'default_char_signedness' field in ControlFileData to store the signedness of the 'char' type. While this change does not encourage the use of 'char' without explicitly specifying its signedness, this field can be used as a hint to ensure consistent behavior for pre-v18 data files that store data sorted by the 'char' type on disk (e.g., GIN and GiST indexes), especially in cross-platform replication scenarios. Newly created database clusters unconditionally set the default char signedness to true. pg_upgrade (with an upcoming commit) changes this flag for clusters if the source database cluster has signedness=false. As a result, signedness=false setting will become rare over time. If we had known about the problem during the last development cycle that forced initdb (v8.3), we would have made all clusters signed or all clusters unsigned. Making pg_upgrade the only source of signedness=false will cause the population of database clusters to converge toward that retrospective ideal. Bump catalog version (for the catalog changes) and PG_CONTROL_VERSION (for the additions in ControlFileData). Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com
* Drop opcintype from index AM strategy translation APIPeter Eisentraut2025-02-21
| | | | | | | | | | | | | | | | | | | | | | | | | | The type argument wasn't actually really necessary. It was a remnant of converting the API of the gist strategy translation from using opclass to using opfamily+opcintype (commits c09e5a6a016, 622f678c102). For looking up the gist translation function, we used the convention "amproclefttype = amprocrighttype = opclass's opcintype" (see pg_amproc.h). But each operator family should only have one translation function, and getting the right type for the lookup is sometimes cumbersome and fragile, so this is all unnecessarily complicated. To simplify this, change the gist stategy support procedure to take "any", "any" as argument. (This is arbitrary but seems intuitive. The alternative of using InvalidOid as argument(s) upsets various DDL commands, so it's not practical.) Then we don't need opcintype for the lookup, and we can remove it from all the API layers introduced by commit c09e5a6a016. This also adds some more documentation about the correct signature of the gist support function and adds more checks in gistvalidate(). This was previously underspecified. (It relied implicitly on convention mentioned above.) Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Remove various unnecessary (char *) castsPeter Eisentraut2025-02-20
| | | | | | | | Remove a number of (char *) casts that are unnecessary. Or in some cases, rewrite the code to make the purpose of the cast clearer. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* Fix FATAL message for invalid recovery timeline at beginning of recoveryMichael Paquier2025-02-20
| | | | | | | | | | | | | | | | | | | | If the requested recovery timeline is not reachable, the logged checkpoint and timeline should to be the values read from the backup_label when it is defined. The message generated used the values from the control file in this case, which is fine when recovering from the control file without a backup_label, but not if there is a backup_label. Issue introduced in ee994272ca50. v15 has introduced xlogrecovery.c and more simplifications in this area (4a92a1c3d1c3, a27048cbcb58), making this change a bit simpler to think about, so backpatch only down to this version. Author: David Steele <david@pgbackrest.org> Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru> Reviewed-by: Benoit Lobréau <benoit.lobreau@dalibo.com> Discussion: https://postgr.es/m/c3d617d4-1696-4aa7-8a4d-5a7d19cc5618@pgbackrest.org Backpatch-through: 15
* Correct relation size estimate with low fillfactorTomas Vondra2025-02-19
| | | | | | | | | | | | | | | | | | | | | Since commit 29cf61ade3, table_block_relation_estimate_size() considers fillfactor when estimating number of rows in a relation before the first ANALYZE. The formula however did not consider tuples may be larger than available space determined by fillfactor, ending with density 0. This ultimately means the relation was estimated to contain a single row. The executor however places at least one tuple per page, even with very low fillfactor values, so the density should be at least 1. Fixed by clamping the density estimate using clamp_row_est(). Reported by Heikki Linnakangas. Fix by me, with regression test inspired by example provided by Heikki. Backpatch to 17, where the issue was introduced. Reported-by: Heikki Linnakangas Backpatch-through: 17 Discussion: https://postgr.es/m/2bf9d973-7789-4937-a7ca-0af9fb49c71e@iki.fi
* Fix crash in brininsertcleanup during logical replication.Tom Lane2025-02-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Logical replication crashes if the subscriber's partitioned table has a BRIN index. There are two independently blamable causes, and this patch fixes both: 1. brininsertcleanup fails if called twice for the same IndexInfo, because it half-destroys its BrinInsertState but leaves it still linked from ii_AmCache. brininsert would also fail in that state, so it's pretty hard to see any advantage to this coding. Fully remove the BrinInsertState, instead, so that a new brininsert call would create a new cache. 2. A logical replication subscriber sometimes does ExecOpenIndices twice on the same ResultRelInfo, followed by doing ExecCloseIndices twice; the second call reaches the brininsertcleanup bug. Quite aside from tickling unexpected cases in aminsertcleanup methods, this seems very wasteful, because the IndexInfos built in the first ExecOpenIndices call are just lost during the second call, and have to be rebuilt at possibly-nontrivial cost. We should establish a coding rule that you don't do that. The problematic coding is that when the target table is partitioned, apply_handle_tuple_routing calls ExecFindPartition which does ExecOpenIndices (and expects that ExecCleanupTupleRouting will close the indexes again). Using the ResultRelInfo made by ExecFindPartition, it calls apply_handle_delete_internal or apply_handle_insert_internal, both of which think they need to do ExecOpenIndices/ExecCloseIndices for themselves. They do in the main non-partitioned code paths, but not here. The simplest fix is to pull their ExecOpenIndices/ExecCloseIndices calls out and put them in the call sites for the non-partitioned cases. (We could have refactored apply_handle_update_internal similarly, but I did not do so today because there's no bug there: the partitioned code path doesn't call it.) Also, remove the always-duplicative open/close calls within apply_handle_tuple_routing itself. Since brininsertcleanup and indeed the whole aminsertcleanup mechanism are new in v17, there's no observable bug in older branches. A case could be made for trying to avoid these duplicative open/close calls in the older branches, but for now it seems not worth the trouble and risk of new bugs. Bug: #18815 Reported-by: Sergey Belyashov <sergey.belyashov@gmail.com> Discussion: https://postgr.es/m/18815-2a0407cc7f40b327@postgresql.org Backpatch-through: 17
* Invalidate inactive replication slots.Amit Kapila2025-02-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit introduces idle_replication_slot_timeout GUC that allows inactive slots to be invalidated at the time of checkpoint. Because checkpoints happen checkpoint_timeout intervals, there can be some lag between when the idle_replication_slot_timeout was exceeded and when the slot invalidation is triggered at the next checkpoint. To avoid such lags, users can force a checkpoint to promptly invalidate inactive slots. Note that the idle timeout invalidation mechanism is not applicable for slots that do not reserve WAL or for slots on the standby server that are synced from the primary server (i.e., standby slots having 'synced' field 'true'). Synced slots are always considered to be inactive because they don't perform logical decoding to produce changes. The slots can become inactive for a long period if a subscriber is down due to a system error or inaccessible because of network issues. If such a situation persists, it might be more practical to recreate the subscriber rather than attempt to recover the node and wait for it to catch up which could be time-consuming. Then, external tools could create replication slots (e.g., for migrations or upgrades) that may fail to remove them if an error occurs, leaving behind unused slots that take up space and resources. Manually cleaning them up can be tedious and error-prone, and without intervention, these lingering slots can cause unnecessary WAL retention and system bloat. As the duration of idle_replication_slot_timeout is in minutes, any test using that would be time-consuming. We are planning to commit a follow up patch for tests by using the injection point framework. Author: Nisha Moond <nisha.moond412@gmail.com> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com Discussion: https://postgr.es/m/OS0PR01MB5716C131A7D80DAE8CB9E88794FC2@OS0PR01MB5716.jpnprd01.prod.outlook.com
* Reduce scope of heap vacuum per_buffer_dataMelanie Plageman2025-02-18
| | | | | | | | | | | | | | | Move lazy_scan_heap()'s per_buffer_data variable into a tighter scope. In lazy_scan_heap()'s phase I heap vacuuming, the read stream API returns a pointer to the next block number to vacuum. As long as read_stream_next_buffer() returns a valid buffer, per_buffer_data should always be valid. Move per_buffer_data into a tighter scope and make sure it is reset to NULL on each iteration so that we get a core dump instead of bogus data from a previous block if something goes wrong in the read stream API. Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/626104.1739729538%40sss.pgh.pa.us
* Revert: Get rid of WALBufMappingLockAlexander Korotkov2025-02-17
| | | | | This commit reverts 6a2275b895. Buildfarm failure on batta spots some concurrency issue, which requires further investigation.
* Add information about WAL buffers full to VACUUM/ANALYZE (VERBOSE)Michael Paquier2025-02-17
| | | | | | | | | | | | | | | | This commit adds the information about the number of times WAL buffers have been full to the logs generated by VACUUM/ANALYZE (VERBOSE) and in the logs generated by autovacuum, complementing the existing information stored by WalUsage. This is the last part of the backend code where the value of wal_buffers_full can be reported, similarly to all the other fields of WalUsage. 320545bfcfee and ce5bcc4a9f26 have done the same for EXPLAIN and pgss. Author: Bertrand Drouvot Reviewed-by: Ilia Evdokimov Discussion: https://postgr.es/m/Z6SOha5YFFgvpwQY@ip-10-97-1-34.eu-west-3.compute.internal
* Move wal_buffers_full from PgStat_PendingWalStats to WalUsageMichael Paquier2025-02-17
| | | | | | | | | | | | | | | | | | | | | | | | wal_buffers_full has been introduced in pg_stat_wal in 8d9a935965f, as some information providing metrics for the tuning of the GUC wal_buffers. WalUsage has been introduced before that in df3b181499. Moving this field is proving to be beneficial for several reasons: - This information can now be made available in more layers, providing more granularity than just pg_stat_wal, on a per-query basis: EXPLAIN, pgss and VACUUM/ANALYZE logs. - A patch is under discussion to provide statistics for WAL at backend level, and this move simplifies a bit the handling of pending statistics. The remaining data in PgStat_PendingWalStats now relates to write/sync counters and times, with equivalents present in pg_stat_io, that backend statistics are able to already track. So this should cut all the dependencies between PgStat_PendingWalStats and WAL stats at backend level. As of this change, wal_buffers_full only shows in pg_stat_wal. Author: Bertrand Drouvot Reviewed-by: Ilia Evdokimov Discussion: https://postgr.es/m/Z6SOha5YFFgvpwQY@ip-10-97-1-34.eu-west-3.compute.internal
* Get rid of WALBufMappingLockAlexander Korotkov2025-02-17
| | | | | | | | | | | | | | | | | | | | | | | | | Allow multiple backends to initialize WAL buffers concurrently. This way `MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without taking a single LWLock in exclusive mode. The new algorithm works as follows: * reserve a page for initialization using XLogCtl->InitializeReserved, * ensure the page is written out, * once the page is initialized, try to advance XLogCtl->InitializedUpTo and signal to waiters using XLogCtl->InitializedUpToCondVar condition variable, * repeat previous steps until we reserve initialization up to the target WAL position, * wait until concurrent initialization finishes using a XLogCtl->InitializedUpToCondVar. Now, multiple backends can, in parallel, concurrently reserve pages, initialize them, and advance XLogCtl->InitializedUpTo to point to the latest initialized page. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
* Add delay time to VACUUM/ANALYZE (VERBOSE) and autovacuum logs.Nathan Bossart2025-02-14
| | | | | | | | | | | Commit bb8dff9995 added this information to the pg_stat_progress_vacuum and pg_stat_progress_analyze system views. This commit adds the same information to the output of VACUUM and ANALYZE with the VERBOSE option and to the autovacuum logs. Suggested-by: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
* Use PqMsg_Progress macro in HandleParallelMessage().Nathan Bossart2025-02-14
| | | | | | | Commit a99cc6c6b4 introduced the PqMsg_Progress macro but missed updating HandleParallelMessage() accordingly. Backpatch-through: 17
* Use streaming read I/O in VACUUM's third phaseMelanie Plageman2025-02-14
| | | | | | | | | | | | | Make vacuum's third phase (its second pass over the heap), which reaps dead items collected in the first phase and marks them as reusable, use the read stream API. This commit adds a new read stream callback, vacuum_reap_lp_read_stream_next(), that looks ahead in the TidStore and returns the next block number to read for vacuum. Author: Melanie Plageman <melanieplageman@gmail.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGKN3oy0bN_3yv8hd78a4%2BM1tJC9z7mD8%2Bf%2ByA%2BGeoFUwQ%40mail.gmail.com
* Use streaming read I/O in VACUUM's first phaseMelanie Plageman2025-02-14
| | | | | | | | | | Make vacuum's first phase, which prunes and freezes tuples and records dead TIDs, use the read stream API by by converting heap_vac_scan_next_block() to a read stream callback. Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CAAKRu_aLwANZpxHc0tC-6OT0OQT4TftDGkKAO5yigMUOv_Tcsw%40mail.gmail.com
* Convert heap_vac_scan_next_block() boolean parameters to flagsMelanie Plageman2025-02-14
| | | | | | | | | | | | | | | | The read stream API only allows one piece of extra per block state to be passed back to the API user (per_buffer_data). lazy_scan_heap() needs two pieces of per-buffer data: whether or not the block was all-visible in the visibility map and whether or not it was eagerly scanned. Convert these two pieces of information to flags so that they can be populated by heap_vac_scan_next_block() and returned to lazy_scan_heap(). A future commit will turn heap_vac_scan_next_block() into the read stream callback for heap phase I vacuuming. Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CAAKRu_bmx33jTqATP5GKNFYwAg02a9dDtk4U_ciEjgBHZSVkOQ%40mail.gmail.com
* Remove unnecessary (char *) casts [xlog]Peter Eisentraut2025-02-13
| | | | | | | | Remove (char *) casts no longer needed after XLogRegisterData() and XLogRegisterBufData() argument type change. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* XLogRegisterData, XLogRegisterBufData void * argument for binary dataPeter Eisentraut2025-02-13
| | | | | | | | | Change XLogRegisterData() and XLogRegisterBufData() functions to take void * for binary data instead of char *. This will remove the need for numerous casts (done in a separate commit for clarity). Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* Remove unnecessary (char *) casts [mem]Peter Eisentraut2025-02-12
| | | | | | | | | | Remove (char *) casts around memory functions such as memcmp(), memcpy(), or memset() where the cast is useless. Since these functions don't take char * arguments anyway, these casts are at best complicated casts to (void *), about which see commit 7f798aca1d5. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* Add is_analyze parameter to vacuum_delay_point().Nathan Bossart2025-02-11
| | | | | | | | | | | | This function is used in both vacuum and analyze code paths, and a follow-up commit will require distinguishing between the two. This commit forces callers to specify whether they are in a vacuum or analyze path, but it does not use that information for anything yet. Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
* Eagerly scan all-visible pages to amortize aggressive vacuumMelanie Plageman2025-02-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Aggressive vacuums must scan every unfrozen tuple in order to advance the relfrozenxid/relminmxid. Because data is often vacuumed before it is old enough to require freezing, relations may build up a large backlog of pages that are set all-visible but not all-frozen in the visibility map. When an aggressive vacuum is triggered, all of these pages must be scanned. These pages have often been evicted from shared buffers and even from the kernel buffer cache. Thus, aggressive vacuums often incur large amounts of extra I/O at the expense of foreground workloads. To amortize the cost of aggressive vacuums, eagerly scan some all-visible but not all-frozen pages during normal vacuums. All-visible pages that are eagerly scanned and set all-frozen in the visibility map are counted as successful eager freezes and those not frozen are counted as failed eager freezes. If too many eager scans fail in a row, eager scanning is temporarily suspended until a later portion of the relation. The number of failures tolerated is configurable globally and per table. To effectively amortize aggressive vacuums, we cap the number of successes as well. Capping eager freeze successes also limits the amount of potentially wasted work if these pages are modified again before the next aggressive vacuum. Once we reach the maximum number of blocks successfully eager frozen, eager scanning is disabled for the remainder of the vacuum of the relation. Original design idea from Robert Haas, with enhancements from Andres Freund, Tomas Vondra, and me Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com
* Virtual generated columnsPeter Eisentraut2025-02-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new variant of generated columns that are computed on read (like a view, unlike the existing stored generated columns, which are computed on write, like a materialized view). The syntax for the column definition is ... GENERATED ALWAYS AS (...) VIRTUAL and VIRTUAL is also optional. VIRTUAL is the default rather than STORED to match various other SQL products. (The SQL standard makes no specification about this, but it also doesn't know about VIRTUAL or STORED.) (Also, virtual views are the default, rather than materialized views.) Virtual generated columns are stored in tuples as null values. (A very early version of this patch had the ambition to not store them at all. But so much stuff breaks or gets confused if you have tuples where a column in the middle is completely missing. This is a compromise, and it still saves space over being forced to use stored generated columns. If we ever find a way to improve this, a bit of pg_upgrade cleverness could allow for upgrades to a newer scheme.) The capabilities and restrictions of virtual generated columns are mostly the same as for stored generated columns. In some cases, this patch keeps virtual generated columns more restricted than they might technically need to be, to keep the two kinds consistent. Some of that could maybe be relaxed later after separate careful considerations. Some functionality that is currently not supported, but could possibly be added as incremental features, some easier than others: - index on or using a virtual column - hence also no unique constraints on virtual columns - extended statistics on virtual columns - foreign-key constraints on virtual columns - not-null constraints on virtual columns (check constraints are supported) - ALTER TABLE / DROP EXPRESSION - virtual column cannot have domain type - virtual columns are not supported in logical replication The tests in generated_virtual.sql have been copied over from generated_stored.sql with the keyword replaced. This way we can make sure the behavior is mostly aligned, and the differences can be visible. Some tests for currently not supported features are currently commented out. Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Tested-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a368248e-69e4-40be-9c07-6c3b5880b0a6@eisentraut.org
* Introduce autovacuum_vacuum_max_threshold.Nathan Bossart2025-02-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One way autovacuum chooses tables to vacuum is by comparing the number of updated or deleted tuples with a value calculated using autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor. The threshold specifies the base value for comparison, and the scale factor specifies the fraction of the table size to add to it. This strategy ensures that smaller tables are vacuumed after fewer updates/deletes than larger tables, which is reasonable in many cases but can result in infrequent vacuums on very large tables. This is undesirable for a couple of reasons, such as very large tables incurring a huge amount of bloat between vacuums. This new parameter provides a way to set a limit on the value calculated with autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor so that very large tables are vacuumed more frequently. By default, it is set to 100,000,000 tuples, but it can be disabled by setting it to -1. It can also be adjusted for individual tables by changing storage parameters. Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Frédéric Yhuel <frederic.yhuel@dalibo.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Michael Banck <mbanck@gmx.net> Reviewed-by: Joe Conway <mail@joeconway.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Vinícius Abrahão <vinnix.bsd@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/956435f8-3b2f-47a6-8756-8c54ded61802%40dalibo.com
* Add data for WAL in pg_stat_io and backend statisticsMichael Paquier2025-02-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds WAL IO stats to both pg_stat_io view and per-backend IO statistics (pg_stat_get_backend_io()). This change is possible since f92c854cf406, as WAL IO is not counted in blocks in some code paths where its stats data is measured (like WAL read in xlogreader.c). IOContext gains IOCONTEXT_INIT and IOObject IOOBJECT_WAL, with the following combinations allowed: - IOOBJECT_WAL/IOCONTEXT_NORMAL is used to track I/O operations done on already-created WAL segments. - IOOBJECT_WAL/IOCONTEXT_INIT is used for tracking I/O operations done when initializing WAL segments. The core changes are done in pg_stat_io.c, backend statistics inherit them. Backend statistics and pg_stat_io are now available for the WAL writer, the WAL receiver and the WAL summarizer processes. I/O timing data is controlled by the GUC track_io_timing, like the existing data of pg_stat_io for consistency. The timings related to IOOBJECT_WAL show up if the GUC is enabled (disabled by default). Bump pgstats file version, due to the additions in IOObject and IOContext, impacting the amount of data written for the fixed-numbered IO stats kind in the pgstats file. Author: Nazir Bilal Yavuz Reviewed-by: Bertrand Drouvot, Nitin Jadhav, Amit Kapila, Michael Paquier, Melanie Plageman, Bharath Rupireddy Discussion: https://postgr.es/m/CAN55FZ3AiQ+ZMxUuXnBpd0Rrh1YhwJ5FudkHg=JU0P+-W8T4Vg@mail.gmail.com
* Integrate GistTranslateCompareType() into IndexAmTranslateCompareType()Peter Eisentraut2025-02-03
| | | | | | | | | | | | | | | | | This turns GistTranslateCompareType() into a callback function of the gist index AM instead of a standalone function. The existing callers are changed to use IndexAmTranslateCompareType(). This then makes that code not hardcoded toward gist. This means in particular that the temporal keys code is now independent of gist. Also, this generalizes commit 74edabce7a3, so other index access methods other than the previously hardcoded ones could now work as REPLICA IDENTITY in a logical replication subscriber. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Fix typo in xlog.cMichael Paquier2025-02-03
| | | | "recovery" is not a verb. Introduced in 68cb5af46cd8.
* Convert strategies to and from compare typesPeter Eisentraut2025-02-02
| | | | | | | | | | | | | | | For each Index AM, provide a mapping between operator strategies and the system-wide generic concept of a comparison type. For example, for btree, BTLessStrategyNumber maps to and from COMPARE_LT. Numerous places in the planner and executor think directly in terms of btree strategy numbers (and a few in terms of hash strategy numbers.) These should be converted over subsequent commits to think in terms of CompareType instead. (This commit doesn't make any use of this API yet.) Author: Mark Dilger <mark.dilger@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Add get_opfamily_name() functionPeter Eisentraut2025-02-01
| | | | | | | | This refactors and simplifies various existing code to make use of the new function. Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com