aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Fix brin.c indentation issues introduced by c1ec02be1dTomas Vondra2023-11-26
| | | | Per buildfarm member koel.
* Reuse BrinDesc and BrinRevmap in brininsertTomas Vondra2023-11-25
| | | | | | | | | | | | | | | | The brininsert code used to initialize (and destroy) BrinDesc and BrinRevmap for each tuple, which is not free. This patch initializes these structures only once, and reuses them for all inserts in the same command. The data is passed through indexInfo->ii_AmCache. This also introduces an optional AM callback "aminsertcleanup" that allows performing custom cleanup in case simply pfree-ing ii_AmCache is not sufficient (which is the case when the cache contains TupleDesc, Buffers, and so on). Author: Soumyadeep Chakraborty Reviewed-by: Alvaro Herrera, Matthias van de Meent, Tomas Vondra Discussion: https://postgr.es/m/CAE-ML%2B9r2%3DaO1wwji1sBN9gvPz2xRAtFUGfnffpd0ZqyuzjamA%40mail.gmail.com
* C comment: clarify that WAL files can be _recycled_ or removedBruce Momjian2023-11-25
| | | | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/CAB7nPqSDdF0heotQU3gsepgqx+9c+6KjLd3R6aNYH7KKfDd2ig@mail.gmail.com Author: Michael Paquier Backpatch-through: master
* modify segno. for pg_walfile_name() and pg_walfile_name_offset()Bruce Momjian2023-11-24
| | | | | | | | | | | | | | | | | | | | | Previously these functions returned the previous segment number if the LSN was on a segment boundary. We now always return the current segment number for an LSN. Docs updated to reflect this change. Regression tests added, author Andres Freund. Also mentioned in thread https://postgr.es/m/flat/20220204225057.GA1535307%40nathanxps13#d964275c9540d8395e138efc0a75f7e8 BACKWARD INCOMPATIBILITY Reported-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/20190726.172120.101752680.horikyota.ntt@gmail.com Co-authored-by: Kyotaro Horiguchi Backpatch-through: master
* Release lock on heap buffer before vacuuming FSMAndres Freund2023-11-17
| | | | | | | | | | | | | | | | | When there are no indexes on a table, we vacuum each heap block after pruning it and then update the freespace map. Periodically, we also vacuum the freespace map. This was done while unnecessarily holding a lock on the heap page. Release the lock before calling FreeSpaceMapVacuumRange() and, while we're at it, ensure the range includes the heap block we just vacuumed. There are no known deadlocks or other similar issues, therefore don't backpatch. It's certainly not good to do all this work under a lock, but it's not frequently reached, making it not worth the risk of backpatching. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAAKRu_YiL%3D44GvGnt1dpYouDSSoV7wzxVoXs8m3p311rp-TVQQ%40mail.gmail.com
* Retire MemoryContextResetAndDeleteChildren() macro.Nathan Bossart2023-11-15
| | | | | | | | | | | | | | | | | As of commit eaa5808e8e, MemoryContextResetAndDeleteChildren() is just a backwards compatibility macro for MemoryContextReset(). Now that some time has passed, this macro seems more likely to create confusion. This commit removes the macro and replaces all remaining uses with calls to MemoryContextReset(). Any third-party code that use this macro will need to be adjusted to call MemoryContextReset() instead. Since the two have behaved the same way since v9.5, such adjustments won't produce any behavior changes for all currently-supported versions of PostgreSQL. Reviewed-by: Amul Sul, Tom Lane, Alvaro Herrera, Dagfinn Ilmari Mannsåker Discussion: https://postgr.es/m/20231113185950.GA1668018%40nathanxps13
* Clear CurrentResourceOwner earlier in CommitTransaction.Heikki Linnakangas2023-11-15
| | | | | | | | | | | Alexander reported a crash with repeated create + drop database, after the ResourceOwner rewrite (commit b8bff07daa). That was fixed by the previous commit, but it nevertheless seems like a good idea clear CurrentResourceOwner earlier, because you're not supposed to use it for anything after we start releasing it. Reviewed-by: Alexander Lakhin Discussion: https://www.postgresql.org/message-id/11b70743-c5f3-3910-8e5b-dd6c115ff829%40gmail.com
* Don't release index root page pin in ginFindParents().Tom Lane2023-11-13
| | | | | | | | | | | | | | | | | | | | | It's clearly stated in the comments that ginFindParents() must keep the pin on the index's root page that's associated with the topmost GinBtreeStack item. However, the code path for the case that the desired downlink has been pushed down to the next index level ignored this proviso, and would release the pin anyway if we were still examining the root level. That led to an assertion failure or "buffer NNNN is not owned by resource owner" error later, when we try to release the pin again at the end of the insertion. This is quite hard to reproduce, since it can only happen if an index root page split occurs concurrently with our own insertion. Thanks to Jeff Janes for finding a test case that triggers it often enough to allow investigation. This has been there since the beginning of GIN, so back-patch to all supported branches. Discussion: https://postgr.es/m/CAMkU=1yCAKtv86dMrD__Ja-7KzjE=uMeKX8y__cx5W-OEWy2ow@mail.gmail.com
* Use REGBUF_NO_CHANGE at one more place in the hash index.Amit Kapila2023-11-13
| | | | | | | | | | Commit 00d7fb5e2e started to use REGBUF_NO_CHANGE at a few places in the code where we register the buffer before marking it dirty but missed updating one of the code flows in the hash index where we free the overflow page without any live tuples on it. Author: Amit Kapila and Hayato Kuroda Discussion: http://postgr.es/m/f045c8f7-ee24-ead6-3679-c04a43d21351@gmail.com
* Prohibit max_slot_wal_keep_size to value other than -1 during upgrade.Amit Kapila2023-11-10
| | | | | | | | | | | We don't want existing slots in the old cluster to get invalidated during the upgrade. During an upgrade, we set this variable to -1 via the command line in an attempt to prevent such invalidations, but users have ways to override it. This patch ensures the value is not overridden by the user. Author: Kyotaro Horiguchi Reviewed-by: Peter Smith, Alvaro Herrera Discussion: http://postgr.es/m/20231027.115759.2206827438943188717.horikyota.ntt@gmail.com
* Ensure we use the correct spelling of "ensure"David Rowley2023-11-10
| | | | | | | | | We seem to have accidentally used "insure" in a few places. Correct that. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pv0biqrhA3pMhu40aDsj343mTsD75khKnHsLqR8P04f=Q@mail.gmail.com Backpatch-through: 12, oldest supported version
* Make ResourceOwners more easily extensible.Heikki Linnakangas2023-11-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of having a separate array/hash for each resource kind, use a single array and hash to hold all kinds of resources. This makes it possible to introduce new resource "kinds" without having to modify the ResourceOwnerData struct. In particular, this makes it possible for extensions to register custom resource kinds. The old approach was to have a small array of resources of each kind, and if it fills up, switch to a hash table. The new approach also uses an array and a hash, but now the array and the hash are used at the same time. The array is used to hold the recently added resources, and when it fills up, they are moved to the hash. This keeps the access to recent entries fast, even when there are a lot of long-held resources. All the resource-specific ResourceOwnerEnlarge*(), ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have been replaced with three generic functions that take resource kind as argument. For convenience, we still define resource-specific wrapper macros around the generic functions with the old names, but they are now defined in the source files that use those resource kinds. The release callback no longer needs to call ResourceOwnerForget on the resource being released. ResourceOwnerRelease unregisters the resource from the owner before calling the callback. That needed some changes in bufmgr.c and some other files, where releasing the resources previously always called ResourceOwnerForget. Each resource kind specifies a release priority, and ResourceOwnerReleaseAll releases the resources in priority order. To make that possible, we have to restrict what you can do between phases. After calling ResourceOwnerRelease(), you are no longer allowed to remember any more resources in it or to forget any previously remembered resources by calling ResourceOwnerForget. There was one case where that was done previously. At subtransaction commit, AtEOSubXact_Inval() would handle the invalidation messages and call RelationFlushRelation(), which temporarily increased the reference count on the relation being flushed. We now switch to the parent subtransaction's resource owner before calling AtEOSubXact_Inval(), so that there is a valid ResourceOwner to temporarily hold that relcache reference. Other end-of-xact routines make similar calls to AtEOXact_Inval() between release phases, but I didn't see any regression test failures from those, so I'm not sure if they could reach a codepath that needs remembering extra resources. There were two exceptions to how the resource leak WARNINGs on commit were printed previously: llvmjit silently released the context without printing the warning, and a leaked buffer io triggered a PANIC. Now everything prints a WARNING, including those cases. Add tests in src/test/modules/test_resowner. Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu Reviewed-by: Peter Eisentraut, Andres Freund Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
* Enlarge assertion in bloom_init() for false_positive_rateMichael Paquier2023-11-08
| | | | | | | | | | | | | | | | | | false_positive_rate is a parameter that can be set with the bloom opclass in BRIN, and setting it to a value of exactly 0.25 would trigger an assertion in the first INSERT done on the index with value set. The assertion changed here relied on BLOOM_{MIN|MAX}_FALSE_POSITIVE_RATE that are somewhat arbitrary values, and specifying an out-of-range value would also trigger a failure when defining such an index. So, as-is, the assertion was just doubling on the min-max check of the reloption. This is now enlarged to check that it is a correct percentage value, instead, based on a suggestion by Tom Lane. Author: Alexander Lakhin Reviewed-by: Tom Lane, Shihao Zhong Discussion: https://postgr.es/m/17969-a6c54de48026d694@postgresql.org Backpatch-through: 14
* doc: 1-byte varlena headers can be used for user PLAIN storageBruce Momjian2023-10-31
| | | | | | | | | | | | This also updates some C comments. Reported-by: suchithjn22@gmail.com Discussion: https://postgr.es/m/167336599095.2667301.15497893107226841625@wrigleys.postgresql.org Author: Laurenz Albe (doc patch) Backpatch-through: 11
* Diagnose !indisvalid in more SQL functions.Noah Misch2023-10-30
| | | | | | | | | | | | | pgstatindex failed with ERRCODE_DATA_CORRUPTED, of the "can't-happen" class XX. The other functions succeeded on an empty index; they might have malfunctioned if the failed index build left torn I/O or other complex state. Report an ERROR in statistics functions pgstatindex, pgstatginindex, pgstathashindex, and pgstattuple. Report DEBUG1 and skip all index I/O in maintenance functions brin_desummarize_range, brin_summarize_new_values, brin_summarize_range, and gin_clean_pending_list. Back-patch to v11 (all supported versions). Discussion: https://postgr.es/m/20231001195309.a3@google.com
* Delay recovery mode LOG after reading backup_label and/or checkpoint recordMichael Paquier2023-10-30
| | | | | | | | | | | | | | | | | | | | | | When beginning recovery, a LOG is displayed by the startup process to show which recovery mode will be used depending on the .signal file(s) set in the data folder, like "standby mode", recovery up to a given target type and value, or archive recovery. A different patch is under discussion to simplify the startup code by requiring the presence of recovery.signal and/or standby.signal when a backup_label file is read. Delaying a bit this LOG ensures that the correct recovery mode would be reported, and putting it at this position does not make it lose its value. While on it, this commit adds a few comments documenting a bit more the initial recovery steps and their dependencies, and fixes an incorrect comment format. This introduces no behavior changes. Extracted from a larger patch by me. Reviewed-by: David Steele, Bowen Shi Discussion: https://postgr.es/m/ZArVOMifjzE7f8W7@paquier.xyz
* Mention standby.signal in FATALs for checkpoint record missing at recoveryMichael Paquier2023-10-30
| | | | | | | | | | | | | | | When beginning recovery from a base backup by reading a backup_label file, it may be possible that no checkpoint record is available depending on the method used when the case backup was taken, which would prevent recovery from beginning. In this case, the FATAL messages issued, initially added by c900c15269f0f, mentioned recovery.signal as an option to do recovery but not standby.signal. Let's add it as an available option, for clarity. Per suggestion from Bowen Shi, extracted from a larger patch by me. Author: Michael Paquier Discussion: https://postgr.es/m/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8=2aw@mail.gmail.com
* Introduce pg_stat_checkpointerMichael Paquier2023-10-30
| | | | | | | | | | | | | | | | | | | | | | | | Historically, the statistics of the checkpointer have been always part of pg_stat_bgwriter. This commit removes a few columns from pg_stat_bgwriter, and introduces pg_stat_checkpointer with equivalent, renamed columns (plus a new one for the reset timestamp): - checkpoints_timed -> num_timed - checkpoints_req -> num_requested - checkpoint_write_time -> write_time - checkpoint_sync_time -> sync_time - buffers_checkpoint -> buffers_written The fields of PgStat_CheckpointerStats and its SQL functions are renamed to match with the new field names, for consistency. Note that background writer and checkpointer have been split into two different processes in commits 806a2aee3791 and bf405ba8e460. The pgstat structures were already split, making this change straight-forward. Bump catalog version. Author: Bharath Rupireddy Reviewed-by: Bertrand Drouvot, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com
* doc Improve C GUC-related commentsBruce Momjian2023-10-27
| | | | | | | | Discussion: https://postgr.es/m/CAEG8a3LZHTR5S+OPZCbZvECwsqdbx=pBRFZZyDjKaAtgoALOQQ@mail.gmail.com Author: Junwang Zhao Backpatch-through: master
* Fix minmax-multi distance for extreme interval valuesTomas Vondra2023-10-27
| | | | | | | | | | | | | | | | | When calculating distance for interval values, the code mostly mimicked interval_mi, i.e. it built a new interval value for the difference. That however does not work for sufficiently distant interval values, when the difference overflows the interval range. Instead, we can calculate the distance directly, without constructing the intermediate (and unnecessary) interval value. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Dean Rasheed Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
* Fix minmax-multi on infinite date/timestamp valuesTomas Vondra2023-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure that infinite values in date/timestamp columns are treated as if in infinite distance. Infinite values should not be merged with other values, leaving them as outliers. The code however returned distance 0 in this case, so that infinite values were merged first. While this does not break the index (i.e. it still produces correct query results), it may make it much less efficient. We don't need explicit handling of infinite date/timestamp values when calculating distances, because those values are represented as extreme but regular values (e.g. INT64_MIN/MAX for the timestamp type). We don't need an exact distance, just a value that is much larger than distanced between regular values. With the added cast to double values, we can simply subtract the values. The regression test queries a value in the "gap" and checks the range was properly eliminated by the BRIN index. This only affects minmax-multi indexes on timestamp/date columns with infinite values, which is not very common in practice. The affected indexes may need to be rebuilt. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
* Fix calculation in brin_minmax_multi_distance_dateTomas Vondra2023-10-27
| | | | | | | | | | | | | | | | | | | | | When calculating the distance between date values, make sure to subtract them in the right order, i.e. (larger - smaller). The distance is used to determine which values to merge, and is expected to be a positive value. The code unfortunately did the subtraction in the opposite order, i.e. (smaller - larger), thus producing negative values and merging values the most distant values first. The resulting index is correct (i.e. produces correct results), but may be significantly less efficient. This affects all minmax-multi indexes on date columns. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
* Fix overflow when calculating timestamp distance in BRINTomas Vondra2023-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | When calculating distances for timestamp values for BRIN minmax-multi indexes, we need to be careful about overflows for extreme values. If the value overflows into a negative value, the index may be inefficient. The new regression test checks this for the timestamp type by adding a table with enough values to force range compaction/merging. The values are close to min/max, which means a risk of overflow. Fixed by converting the int64 values to double first, before calculating the distance. This prevents the overflow. We may lose some precision, of course, but that's good enough. In the worst case we build a slightly less efficient index, but for large distances this won't matter. This only affects minmax-multi indexes on timestamp columns, with ranges containing values sufficiently distant to cause an overflow. That seems like a fairly rare case in practice. Backpatch to 14, where minmax-multi indexes were introduced. Reported-by: Ashutosh Bapat Reviewed-by: Ashutosh Bapat, Dean Rasheed Backpatch-through: 14 Discussion: https://postgr.es/m/eef0ea8c-4aaa-8d0d-027f-58b1f35dd170@enterprisedb.com
* Add trailing commas to enum definitionsPeter Eisentraut2023-10-26
| | | | | | | | | | | | | | | | | | | | Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
* Assert that buffers are marked dirty before XLogRegisterBuffer().Jeff Davis2023-10-23
| | | | | | | | | | | Enforce the rule from transam/README in XLogRegisterBuffer(), and update callers to follow the rule. Hash indexes sometimes register clean pages as a part of the locking protocol, so provide a REGBUF_NO_CHANGE flag to support that use. Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a904@iki.fi Reviewed-by: Heikki Linnakangas
* Change struct tablespaceinfo's oid member from 'char *' to 'Oid'Robert Haas2023-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | This shouldn't change behavior except in the unusual case where there are file in the tablespace directory that have entirely numeric names but are nevertheless not possible names for a tablespace directory, either because their names have leading zeroes that shouldn't be there, or the value is actually zero, or because the value is too large to represent as an OID. In those cases, the directory would previously have made it into the list of tablespaceinfo objects and no longer will. Thus, base backups will now ignore such directories, instead of treating them as legitimate tablespace directories. Similarly, if entries for such tablespaces occur in a tablespace_map file, they will now be rejected as erroneous, instead of being honored. This is infrastructure for future work that wants to be able to know the tablespace of each relation that is part of a backup *as an OID*. By strengthening the up-front validation, we don't have to worry about weird cases later, and can more easily avoid repeated string->integer conversions. Patch by me, reviewed by David Steele. Discussion: http://postgr.es/m/CA+TgmoZNVeBzoqDL8xvr-nkaepq815jtDR4nJzPew7=3iEuM1g@mail.gmail.com
* During online checkpoints, insert XLOG_CHECKPOINT_REDO at redo point.Robert Haas2023-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | This allows tools that read the WAL sequentially to identify (possible) redo points when they're reached, rather than only being able to detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly much later in the WAL stream. There are other possible applications as well; see the discussion links below. Any redo location that precedes the checkpoint location should now point to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this. While adjusting the code in CreateCheckPoint() for this patch, I made it call WALInsertLockAcquireExclusive a bit later than before, since there appears to be no need for it to be held while checking whether the system is idle, whether this is an end-of-recovery checkpoint, or what the current timeline is. Bump XLOG_PAGE_MAGIC. Patch by me, based in part on earlier work from Dilip Kumar. Review by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com
* Reword messages about impending (M)XID exhaustion.Robert Haas2023-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | First, we shouldn't recommend switching to single-user mode, because that's terrible advice. Especially on newer versions where VACUUM will enter emergency mode when nearing (M)XID exhaustion, it's perfectly fine to just VACUUM in multi-user mode. Doing it that way is less disruptive and avoids disabling the safeguards that prevent actual wraparound, so recommend that instead. Second, be more precise about what is going to happen (when we're nearing the limits) or what is happening (when we actually hit them). The database doesn't shut down, nor does it refuse all commands. It refuses commands that assign whichever of XIDs and MXIDs are nearly exhausted. No back-patch. The existing hint that advises going to single-user mode is sufficiently awful advice that removing it or changing it might be justifiable even though we normally avoid changing user-facing messages in back-branches, but I (rhaas) felt that it was better to be more conservative and limit this fix to master only. Aside from the usual risk of breaking translations, people might be used to the existing message, or even have monitoring scripts that look for it. Alexander Alekseev, John Naylor, Robert Haas, reviewed at various times by Peter Geoghegan, Hannu Krosing, and Andres Freund. Discussion: http://postgr.es/m/CA+TgmoZBg95FiR9wVQPAXpGPRkacSt2okVge+PKPPFppN7sfnQ@mail.gmail.com
* Talk about assigning, rather than generating, new MultiXactIds.Robert Haas2023-10-17
| | | | | | | | | The word "assign" is used in various places internally to describe what GetNewMultiXactId does, but the user-facing messages have previously said "generate". For consistency, standardize on "assign," which seems (at least to me) to be slightly clearer. Discussion: http://postgr.es/m/CA+TgmoaoE1_i3=4-7GCTtKLVZVQ2Gh6qESW2VG1OprtycxOHMA@mail.gmail.com
* Move extra code out of the Pre/PostRestoreCommand() section.Nathan Bossart2023-10-16
| | | | | | | | | | | | | | | | | | | | If SIGTERM is received within this section, the startup process will immediately proc_exit() in the signal handler, so it is inadvisable to include any more code than is required there (as such code is unlikely to be compatible with doing proc_exit() in a signal handler). This commit moves the code recently added to this section (see 1b06d7bac9 and 7fed801135) to outside of the section. This ensures that the startup process only calls proc_exit() in its SIGTERM handler for the duration of the system() call, which is how this code worked from v8.4 to v14. Reported-by: Michael Paquier, Thomas Munro Analyzed-by: Andres Freund Suggested-by: Tom Lane Reviewed-by: Michael Paquier, Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13 Backpatch-through: 15
* Fix comment from commit 22655aa231.Thomas Munro2023-10-16
| | | | | Per automated complaint from BF animal koel this needed to be re-indented, but there was also a typo. Back-patch to 16.
* Fix bulk table extension when copying into multiple partitionsAndres Freund2023-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When COPYing into a partitioned table that does now permit the use of table_multi_insert(), we could error out with ERROR: could not read block NN in file "base/...": read only 0 of 8192 bytes because BulkInsertState->next_free was not reset between partitions. This problem occurred only when not able to use table_multi_insert(), as a dedicated BulkInsertState for each partition is used in that case. The bug was introduced in 00d1e02be24, but it was hard to hit at that point, as commonly bulk relation extension is not used when not using table_multi_insert(). It became more likely after 82a4edabd27, which expanded the use of bulk extension. To fix the bug, reset the bulk relation extension state in BulkInsertState in ReleaseBulkInsertStatePin(). That was added (in b1ecb9b3fcf) to tackle a very similar issue. Obviously the name is not quite correct, but there might be external callers, and bulk insert state needs to be reset in precisely in the situations that ReleaseBulkInsertStatePin() already needed to be called. Medium term the better fix likely is to disallow reusing BulkInsertState across relations. Add a test that, without the fix, reproduces #18130 in most configurations. The test also catches the problem fixed in b1ecb9b3fcf when run with small shared_buffers. Reported-by: Ivan Kolombet <enderstd@gmail.com> Analyzed-by: Tom Lane <tgl@sss.pgh.pa.us> Analyzed-by: Andres Freund <andres@anarazel.de> Bug: #18130 Discussion: https://postgr.es/m/18130-7a86a7356a75209d%40postgresql.org Discussion: https://postgr.es/m/257696.1695670946%40sss.pgh.pa.us Backpatch: 16-
* Improve the naming in wal_sync_method code.Nathan Bossart2023-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | * sync_method is renamed to wal_sync_method. * sync_method_options[] is renamed to wal_sync_method_options[]. * assign_xlog_sync_method() is renamed to assign_wal_sync_method(). * The names of the available synchronization methods are now prefixed with "WAL_SYNC_METHOD_" and have been moved into a WalSyncMethod enum. * PLATFORM_DEFAULT_SYNC_METHOD is renamed to PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is renamed to DEFAULT_WAL_SYNC_METHOD. These more descriptive names help distinguish the code for wal_sync_method from the code for DataDirSyncMethod (e.g., the recovery_init_sync_method configuration parameter and the --sync-method option provided by several frontend utilities). This change also prevents name collisions between the aforementioned sets of code. Since this only improves the naming of internal identifiers, there should be no behavior change. Author: Maxim Orlov Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com
* Add wait events for checkpoint delay mechanism.Thomas Munro2023-10-13
| | | | | | | | | | | When MyProc->delayChkptFlags is set to temporarily block phase transitions in a concurrent checkpoint, the checkpointer enters a sleep-poll loop to wait for the flag to be cleared. We should show that as a wait event in the pg_stat_activity view. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGL7Whi8iwKbzkbn_1fixH3Yy8aAPz7mfq6Hpj7FeJrKMg%40mail.gmail.com
* Unify two isLogSwitch tests in XLogInsertRecord.Robert Haas2023-10-12
| | | | | | | | | | | | | | | | | An upcoming patch wants to introduce an additional special case in this function. To keep that as cheap as possible, minimize the amount of branching that we do based on whether this is an XLOG_SWITCH record. Additionally, and also in the interest of keeping the overhead of special-case code paths as low as possible, apply likely() to the non-XLOG_SWITCH case, since only a very tiny fraction of WAL records will be XLOG_SWITCH records. Patch by me, reviewed by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
* Reindent comment in GenericXLogFinish().Tom Lane2023-10-11
| | | | Restore pgindent cleanliness, per buildfarm member koel.
* Fix bug in GenericXLogFinish().Jeff Davis2023-10-10
| | | | | | | | Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi Reviewed-by: Heikki Linnakangas Backpatch-through: 11
* Add const to values and nulls argumentsPeter Eisentraut2023-10-10
| | | | | | | This excludes any changes that would change the external AM APIs. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/14c31f4a-0347-0805-dce8-93a9072c05a5%40eisentraut.org
* Remove duplicate words in docs and code comments.Amit Kapila2023-10-09
| | | | | | | Additionally, add a missing "the" in a couple of places. Author: Vignesh C, Dagfinn Ilmari Mannsåker Discussion: http://postgr.es/m/CALDaNm28t+wWyPfuyqEaARS810Je=dRFkaPertaLAEJYY2cWYQ@mail.gmail.com
* Fix another typo in e0b1ee17dcAlexander Korotkov2023-10-07
| | | | | Reported-by: Richard Guo Discussion: https://postgr.es/m/CAMbWs4_kHMJDak75y1kBTirv-drS1-knT-7Mpg5LprAjqRJDVA%40mail.gmail.com
* Fix typos in e0b1ee17dcAlexander Korotkov2023-10-07
| | | | Reported-by: Alexander Lakhin
* Skip checking of scan keys required for directional scan in B-treeAlexander Korotkov2023-10-06
| | | | | | | | | | | | | | | | | | | | | Currently, B-tree code matches every scan key to every item on the page. Imagine the ordered B-tree scan for the query like this. SELECT * FROM tbl WHERE col > 'a' AND col < 'b' ORDER BY col; The (col > 'a') scan key will be always matched once we find the location to start the scan. The (col < 'b') scan key will match every item on the page as long as it matches the last item on the page. This patch implements prechecking of the scan keys required for directional scan on beginning of page scan. If precheck is successful we can skip this scan keys check for the items on the page. That could lead to significant acceleration especially if the comparison operator is expensive. Idea from patch by Konstantin Knizhnik. Discussion: https://postgr.es/m/079c3f8e-3371-abe2-e93c-fc8a0ae3f571%40garret.ru Reviewed-by: Peter Geoghegan, Pavel Borisov
* Move BuildDescForRelation() from tupdesc.c to tablecmds.cPeter Eisentraut2023-10-05
| | | | | | | | | | | | | | BuildDescForRelation() main job is to convert ColumnDef lists to pg_attribute/tuple descriptor arrays, which is really mostly an internal subroutine of DefineRelation() and some related functions, which is more the remit of tablecmds.c and doesn't have much to do with the basic tuple descriptor interfaces in tupdesc.c. This is also supported by observing the header includes we can remove in tupdesc.c. By moving it over, we can also (in the future) make BuildDescForRelation() use more internals of tablecmds.c that are not sensible to be exposed in tupdesc.c. Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da@eisentraut.org
* Push attidentity and attgenerated handling into BuildDescForRelation()Peter Eisentraut2023-10-05
| | | | | | | | | Previously, this was handled by the callers separately, but it can be trivially moved into BuildDescForRelation() so that it is handled in a central place. Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://www.postgresql.org/message-id/flat/52a125e4-ff9a-95f5-9f61-b87cf447e4da@eisentraut.org
* Tidy-up some appendStringInfo*() usagesDavid Rowley2023-10-03
| | | | | | | | | | | | Make a few newish calls to appendStringInfo() which have no special formatting use appendStringInfoString() instead. Also, adjust usages of appendStringInfoString() which only append a string containing a single character to make use of appendStringInfoChar() instead. This makes the code marginally faster, but primarily this change is so we use the StringInfo type as it was intended to be used. Discussion: https://postgr.es/m/CAApHDvpXKQmL+r=VDNS98upqhr9yGBhv2Jw3GBFFk_wKHcB39A@mail.gmail.com
* Fail hard on out-of-memory failures in xlogreader.cMichael Paquier2023-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit changes the WAL reader routines so as a FATAL for the backend or exit(FAILURE) for the frontend is triggered if an allocation for a WAL record decode fails in walreader.c, rather than treating this case as bogus data, which would be equivalent to the end of WAL. The key is to avoid palloc_extended(MCXT_ALLOC_NO_OOM) in walreader.c, relying on plain palloc() calls. The previous behavior could make WAL replay finish too early than it should. For example, crash recovery finishing earlier may corrupt clusters because not all the WAL available locally was replayed to ensure a consistent state. Out-of-memory failures would show up randomly depending on the memory pressure on the host, but one simple case would be to generate a large record, then replay this record after downsizing a host, as Ethan Mertz originally reported. This relies on bae868caf222, as the WAL reader routines now do the memory allocation required for a record only once its header has been fully read and validated, making xl_tot_len trustable. Making the WAL reader react differently on out-of-memory or bogus record data would require ABI changes, so this is the safest choice for stable branches. Also, it is worth noting that 3f1ce973467a has been using a plain palloc() in this code for some time now. Thanks to Noah Misch and Thomas Munro for the discussion. Like the other commit, backpatch down to 12, leaving out v11 that will be EOL'd soon. The behavior of considering a failed allocation as bogus data comes originally from 0ffe11abd3a0, where the record length retrieved from its header was not entirely trustable. Reported-by: Ethan Mertz Discussion: https://postgr.es/m/ZRKKdI5-RRlta3aF@paquier.xyz Backpatch-through: 12
* Remove retry loop in heap_page_prune().Robert Haas2023-10-02
| | | | | | | | | | | | | | | The retry loop is needed because heap_page_prune() calls HeapTupleSatisfiesVacuum() and then lazy_scan_prune() does the same thing again, and they might get different answers due to concurrent clog updates. But this patch makes heap_page_prune() return the HeapTupleSatisfiesVacuum() results that it computed back to the caller, which allows lazy_scan_prune() to avoid needing to recompute those values in the first place. That's nice both because it eliminates the need for a retry loop and also because it's cheaper. Melanie Plageman, reviewed by David Geier, Andres Freund, and me. Discussion: https://postgr.es/m/CAAKRu_br124qsGJieuYA0nGjywEukhK1dKBfRdby_4yY3E9SXA%40mail.gmail.com
* Add rmgrdesc READMEHeikki Linnakangas2023-10-02
| | | | | | | | | | In the README, briefly explain what rmgrdesc functions are, and why they are in a separate directory. Commit c03c2eae0a added some guidelines on the preferred output format; move that to the README too. Reviewed-by: Melanie Plageman, Peter Geoghegan Discussion: https://www.postgresql.org/message-id/9159daf7-f42d-781b-458f-1b2cf32cb256%40iki.fi
* Correct assertion and comments about XLogRecordMaxSize.Noah Misch2023-10-01
| | | | | | The largest allocation, of xl_tot_len+8192, is in allocate_recordbuf(). Discussion: https://postgr.es/m/20230812211327.GB2326466@rfd.leadboat.com
* Fix btmarkpos/btrestrpos array key wraparound bug.Peter Geoghegan2023-09-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nbtree's mark/restore processing failed to correctly handle an edge case involving array key advancement and related search-type scan key state. Scans with ScalarArrayScalarArrayOpExpr quals requiring mark/restore processing (for a merge join) could incorrectly conclude that an affected array/scan key must not have advanced during the time between marking and restoring the scan's position. As a result of all this, array key handling within btrestrpos could skip a required call to _bt_preprocess_keys(). This confusion allowed later primitive index scans to overlook tuples matching the true current array keys. The scan's search-type scan keys would still have spurious values corresponding to the final array element(s) -- not values matching the first/now-current array element(s). To fix, remember that "array key wraparound" has taken place during the ongoing btrescan in a flag variable stored in the scan's state, and use that information at the point where btrestrpos decides if another call to _bt_preprocess_keys is required. Oversight in commit 70bc5833, which taught nbtree to handle array keys during mark/restore processing, but missed this subtlety. That commit was itself a bug fix for an issue in commit 9e8da0f7, which taught nbtree to handle ScalarArrayOpExpr quals natively. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzkgP3DDRJxw6DgjCxo-cu-DKrvjEv_ArkP2ctBJatDCYg@mail.gmail.com Backpatch: 11- (all supported branches).