aboutsummaryrefslogtreecommitdiff
path: root/src/backend/storage/ipc
Commit message (Collapse)AuthorAge
...
* Revert: Implement pg_wal_replay_wait() stored procedureAlexander Korotkov2024-04-11
| | | | | | | This commit reverts 06c418e163, e37662f221, bf1e650806, 25f42429e2, ee79928441, and 74eaf66f98 per review by Heikki Linnakangas. Discussion: https://postgr.es/m/b155606b-e744-4218-bda5-29379779da1a%40iki.fi
* Combine freezing and pruning steps in VACUUMHeikki Linnakangas2024-04-03
| | | | | | | | | | | | | | | | | | | | | | | | Execute both freezing and pruning of tuples in the same heap_page_prune() function, now called heap_page_prune_and_freeze(), and emit a single WAL record containing all changes. That reduces the overall amount of WAL generated. This moves the freezing logic from vacuumlazy.c to the heap_page_prune_and_freeze() function. The main difference in the coding is that in vacuumlazy.c, we looked at the tuples after the pruning had already happened, but in heap_page_prune_and_freeze() we operate on the tuples before pruning. The heap_prepare_freeze_tuple() function is now invoked after we have determined that a tuple is not going to be pruned away. VACUUM no longer needs to loop through the items on the page after pruning. heap_page_prune_and_freeze() does all the work. It now returns the list of dead offsets, including existing LP_DEAD items, to the caller. Similarly it's now responsible for tracking 'all_visible', 'all_frozen', and 'hastup' on the caller's behalf. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
* Move WaitLSNShmemInit() to CreateOrAttachShmemStructs()Alexander Korotkov2024-04-03
| | | | | | | Thanks to Andres Freund, Thomas Munrom and David Rowley for investigating this issue. Discussion: https://postgr.es/m/CAPpHfdvap5mMLikt8CUjA0osAvCJHT0qnYeR3f84EJ_Kvse0mg%40mail.gmail.com
* Implement pg_wal_replay_wait() stored procedureAlexander Korotkov2024-04-02
| | | | | | | | | | | | | | | | | | | | | | | | pg_wal_replay_wait() is to be used on standby and specifies waiting for the specific WAL location to be replayed before starting the transaction. This option is useful when the user makes some data changes on primary and needs a guarantee to see these changes on standby. The queue of waiters is stored in the shared memory array sorted by LSN. During replay of WAL waiters whose LSNs are already replayed are deleted from the shared memory array and woken up by setting of their latches. pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise, the snapshot could prevent the replay of WAL records implying a kind of self-deadlock. This is why it is only possible to implement pg_wal_replay_wait() as a procedure working in a non-atomic context, not a function. Catversion is bumped. Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru Author: Kartyshov Ivan, Alexander Korotkov Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
* Fix false reports in pg_visibilityAlexander Korotkov2024-03-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, pg_visibility computes its xid horizon using the GetOldestNonRemovableTransactionId(). The problem is that this horizon can sometimes go backward. That can lead to reporting false errors. In order to fix that, this commit implements a new function GetStrictOldestNonRemovableTransactionId(). This function computes the xid horizon, which would be guaranteed to be newer or equal to any xid horizon computed before. We have to do the following to achieve this. 1. Ignore processes xmin's, because they consider connection to other databases that were ignored before. 2. Ignore KnownAssignedXids, because they are not database-aware. At the same time, the primary could compute its horizons database-aware. 3. Ignore walsender xmin, because it could go backward if some replication connections don't use replication slots. As a result, we're using only currently running xids to compute the horizon. Surely these would significantly sacrifice accuracy. But we have to do so to avoid reporting false errors. Inspired by earlier patch by Daniel Shelepanov and the following discussion with Robert Haas and Tom Lane. Discussion: https://postgr.es/m/1649062270.289865713%40f403.i.mail.ru Reviewed-by: Alexander Lakhin, Dmitry Koval
* Make the order of the header file includes consistentPeter Eisentraut2024-03-13
| | | | | | | | Similar to commit 7e735035f20. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4-WhpCFMbXCjtJ%2BFzmjfPrp7Hw1pk4p%2BZpU95Kh3ofZ1A%40mail.gmail.com
* Remove the adminpack contrib extensionDaniel Gustafsson2024-03-04
| | | | | | | | | | | | The adminpack extension was only used to support pgAdmin III, which in turn was declared EOL many years ago. Removing the extension also allows us to remove functions from core as well which were only used to support old version of adminpack. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CALj2ACUmL5TraYBUBqDZBi1C+Re8_=SekqGYqYprj_W8wygQ8w@mail.gmail.com
* Remove unused #include's from backend .c filesPeter Eisentraut2024-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
* Use MyBackendType in more places to check what process this isHeikki Linnakangas2024-03-04
| | | | | | | | | | Remove IsBackgroundWorker, IsAutoVacuumLauncherProcess(), IsAutoVacuumWorkerProcess(), and IsLogicalSlotSyncWorker() in favor of new Am*Process() macros that use MyBackendType. For consistency with the existing Am*Process() macros. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/f3ecd4cb-85ee-4e54-8278-5fabfb3a4ed0@iki.fi
* Replace BackendIds with 0-based ProcNumbersHeikki Linnakangas2024-03-03
| | | | | | | | | | | | | | | | | | Now that BackendId was just another index into the proc array, it was redundant with the 0-based proc numbers used in other places. Replace all usage of backend IDs with proc numbers. The only place where the term "backend id" remains is in a few pgstat functions that expose backend IDs at the SQL level. Those IDs are now in fact 0-based ProcNumbers too, but the documentation still calls them "backend ids". That term still seems appropriate to describe what the numbers are, so I let it be. One user-visible effect is that pg_temp_0 is now a valid temp schema name, for backend with ProcNumber 0. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
* Redefine backend ID to be an index into the proc arrayHeikki Linnakangas2024-03-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, backend ID was an index into the ProcState array, in the shared cache invalidation manager (sinvaladt.c). The entry in the ProcState array was reserved at backend startup by scanning the array for a free entry, and that was also when the backend got its backend ID. Things become slightly simpler if we redefine backend ID to be the index into the PGPROC array, and directly use it also as an index to the ProcState array. This uses a little more memory, as we reserve a few extra slots in the ProcState array for aux processes that don't need them, but the simplicity is worth it. Aux processes now also have a backend ID. This simplifies the reservation of BackendStatusArray and ProcSignal slots. You can now convert a backend ID into an index into the PGPROC array simply by subtracting 1. We still use 0-based "pgprocnos" in various places, for indexes into the PGPROC array, but the only difference now is that backend IDs start at 1 while pgprocnos start at 0. (The next commmit will get rid of the term "backend ID" altogether and make everything 0-based.) There is still a 'backendId' field in PGPROC, now part of 'vxid' which encapsulates the backend ID and local transaction ID together. It's needed for prepared xacts. For regular backends, the backendId is always equal to pgprocno + 1, but for prepared xact PGPROC entries, it's the ID of the original backend that processed the transaction. Reviewed-by: Andres Freund, Reid Thompson Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
* Add helper functions for dshash tables with string keys.Nathan Bossart2024-02-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Presently, string keys are not well-supported for dshash tables. The dshash code always copies key_size bytes into new entries' keys, and dshash.h only provides compare and hash functions that forward to memcmp() and tag_hash(), both of which do not stop at the first NUL. This means that callers must pad string keys so that the data beyond the first NUL does not adversely affect the results of copying, comparing, and hashing the keys. To better support string keys in dshash tables, this commit does a couple things: * A new copy_function field is added to the dshash_parameters struct. This function pointer specifies how the key should be copied into new table entries. For example, we only want to copy up to the first NUL byte for string keys. A dshash_memcpy() helper function is provided and used for all existing in-tree dshash tables without string keys. * A set of helper functions for string keys are provided. These helper functions forward to strcmp(), strcpy(), and string_hash(), all of which ignore data beyond the first NUL. This commit also adjusts the DSM registry's dshash table to use the new helper functions for string keys. Reviewed-by: Andy Fan Discussion: https://postgr.es/m/20240119215941.GA1322079%40nathanxps13
* Remove superfluous 'pgprocno' field from PGPROCHeikki Linnakangas2024-02-22
| | | | | | | | | It was always just the index of the PGPROC entry from the beginning of the proc array. Introduce a macro to compute it from the pointer instead. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
* Centralize logic for restoring errno in signal handlers.Nathan Bossart2024-02-14
| | | | | | | | | | | | | | | | Presently, we rely on each individual signal handler to save the initial value of errno and then restore it before returning if needed. This is easily forgotten and, if missed, often goes undetected for a long time. In commit 3b00fdba9f, we introduced a wrapper signal handler function that checks whether MyProcPid matches getpid(). This commit moves the aforementioned errno restoration code from the individual signal handlers to the new wrapper handler so that we no longer need to worry about missing it. Reviewed-by: Andres Freund, Noah Misch Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13
* Add a slot synchronization function.Amit Kapila2024-02-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit introduces a new SQL function pg_sync_replication_slots() which is used to synchronize the logical replication slots from the primary server to the physical standby so that logical replication can be resumed after a failover or planned switchover. A new 'synced' flag is introduced in pg_replication_slots view, indicating whether the slot has been synchronized from the primary server. On a standby, synced slots cannot be dropped or consumed, and any attempt to perform logical decoding on them will result in an error. The logical replication slots on the primary can be synchronized to the hot standby by using the 'failover' parameter of pg-create-logical-replication-slot(), or by using the 'failover' option of CREATE SUBSCRIPTION during slot creation, and then calling pg_sync_replication_slots() on standby. For the synchronization to work, it is mandatory to have a physical replication slot between the primary and the standby aka 'primary_slot_name' should be configured on the standby, and 'hot_standby_feedback' must be enabled on the standby. It is also necessary to specify a valid 'dbname' in the 'primary_conninfo'. If a logical slot is invalidated on the primary, then that slot on the standby is also invalidated. If a logical slot on the primary is valid but is invalidated on the standby, then that slot is dropped but will be recreated on the standby in the next pg_sync_replication_slots() call provided the slot still exists on the primary server. It is okay to recreate such slots as long as these are not consumable on standby (which is the case currently). This situation may occur due to the following reasons: - The 'max_slot_wal_keep_size' on the standby is insufficient to retain WAL records from the restart_lsn of the slot. - 'primary_slot_name' is temporarily reset to null and the physical slot is removed. The slot synchronization status on the standby can be monitored using the 'synced' column of pg_replication_slots view. A functionality to automatically synchronize slots by a background worker and allow logical walsenders to wait for the physical will be done in subsequent commits. Author: Hou Zhijie, Shveta Malik, Ajin Cherian based on an earlier version by Peter Eisentraut Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Nisha Moond, Kuroda Hayato, Amit Kapila Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
* Fix 'mmap' DSM implementation with allocations larger than 4 GBHeikki Linnakangas2024-02-13
| | | | | | Fixes bug #18341. Backpatch to all supported versions. Discussion: https://www.postgresql.org/message-id/18341-ce16599e7fd6228c@postgresql.org
* Fix possible NULL pointer dereference in GetNamedDSMSegment().Nathan Bossart2024-01-22
| | | | | | | | | | | | GetNamedDSMSegment() doesn't check whether dsm_attach() returns NULL, which creates the possibility of a NULL pointer dereference soon after. To fix, emit an ERROR if dsm_attach() returns NULL. This shouldn't happen, but it would be nice to avoid a segfault if it does. In passing, tidy up the surrounding code. Reported-by: Tom Lane Reviewed-by: Michael Paquier, Bharath Rupireddy Discussion: https://postgr.es/m/3348869.1705854106%40sss.pgh.pa.us
* Add backend support for injection pointsMichael Paquier2024-01-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Injection points are a new facility that makes possible for developers to run custom code in pre-defined code paths. Its goal is to provide ways to design and run advanced tests, for cases like: - Race conditions, where processes need to do actions in a controlled ordered manner. - Forcing a state, like an ERROR, FATAL or even PANIC for OOM, to force recovery, etc. - Arbitrary sleeps. This implements some basics, and there are plans to extend it more in the future depending on what's required. Hence, this commit adds a set of routines in the backend that allows developers to attach, detach and run injection points: - A code path calling an injection point can be declared with the macro INJECTION_POINT(name). - InjectionPointAttach() and InjectionPointDetach() to respectively attach and detach a callback to/from an injection point. An injection point name is registered in a shmem hash table with a library name and a function name, which will be used to load the callback attached to an injection point when its code path is run. Injection point names are just strings, so as an injection point can be declared and run by out-of-core extensions and modules, with callbacks defined in external libraries. This facility is hidden behind a dedicated switch for ./configure and meson, disabled by default. Note that backends use a local cache to store callbacks already loaded, cleaning up their cache if a callback has found to be removed on a best-effort basis. This could be refined further but any tests but what we have here was fine with the tests I've written while implementing these backend APIs. Author: Michael Paquier, with doc suggestions from Ashutosh Bapat. Reviewed-by: Ashutosh Bapat, Nathan Bossart, Álvaro Herrera, Dilip Kumar, Amul Sul, Nazir Bilal Yavuz Discussion: https://postgr.es/m/ZTiV8tn_MIb_H2rE@paquier.xyz
* Introduce the dynamic shared memory registry.Nathan Bossart2024-01-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | Presently, the most straightforward way for a shared library to use shared memory is to request it at server startup via a shmem_request_hook, which requires specifying the library in shared_preload_libraries. Alternatively, the library can create a dynamic shared memory (DSM) segment, but absent a shared location to store the segment's handle, other backends cannot use it. This commit introduces a registry for DSM segments so that these other backends can look up existing segments with a library-specified string. This allows libraries to easily use shared memory without needing to request it at server startup. The registry is accessed via the new GetNamedDSMSegment() function. This function handles allocating the segment and initializing it via a provided callback. If another backend already created and initialized the segment, it simply attaches the segment. GetNamedDSMSegment() locks the registry appropriately to ensure that only one backend initializes the segment and that all other backends just attach it. The registry itself is comprised of a dshash table that stores the DSM segment handles keyed by a library-specified string. Reviewed-by: Michael Paquier, Andrei Lepikhov, Nikita Malakhov, Robert Haas, Bharath Rupireddy, Zhang Mingli, Amul Sul Discussion: https://postgr.es/m/20231205034647.GA2705267%40nathanxps13
* Fix name collision in c64086b79dbaAlexander Korotkov2024-01-19
| | | | | Reported-by: Erik Rijkers, Tom Lane Discussion: https://postgr.es/m/E1rQqeS-002A0s-Qm%40gemulon.postgresql.org
* Reorder actions in ProcArrayApplyRecoveryInfo()Alexander Korotkov2024-01-19
| | | | | | | | | | | | | | Since 5a1dfde8334b, 2PC filenames use FullTransactionId. Thus, it needs to convert TransactionId to FullTransactionId in StandbyTransactionIdIsPrepared() using TransamVariables->nextXid. However, ProcArrayApplyRecoveryInfo() first releases locks with usage StandbyTransactionIdIsPrepared(), then advances TransamVariables->nextXid. This sequence of actions could cause errors. This commit makes ProcArrayApplyRecoveryInfo() advance TransamVariables->nextXid before releasing locks. Reported-by: Thomas Munro, Michael Paquier Discussion: https://postgr.es/m/CA%2BhUKGLj_ve1_pNAnxwYU9rDcv7GOhsYXJt7jMKSA%3D5-6ss-Cw%40mail.gmail.com Discussion: https://postgr.es/m/Zadp9f4E1MYvMJqe%40paquier.xyz
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Add support for incremental backup.Robert Haas2023-12-20
| | | | | | | | | | | | | | | | | | | | | | To take an incremental backup, you use the new replication command UPLOAD_MANIFEST to upload the manifest for the prior backup. This prior backup could either be a full backup or another incremental backup. You then use BASE_BACKUP with the INCREMENTAL option to take the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST option to trigger this behavior. An incremental backup is like a regular full backup except that some relation files are replaced with files with names like INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains additional lines identifying it as an incremental backup. The new pg_combinebackup tool can be used to reconstruct a data directory from a full backup and a series of incremental backups. Patch by me. Reviewed by Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut, and Álvaro Herrera. Thanks especially to Jakub for incredibly helpful and extensive testing. Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
* Remove trace_recovery_messagesMichael Paquier2023-12-11
| | | | | | | | | | | | This GUC was intended as a debugging help in the 9.0 area when hot standby and streaming replication were being developped, able to offer more information at LOG level rather than DEBUGn. There are more tools available these days that are able to offer rather equivalent information, like pg_waldump introduced in 9.3. It is not obvious how this facility is useful these days, so let's remove it. Author: Bharath Rupireddy Discussion: https://postgr.es/m/ZXEXEAUVFrvpquSd@paquier.xyz
* Rename ShmemVariableCache to TransamVariablesHeikki Linnakangas2023-12-08
| | | | | | | | The old name was misleading: It's not a cache, the values kept in the struct are the authoritative source. Reviewed-by: Tristan Partin, Richard Guo Discussion: https://www.postgresql.org/message-id/6537d63d-4bb5-46f8-9b5d-73a8ba4720ab@iki.fi
* Initialize ShmemVariableCache like other shmem areasHeikki Linnakangas2023-12-08
| | | | | | | For sake of consistency. Reviewed-by: Tristan Partin, Richard Guo Discussion: https://www.postgresql.org/message-id/6537d63d-4bb5-46f8-9b5d-73a8ba4720ab@iki.fi
* Refactor CreateSharedMemoryAndSemaphoresHeikki Linnakangas2023-12-03
| | | | | | | | | | | | For clarity, have separate functions for *creating* the shared memory and semaphores at postmaster or single-user backend startup, and for *attaching* to existing shared memory structures in EXEC_BACKEND case. CreateSharedMemoryAndSemaphores() is now called only at postmaster startup, and a new AttachSharedMemoryStructs() function is called at backend startup in EXEC_BACKEND mode. Reviewed-by: Tristan Partin, Andres Freund Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
* Use ResourceOwner to track WaitEventSets.Heikki Linnakangas2023-11-23
| | | | | | | | | | | | | | | | | | | | | | | A WaitEventSet holds file descriptors or event handles (on Windows). If FreeWaitEventSet is not called, those fds or handles are leaked. Use ResourceOwners to track WaitEventSets, to clean those up automatically on error. This was a live bug in async Append nodes, if a FDW's ForeignAsyncRequest function failed. (In back branches, I will apply a more localized fix for that based on PG_TRY-PG_FINALLY.) The added test doesn't check for leaking resources, so it passed even before this commit. But at least it covers the code path. In the passing, fix misleading comment on what the 'nevents' argument to WaitEventSetWait means. Report by Alexander Lakhin, analysis and suggestion for the fix by Tom Lane. Fixes bug #17828. Reviewed-by: Alexander Lakhin, Thomas Munro Discussion: https://www.postgresql.org/message-id/472235.1678387869@sss.pgh.pa.us
* Remove incorrect file reference in comment.Etsuro Fujita2023-11-13
| | | | | | | | | | | | | | | Commit b7eda3e0e moved XidInMVCCSnapshot() from tqual.c into snapmgr.c, but follow-up commit c91560def incorrectly updated this reference. We could fix it, but as pointed out by Daniel Gustafsson, 1) the reader can easily find the file that contains the definition of that function, e.g. by grepping, and 2) this kind of reference is prone to going stale; so let's just remove it. Back-patch to all supported branches. Reviewed by Daniel Gustafsson. Discussion: https://postgr.es/m/CAPmGK145VdKkPBLWS2urwhgsfidbSexwY-9zCL6xSUJH%2BBTUUg%40mail.gmail.com
* Make ResourceOwners more easily extensible.Heikki Linnakangas2023-11-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of having a separate array/hash for each resource kind, use a single array and hash to hold all kinds of resources. This makes it possible to introduce new resource "kinds" without having to modify the ResourceOwnerData struct. In particular, this makes it possible for extensions to register custom resource kinds. The old approach was to have a small array of resources of each kind, and if it fills up, switch to a hash table. The new approach also uses an array and a hash, but now the array and the hash are used at the same time. The array is used to hold the recently added resources, and when it fills up, they are moved to the hash. This keeps the access to recent entries fast, even when there are a lot of long-held resources. All the resource-specific ResourceOwnerEnlarge*(), ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have been replaced with three generic functions that take resource kind as argument. For convenience, we still define resource-specific wrapper macros around the generic functions with the old names, but they are now defined in the source files that use those resource kinds. The release callback no longer needs to call ResourceOwnerForget on the resource being released. ResourceOwnerRelease unregisters the resource from the owner before calling the callback. That needed some changes in bufmgr.c and some other files, where releasing the resources previously always called ResourceOwnerForget. Each resource kind specifies a release priority, and ResourceOwnerReleaseAll releases the resources in priority order. To make that possible, we have to restrict what you can do between phases. After calling ResourceOwnerRelease(), you are no longer allowed to remember any more resources in it or to forget any previously remembered resources by calling ResourceOwnerForget. There was one case where that was done previously. At subtransaction commit, AtEOSubXact_Inval() would handle the invalidation messages and call RelationFlushRelation(), which temporarily increased the reference count on the relation being flushed. We now switch to the parent subtransaction's resource owner before calling AtEOSubXact_Inval(), so that there is a valid ResourceOwner to temporarily hold that relcache reference. Other end-of-xact routines make similar calls to AtEOXact_Inval() between release phases, but I didn't see any regression test failures from those, so I'm not sure if they could reach a codepath that needs remembering extra resources. There were two exceptions to how the resource leak WARNINGs on commit were printed previously: llvmjit silently released the context without printing the warning, and a leaked buffer io triggered a PANIC. Now everything prints a WARNING, including those cases. Add tests in src/test/modules/test_resowner. Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu Reviewed-by: Peter Eisentraut, Andres Freund Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
* Ban role pg_signal_backend from more superuser backend types.Noah Misch2023-11-06
| | | | | | | | | | | | | | | | | | | Documentation says it cannot signal "a backend owned by a superuser". On the contrary, it could signal background workers, including the logical replication launcher. It could signal autovacuum workers and the autovacuum launcher. Block all that. Signaling autovacuum workers and those two launchers doesn't stall progress beyond what one could achieve other ways. If a cluster uses a non-core extension with a background worker that does not auto-restart, this could create a denial of service with respect to that background worker. A background worker with bugs in its code for responding to terminations or cancellations could experience those bugs at a time the pg_signal_backend member chooses. Back-patch to v11 (all supported versions). Reviewed by Jelte Fennema-Nio. Reported by Hemanth Sandrana and Mahendrakar Srinivasarao. Security: CVE-2023-5870
* Add trailing commas to enum definitionsPeter Eisentraut2023-10-26
| | | | | | | | | | | | | | | | | | | | Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
* Fix min_dynamic_shared_memory on Windows.Thomas Munro2023-10-22
| | | | | | | | | | | | | | | | | | When min_dynamic_shared_memory is set above 0, we try to find space in a pre-allocated region of the main shared memory area instead of calling dsm_impl_XXX() routines to allocate more. The dsm_pin_segment() and dsm_unpin_segment() routines had a bug: they called dsm_impl_XXX() routines even for main region segments. Nobody noticed before now because those routines do nothing on Unix, but on Windows they'd fail while attempting to duplicate an invalid Windows HANDLE. Add the missing gating. Back-patch to 14, where commit 84b1c63a added this feature. Fixes pgsql-bugs bug #18165. Reported-by: Maxime Boyer <maxime.boyer@cra-arc.gc.ca> Tested-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/18165-bf4f525cea6e51de%40postgresql.org
* Avoid calling proc_exit() in processes forked by system().Nathan Bossart2023-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The SIGTERM handler for the startup process immediately calls proc_exit() for the duration of the restore_command, i.e., a call to system(). This system() call forks a new process to execute the shell command, and this child process inherits the parent's signal handlers. If both the parent and child processes receive SIGTERM, both will attempt to call proc_exit(). This can end badly. For example, both processes will try to remove themselves from the PGPROC shared array. To fix this problem, this commit adds a check in StartupProcShutdownHandler() to see whether MyProcPid == getpid(). If they match, this is the parent process, and we can proc_exit() like before. If they do not match, this is a child process, and we just emit a message to STDERR (in a signal safe manner) and _exit(), thereby skipping any problematic exit callbacks. This commit also adds checks in proc_exit(), ProcKill(), and AuxiliaryProcKill() that verify they are not being called within such child processes. Suggested-by: Andres Freund Reviewed-by: Thomas Munro, Andres Freund Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13 Backpatch-through: 11
* Remove extra parenthesis from comment.Etsuro Fujita2023-10-06
|
* Teach WaitEventSetWait() to report multiple events on Windows.Thomas Munro2023-09-08
| | | | | | | | | | | | | | | | | | | | | | The WAIT_USE_WIN32 implementation of WaitEventSetWait() previously reported at most one event per call, because that's what the underlying WaitForMultipleObjects() call does. We can make the behavior match the three Unix implementations by looping until our output buffer is full, or there are no more events available now. This makes no difference to most callers including the regular FEBE socket code, since they ask for at most one event anyway. A difference in socket accept priority might be perceived by end users after commit 7389aad6 started using WaitEventSet in the postmaster. With this commit, the accept order now matches Unix systems, servicing listening sockets in round-robin order. We decided it wasn't really a bug or worth back-patching, but it seems good to align the behavior across platforms. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Tested-by: "Wei Wang (Fujitsu)" <wangw.fnst@fujitsu.com> Discussion: https://postgr.es/m/CA%2BhUKG%2BA2dk29hr5zRP3HVJQ-_PncNJM6HVQ7aaYLXLRBZU-xw%40mail.gmail.com
* Improve BackendXidGetPid() to only access allProcs on matching XIDMichael Paquier2023-09-08
| | | | | | | | | Compilers are able to optimize that, but it makes the code slightly more readable this way. Author: Zhao Junwang Reviewed-by: Ashutosh Bapat Discussion: https://postgr.es/m/CAEG8a3+i9gtqF65B+g_puVaCQuf0rZC-EMqMyEjGFJYOqUUWfA@mail.gmail.com
* Fix recovery conflict SIGUSR1 handling.Thomas Munro2023-09-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We shouldn't be doing non-trivial work in signal handlers in general, and in this case the handler could reach unsafe code and corrupt state. It also clobbered its own "reason" code. Move all recovery conflict decision logic into the next CHECK_FOR_INTERRUPTS(), and have the signal handler just set flags and the latch, following the standard pattern. Since there are several different "reasons", use a separate flag for each. With this refactoring, the recovery conflict system no longer piggy-backs on top of the regular query cancelation mechanism, but instead raises an error directly if it decides that is necessary. It still needs to respect QueryCancelHoldoffCount, because otherwise the FEBE protocol might get out of sync (see commit 2b3a8b20c2d). This fixes one class of intermittent failure in the new 031_recovery_conflict.pl test added by commit 9f8a050f, though the buggy coding is much older. Failures outside contrived testing seem to be very rare (or perhaps incorrectly attributed) in the field, based on lack of reports. No back-patch for now due to complexity and release schedule. We have the option to back-patch into 16 later, as 16 has prerequisite commit bea3d7e. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Reviewed-by: Michael Paquier <michael@paquier.xyz> (earlier version) Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier version) Tested-by: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com Discussion: https://postgr.es/m/CALj2ACVr8au2J_9D88UfRCi0JdWhyQDDxAcSVav0B0irx9nXEg%40mail.gmail.com
* Use more consistent names for wait event objects and typesMichael Paquier2023-09-06
| | | | | | | | | | | | | | | | | The event names use the same case-insensitive characters, hence applying lower() or upper() to the monitoring queries allows the detection of the same events as before this change. It is possible to cross-check the data with the system view pg_wait_events, for instance, with a query like that showing no differences: SELECT lower(type), lower(name), description FROM pg_wait_events ORDER BY 1, 2; This will help in the introduction of more simplifications in the format of wait_event_names. Some of the enum values in the code had to be renamed a bit to follow the same convention naming across the board. Reviewed-by: Bertrand Drouvot Discussion: https://postgr.es/m/ZOxVHQwEC/9X/p/z@paquier.xyz
* Replace known_assigned_xids_lck with memory barriers.Nathan Bossart2023-09-05
| | | | | | | | | | | | | This lock was introduced before memory barrier support was added, and it is only used to guarantee proper memory ordering when KnownAssignedXidsAdd() appends to the array without a lock. Now that such memory barrier support exists, we can remove the lock and use barriers instead. Suggested-by: Tom Lane Author: Michail Nikolaev Reviewed-by: Robert Haas Discussion: https://postgr.es/m/CANtu0oh0si%3DjG5z_fLeFtmYcETssQ08kLEa8b6TQqDm_cinroA%40mail.gmail.com
* Remove the "snapshot too old" feature.Thomas Munro2023-09-05
| | | | | | | | | | | | | | | | | Remove the old_snapshot_threshold setting and mechanism for producing the error "snapshot too old", originally added by commit 848ef42b. Unfortunately it had a number of known problems in terms of correctness and performance, mostly reported by Andres in the course of his work on snapshot scalability. We agreed to remove it, after a long period without an active plan to fix it. This is certainly a desirable feature, and someone might propose a new or improved implementation in the future. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CACG%3DezYV%2BEvO135fLRdVn-ZusfVsTY6cH1OZqWtezuEYH6ciQA%40mail.gmail.com Discussion: https://postgr.es/m/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
* Fix wording in commentDaniel Gustafsson2023-08-23
| | | | | | | | | The comment for the DSM_OP_CREATE paramater read "the a new handle" which is confusing. Fix by rewording to indicate what the parameter means for DSM_OP_CREATE. Reported-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAEG8a3J2bc197ym-M_ykOXb9ox2eNn-QNKNeoSAoHYSw2NCOnw@mail.gmail.com
* Support custom wait events for wait event type "Extension"Michael Paquier2023-07-31
| | | | | | | | | | | | | | | | | | | | | | | Two backend routines are added to allow extension to allocate and define custom wait events, all of these being allocated in the type "Extension": * WaitEventExtensionNew(), that allocates a wait event ID computed from a counter in shared memory. * WaitEventExtensionRegisterName(), to associate a custom string to the wait event ID allocated. Note that this includes an example of how to use this new facility in worker_spi with tests in TAP for various scenarios, and some documentation about how to use them. Any code in the tree that currently uses WAIT_EVENT_EXTENSION could switch to this new facility to define custom wait events. This is left as work for future patches. Author: Masahiro Ikeda Reviewed-by: Andres Freund, Michael Paquier, Tristan Partin, Bharath Rupireddy Discussion: https://postgr.es/m/b9f5411acda0cf15c8fbb767702ff43e@oss.nttdata.com
* Message wording improvementsPeter Eisentraut2023-07-10
|
* Add GUC parameter "huge_pages_status"Michael Paquier2023-07-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is useful to show the allocation state of huge pages when setting up a server with "huge_pages = try", where allocating huge pages would be attempted but the server would continue its startup sequence even if the allocation fails. The effective status of huge pages is not easily visible without OS-level tools (or for instance, a lookup at /proc/N/smaps), and the environments where Postgres runs may not authorize that. Like the other GUCs related to huge pages, this works for Linux and Windows. This GUC can report as values: - "on", if huge pages were allocated. - "off", if huge pages were not allocated. - "unknown", a special state that could only be seen when using for example postgres -C because it is only possible to know if the shared memory allocation worked after we can check for the GUC values, even if checking a runtime-computed GUC. This value should never be seen when querying for the GUC on a running server. An assertion is added to check that. The discussion has also turned around having a new function to grab this status, but this would have required more tricks for -DEXEC_BACKEND, something that GUCs already handle. Noriyoshi Shinoda has initiated the thread that has led to the result of this commit. Author: Justin Pryzby Reviewed-by: Nathan Bossart, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/TU4PR8401MB1152EBB0D271F827E2E37A01EECC9@TU4PR8401MB1152.NAMPRD84.PROD.OUTLOOK.COM
* Refactor some code related to wait events "BufferPin" and "Extension"Michael Paquier2023-07-03
| | | | | | | | | | | | | | | | | | The following changes are done: - Addition of WaitEventBufferPin and WaitEventExtension, that hold a list of wait events related to each category. - Addition of two functions that encapsulate the list of wait events for each category. - Rename BUFFER_PIN to BUFFERPIN (only this wait event class used an underscore, requiring a specific rule in the automation script). These changes make a bit easier the automatic generation of all the code and documentation related to wait events, as all the wait event categories are now controlled by consistent structures and functions. Author: Bertrand Drouvot Discussion: https://postgr.es/m/c6f35117-4b20-4c78-1df5-d3056010dcf5@gmail.com Discussion: https://postgr.es/m/77a86b3a-c4a8-5f5d-69b9-d70bbf2e9b98@gmail.com
* Trust signalfd on illumos, again.Thomas Munro2023-07-02
| | | | | | | | | | | | | | | | | Commit 3ab4fc5d avoided choosing signalfd by default on illumos, because it triggered kernel panics. That was fixed, so we can remove a kludge from our code. Users/packagers can still override the default choice at compile time if desired, and we'll leave the back-branches unchanged so they keep choosing self-pipe by default, but we'll default to signalfd (like we do for Linux) in 17. Fixed kernels should be everywhere by the time 17 ships. The illumos issues were: * https://www.illumos.org/issues/13700 * https://www.illumos.org/issues/14892 Discussion: https://postgr.es/m/CA+hUKG+NK-K_G_i1H3OpDTwYPEsiwQi_jw58PGcW2H+-N2eVCA@mail.gmail.com
* Error message wording improvementsPeter Eisentraut2023-06-29
|
* Pre-beta2 mechanical code beautification.Tom Lane2023-06-20
| | | | | | | | | Run pgindent and pgperltidy. It seems we're still some ways away from all committers doing this automatically. Now that we have a buildfarm animal that will whine about poorly-indented code, we'll try to keep the tree more tidy. Discussion: https://postgr.es/m/3156045.1687208823@sss.pgh.pa.us
* Report stats when replaying XLOG_RUNNING_XACTSAndres Freund2023-06-12
| | | | | | | | | | | | | | | | | | | | | | | | | Previously stats in the startup process would only get reported during shutdown of the startup process. It has been that way for a long time, but became a lot more noticeable with the new pg_stat_io view, which separates out IO done by different backend types... While replaying after every XLOG_RUNNING_XACTS isn't the prettiest approach, it has the advantage of being quite easy. Given that we're well past feature freeze... It's not a problem that we don't report stats more frequently with wal_level=minimal, in that case stats can't be read before the stats process has shut down. Besides the above, this commit also changes pgstat_report_stat() to acquire the timestamp with GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp(). Thanks to Melih Mutlu, Kyotaro Horiguchi for prototypes of other approaches to solving this issue. Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Discussion: https://postgr.es/m/5315aedc-fbca-1556-c5de-dc2e00b23a14@oss.nttdata.com