aboutsummaryrefslogtreecommitdiff
path: root/src/backend/storage/buffer
Commit message (Collapse)AuthorAge
...
* Rename some shared memory initialization routinesHeikki Linnakangas2024-08-29
| | | | | | | | | | | | | | | | To make them follow the usual naming convention where FoobarShmemSize() calculates the amount of shared memory needed by Foobar subsystem, and FoobarShmemInit() performs the initialization. I didn't rename CreateLWLocks() and InitShmmeIndex(), because they are a little special. They need to be called before any of the other ShmemInit() functions, because they set up the shared memory bookkeeping itself. I also didn't rename InitProcGlobal(), because unlike other Shmeminit functions, it's not called by individual backends. Reviewed-by: Andreas Karlsson Discussion: https://www.postgresql.org/message-id/c09694ff-2453-47e5-b26c-32a16cd75ce6@iki.fi
* Use pgBufferUsage for buffer usage tracking in analyze.Masahiko Sawada2024-08-13
| | | | | | | | | | | | | | | | | | | | Previously, (auto)analyze used global variables VacuumPageHit, VacuumPageMiss, and VacuumPageDirty to track buffer usage. However, pgBufferUsage provides a more generic way to track buffer usage with support functions. This change replaces those global variables with pgBufferUsage in analyze. Since analyze was the sole user of those variables, it removes their declarations. Vacuum previously used those variables but replaced them with pgBufferUsage as part of a bug fix, commit 5cd72cc0c. Additionally, it adjusts the buffer usage message in both vacuum and analyze for better consistency. Author: Anthonin Bonnefoy Reviewed-by: Masahiko Sawada, Michael Paquier Discussion: https://postgr.es/m/CAO6_Xqr__kTTCLkftqS0qSCm-J7_xbRG3Ge2rWhucxQJMJhcRA%40mail.gmail.com
* Fix private struct field name to match the code using it.Noah Misch2024-07-23
| | | | | | | | Commit 8720a15e9ab121e49174d889eaeafae8ac89de7b added the wrong name. Nazir Bilal Yavuz Discussion: https://postgr.es/m/20240720181405.5a.nmisch@google.com
* Use read streams in CREATE DATABASE when STRATEGY=WAL_LOG.Noah Misch2024-07-20
| | | | | | | | | | | | While this doesn't significantly change runtime now, it arranges for STRATEGY=WAL_LOG to benefit automatically from future optimizations to the read_stream subsystem. For large tables in the template database, this does read 16x as many bytes per system call. Platforms with high per-call overhead, if any, may see an immediate benefit. Nazir Bilal Yavuz Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com
* Refactor PinBufferForBlock() to remove checks about persistence.Noah Misch2024-07-20
| | | | | | | | | | There are checks in PinBufferForBlock() function to set persistence of the relation. This function is called for each block in the relation. Instead, set persistence of the relation before PinBufferForBlock(). Nazir Bilal Yavuz Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com
* Remove "smgr_persistence == 0" dead code.Noah Misch2024-07-20
| | | | | | | | | Reaching that code would have required multiple processes performing relation extension during recovery, which does not happen. That caller has the persistence available, so pass it. This was dead code as soon as commit 210622c60e1a9db2e2730140b8106ab57d259d15 added it. Discussion: https://postgr.es/m/CAN55FZ0JKL6vk1xQp6rfOXiNFV1u1H0tJDPPGHWoiO3ea2Wc=A@mail.gmail.com
* Fix RBM_ZERO_AND_LOCK.Thomas Munro2024-06-10
| | | | | | | | | | | | | | | | | | | | Commit 210622c6 accidentally zeroed out pages even if they were found in the buffer pool. It should always lock the page, but it should only zero pages that were not already valid. Otherwise, concurrent readers that hold only a pin could see corrupted page contents changing under their feet. While here, rename ZeroAndLockBuffer() to match the RBM_ flag name. Also restore a some useful comments lost by 210622c6's refactoring, and add some new ones to clarify why we need to use the BM_IO_IN_PROGRESS infrastructure despite not doing I/O. Reported-by: Noah Misch <noah@leadboat.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> (earlier version) Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier version) Discussion: https://postgr.es/m/20240512171658.7e.nmisch@google.com Discussion: https://postgr.es/m/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
* Revise GUC names quoting in messages againPeter Eisentraut2024-05-17
| | | | | | | | | | | | | | | After further review, we want to move in the direction of always quoting GUC names in error messages, rather than the previous (PG16) wildly mixed practice or the intermittent (mid-PG17) idea of doing this depending on how possibly confusing the GUC name is. This commit applies appropriate quotes to (almost?) all mentions of GUC names in error messages. It partially supersedes a243569bf65 and 8d9978a7176, which had moved things a bit in the opposite direction but which then were abandoned in a partial state. Author: Peter Smith <smithpb2250@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAHut%2BPv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w%40mail.gmail.com
* Fix typos and duplicate wordsDaniel Gustafsson2024-04-18
| | | | | | | | | | | | This fixes various typos, duplicated words, and tiny bits of whitespace mainly in code comments but also in docs. Author: Daniel Gustafsson <daniel@yesql.se> Author: Heikki Linnakangas <hlinnaka@iki.fi> Author: Alexander Lakhin <exclusion@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
* Revert indexed and enlargable binary heap implementation.Masahiko Sawada2024-04-11
| | | | | | | | | | | This reverts commit b840508644 and bcb14f4abc. These commits were made for commit 5bec1d6bc5 (Improve eviction algorithm in ReorderBuffer using max-heap for many subtransactions). However, per discussion, commit efb8acc0d0 replaced binary heap + index with pairing heap, and made these commits unnecessary. Reported-by: Jeff Davis Discussion: https://postgr.es/m/12747c15811d94efcc5cda72d6b35c80d7bf3443.camel%40j-davis.com
* Add pg_buffercache_evict() function for testing.Thomas Munro2024-04-08
| | | | | | | | | | | | | | | | | | | | | | | | When testing buffer pool logic, it is useful to be able to evict arbitrary blocks. This function can be used in SQL queries over the pg_buffercache view to set up a wide range of buffer pool states. Of course, buffer mappings might change concurrently so you might evict a block other than the one you had in mind, and another session might bring it back in at any time. That's OK for the intended purpose of setting up developer testing scenarios, and more complicated interlocking schemes to give stronger guararantees about that would likely be less flexible for actual testing work anyway. Superuser-only. Author: Palak Chaturvedi <chaturvedipalak1911@gmail.com> Author: Thomas Munro <thomas.munro@gmail.com> (docs, small tweaks) Reviewed-by: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Cary Huang <cary.huang@highgo.ca> Reviewed-by: Cédric Villemain <cedric.villemain+pgsql@abcsql.com> Reviewed-by: Jim Nasby <jim.nasby@gmail.com> Reviewed-by: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CALfch19pW48ZwWzUoRSpsaV9hqt0UPyaBPC4bOZ4W+c7FF566A@mail.gmail.com
* Increase default vacuum_buffer_usage_limit to 2MB.Thomas Munro2024-04-06
| | | | | | | | | | | | | | | | | | The BAS_VACUUM ring size has been 256kB since commit d526575f introduced the mechanism 17 years ago. Commit 1cbbee03 recently made it configurable but retained the traditional default. The correct default size has been debated for years, but 256kB is certainly very small. VACUUM soon needs to write back data it dirtied only 32 blocks ago, which usually requires flushing the WAL. New experiments in prefetching pages for VACUUM exacerbated the problem by crashing into dirty data even sooner. Let's make the default 2MB. That's 1.6% of the default toy buffer pool size, and 0.2% of 1GB, which would be a considered a small shared_buffers setting for a real system these days. Users are still free to set the GUC to a different value. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de Discussion: https://postgr.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com
* Allow BufferAccessStrategy to limit pin count.Thomas Munro2024-04-06
| | | | | | | | | | | | | | While pinning extra buffers to look ahead, users of strategies are in danger of using too many buffers. For some strategies, that means "escaping" from the ring, and in others it means forcing dirty data to disk very frequently with associated WAL flushing. Since external code has no insight into any of that, allow individual strategy types to expose a clamp that should be applied when deciding how many buffers to pin at once. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
* Add functions to binaryheap for efficient key removal and update.Masahiko Sawada2024-04-03
| | | | | | | | | | | | | | | | | | | | | | | Previously, binaryheap didn't support updating a key and removing a node in an efficient way. For example, in order to remove a node from the binaryheap, the caller had to pass the node's position within the array that the binaryheap internally has. Removing a node from the binaryheap is done in O(log n) but searching for the key's position is done in O(n). This commit adds a hash table to binaryheap in order to track the position of each nodes in the binaryheap. That way, by using newly added functions such as binaryheap_update_up() etc., both updating a key and removing a node can be done in O(1) on an average and O(log n) in worst case. This is known as the indexed binary heap. The caller can specify to use the indexed binaryheap by passing indexed = true. The current code does not use the new indexing logic, but it will be used by an upcoming patch. Reviewed-by: Vignesh C, Peter Smith, Hayato Kuroda, Ajin Cherian, Tomas Vondra, Shubham Khanna Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
* Provide vectored variant of ReadBuffer().Thomas Munro2024-04-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Break ReadBuffer() up into two steps. StartReadBuffers() and WaitReadBuffers() give us two main advantages: 1. Multiple consecutive blocks can be read with one system call. 2. Advice (hints of future reads) can optionally be issued to the kernel ahead of time. The traditional ReadBuffer() function is now implemented in terms of those functions, to avoid duplication. A new GUC io_combine_limit is defined, and the functions for limiting per-backend pin counts are made into public APIs. Those are provided for use by callers of StartReadBuffers(), when deciding how many buffers to read at once. The following commit will add a higher level mechanism for doing that automatically with a practical interface. With some more infrastructure in later work, StartReadBuffers() could be extended to start real asynchronous I/O instead of just issuing advice and leaving WaitReadBuffers() to do the work synchronously. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> (some optimization tweaks) Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
* Remove unused #include's from backend .c filesPeter Eisentraut2024-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
* Replace BackendIds with 0-based ProcNumbersHeikki Linnakangas2024-03-03
| | | | | | | | | | | | | | | | | | Now that BackendId was just another index into the proc array, it was redundant with the 0-based proc numbers used in other places. Replace all usage of backend IDs with proc numbers. The only place where the term "backend id" remains is in a few pgstat functions that expose backend IDs at the SQL level. Those IDs are now in fact 0-based ProcNumbers too, but the documentation still calls them "backend ids". That term still seems appropriate to describe what the numbers are, so I let it be. One user-visible effect is that pg_temp_0 is now a valid temp schema name, for backend with ProcNumber 0. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
* Remove superfluous 'pgprocno' field from PGPROCHeikki Linnakangas2024-02-22
| | | | | | | | | It was always just the index of the PGPROC entry from the beginning of the proc array. Introduce a macro to compute it from the pointer instead. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
* Replace calls to pg_qsort() with the qsort() macro.Nathan Bossart2024-02-16
| | | | | | | | | Calls to this function might give the impression that pg_qsort() is somehow different than qsort(), when in fact there is a qsort() macro in port.h that expands all in-tree uses to pg_qsort(). Reviewed-by: Mats Kindahl Discussion: https://postgr.es/m/CA%2B14426g2Wa9QuUpmakwPxXFWG_1FaY0AsApkvcTBy-YfS6uaw%40mail.gmail.com
* Fix typo in commentsHeikki Linnakangas2024-02-03
| | | | Backpatch-through: v16
* Fix bug in bulk extending temp relation after failureHeikki Linnakangas2024-02-02
| | | | | | | | | | | | | | | | | | | | | | A ResourceOwnerEnlarge() call was missing. That led to an error: ERROR: ResourceOwnerRemember called but array was full and an assertion failure, if you tried to extend a temp relation again after a failure. Alexander's test case used running out of disk space to trigger the original failure. This bug was introduced in the large ResourceOwner rewrite commit b8bff07daa. Before that, the UnpinLocalBuffer() call guaranteed that the subsequent PinLocalBuffer() will succeed, but after the rewrite, releasing an old resource doesn't guarantee that there is space for a new one. Add a comment explaining why the UnpinBuffer + PinBuffer calls in BufferAlloc(), with no ResourceOwnerEnlarge() in between, are safe. Reported-by: Alexander Lakhin Discussion: https://www.postgresql.org/message-id/dc574fea-c83e-a600-08cd-10881762e4fa@gmail.com
* Give SMgrRelation pointers a well-defined lifetime.Heikki Linnakangas2024-01-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After calling smgropen(), it was not clear how long you could continue to use the result, because various code paths including cache invalidation could call smgrclose(), which freed the memory. Guarantee that the object won't be destroyed until the end of the current transaction, or in recovery, the commit/abort record that destroys the underlying storage. smgrclose() is now just an alias for smgrrelease(). It closes files and forgets all state except the rlocator, but keeps the SMgrRelation object valid. A new smgrdestroy() function is used by rare places that know there should be no other references to the SMgrRelation. The short version: * smgrclose() is now just an alias for smgrrelease(). It releases resources, but doesn't destroy until EOX * smgrdestroy() now frees memory, and should rarely be used. Existing code should be unaffected, but it is now possible for code that has an SMgrRelation object to use it repeatedly during a transaction as long as the storage hasn't been physically dropped. Such code would normally hold a lock on the relation. This also replaces the "ownership" mechanism of SMgrRelations with a pin counter. An SMgrRelation can now be "pinned", which prevents it from being destroyed at end of transaction. There can be multiple pins on the same SMgrRelation. In practice, the pin mechanism is only used by the relcache, so there cannot be more than one pin on the same SMgrRelation. Except with swap_relation_files XXX Author: Thomas Munro, Heikki Linnakangas Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
* Fix corruption of local buffer state during extend of temp relationMichael Paquier2024-01-05
| | | | | | | | | | | | | | | | | | | | A typo has been introduced by 31966b151e6a when updating the state of a local buffer when a temporary relation is extended, for the case of a block included in the relation range extended, when it is already found in the hash table holding the local buffers. In this case, BM_VALID should be cleared, but the buffer state was changed so as BM_VALID remained while clearing the other flags. As reported on the thread, it was possible to corrupt the state of the local buffers on ENOSPC, but the states would be corrupted on any kind of ERROR during the relation extend (like partial writes or some other errno). Reported-by: Alexander Lakhin Author: Tender Wang Reviewed-by: Richard Guo, Alexander Lakhin, Michael Paquier Discussion: https://postgr.es/m/18259-6e256429825dd435@postgresql.org Backpatch-through: 16
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Refactor pgstat_prepare_io_time() with an input argument instead of a GUCMichael Paquier2023-12-16
| | | | | | | | | | | Originally, this routine relied on track_io_timing to check if a time interval for an I/O operation stored in pg_stat_io should be initialized or not. However, the addition of WAL statistics to pg_stat_io requires that the initialization happens when track_wal_io_timing is enabled, which is dependent on the code path where the I/O operation happens. Author: Nazir Bilal Yavuz Discussion: https://postgr.es/m/CAN55FZ3AiQ+ZMxUuXnBpd0Rrh1YhwJ5FudkHg=JU0P+-W8T4Vg@mail.gmail.com
* Provide multi-block smgrprefetch().Thomas Munro2023-12-16
| | | | | | | | | | | | Previously smgrprefetch() could issue POSIX_FADV_WILLNEED advice for a single block at a time. Add an nblocks argument so that we can do the same for a range of blocks. This usually produces a single system call, but might need to loop if it crosses a segment boundary. Initially it is only called with nblocks == 1, but proposed patches will make wider calls. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> (earlier version) Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
* Apply quotes more consistently to GUC names in logsMichael Paquier2023-11-30
| | | | | | | | | | | | | | Quotes are applied to GUCs in a very inconsistent way across the code base, with a mix of double quotes or no quotes used. This commit removes double quotes around all the GUC names that are obviously referred to as parameters with non-English words (use of underscore, mixed case, etc). This is the result of a discussion with Álvaro Herrera, Nathan Bossart, Laurenz Albe, Peter Eisentraut, Tom Lane and Daniel Gustafsson. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w@mail.gmail.com
* Make ResourceOwners more easily extensible.Heikki Linnakangas2023-11-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of having a separate array/hash for each resource kind, use a single array and hash to hold all kinds of resources. This makes it possible to introduce new resource "kinds" without having to modify the ResourceOwnerData struct. In particular, this makes it possible for extensions to register custom resource kinds. The old approach was to have a small array of resources of each kind, and if it fills up, switch to a hash table. The new approach also uses an array and a hash, but now the array and the hash are used at the same time. The array is used to hold the recently added resources, and when it fills up, they are moved to the hash. This keeps the access to recent entries fast, even when there are a lot of long-held resources. All the resource-specific ResourceOwnerEnlarge*(), ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have been replaced with three generic functions that take resource kind as argument. For convenience, we still define resource-specific wrapper macros around the generic functions with the old names, but they are now defined in the source files that use those resource kinds. The release callback no longer needs to call ResourceOwnerForget on the resource being released. ResourceOwnerRelease unregisters the resource from the owner before calling the callback. That needed some changes in bufmgr.c and some other files, where releasing the resources previously always called ResourceOwnerForget. Each resource kind specifies a release priority, and ResourceOwnerReleaseAll releases the resources in priority order. To make that possible, we have to restrict what you can do between phases. After calling ResourceOwnerRelease(), you are no longer allowed to remember any more resources in it or to forget any previously remembered resources by calling ResourceOwnerForget. There was one case where that was done previously. At subtransaction commit, AtEOSubXact_Inval() would handle the invalidation messages and call RelationFlushRelation(), which temporarily increased the reference count on the relation being flushed. We now switch to the parent subtransaction's resource owner before calling AtEOSubXact_Inval(), so that there is a valid ResourceOwner to temporarily hold that relcache reference. Other end-of-xact routines make similar calls to AtEOXact_Inval() between release phases, but I didn't see any regression test failures from those, so I'm not sure if they could reach a codepath that needs remembering extra resources. There were two exceptions to how the resource leak WARNINGs on commit were printed previously: llvmjit silently released the context without printing the warning, and a leaked buffer io triggered a PANIC. Now everything prints a WARNING, including those cases. Add tests in src/test/modules/test_resowner. Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu Reviewed-by: Peter Eisentraut, Andres Freund Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
* Move a few ResourceOwnerEnlarge() calls for safety and clarity.Heikki Linnakangas2023-11-08
| | | | | | | | | | | | | | | | | | | | | | These are functions where a lot of things happen between the ResourceOwnerEnlarge and ResourceOwnerRemember calls. It's important that there are no unrelated ResourceOwnerRemember calls in the code in between, otherwise the reserved entry might be used up by the intervening ResourceOwnerRemember and not be available at the intended ResourceOwnerRemember call anymore. I don't see any bugs here, but the longer the code path between the calls is, the harder it is to verify. In bufmgr.c, there is a function similar to ResourceOwnerEnlarge, ReservePrivateRefCountEntry(), to ensure that the private refcount array has enough space. The ReservePrivateRefCountEntry() calls were made at different places than the ResourceOwnerEnlargeBuffers() calls. Move the ResourceOwnerEnlargeBuffers() and ReservePrivateRefCountEntry() calls together for consistency. Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu Reviewed-by: Peter Eisentraut, Andres Freund Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
* Introduce pg_stat_checkpointerMichael Paquier2023-10-30
| | | | | | | | | | | | | | | | | | | | | | | | Historically, the statistics of the checkpointer have been always part of pg_stat_bgwriter. This commit removes a few columns from pg_stat_bgwriter, and introduces pg_stat_checkpointer with equivalent, renamed columns (plus a new one for the reset timestamp): - checkpoints_timed -> num_timed - checkpoints_req -> num_requested - checkpoint_write_time -> write_time - checkpoint_sync_time -> sync_time - buffers_checkpoint -> buffers_written The fields of PgStat_CheckpointerStats and its SQL functions are renamed to match with the new field names, for consistency. Note that background writer and checkpointer have been split into two different processes in commits 806a2aee3791 and bf405ba8e460. The pgstat structures were already split, making this change straight-forward. Bump catalog version. Author: Bharath Rupireddy Reviewed-by: Bertrand Drouvot, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com
* Assert that buffers are marked dirty before XLogRegisterBuffer().Jeff Davis2023-10-23
| | | | | | | | | | | Enforce the rule from transam/README in XLogRegisterBuffer(), and update callers to follow the rule. Hash indexes sometimes register clean pages as a part of the locking protocol, so provide a REGBUF_NO_CHANGE flag to support that use. Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a904@iki.fi Reviewed-by: Heikki Linnakangas
* Standardize type of extend_by counterPeter Eisentraut2023-09-19
| | | | | | | | | | | | The counter of extend_by loops is mixed int and uint32. Fix by standardizing from int to uint32, to match the extend_by variable. Fixup for 31966b151e. Author: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Gurjeet Singh <gurjeet@singh.im> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAEudQAqHG-JP-YnG54ftL_b7v6-57rMKwET_MSvEoen0UHuPig@mail.gmail.com
* Fix tracking of temp table relation extensions as writesAndres Freund2023-09-13
| | | | | | | | | | | | | | | Karina figured out that I (Andres) confused BufferUsage.temp_blks_written with BufferUsage.local_blks_written in fcdda1e4b5. Tests in core PG can't easily test this, as BufferUsage is just used for EXPLAIN (ANALYZE, BUFFERS) and pg_stat_statements. Thus this commit adds tests for this to pg_stat_statements. Reported-by: Karina Litskevich <litskevichkarina@gmail.com> Author: Karina Litskevich <litskevichkarina@gmail.com> Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CACiT8ibxXA6+0amGikbeFhm8B84XdQVo6D0Qfd1pQ1s8zpsnxQ@mail.gmail.com Backpatch: 16-, where fcdda1e4b5 was merged
* Fix recovery conflict SIGUSR1 handling.Thomas Munro2023-09-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We shouldn't be doing non-trivial work in signal handlers in general, and in this case the handler could reach unsafe code and corrupt state. It also clobbered its own "reason" code. Move all recovery conflict decision logic into the next CHECK_FOR_INTERRUPTS(), and have the signal handler just set flags and the latch, following the standard pattern. Since there are several different "reasons", use a separate flag for each. With this refactoring, the recovery conflict system no longer piggy-backs on top of the regular query cancelation mechanism, but instead raises an error directly if it decides that is necessary. It still needs to respect QueryCancelHoldoffCount, because otherwise the FEBE protocol might get out of sync (see commit 2b3a8b20c2d). This fixes one class of intermittent failure in the new 031_recovery_conflict.pl test added by commit 9f8a050f, though the buggy coding is much older. Failures outside contrived testing seem to be very rare (or perhaps incorrectly attributed) in the field, based on lack of reports. No back-patch for now due to complexity and release schedule. We have the option to back-patch into 16 later, as 16 has prerequisite commit bea3d7e. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Reviewed-by: Michael Paquier <michael@paquier.xyz> (earlier version) Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier version) Tested-by: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com Discussion: https://postgr.es/m/CALj2ACVr8au2J_9D88UfRCi0JdWhyQDDxAcSVav0B0irx9nXEg%40mail.gmail.com
* Remove the "snapshot too old" feature.Thomas Munro2023-09-05
| | | | | | | | | | | | | | | | | Remove the old_snapshot_threshold setting and mechanism for producing the error "snapshot too old", originally added by commit 848ef42b. Unfortunately it had a number of known problems in terms of correctness and performance, mostly reported by Andres in the course of his work on snapshot scalability. We agreed to remove it, after a long period without an active plan to fix it. This is certainly a desirable feature, and someone might propose a new or improved implementation in the future. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CACG%3DezYV%2BEvO135fLRdVn-ZusfVsTY6cH1OZqWtezuEYH6ciQA%40mail.gmail.com Discussion: https://postgr.es/m/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
* ExtendBufferedWhat -> BufferManagerRelation.Thomas Munro2023-08-23
| | | | | | | | | | | | | | | | | Commit 31966b15 invented a way for functions dealing with relation extension to accept a Relation in online code and an SMgrRelation in recovery code. It seems highly likely that future bufmgr.c interfaces will face the same problem, and need to do something similar. Generalize the names so that each interface doesn't have to re-invent the wheel. Back-patch to 16. Since extension AM authors might start using the constructor macros once 16 ships, we agreed to do the rename in 16 rather than waiting for 17. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com
* Fix off-by-one in LimitAdditionalPins()Andres Freund2023-07-24
| | | | | | | | | | | | | | | Due to the bug LimitAdditionalPins() could return 0, violating LimitAdditionalPins()'s API ("One additional pin is always allowed"). This could be hit when setting shared_buffers very low and using a fair amount of concurrency. This bug was introduced in 31966b151e6a. Author: "Anton A. Melnikov" <aamelnikov@inbox.ru> Reported-by: "Anton A. Melnikov" <aamelnikov@inbox.ru> Reported-by: Victoria Shepard Discussion: https://postgr.es/m/ae46f2fb-5586-3de0-b54b-1bb0f6410ebd@inbox.ru Backpatch: 16-
* Refactor some code related to wait events "BufferPin" and "Extension"Michael Paquier2023-07-03
| | | | | | | | | | | | | | | | | | The following changes are done: - Addition of WaitEventBufferPin and WaitEventExtension, that hold a list of wait events related to each category. - Addition of two functions that encapsulate the list of wait events for each category. - Rename BUFFER_PIN to BUFFERPIN (only this wait event class used an underscore, requiring a specific rule in the automation script). These changes make a bit easier the automatic generation of all the code and documentation related to wait events, as all the wait event categories are now controlled by consistent structures and functions. Author: Bertrand Drouvot Discussion: https://postgr.es/m/c6f35117-4b20-4c78-1df5-d3056010dcf5@gmail.com Discussion: https://postgr.es/m/77a86b3a-c4a8-5f5d-69b9-d70bbf2e9b98@gmail.com
* Remove over-eager assertion in ExtendBufferedRelTo()Andres Freund2023-05-21
| | | | | | | | | | | | | | | The assertion checked that the size of the relation is not "too large" - but the code is explicitly dealing with the possibility of another backend extending the relation concurrently. In that case the new relation size could be bigger than what the current backend needs, wrongly triggering an assertion failure. Unfortunately it is hard to write a reliable and affordable regression tests for this, as a lot of concurrency is needed to encounter the bug. Introduced in 31966b151e6a. Reported-by: Melanie Plageman <melanieplageman@gmail.com>
* Pre-beta mechanical code beautification.Tom Lane2023-05-19
| | | | | | | | | | | | | | | Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
* Add writeback to pg_stat_ioAndres Freund2023-05-17
| | | | | | | | | | | | | | | 28e626bde00 added the concept of IOOps but neglected to include writeback operations. ac8d53dae5 added time spent doing these I/O operations. Without counting writeback, checkpointer write time in the log often differed substantially from that in pg_stat_io. To fix this, add IOOp IOOP_WRITEBACK and track writeback in pg_stat_io. Bumps catversion. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de
* Update parameter name context to wb_contextAndres Freund2023-05-17
| | | | | | | | | | For clarity of review, renaming the function parameter "context" in ScheduleBufferTagForWriteback() and IssuePendingWritebacks() to "wb_context" is a separate commit. The next commit adds an "io_context" parameter and "wb_context" makes it more clear which is which. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_acc6iL4M3hvOTeztf_ZPpsB3Pqio5aVHgZ5q=Pi3BZKg@mail.gmail.com
* Fix various typosDavid Rowley2023-04-18
| | | | | | | | | | | | This fixes many spelling mistakes in comments, but a few references to invalid parameter names, function names and option names too in comments and also some in string constants Also, fix an #undef that was undefining the incorrect definition Author: Alexander Lakhin Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/d5f68d19-c0fc-91a9-118d-7c6a5a3f5fad@gmail.com
* Support RBM_ZERO_AND_CLEANUP_LOCK in ExtendBufferedRelTo(), add testsAndres Freund2023-04-14
| | | | | | | | | | | | | | | | | | | For some reason I had not implemented RBM_ZERO_AND_CLEANUP_LOCK support in ExtendBufferedRelTo(), likely thinking it not being reachable. But it is reachable, e.g. when replaying a WAL record for a page in a relation that subsequently is truncated (likely only reachable when doing crash recovery or PITR, not during ongoing streaming replication). As now all of the RBM_* modes are supported, remove assertions checking mode. As we had no test coverage for this scenario, add a new TAP test. There's plenty more that ought to be tested in this area... Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us Discussion: https://postgr.es/m/0b5eb82b-cb99-e0a4-b932-3dc60e2e3926@gmail.com
* Harmonize some more function parameter names.Peter Geoghegan2023-04-13
| | | | | | | | | | | | | | Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. These inconsistencies were all introduced relatively recently, after the code base had parameter name mismatches fixed in bulk (see commits starting with commits 4274dc22 and 035ce1fe). pg_bsd_indent still has a couple of similar inconsistencies, which I (pgeoghegan) have left untouched for now. Like all earlier commits that cleaned up function parameter names, this commit was written with help from clang-tidy.
* Add io_direct setting (developer-only).Thomas Munro2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | Provide a way to ask the kernel to use O_DIRECT (or local equivalent) where available for data and WAL files, to avoid or minimize kernel caching. This hurts performance currently and is not intended for end users yet. Later proposed work would introduce our own I/O clustering, read-ahead, etc to replace the facilities the kernel disables with this option. The only user-visible change, if the developer-only GUC is not used, is that this commit also removes the obscure logic that would activate O_DIRECT for the WAL when wal_sync_method=open_[data]sync and wal_level=minimal (which also requires max_wal_senders=0). Those are non-default and unlikely settings, and this behavior wasn't (correctly) documented. The same effect can be achieved with io_direct=wal. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
* Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.Thomas Munro2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit 439f6175. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
* Track IO times in pg_stat_ioAndres Freund2023-04-07
| | | | | | | | | | | | | | | | a9c70b46dbe and 8aaa04b32S added counting of IO operations to a new view, pg_stat_io. Now, add IO timing for reads, writes, extends, and fsyncs to pg_stat_io as well. This combines the tracking for pgBufferUsage with the tracking for pg_stat_io into a new function pgstat_count_io_op_time(). This should make it a bit easier to avoid the somewhat costly instr_time conversion done for pgBufferUsage. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
* Improve IO accounting for temp relation writesAndres Freund2023-04-07
| | | | | | | | | | | | | | Both pgstat_database and pgBufferUsage count IO timing for reads of temporary relation blocks into local buffers. However, both failed to count write IO timing for flushes of dirty local buffers. Fix. Additionally, FlushRelationBuffers() seems to have omitted counting write IO (both count and timing) stats for both pgstat_database and pgBufferUsage. Fix. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20230321023451.7rzy4kjj2iktrg2r%40awork3.anarazel.de
* Fix copy-paste bug in 12f3867f553 triggering an assert after a write errorAndres Freund2023-04-07
| | | | | | | The same condition accidentally was copied to both branches. Manual testing confirms that otherwise the error recovery path works fine. Found while reviewing the logical-decoding-on-standby patch.