postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
...
*	Increment xactCompletionCount during subtransaction abort.	Andres Freund	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Snapshot caching, introduced in 623a9ba79b, did not increment xactCompletionCount during subtransaction abort. That could lead to an older snapshot being reused. That is, at least as far as I can see, not a correctness issue (for MVCC snapshots there's no difference between "in progress" and "aborted"). The only difference between the old and new snapshots would be a newer ->xmax. While HeapTupleSatisfiesMVCC makes the same visibility determination, reusing the old snapshot leads HeapTupleSatisfiesMVCC to not set HEAP_XMIN_INVALID. Which subsequently causes the kill_prior_tuple optimization to not kick in (via HeapTupleIsSurelyDead() returning false). The performance effects of doing the same index-lookups over and over again is how the issue was discovered... Fix the issue by incrementing xactCompletionCount in XidCacheRemoveRunningXids. It already acquires ProcArrayLock exclusively, making that an easy proposition. Add a test to ensure that kill_prior_tuple prevents index growth when it involves aborted subtransaction of the current transaction. Author: Andres Freund Discussion: https://postgr.es/m/20210406043521.lopeo7bbigad3n6t@alap3.anarazel.de Discussion: https://postgr.es/m/20210317055718.v6qs3ltzrformqoa%40alap3.anarazel.de
*	Add function to log the memory contexts of specified backend process.	Fujii Masao	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 3e98c0bafb added pg_backend_memory_contexts view to display the memory contexts of the backend process. However its target process is limited to the backend that is accessing to the view. So this is not so convenient when investigating the local memory bloat of other backend process. To improve this situation, this commit adds pg_log_backend_memory_contexts() function that requests to log the memory contexts of the specified backend process. This information can be also collected by calling MemoryContextStats(TopMemoryContext) via a debugger. But this technique cannot be used in some environments because no debugger is available there. So, pg_log_backend_memory_contexts() allows us to see the memory contexts of specified backend more easily. Only superusers are allowed to request to log the memory contexts because allowing any users to issue this request at an unbounded rate would cause lots of log messages and which can lead to denial of service. On receipt of the request, at the next CHECK_FOR_INTERRUPTS(), the target backend logs its memory contexts at LOG_SERVER_ONLY level, so that these memory contexts will appear in the server log but not be sent to the client. It logs one message per memory context. Because if it buffers all memory contexts into StringInfo to log them as one message, which may require the buffer to be enlarged very much and lead to OOM error since there can be a large number of memory contexts in a backend. When a backend process is consuming huge memory, logging all its memory contexts might overrun available disk space. To prevent this, now this patch limits the number of child contexts to log per parent to 100. As with MemoryContextStats(), it supposes that practical cases where the log gets long will typically be huge numbers of siblings under the same parent context; while the additional debugging value from seeing details about individual siblings beyond 100 will not be large. There was another proposed patch to add the function to return the memory contexts of specified backend as the result sets, instead of logging them, in the discussion. However that patch is not included in this commit because it had several issues to address. Thanks to Tatsuhito Kasahara, Andres Freund, Tom Lane, Tomas Vondra, Michael Paquier, Kyotaro Horiguchi and Zhihong Yu for the discussion. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Kyotaro Horiguchi, Zhihong Yu, Fujii Masao Discussion: https://postgr.es/m/0271f440ac77f2a4180e0e56ebd944d1@oss.nttdata.com
*	Shut down transaction tracking at startup process exit.	Fujii Masao	2021-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Maxim Orlov reported that the shutdown of standby server could result in the following assertion failure. The cause of this issue was that, when the shutdown caused the startup process to exit, recovery-time transaction tracking was not shut down even if it's already initialized, and some locks the tracked transactions were holding could not be released. At this situation, if other process was invoked and the PGPROC entry that the startup process used was assigned to it, it found such unreleased locks and caused the assertion failure, during the initialization of it. TRAP: FailedAssertion("SHMQueueEmpty(&(MyProc->myProcLocks[i]))" This commit fixes this issue by making the startup process shut down transaction tracking and release all locks, at the exit of it. Back-patch to all supported branches. Reported-by: Maxim Orlov Author: Fujii Masao Reviewed-by: Maxim Orlov Discussion: https://postgr.es/m/ad4ce692cc1d89a093b471ab1d969b0b@postgrespro.ru
*	Improve efficiency of wait event reporting, remove proc.h dependency.	Andres Freund	2021-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pgstat_report_wait_start() and pgstat_report_wait_end() required two conditional branches so far. One to check if MyProc is NULL, the other to check if pgstat_track_activities is set. As wait events are used around comparatively lightweight operations, and are inlined (reducing branch predictor effectiveness), that's not great. The dependency on MyProc has a second disadvantage: Low-level subsystems, like storage/file/fd.c, report wait events, but architecturally it is preferable for them to not depend on inter-process subsystems like proc.h (defining PGPROC). After this change including pgstat.h (nor obviously its sub-components like backend_status.h, wait_event.h, ...) does not pull in IPC related headers anymore. These goals, efficiency and abstraction, are achieved by having pgstat_report_wait_start/end() not interact with MyProc, but instead a new my_wait_event_info variable. At backend startup it points to a local variable, removing the need to check for MyProc being NULL. During process initialization my_wait_event_info is redirected to MyProc->wait_event_info. At shutdown this is reversed. Because wait event reporting now does not need to know about where the wait event is stored, it does not need to know about PGPROC anymore. The removal of the branch for checking pgstat_track_activities is simpler: Don't check anymore. The cost due to the branch are often higher than the store - and even if not, pgstat_track_activities is rarely disabled. The main motivator to commit this work now is that removing the (indirect) pgproc.h include from pgstat.h simplifies a patch to move statistics reporting to shared memory (which still has a chance to get into 14). Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20210402194458.2vu324hkk2djq6ce@alap3.anarazel.de
*	Do not rely on pgstat.h to indirectly include storage/ headers.	Andres Freund	2021-04-02
\| \| \| \| \| \| \| \|	An upcoming patch might remove the (now indirect) proc.h include (which in turn includes other headers), and it's cleaner for the modified files to include their dependencies directly anyway... Discussion: https://postgr.es/m/20210402194458.2vu324hkk2djq6ce@alap3.anarazel.de
*	Rename Default Roles to Predefined Roles	Stephen Frost	2021-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \|	The term 'default roles' wasn't quite apt as these roles aren't able to be modified or removed after installation, so rename them to be 'Predefined Roles' instead, adding an entry into the newly added Obsolete Appendix to help users of current releases find the new documentation. Bruce Momjian and Stephen Frost Discussion: https://postgr.es/m/157742545062.1149.11052653770497832538%40wrigleys.postgresql.org and https://www.postgresql.org/message-id/20201120211304.GG16415@tamriel.snowman.net
*	Add support for asynchronous execution.	Etsuro Fujita	2021-03-31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This implements asynchronous execution, which runs multiple parts of a non-parallel-aware Append concurrently rather than serially to improve performance when possible. Currently, the only node type that can be run concurrently is a ForeignScan that is an immediate child of such an Append. In the case where such ForeignScans access data on different remote servers, this would run those ForeignScans concurrently, and overlap the remote operations to be performed simultaneously, so it'll improve the performance especially when the operations involve time-consuming ones such as remote join and remote aggregation. We may extend this to other node types such as joins or aggregates over ForeignScans in the future. This also adds the support for postgres_fdw, which is enabled by the table-level/server-level option "async_capable". The default is false. Robert Haas, Kyotaro Horiguchi, Thomas Munro, and myself. This commit is mostly based on the patch proposed by Robert Haas, but also uses stuff from the patch proposed by Kyotaro Horiguchi and from the patch proposed by Thomas Munro. Reviewed by Kyotaro Horiguchi, Konstantin Knizhnik, Andrey Lepikhov, Movead Li, Thomas Munro, Justin Pryzby, and others. Discussion: https://postgr.es/m/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com Discussion: https://postgr.es/m/20200228.170650.667613673625155850.horikyota.ntt%40gmail.com
*	PageAddItemExtended(): Add LP_UNUSED assertion.	Peter Geoghegan	2021-03-28
\| \| \| \| \| \|	Assert that LP_UNUSED items have no storage. If it's worth having defensive code in non-assert builds then it's worth having an assertion as well.
*	Fix concurrency issues with WAL segment recycling on Windows	Michael Paquier	2021-03-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit is mostly a revert of aaa3aed, that switched the routine doing the internal renaming of recycled WAL segments to use on Windows a combination of CreateHardLinkA() plus unlink() instead of rename(). As reported by several users of Postgres 13, this is causing concurrency issues when manipulating WAL segments, mostly in the shape of the following error: LOG: could not rename file "pg_wal/000000XX000000YY000000ZZ": Permission denied This moves back to a logic where a single rename() (well, pgrename() for Windows) is used. This issue has proved to be hard to hit when I tested it, facing it only once with an archive_command that was not able to do its work, so it is environment-sensitive. The reporters of this issue have been able to confirm that the situation improved once we switched back to a single rename(). In order to check things, I have provided to the reporters a patched build based on 13.2 with aaa3aed reverted, to test if the error goes away, and an unpatched build of 13.2 to test if the error still showed up (just to make sure that I did not mess up my build process). Extra thanks to Fujii Masao for pointing out what looked like the culprit commit, and to all the reporters for taking the time to test what I have sent them. Reported-by: Andrus, Guy Burgess, Yaroslav Pashinsky, Thomas Trenz Reviewed-by: Tom Lane, Andres Freund Discussion: https://postgr.es/m/3861ff1e-0923-7838-e826-094cc9bef737@hot.ee Discussion: https://postgr.es/m/16874-c3eecd319e36a2bf@postgresql.org Discussion: https://postgr.es/m/095ccf8d-7f58-d928-427c-b17ace23cae6@burgess.co.nz Discussion: https://postgr.es/m/16927-67c570d968c99567%40postgresql.org Discussion: https://postgr.es/m/YFBcRbnBiPdGZvfW@paquier.xyz Backpatch-through: 13
*	Move lwlock-release probe back where it belongs	Peter Eisentraut	2021-03-21
\| \| \| \| \| \| \| \| \| \|	The documentation specifically states that lwlock-release fires before any released waiters have been awakened. It worked that way until ab5194e6f617a9a9e7aadb3dd1cee948a42d0755, where is seems to have been misplaced accidentally. Move it back where it belongs. Author: Craig Ringer <craig.ringer@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/CAGRY4nwxKUS_RvXFW-ugrZBYxPFFM5kjwKT5O+0+Stuga5b4+Q@mail.gmail.com
*	Provide recovery_init_sync_method=syncfs.	Thomas Munro	2021-03-20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit 2ce439f3 we have opened every file in the data directory and called fsync() at the start of crash recovery. This can be very slow if there are many files, leading to field complaints of systems taking minutes or even hours to begin crash recovery. Provide an alternative method, for Linux only, where we call syncfs() on every possibly different filesystem under the data directory. This is equivalent, but avoids faulting in potentially many inodes from potentially slow storage. The new mode comes with some caveats, described in the documentation, so the default value for the new setting is "fsync", preserving the older behavior. Reported-by: Michael Brown <michael.brown@discourse.org> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Paul Guo <guopa@vmware.com> Reviewed-by: Bruce Momjian <bruce@momjian.us> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: David Steele <david@pgmasters.net> Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
*	Remove temporary files after backend crash	Tomas Vondra	2021-03-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After a crash of a backend using temporary files, the files used to be left behind, on the basis that it might be useful for debugging. But we don't have any reports of anyone actually doing that, and it means the disk usage may grow over time due to repeated backend failures (possibly even hitting ENOSPC). So this behavior is a bit unfortunate, and fixing it required either manual cleanup (deleting files, which is error-prone) or restart of the instance (i.e. service disruption). This implements automatic cleanup of temporary files, controled by a new GUC remove_temp_files_after_crash. By default the files are removed, but it can be disabled to restore the old behavior if needed. Author: Euler Taveira Reviewed-by: Tomas Vondra, Michael Paquier, Anastasia Lubennikova, Thomas Munro Discussion: https://postgr.es/m/CAH503wDKdYzyq7U-QJqGn%3DGm6XmoK%2B6_6xTJ-Yn5WSvoHLY1Ww%40mail.gmail.com
*	Make archiver process an auxiliary process.	Fujii Masao	2021-03-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit changes WAL archiver process so that it's treated as an auxiliary process and can use shared memory. This is an infrastructure patch required for upcoming shared-memory based stats collector patch series. These patch series basically need any processes including archiver that can report the statistics to access to shared memory. Since this patch itself is useful to simplify the code and when users monitor the status of archiver, it's committed separately in advance. This commit simplifies the code for WAL archiving. For example, previously backends need to signal to archiver via postmaster when they notify archiver that there are some WAL files to archive. On the other hand, this commit removes that signal to postmaster and enables backends to notify archier directly using shared latch. Also, as the side of this change, the information about archiver process becomes viewable at pg_stat_activity view. Author: Kyotaro Horiguchi Reviewed-by: Andres Freund, Álvaro Herrera, Julien Rouhaud, Tomas Vondra, Arthur Zakirov, Fujii Masao Discussion: https://postgr.es/m/20180629.173418.190173462.horiguchi.kyotaro@lab.ntt.co.jp
*	Specialize checkpointer sort functions.	Thomas Munro	2021-03-12
\| \| \| \| \| \| \| \| \|	When sorting a potentially large number of dirty buffers, the checkpointer can benefit from a faster sort routine. One reported improvement on a large buffer pool system was 1.4s -> 0.6s. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJ2-eaDqAum5bxhpMNhvuJmRDZxB_Tow0n-gse%2BHG0Yig%40mail.gmail.com
*	Fix size overflow in calculation introduced by commits d6ad34f3 and bea449c6.	Amit Kapila	2021-03-12
\| \| \| \| \| \| \|	Reported-by: Thomas Munro Author: Takayuki Tsunakawa Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/CA+hUKG+oPoFizjABt=GXZWTEHx3oev5rAe2scjW2r6F1rguo5w@mail.gmail.com
*	Minor modernization for README.barrier.	Thomas Munro	2021-03-12
\| \| \| \| \|	Itanium is very uncommon and being discontinued. ARM is everywhere. Prefer ARM as an example of an architecture with weak memory ordering.
*	Replace buffer I/O locks with condition variables.	Thomas Munro	2021-03-11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Backends waiting for buffer I/O are now interruptible. 2. If something goes wrong in a backend that is currently performing I/O, waiting backends no longer wake up until that backend reaches AbortBufferIO() and broadcasts on the CV. Previously, any waiters would wake up (because the I/O lock was automatically released) and then busy-loop until AbortBufferIO() cleared BM_IO_IN_PROGRESS. 3. LWLockMinimallyPadded is removed, as it would now be unused. Author: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier version, 2016) Discussion: https://postgr.es/m/CA%2BhUKGJ8nBFrjLuCTuqKN0pd2PQOwj9b_jnsiGFFMDvUxahj_A%40mail.gmail.com Discussion: https://postgr.es/m/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com
*	Remove obsolete comment for WaitForProcSignalBarrier().	Thomas Munro	2021-03-02
\| \| \| \| \| \|	Commit 814f1d8b removed the behavior described. Reported-by: Amit Kapila <amit.kapila16@gmail.com>
*	Allow condition variables to be used in interrupt code.	Thomas Munro	2021-03-01
\| \| \| \| \| \| \| \| \| \|	Adjust the condition variable sleep loop to work correctly when code reached by its internal CHECK_FOR_INTERRUPTS() call interacts with another condition variable. There are no such cases currently, but a proposed patch would do this. Discussion: https://postgr.es/m/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com
*	Use condition variables for ProcSignalBarriers.	Thomas Munro	2021-03-01
\| \| \| \| \| \| \|	Instead of a poll/sleep loop, use a condition variable for precise wake-up whenever a backend's pss_barrierGeneration advances. Discussion: https://postgr.es/m/CA+hUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg@mail.gmail.com
*	Use EVFILT_SIGNAL for kqueue latches.	Thomas Munro	2021-03-01
\| \| \| \| \| \| \| \| \| \| \|	Cut down on system calls and other overheads by waiting for SIGURG explicitly with kqueue instead of using a signal handler and self-pipe. Affects *BSD and macOS systems. This leaves only the poll implementation with a signal handler and the traditional self-pipe trick. Discussion: https://postgr.es/m/CA+hUKGJjxPDpzBE0a3hyUywBvaZuC89yx3jK9RFZgfv_KHU7gg@mail.gmail.com
*	Use signalfd(2) for epoll latches.	Thomas Munro	2021-03-01
\| \| \| \| \| \| \| \| \| \|	Cut down on system calls and other overheads by reading from a signalfd instead of using a signal handler and self-pipe. Affects Linux sytems, and possibly others including illumos that implement the Linux epoll and signalfd interfaces. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA+hUKGJjxPDpzBE0a3hyUywBvaZuC89yx3jK9RFZgfv_KHU7gg@mail.gmail.com
*	Use SIGURG rather than SIGUSR1 for latches.	Thomas Munro	2021-03-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Traditionally, SIGUSR1 has been overloaded for ad-hoc signals, procsignal.c signals and latch.c wakeups. Move that last use over to a new dedicated signal. SIGURG is normally used to report out-of-band socket data, but PostgreSQL doesn't use that facility. The signal handler is now installed in all postmaster children by InitializeLatchSupport(). Those wishing to disconnect from it should call ShutdownLatchSupport(). Future patches will use this separation of signals to avoid the need for a signal handler on some operating systems. Discussion: https://postgr.es/m/CA+hUKGJjxPDpzBE0a3hyUywBvaZuC89yx3jK9RFZgfv_KHU7gg@mail.gmail.com
*	Optimize latches to send fewer signals.	Thomas Munro	2021-03-01
\| \| \| \| \| \| \|	Don't send signals to processes that aren't sleeping. Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA+hUKGJjxPDpzBE0a3hyUywBvaZuC89yx3jK9RFZgfv_KHU7gg@mail.gmail.com
*	Remove latch.c workaround for Linux < 2.6.27.	Thomas Munro	2021-03-01
\| \| \| \| \| \| \| \| \| \| \|	Commit 82ebbeb0 added a workaround for systems with no epoll_create1() and EPOLL_CLOEXEC. Linux < 2.6.27 and glibc < 2.9 are long gone. Now seems like a good time to drop the extra code, because otherwise we'd have to add similar already-dead workaround code to new patches using XXX_CLOEXEC flags that arrived in the same kernel release. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA%2BhUKGKL_%3DaO%3Dr30N%3Ds9VoDgTqHpRSzePRbA9dkYO7snc7HsxA%40mail.gmail.com
*	Add TID Range Scans to support efficient scanning ranges of TIDs	David Rowley	2021-02-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a new executor node named TID Range Scan. The query planner will generate paths for TID Range scans when quals are discovered on base relations which search for ranges on the table's ctid column. These ranges may be open at either end. For example, WHERE ctid >= '(10,0)'; will return all tuples on page 10 and over. To support this, two new optional callback functions have been added to table AM. scan_set_tidrange is used to set the scan range to just the given range of TIDs. scan_getnextslot_tidrange fetches the next tuple in the given range. For AMs were scanning ranges of TIDs would not make sense, these functions can be set to NULL in the TableAmRoutine. The query planner won't generate TID Range Scan Paths in that case. Author: Edmund Horner, David Rowley Reviewed-by: David Rowley, Tomas Vondra, Tom Lane, Andres Freund, Zhihong Yu Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com
*	Use full 64-bit XIDs in deleted nbtree pages.	Peter Geoghegan	2021-02-24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Otherwise we risk "leaking" deleted pages by making them non-recyclable indefinitely. Commit 6655a729 did the same thing for deleted pages in GiST indexes. That work was used as a starting point here. Stop storing an XID indicating the oldest bpto.xact across all deleted though unrecycled pages in nbtree metapages. There is no longer any reason to care about that condition/the oldest XID. It only ever made sense when wraparound was something _bt_vacuum_needs_cleanup() had to consider. The btm_oldest_btpo_xact metapage field has been repurposed and renamed. It is now btm_last_cleanup_num_delpages, which is used to remember how many non-recycled deleted pages remain from the last VACUUM (in practice its value is usually the precise number of pages that were _newly deleted_ during the specific VACUUM operation that last set the field). The general idea behind storing btm_last_cleanup_num_delpages is to use it to give _some_ consideration to non-recycled deleted pages inside _bt_vacuum_needs_cleanup() -- though never too much. We only really need to avoid leaving a truly excessive number of deleted pages in an unrecycled state forever. We only do this to cover certain narrow cases where no other factor makes VACUUM do a full scan, and yet the index continues to grow (and so actually misses out on recycling existing deleted pages). These metapage changes result in a clear user-visible benefit: We no longer trigger full index scans during VACUUM operations solely due to the presence of only 1 or 2 known deleted (though unrecycled) blocks from a very large index. All that matters now is keeping the costs and benefits in balance over time. Fix an issue that has been around since commit 857f9c36, which added the "skip full scan of index" mechanism (i.e. the _bt_vacuum_needs_cleanup() logic). The accuracy of btm_last_cleanup_num_heap_tuples accidentally hinged upon _when_ the source value gets stored. We now always store btm_last_cleanup_num_heap_tuples in btvacuumcleanup(). This fixes the issue because IndexVacuumInfo.num_heap_tuples (the source field) is expected to accurately indicate the state of the table _after_ the VACUUM completes inside btvacuumcleanup(). A backpatchable fix cannot easily be extracted from this commit. A targeted fix for the issue will follow in a later commit, though that won't happen today. I (pgeoghegan) have chosen to remove any mention of deleted pages in the documentation of the vacuum_cleanup_index_scale_factor GUC/param, since the presence of deleted (though unrecycled) pages is no longer of much concern to users. The vacuum_cleanup_index_scale_factor description in the docs now seems rather unclear in any case, and it should probably be rewritten in the near future. Perhaps some passing mention of page deletion will be added back at the same time. Bump XLOG_PAGE_MAGIC due to nbtree WAL records using full XIDs now. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WznpdHvujGUwYZ8sihX=d5u-tRYhi-F4wnV2uN2zHpMUXw@mail.gmail.com
*	Fix some typos, grammar and style in docs and comments	Michael Paquier	2021-02-24
\| \| \| \| \| \| \| \|	The portions fixing the documentation are backpatched where needed. Author: Justin Pryzby Discussion: https://postgr.es/m/20210210235557.GQ20012@telsasoft.com backpatch-through: 9.6
*	VACUUM: ignore indexing operations with CONCURRENTLY	Alvaro Herrera	2021-02-23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As envisioned in commit c98763bf51bf, it is possible for VACUUM to ignore certain transactions that are executing CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY for the purposes of computing Xmin; that's because we know those transactions are not going to examine any other tables, and are not going to execute anything else in the same transaction. (Only operations on "safe" indexes can be ignored: those on indexes that are neither partial nor expressional). This is extremely useful in cases where CIC/RC can run for a very long time, because that used to be a significant headache for concurrent vacuuming of other tables. Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql
*	Simplify printing of LSNs	Peter Eisentraut	2021-02-23
\| \| \| \| \| \| \| \| \| \|	Add a macro LSN_FORMAT_ARGS for use in printf-style printing of LSNs. Convert all applicable code to use it. Reviewed-by: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://www.postgresql.org/message-id/flat/CAExHW5ub5NaTELZ3hJUCE6amuvqAtsSxc7O+uK7y4t9Rrk23cw@mail.gmail.com
*	Fix typo	Alvaro Herrera	2021-02-22
\|
*	Initialize atomic variable waitStart in PGPROC, at postmaster startup.	Fujii Masao	2021-02-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 46d6e5f567 added the atomic variable "waitStart" into PGPROC struct, to store the time at which wait for lock acquisition started. Previously this variable was initialized every time each backend started. Instead, this commit makes postmaster initialize it at the startup, to ensure that the variable should be initialized before any use of it. This commit also moves the code to initialize "waitStart" variable for prepare transaction, from TwoPhaseGetDummyProc() to MarkAsPreparingGuts(). Because MarkAsPreparingGuts() is more proper place to do that since it initializes other PGPROC variables. Author: Fujii Masao Reviewed-by: Atsushi Torikoshi Discussion: https://postgr.es/m/1df88660-6f08-cc6e-b7e2-f85296a2bdab@oss.nttdata.com
*	Use errmsg_internal for debug messages	Peter Eisentraut	2021-02-17
\| \| \| \| \| \|	An inconsistent set of debug-level messages was not using errmsg_internal(), thus uselessly exposing the messages to translation work. Fix those.
*	Display the time when the process started waiting for the lock, in pg_locks, ↵	Fujii Masao	2021-02-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	take 2 This commit adds new column "waitstart" into pg_locks view. This column reports the time when the server process started waiting for the lock if the lock is not held. This information is useful, for example, when examining the amount of time to wait on a lock by subtracting "waitstart" in pg_locks from the current time, and identify the lock that the processes are waiting for very long. This feature uses the current time obtained for the deadlock timeout timer as "waitstart" (i.e., the time when this process started waiting for the lock). Since getting the current time newly can cause overhead, we reuse the already-obtained time to avoid that overhead. Note that "waitstart" is updated without holding the lock table's partition lock, to avoid the overhead by additional lock acquisition. This can cause "waitstart" in pg_locks to become NULL for a very short period of time after the wait started even though "granted" is false. This is OK in practice because we can assume that users are likely to look at "waitstart" when waiting for the lock for a long time. The first attempt of this patch (commit 3b733fcd04) caused the buildfarm member "rorqual" (built with --disable-atomics --disable-spinlocks) to report the failure of the regression test. It was reverted by commit 890d2182a2. The cause of this failure was that the atomic variable for "waitstart" in the dummy process entry created at the end of prepare transaction was not initialized. This second attempt fixes that issue. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
*	Hold interrupts while running dsm_detach() callbacks.	Thomas Munro	2021-02-15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While cleaning up after a parallel query or parallel index creation that created temporary files, we could be interrupted by a statement timeout. The error handling path would then fail to clean up the files when it ran dsm_detach() again, because the callback was already popped off the list. Prevent this hazard by holding interrupts while the cleanup code runs. Thanks to Heikki Linnakangas for this suggestion, and also to Kyotaro Horiguchi, Masahiko Sawada, Justin Pryzby and Tom Lane for discussion of this and earlier ideas on how to fix the problem. Back-patch to all supported releases. Reported-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20191212180506.GR2082@telsasoft.com
*	Revert "Display the time when the process started waiting for the lock, in ↵	Fujii Masao	2021-02-09
\| \| \| \| \| \| \| \|	pg_locks." This reverts commit 3b733fcd04195399db56f73f0616b4f5c6828e18. Per buildfarm members prion and rorqual.
*	Display the time when the process started waiting for the lock, in pg_locks.	Fujii Masao	2021-02-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit adds new column "waitstart" into pg_locks view. This column reports the time when the server process started waiting for the lock if the lock is not held. This information is useful, for example, when examining the amount of time to wait on a lock by subtracting "waitstart" in pg_locks from the current time, and identify the lock that the processes are waiting for very long. This feature uses the current time obtained for the deadlock timeout timer as "waitstart" (i.e., the time when this process started waiting for the lock). Since getting the current time newly can cause overhead, we reuse the already-obtained time to avoid that overhead. Note that "waitstart" is updated without holding the lock table's partition lock, to avoid the overhead by additional lock acquisition. This can cause "waitstart" in pg_locks to become NULL for a very short period of time after the wait started even though "granted" is false. This is OK in practice because we can assume that users are likely to look at "waitstart" when waiting for the lock for a long time. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
*	Rename removable xid function for consistency.	Peter Geoghegan	2021-02-07
\| \| \| \| \| \| \| \| \| \| \| \|	GlobalVisIsRemovableFullXid() is now GlobalVisCheckRemovableFullXid(). This is consistent with the general convention for FullTransactionId equivalents of functions that deal with TransactionId values. It now matches the nearby GlobalVisCheckRemovableXid() function, which performs the same check for callers that use TransactionId values. Oversight in commit dc7420c2c92. Discussion: https://postgr.es/m/CAH2-Wzmes12jFNDcVgpU89Vp=r6uLFrE-MT0fjSWGsE70UiNaA@mail.gmail.com
*	Fix CREATE INDEX CONCURRENTLY for simultaneous prepared transactions.	Noah Misch	2021-01-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a cluster having used CREATE INDEX CONCURRENTLY while having enabled prepared transactions, queries that use the resulting index can silently fail to find rows. Fix this for future CREATE INDEX CONCURRENTLY by making it wait for prepared transactions like it waits for ordinary transactions. This expands the VirtualTransactionId structure domain to admit prepared transactions. It may be necessary to reindex to recover from past occurrences. Back-patch to 9.5 (all supported versions). Andrey Borodin, reviewed (in earlier versions) by Tom Lane and Michael Paquier. Discussion: https://postgr.es/m/2E712143-97F7-4890-B470-4A35142ABC82@yandex-team.ru
*	Remove CheckpointLock.	Robert Haas	2021-01-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Up until now, we've held this lock when performing a checkpoint or restartpoint, but commit 076a055acf3c55314de267c62b03191586d79cf6 back in 2004 and commit 7e48b77b1cebb9a43f9fdd6b17128a0ba36132f9 from 2009, taken together, have removed all need for this. In the present code, there's only ever one process entitled to attempt a checkpoint: either the checkpointer, during normal operation, or the postmaster, during single-user operation. So, we don't need the lock. One possible concern in making this change is that it means that a substantial amount of code where HOLD_INTERRUPTS() was previously in effect due to the preceding LWLockAcquire() will now be running without that. This could mean that ProcessInterrupts() gets called in places from which it didn't before. However, this seems unlikely to do very much, because the checkpointer doesn't have any signal mapped to die(), so it's not clear how, for example, ProcDiePending = true could happen in the first place. Similarly with ClientConnectionLost and recovery conflicts. Also, if there are any such problems, we might want to fix them rather than reverting this, since running lots of code with interrupt handling suspended is generally bad. Patch by me, per an inquiry by Amul Sul. Review by Tom Lane and Michael Paquier. Discussion: http://postgr.es/m/CAAJ_b97XnBBfYeSREDJorFsyoD1sHgqnNuCi=02mNQBUMnA=FA@mail.gmail.com
*	Remove bogus tracepoint	Peter Eisentraut	2021-01-22
\| \| \| \| \| \| \| \| \|	Calls to LWLockWaitForVar() fired the TRACE_POSTGRESQL_LWLOCK_ACQUIRE tracepoint, but LWLockWaitForVar() never actually acquires the LWLock. (Probably a copy/paste bug in 68a2e52bbaf.) Remove it. Author: Craig Ringer <craig.ringer@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAGRY4nxJo+-HCC2i5H93ttSZ4gZO-FSddCwvkb-qAfQ1zdXd1w@mail.gmail.com
*	Allow for error or refusal while absorbing a ProcSignalBarrier.	Robert Haas	2021-01-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, the per-barrier-type functions tasked with absorbing them were expected to always succeed and never throw an error. However, that's a bit inconvenient. Further study has revealed that there are realistic cases where it might not be possible to absorb a ProcSignalBarrier without terminating the transaction, or even the whole backend. Similarly, for some barrier types, there might be other reasons where it's not reasonably possible to absorb the barrier at certain points in the code, so provide a way for a per-barrier-type function to reject absorbing the barrier. Unfortunately, there's still no committed code making use of this infrastructure; hopefully, we'll get there. :-( Patch by me, reviewed by Andres Freund and Amul Sul. Discussion: http://postgr.es/m/20200908182005.xya7wetdh3pndzim@alap3.anarazel.de Discussion: http://postgr.es/m/CA+Tgmob56Pk1-5aTJdVPCWFHon7me4M96ENpGe9n_R4JUjjhZA@mail.gmail.com
*	Prevent excess SimpleLruTruncate() deletion.	Noah Misch	2021-01-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Every core SLRU wraps around. With the exception of pg_notify, the wrap point can fall in the middle of a page. Account for this in the PagePrecedes callback specification and in SimpleLruTruncate()'s use of said callback. Update each callback implementation to fit the new specification. This changes SerialPagePrecedesLogically() from the style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes(). (Whereas pg_clog and pg_serial share a key space, pg_serial is nothing like pg_notify.) The bug fixed here has the same symptoms and user followup steps as 592a589a04bd456410b853d86bd05faa9432cbbb. Back-patch to 9.5 (all supported versions). Reviewed by Andrey Borodin and (in earlier versions) by Tom Lane. Discussion: https://postgr.es/m/20190202083822.GC32531@gust.leadboat.com
*	Fix calculation of how much shared memory is required to store a TOC.	Fujii Masao	2021-01-15
\| \| \| \| \| \| \| \| \| \| \|	Commit ac883ac453 refactored shm_toc_estimate() but changed its calculation of shared memory size for TOC incorrectly. Previously this could cause too large memory to be allocated. Back-patch to v11 where the bug was introduced. Author: Takayuki Tsunakawa Discussion: https://postgr.es/m/TYAPR01MB2990BFB73170E2C4921E2C4DFEA80@TYAPR01MB2990.jpnprd01.prod.outlook.com
*	Move our p{read,write}v replacements into their own files.	Thomas Munro	2021-01-14
\| \| \| \| \| \| \| \| \| \| \| \| \|	macOS's ranlib issued a warning about an empty pread.o file with the previous arrangement, on systems new enough to require no replacement functions. Let's go back to using configure's AC_REPLACE_FUNCS system to build and include each .o in the library only if it's needed, which requires moving the *v() functions to their own files. Also move the _with_retry() wrapper to a more permanent home. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1283127.1610554395%40sss.pgh.pa.us
*	Log long wait time on recovery conflict when it's resolved.	Fujii Masao	2021-01-13
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is a follow-up of the work done in commit 0650ff2303. This commit extends log_recovery_conflict_waits so that a log message is produced also when recovery conflict has already been resolved after deadlock_timeout passes, i.e., when the startup process finishes waiting for recovery conflict after deadlock_timeout. This is useful in investigating how long recovery conflicts prevented the recovery from applying WAL. Author: Fujii Masao Reviewed-by: Kyotaro Horiguchi, Bertrand Drouvot Discussion: https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
*	Optimize DropRelFileNodesAllBuffers() for recovery.	Amit Kapila	2021-01-13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similar to commit d6ad34f341, this patch optimizes DropRelFileNodesAllBuffers() by avoiding the complete buffer pool scan and instead find the buffers to be invalidated by doing lookups in the BufMapping table. This optimization helps operations where the relation files need to be removed like Truncate, Drop, Abort of Create Table, etc. Author: Kirk Jamison Reviewed-by: Kyotaro Horiguchi, Takayuki Tsunakawa, and Amit Kapila Tested-By: Haiying Tang Discussion: https://postgr.es/m/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com
*	Optimize DropRelFileNodeBuffers() for recovery.	Amit Kapila	2021-01-12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The recovery path of DropRelFileNodeBuffers() is optimized so that scanning of the whole buffer pool can be avoided when the number of blocks to be truncated in a relation is below a certain threshold. For such cases, we find the buffers by doing lookups in BufMapping table. This improves the performance by more than 100 times in many cases when several small tables (tested with 1000 relations) are truncated and where the server is configured with a large value of shared buffers (greater than equal to 100GB). This optimization helps cases (a) when vacuum or autovacuum truncated off any of the empty pages at the end of a relation, or (b) when the relation is truncated in the same transaction in which it was created. This commit introduces a new API smgrnblocks_cached which returns a cached value for the number of blocks in a relation fork. This helps us to determine the exact size of relation which is required to apply this optimization. The exact size is required to ensure that we don't leave any buffer for the relation being dropped as otherwise the background writer or checkpointer can lead to a PANIC error while flushing buffers corresponding to files that don't exist. Author: Kirk Jamison based on ideas by Amit Kapila Reviewed-by: Kyotaro Horiguchi, Takayuki Tsunakawa, and Amit Kapila Tested-By: Haiying Tang Discussion: https://postgr.es/m/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com
*	Add GUC to log long wait times on recovery conflicts.	Fujii Masao	2021-01-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit adds GUC log_recovery_conflict_waits that controls whether a log message is produced when the startup process is waiting longer than deadlock_timeout for recovery conflicts. This is useful in determining if recovery conflicts prevent the recovery from applying WAL. Note that currently a log message is produced only when recovery conflict has not been resolved yet even after deadlock_timeout passes, i.e., only when the startup process is still waiting for recovery conflict even after deadlock_timeout. Author: Bertrand Drouvot, Masahiko Sawada Reviewed-by: Alvaro Herrera, Kyotaro Horiguchi, Fujii Masao Discussion: https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
*	Add idle_session_timeout.	Tom Lane	2021-01-06
\| \| \| \| \| \| \| \| \| \| \| \|	This GUC variable works much like idle_in_transaction_session_timeout, in that it kills sessions that have waited too long for a new client query. But it applies when we're not in a transaction, rather than when we are. Li Japin, reviewed by David Johnston and Hayato Kuroda, some fixes by me Discussion: https://postgr.es/m/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com