postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
...
*	Fill CommonRdOptions with default values in extract_autovac_opts()	Alexander Korotkov	2024-04-08
\| \| \| \| \| \|	Reported-by: Thomas Munro Reported-by: Pavel Borisov Discussion: https://postgr.es/m/CA%2BhUKGLZzLR50RBvuqOO3MZ%3DF54ETz-rTp1PDX9uDGP_GqyYqA%40mail.gmail.com
*	Custom reloptions for table AM	Alexander Korotkov	2024-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Let table AM define custom reloptions for its tables. This allows specifying AM-specific parameters by the WITH clause when creating a table. The reloptions, which could be used outside of table AM, are now extracted into the CommonRdOptions data structure. These options could be by decision of table AM directly specified by a user or calculated in some way. The new test module test_tam_options evaluates the ability to set up custom reloptions and calculate fields of CommonRdOptions on their base. The code may use some parts from prior work by Hao Wu. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
*	Use bump context for TID bitmaps stored by vacuum	John Naylor	2024-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Vacuum does not pfree individual entries, and only frees the entire storage space when finished with it. This allows using a bump context, eliminating the chunk header in each leaf allocation. Most leaf allocations will be 16 to 32 bytes, so that's a significant savings. TidStoreCreateLocal gets a boolean parameter to indicate that the created store is insert-only. This requires a separate tree context for iteration, since we free the iteration state after iteration completes. Discussion: https://postgr.es/m/CANWCAZac%3DpBePg3rhX8nXkUuaLoiAJJLtmnCfZsPEAS4EtJ%3Dkg%40mail.gmail.com Discussion: https://postgr.es/m/CANWCAZZQFfxvzO8yZHFWtQV+Z2gAMv1ku16Vu7KWmb5kZQyd1w@mail.gmail.com
*	Remove references to old function name	Andres Freund	2024-04-07
\| \| \| \| \| \| \| \| \| \| \|	In a97bbe1f1df I accidentally referenced heapgetpage(), both in a function name and a comment. But since 44086b09753 the relevant function is named heap_prepare_pagescan(). Rename the new function to page_collect_tuples(). Reported-by: Melanie Plageman <melanieplageman@gmail.com> Reported-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20240407172615.cocrsvboqm3ttqe4@awork3.anarazel.de Discussion: https://postgr.es/m/CAApHDvp4SniHopTrVeKWcEvNXFtdki0utAvO=5R7H6TNhtULRQ@mail.gmail.com
*	Fix alignment of stack variable	John Naylor	2024-04-08
\| \| \| \| \| \| \| \|	Declare with union similar to PGAlignedBlock. Report and fix by Andres Freund Discussion: https://postgr.es/m/20240407190731.izm3mdazednrsiqk%40awork3.anarazel.de
*	Remove redundant nbtree preprocessing assertions.	Peter Geoghegan	2024-04-07
\| \| \| \| \| \| \| \|	One of the assertions was the subject of a false positive complaint from Coverity, but none of the assertions added much, so get rid of them. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3000247.1712537309@sss.pgh.pa.us
*	Use streaming I/O in ANALYZE.	Thomas Munro	2024-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ANALYZE command prefetches and reads sample blocks chosen by a BlockSampler algorithm. Instead of calling [Prefetch\|Read]Buffer() for each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
*	Use conditional variable to wait for next MultiXact offset	Alvaro Herrera	2024-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In one multixact.c edge case, we need a mechanism to wait for one multixact offset to be written before being allowed to read the next one. We used to handle this case by sleeping for one millisecond and retrying, but such sleeps have been reported as problematic in production cases. We can avoid the problem by using a condition variable: readers sleep on it and then every creator of multixacts broadcasts into the CV when creation is sufficiently far along. Author: Kyotaro Horiguchi <horikyotajntt@gmail.com> Reviewed-by: Andrey Borodin <amborodin@acm.org> Discussion: https://postgr.es/m/47A598F4-B4E7-4029-8FEC-A06A6C3CB4B5@yandex-team.ru Discussion: https://postgr.es/m/20200515.090333.24867479329066911.horikyota.ntt
*	Avoid extra lookups with nbtree array inequalities.	Peter Geoghegan	2024-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nbtree index scans with SAOP inequalities (but no SAOP equalities) performed extra ORDER proc lookups for any remaining equality strategy scan keys. This could waste cycles, and caused assertion failures. Keeping around a separate ORDER proc is only necessary for a scan's non-array/non-SAOP equality scan keys when the scan has at least one other SAOP equality strategy key (a SAOP inequality shouldn't count). To fix, replace _bt_preprocess_array_keys_final's assertion with a test that makes the function return early when the scan has no SAOP equality scan keys. Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/0539d3d3-a402-0a49-ed5e-26429dffc4bd@gmail.com
*	Use streaming I/O in sequential scans.	Thomas Munro	2024-04-08
\| \| \| \| \| \| \| \| \| \|	Instead of calling ReadBuffer() for each block, heap sequential scans and TID range scans now use the streaming API introduced in b5a9b18cd0. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
*	Add XLogCtl->logInsertResult	Alvaro Herrera	2024-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This tracks the position of WAL that's been fully copied into WAL buffers by all processes emitting WAL. (For some reason we call that "WAL insertion"). This is updated using atomic monotonic advance during WaitXLogInsertionsToFinish, which is not when the insertions actually occur, but it's the only place where we know where have all the insertions have completed. This value is useful in WALReadFromBuffers, which can verify that callers don't try to read past what has been inserted. (However, more infrastructure is needed in order to actually use WAL after the flush point, since it could be lost.) The value is also useful in WaitXLogInsertionsToFinish() itself, since we can now exit quickly when all WAL has been already inserted, without even having to take any locks.
*	Reduce branches in heapgetpage()'s per-tuple loop	Andres Freund	2024-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Until now, heapgetpage()'s loop over all tuples performed some conditional checks for each tuple, even though condition did not change across the loop. This commit fixes that by moving the loop into an inline function. By calling it with different constant arguments, the compiler can generate an optimized loop for the different conditions, at the price of two per-page checks. For cases of all-visible tables and an isolation level other than serializable, speedups of up to 25% have been measured. Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Tested-by: Quan Zongliang <quanzongliang@yeah.net> Discussion: https://postgr.es/m/20230716015656.xjvemfbp5fysjiea@awork3.anarazel.de Discussion: https://postgr.es/m/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce@yeah.net
*	Optimize visibilitymap_count() with AVX-512 instructions.	Nathan Bossart	2024-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 792752af4e added infrastructure for using AVX-512 intrinsic functions, and this commit uses that infrastructure to optimize visibilitymap_count(). Specificially, a new pg_popcount_masked() function is introduced that applies a bitmask to every byte in the buffer prior to calculating the population count, which is used to filter out the all-visible or all-frozen bits as needed. Platforms without AVX-512 support should also see a nice speedup due to the reduced number of calls to a function pointer. Co-authored-by: Ants Aasma Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
*	BitmapHeapScan: Push skip_fetch optimization into table AM	Tomas Vondra	2024-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 7c70996ebf0949b142 introduced an optimization to allow bitmap scans to operate like index-only scans by not fetching a block from the heap if none of the underlying data is needed and the block is marked all visible in the visibility map. With the introduction of table AMs, a FIXME was added to this code indicating that the skip_fetch logic should be pushed into the table AM-specific code, as not all table AMs may use a visibility map in the same way. This commit resolves this FIXME for the current block. The layering violation is still present in BitmapHeapScans's prefetching code, which uses the visibility map to decide whether or not to prefetch a block. However, this can be addressed independently. Author: Melanie Plageman Reviewed-by: Andres Freund, Heikki Linnakangas, Tomas Vondra, Mark Dilger Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
*	Call WaitLSNCleanup() in AbortTransaction()	Alexander Korotkov	2024-04-07
\| \| \| \| \| \| \| \| \| \| \|	Even though waiting for replay LSN happens without explicit transaction, AbortTransaction() is responsible for the cleanup of the shared memory if the error is thrown in a stored procedure. So, we need to do WaitLSNCleanup() there to clean up after some unexpected error happened while waiting for replay LSN. Discussion: https://postgr.es/m/202404051815.eri4u5q6oj26%40alvherre.pgsql Author: Alvaro Herrera
*	Enhance nbtree ScalarArrayOp execution.	Peter Geoghegan	2024-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals natively. This works by pushing down the full context (the array keys) to the nbtree index AM, enabling it to execute multiple primitive index scans that the planner treats as one continuous index scan/index path. This earlier enhancement enabled nbtree ScalarArrayOp index-only scans. It also allowed scans with ScalarArrayOp quals to return ordered results (with some notable restrictions, described further down). Take this general approach a lot further: teach nbtree SAOP index scans to decide how to execute ScalarArrayOp scans (when and where to start the next primitive index scan) based on physical index characteristics. This can be far more efficient. All SAOP scans will now reliably avoid duplicative leaf page accesses (just like any other nbtree index scan). SAOP scans whose array keys are naturally clustered together now require far fewer index descents, since we'll reliably avoid starting a new primitive scan just to get to a later offset from the same leaf page. The scan's arrays now advance using binary searches for the array element that best matches the next tuple's attribute value. Required scan key arrays (i.e. arrays from scan keys that can terminate the scan) ratchet forward in lockstep with the index scan. Non-required arrays (i.e. arrays from scan keys that can only exclude non-matching tuples) "advance" without the process ever rolling over to a higher-order array. Naturally, only required SAOP scan keys trigger skipping over leaf pages (non-required arrays cannot safely end or start primitive index scans). Consequently, even index scans of a composite index with a high-order inequality scan key (which we'll mark required) and a low-order SAOP scan key (which we won't mark required) now avoid repeating leaf page accesses -- that benefit isn't limited to simpler equality-only cases. In general, all nbtree index scans now output tuples as if they were one continuous index scan -- even scans that mix a high-order inequality with lower-order SAOP equalities reliably output tuples in index order. This allows us to remove a couple of special cases that were applied when building index paths with SAOP clauses during planning. Bugfix commit 807a40c5 taught the planner to avoid generating unsafe path keys: path keys on a multicolumn index path, with a SAOP clause on any attribute beyond the first/most significant attribute. These cases are now all safe, so we go back to generating path keys without regard for the presence of SAOP clauses (just like with any other clause type). Affected queries can now exploit scan output order in all the usual ways (e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early). Also undo changes from follow-up bugfix commit a4523c5a, which taught the planner to produce alternative index paths, with path keys, but without low-order SAOP index quals (filter quals were used instead). We'll no longer generate these alternative paths, since they can no longer offer any meaningful advantages over standard index qual paths. Affected queries thereby avoid all of the disadvantages that come from using filter quals within index scan nodes. They can avoid extra heap page accesses from using filter quals to exclude non-matching tuples (index quals will never have that problem). They can also skip over irrelevant sections of the index in more cases (though only when nbtree determines that starting another primitive scan actually makes sense). There is a theoretical risk that removing restrictions on SAOP index paths from the planner will break compatibility with amcanorder-based index AMs maintained as extensions. Such an index AM could have the same limitations around ordered SAOP scans as nbtree had up until now. Adding a pro forma incompatibility item about the issue to the Postgres 17 release notes seems like a good idea. Author: Peter Geoghegan <pg@bowt.ie> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
*	Operate XLogCtl->log{Write,Flush}Result with atomics	Alvaro Herrera	2024-04-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This removes the need to hold both the info_lck spinlock and WALWriteLock to update them. We use stock atomic write instead, with WALWriteLock held. Readers can use atomic read, without any locking. This allows for some code to be reordered: some places were a bit contorted to avoid repeated spinlock acquisition, but that's no longer a concern, so we can turn them to more natural coding. Some further changes are possible (maybe to performance wins), but in this commit I did rather minimal ones only, to avoid increasing the blast radius. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
*	Secondary refactor of heap scanning functions	David Rowley	2024-04-04
\| \| \| \| \| \| \| \|	Similar to 44086b097, refactor heap scanning functions to be more suitable for the read stream API. Author: Melanie Plageman Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
*	Preliminary refactor of heap scanning functions	David Rowley	2024-04-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To allow the use of the read stream API added in b5a9b18cd for sequential scans on heap tables, here we make some adjustments to make that change less invasive and perhaps make the code easier to follow in the process. Here heapgetpage() gets broken into two functions: 1) The part which reads the block has now been moved into a function named heapfetchbuf(). 2) The part which performed pruning and populated the scan's rs_vistuples[] array is now moved into a new function named heap_prepare_pagescan(). The functionality provided by heap_prepare_pagescan() was only ever required by SO_ALLOW_PAGEMODE scans, so the branching that was previously done in heapgetpage() is no longer needed as we simply just don't call heap_prepare_pagescan() from heapgettup() in the refactored code. Author: Melanie Plageman Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
*	Invent SERIALIZE option for EXPLAIN.	Tom Lane	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EXPLAIN (ANALYZE, SERIALIZE) allows collection of statistics about the volume of data emitted by a query, as well as the time taken to convert the data to the on-the-wire format. Previously there was no way to investigate this without actually sending the data to the client, in which case network transmission costs might swamp what you wanted to see. In particular this feature allows investigating the costs of de-TOASTing compressed or out-of-line data during formatting. Stepan Rutz and Matthias van de Meent, reviewed by Tomas Vondra and myself Discussion: https://postgr.es/m/ca0adb0e-fa4e-c37e-1cd7-91170b18cae1@gmx.de
*	Split XLogCtl->LogwrtResult into separate struct members	Alvaro Herrera	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After this change we have XLogCtl->logWriteResult and ->logFlushResult. There's no functional change, other than the fact that the assignment from shared memory to local is no longer done via struct assignment, but instead using a macro that copies each member separately. The current representation is inconvenient going forward; notably, we would like to add a new member "Copy" (to keep track of the last position copied into WAL buffers), so the symmetry between the values in shared memory vs. those in local would be lost. This also gives us freedom to later change the concurrency model for the values in shared memory: we can make them use atomics instead of relying on the info_lck spinlock. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/202404031119.cd2kugjk2vho@alvherre.pgsql
*	Combine freezing and pruning steps in VACUUM	Heikki Linnakangas	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Execute both freezing and pruning of tuples in the same heap_page_prune() function, now called heap_page_prune_and_freeze(), and emit a single WAL record containing all changes. That reduces the overall amount of WAL generated. This moves the freezing logic from vacuumlazy.c to the heap_page_prune_and_freeze() function. The main difference in the coding is that in vacuumlazy.c, we looked at the tuples after the pruning had already happened, but in heap_page_prune_and_freeze() we operate on the tuples before pruning. The heap_prepare_freeze_tuple() function is now invoked after we have determined that a tuple is not going to be pruned away. VACUUM no longer needs to loop through the items on the page after pruning. heap_page_prune_and_freeze() does all the work. It now returns the list of dead offsets, including existing LP_DEAD items, to the caller. Similarly it's now responsible for tracking 'all_visible', 'all_frozen', and 'hastup' on the caller's behalf. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
*	Refactor how heap_prune_chain() updates prunable_xid	Heikki Linnakangas	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In preparation of freezing and counting tuples which are not candidates for pruning, split heap_prune_record_unchanged() into multiple functions, depending the kind of line pointer. That's not too interesting right now, but makes the next commit smaller. Recording the lowest soon-to-be prunable xid is one of the actions we take for unchanged LP_NORMAL item pointers but not for others, so move that to the new heap_prune_record_unchanged_lp_normal() function. The next commit will add more actions to these functions. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
*	Use the pairing heap instead of a flat array for LSN replay waiters	Alexander Korotkov	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	06c418e163 introduced pg_wal_replay_wait() procedure allowing to wait for the particular LSN to be replayed on standby. The waiters were stored in the flat array. Even though scanning small arrays is fast, that might be a problem at scale (a lot of waiting processes). This commit replaces the flat shared memory array with the pairing heap, which holds the waiter with the least LSN at the top. This gives us O(log N) complexity for both inserting and removing waiters. Reported-by: Alvaro Herrera Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
*	Add error codes to some PANIC/FATAL errors reports	Daniel Gustafsson	2024-04-03
\| \| \| \| \| \| \| \| \| \| \|	This adds errcodes to a set of PANIC and FATAL errors in xlog.c and relcache.c, which previously had no errcode at all set, in order to make fleetwide analysis of errorlogs easier. There are many more ereport/elogs left which could benefit from having an errcode but this at least makes a dent in the issue. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
*	Implement pg_wal_replay_wait() stored procedure	Alexander Korotkov	2024-04-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pg_wal_replay_wait() is to be used on standby and specifies waiting for the specific WAL location to be replayed before starting the transaction. This option is useful when the user makes some data changes on primary and needs a guarantee to see these changes on standby. The queue of waiters is stored in the shared memory array sorted by LSN. During replay of WAL waiters whose LSNs are already replayed are deleted from the shared memory array and woken up by setting of their latches. pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise, the snapshot could prevent the replay of WAL records implying a kind of self-deadlock. This is why it is only possible to implement pg_wal_replay_wait() as a procedure working in a non-atomic context, not a function. Catversion is bumped. Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru Author: Kartyshov Ivan, Alexander Korotkov Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
*	Revert "Custom reloptions for table AM"	Alexander Korotkov	2024-04-02
\| \| \| \| \| \| \| \|	This reverts commit c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues spotted after commit. Reported-by: Jeff Davis Discussion: https://postgr.es/m/11550b536211d5748bb2865ed6cb3502ff073bf7.camel%40j-davis.com
*	Use TidStore for dead tuple TIDs storage during lazy vacuum.	Masahiko Sawada	2024-04-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, we used a simple array for storing dead tuple IDs during lazy vacuum, which had a number of problems: * The array used a single allocation and so was limited to 1GB. * The allocation was pessimistically sized according to table size. * Lookup with binary search was slow because of poor CPU cache and branch prediction behavior. This commit replaces that array with the TID store from commit 30e144287a. Since the backing radix tree makes small allocations as needed, the 1GB limit is now gone. Further, the total memory used is now often smaller by an order of magnitude or more, depending on the distribution of blocks and offsets. These two features should make multiple rounds of heap scanning and index cleanup an extremely rare event. TID lookup during index cleanup is also several times faster, even more so when index order is correlated with heap tuple order. Since there is no longer a predictable relationship between the number of dead tuples vacuumed and the space taken up by their TIDs, the number of tuples no longer provides any meaningful insights for users, nor is the maximum number predictable. For that reason this commit also changes to byte-based progress reporting, with the relevant columns of pg_stat_progress_vacuum renamed accordingly to max_dead_tuple_bytes and dead_tuple_bytes. For parallel vacuum, both the TID store and supplemental information specific to vacuum are shared among the parallel vacuum workers. As with the previous array, we don't take any locks on TidStore during parallel vacuum since writes are still only done by the leader process. Bump catalog version. Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
*	Introduce 'options' argument to heap_page_prune()	Heikki Linnakangas	2024-04-02
\| \| \| \| \| \| \| \| \| \| \| \|	Currently there is only one option, HEAP_PAGE_PRUNE_MARK_UNUSED_NOW which replaces the old boolean argument, but upcoming patches will introduce at least one more. Having a lot of boolean arguments makes it hard to see at the call sites what the arguments mean, so prefer a bitmask of options with human-readable names. Author: Melanie Plageman <melanieplageman@gmail.com> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://www.postgresql.org/message-id/20240401172219.fngjosaqdgqqvg4e@liskov
*	Handle non-chain tuples outside of heap_prune_chain()	Heikki Linnakangas	2024-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handle dead branches of aborted HOT chains outside heap_prune_chain() as a separate phase. This simplifies the logic in heap_prune_chain(), as well as allowing us to clean up more RECENTLY_DEAD -> DEAD chains. To accomplish this efficiently, partition tuples into HOT and non-HOT while first collecting visibility information for each tuple in heap_page_prune(). Then call heap_prune_chain() only on potential chain members. Then mop up the leftover HOT tuples afterwards. As part of this, keep track of which items on page have already been processed, in 'processed' array. This replaces the 'marked' array which was only set for tuples marked for removal or redirection. The 'processed' array is updated also for items that are left unchanged, when we conclude that an item can be left unchanged. At the end of pruning, every item on the page should be marked as processed in the array; an assertion is added for that. Author: Melanie Plageman <melanieplageman@gmail.com> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
*	Refactor heap_prune_chain()	Heikki Linnakangas	2024-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Keep track of the number of deleted tuples in PruneState and record this information when recording a tuple dead, unused or redirected. This removes a special case from the traversal and chain processing logic as well as setting a precedent of recording the impact of prune actions in the record functions themselves. This paradigm will be used in future commits which move tracking of additional statistics on pruning actions from lazy_scan_prune() to heap_prune_chain(). Simplify heap_prune_chain()'s chain traversal logic by handling each case explicitly. That is, do not attempt to share code when processing different types of chains. For each category of chain, process it specifically and procedurally: first handling the root, then any intervening tuples, and, finally, the end of the chain. While we are at it, add a few new comments to heap_prune_chain() clarifying some special cases involving RECENTLY_DEAD tuples. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
*	Minor refactoring in heap_page_prune	Heikki Linnakangas	2024-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pass 'page', 'blockno' and 'maxoff' to heap_prune_chain() as arguments, so that it doesn't need to fetch them from the buffer. This saves a few cycles per chain. Remove the "if (off_loc != NULL)" checks, and require the caller to pass a non-NULL 'off_loc'. Pass a pointer to a dummy local variable when it's not needed. Those checks are cheap, but it's still better to avoid them in the per-chain loops when we can do so easily. The CPU time saving from these changes are hardly measurable, but fewer instructions is good anyway, so why not. I spotted the potential for these while reviewing Melanie Plageman's patch set to combine prune and freeze records. Discussion: https://www.postgresql.org/message-id/CAAKRu_abm2tHhrc0QSQa%3D%3DsHe%3DVA1%3Doz1dJMQYUOKuHmu%2B9Xrg%40mail.gmail.com
*	Let table AM insertion methods control index insertion	Alexander Korotkov	2024-03-30
\| \| \| \| \| \| \| \| \| \| \| \|	Previously, the executor did index insert unconditionally after calling table AM interface methods tuple_insert() and multi_insert(). This commit introduces the new parameter insert_indexes for these two methods. Setting '*insert_indexes' to true saves the current logic. Setting it to false indicates that table AM cares about index inserts itself and doesn't want the caller to do that. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Reviewed-by: Pavel Borisov, Matthias van de Meent, Mark Dilger
*	Custom reloptions for table AM	Alexander Korotkov	2024-03-30
\| \| \| \| \| \| \| \| \| \| \|	Let table AM define custom reloptions for its tables. This allows to specify AM-specific parameters by WITH clause when creating a table. The code may use some parts from prior work by Hao Wu. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent
*	Generalize relation analyze in table AM interface	Alexander Korotkov	2024-03-30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, there is just one algorithm for sampling tuples from a table written in acquire_sample_rows(). Custom table AM can just redefine the way to get the next block/tuple by implementing scan_analyze_next_block() and scan_analyze_next_tuple() API functions. This approach doesn't seem general enough. For instance, it's unclear how to sample this way index-organized tables. This commit allows table AM to encapsulate the whole sampling algorithm (currently implemented in acquire_sample_rows()) into the relation_analyze() API function. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Reviewed-by: Pavel Borisov, Matthias van de Meent
*	Allow "internal" subtransactions in parallel mode.	Tom Lane	2024-03-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow use of BeginInternalSubTransaction() in parallel mode, so long as the subtransaction doesn't attempt to acquire an XID or increment the command counter. Given those restrictions, the other parallel processes don't need to know about the subtransaction at all, so this should be safe. The benefit is that it allows subtransactions intended for error recovery, such as pl/pgsql exception blocks, to be used in PARALLEL SAFE functions. Another reason for doing this is that the API of BeginInternalSubTransaction() doesn't allow reporting failure. pl/python for one, and perhaps other PLs, copes very poorly with an error longjmp out of BeginInternalSubTransaction(). The headline feature of this patch removes the only easily-triggerable failure case within that function. There remain some resource-exhaustion and similar cases, which we now deal with by promoting them to FATAL errors, so that callers need not try to clean up. (It is likely that such errors would leave us with corrupted transaction state inside xact.c, making recovery difficult if not impossible anyway.) Although this work started because of a report of a pl/python crash, we're not going to do anything about that in the back branches. Back-patching this particular fix is obviously not very wise. While we could contemplate some narrower band-aid, pl/python is already an untrusted language, so it seems okay to classify this as a "so don't do that" case. Patch by me, per report from Hao Zhang. Thanks to Robert Haas for review. Discussion: https://postgr.es/m/CALY6Dr-2yLVeVPhNMhuBnRgOZo1UjoTETgtKBx1B2gUi8yy+3g@mail.gmail.com
*	Remove obsolete comment about VACUUM retrying pruning	Heikki Linnakangas	2024-03-28
\| \| \| \| \| \| \| \|	Commit 1ccc1e05ae removed the retry logic that the comment talked about. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240328015326.x5gnzsohl6j23b42@liskov
*	Rethink create and attach APIs of shared TidStore.	Masahiko Sawada	2024-03-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, the behavior of TidStoreCreate() was inconsistent between local and shared TidStore instances in terms of memory limitation. For local TidStore, a memory context was created with initial and maximum memory block sizes, as well as a minimum memory context size, based on the specified max_bytes values. However, for shared TidStore, the provided DSA area was used for TID storage. Although commit bb952c8c8b allowed specifying the initial and maximum DSA segment sizes, callers would have needed to clamp their own limits, which was not consistent and user-friendly. With this commit, when creating a shared TidStore, a dedicated DSA area is created for TID storage instead of using a provided DSA area. The initial and maximum DSA segment sizes are chosen based on the specified max_bytes. Other processes can attach to the shared TidStore using the handle of the created DSA returned by the new TidStoreGetDSA() function and the DSA pointer returned by TidStoreGetHandle(). The created DSA has the same lifetime as the shared TidStore and is deleted when all processes detach from it. To improve clarity, the TidStoreCreate() function has been divided into two separate functions: TidStoreCreateLocal() and TidStoreCreateShared(). Reviewed-by: John Naylor Discussion: https://postgr.es/m/CAD21AoAyc1j%3DBCdUqZfk6qbdjZ68UgRx1Gkpk0oah4K7S0Ri9g%40mail.gmail.com
*	Fix some typos and grammar issues from commit 87985cc92522	Alexander Korotkov	2024-03-27
\| \| \| \|	Reported-by: Alexander Lakhin
*	Fix a calculation in TidStoreCreate().	Masahiko Sawada	2024-03-26
\| \| \| \| \| \| \| \| \| \| \| \|	Since we expect that the max_bytes is in bytes, not in kilobytes, it should not be multiplied by 1024. Introduced by 30e144287a. Reported-by: John Naylor, David Rowley Reviewed-by: John Naylor Discussion: https://postgr.es/m/CANWCAZZTE-14ofsucofTuhFsfuDGBNf%3DNZb22TMYT8bxA41oQQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvojg82NDaDEpj1WEZSbVTafj%3DDRmW%2BFrkBdW8ScL4OFxA%40mail.gmail.com
*	Allow locking updated tuples in tuple_update() and tuple_delete()	Alexander Korotkov	2024-03-26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, in read committed transaction isolation mode (default), we have the following sequence of actions when tuple_update()/tuple_delete() finds the tuple updated by the concurrent transaction. 1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which returns TM_Updated. 2. Lock tuple with tuple_lock(). 3. Re-evaluate plan qual (recheck if we still need to update/delete and calculate the new tuple for update). 4. Second attempt to update/delete tuple with tuple_update()/tuple_delete(). This attempt should be successful, since the tuple was previously locked. This commit eliminates step 2 by taking the lock during the first tuple_update()/tuple_delete() call. The heap table access method saves some effort by checking the updated tuple once instead of twice. Future undo-based table access methods, which will start from the latest row version, can immediately place a lock there. Also, this commit makes tuple_update()/tuple_delete() optionally save the old tuple into the dedicated slot. That saves efforts on re-fetching tuples in certain cases. The code in nodeModifyTable.c is simplified by removing the nested switch/case. Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp Reviewed-by: Andres Freund, Chris Travers
*	Merge prune, freeze and vacuum WAL record formats	Heikki Linnakangas	2024-03-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The new combined WAL record is now used for pruning, freezing and 2nd pass of vacuum. This is in preparation for changing VACUUM to write a combined prune+freeze record per page, instead of separate two records. The new WAL record format now supports that, but the code still always writes separate records for pruning and freezing. This reserves separate XLOG_HEAP2_* info codes for when the pruning record is emitted for on-access pruning or VACUUM, per Peter Geoghegan's suggestion. The record format is identical, but having separate info codes makes it easier analyze pruning and vacuuming with pg_waldump. The function to emit the new WAL record, log_heap_prune_and_freeze(), is in pruneheap.c. The existing heap_log_freeze_plan() and its subroutines are moved to pruneheap.c without changes, to keep them together with log_heap_prune_and_freeze(). Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/CAAKRu_azf-zH%3DDgVbquZ3tFWjMY1w5pO8m-TXJaMdri8z3933g@mail.gmail.com Discussion: https://www.postgresql.org/message-id/CAAKRu_b2oE4GL%3Dq4g9mcByS9yT7wTQvEH9OLpabj28e%2BWKFi2A@mail.gmail.com
*	Fix an oversight in refactoring in 06b10f80ba4.	Alexander Korotkov	2024-03-22
\| \| \| \| \| \| \| \| \|	It was against intended skipping prechecking keys optimization in the first page of range queries to not influence point queries performance. Reported-by: Anton Melnikov Discussion: https://postgr.es/m/30cd7524-b9f1-4cf8-9c4a-223eb2e34441%40postgrespro.ru Author: Pavel Borisov
*	Allow table AM tuple_insert() method to return the different slot	Alexander Korotkov	2024-03-21
\| \| \| \| \| \| \| \| \| \| \|	This allows table AM to return a native tuple slot even if VirtualTupleTableSlot is given as an input. Native tuple slots have knowledge about system attributes, which could be accessed in the future. table_multi_insert() method already can modify the input 'slots' array. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov Reviewed-by: Nikita Malakhov, Japin Li
*	Allow table AM to store complex data structures in rd_amcache	Alexander Korotkov	2024-03-21
\| \| \| \| \| \| \| \| \| \| \|	The new table AM method free_rd_amcache is responsible for freeing all the memory related to rd_amcache and setting free_rd_amcache to NULL. If the new method is not specified, we still assume rd_amcache to be a single chunk of memory, which could be just pfree'd. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov Reviewed-by: Nikita Malakhov, Japin Li
*	Add TIDStore, to store sets of TIDs (ItemPointerData) efficiently.	Masahiko Sawada	2024-03-21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TIDStore is a data structure designed to efficiently store large sets of TIDs. For TID storage, it employs a radix tree, where the key is a block number, and the value is a bitmap representing offset numbers. The TIDStore can be created on a DSA area and used by multiple backend processes simultaneously. There are potential future users such as tidbitmap.c, though it's very likely the interface will need to evolve as we come to understand the needs of different kinds of users. For example, we can support updating the offset bitmap of existing values. Currently, the TIDStore is not used for anything yet, aside from the test code. But an upcoming patch will use it. This includes a unit test module, in src/test/modules/test_tidstore. Co-authored-by: John Naylor Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
*	Remove unused PruneState member rel	Heikki Linnakangas	2024-03-20
\| \| \| \| \| \| \|	PruneState->rel is no longer being used, so just remove it. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240320013602.6sypr4cx6sefpemg@liskov
*	Reorganize heap_page_prune() function comment	Heikki Linnakangas	2024-03-20
\| \| \| \| \| \| \| \|	heap_page_prune()'s function header comment didn't explain the parameters in the same order they appear in the function. Fix that. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240320013602.6sypr4cx6sefpemg@liskov
*	Separate equalRowTypes() from equalTupleDescs()	Peter Eisentraut	2024-03-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This introduces a new function equalRowTypes() that is effectively a subset of equalTupleDescs() but only compares the number of attributes and attribute name, type, typmod, and collation. This is enough for most existing uses of equalTupleDescs(), which are changed to use the new function. The only remaining callers of equalTupleDescs() are those that really want to check the full tuple descriptor as such, without concern about record or row or record type semantics. The existing function hashTupleDesc() is renamed to hashRowType(), because it now corresponds more to equalRowTypes(). The purpose of this change is to be clearer about the semantics of the equality asked for by each caller. (At least one caller had a comment that questioned whether equalTupleDescs() was too restrictive.) For example, 4f622503d6d removed attstattarget from the tuple descriptor structure. It was not fully clear at the time how this should affect equalTupleDescs(). Now the answer is clear: By their own definitions, equalRowTypes() does not care, and equalTupleDescs() just compares whatever is in the tuple descriptor but does not care why it is in there. Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/f656d6d9-6660-4518-a006-2f65cafbebd1%40eisentraut.org
*	Remove redundant snapshot copying from parallel leader to workers	Heikki Linnakangas	2024-03-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The parallel query infrastructure copies the leader backend's active snapshot to the worker processes. But BitmapHeapScan node also had bespoken code to pass the snapshot from leader to the worker. That was redundant, so remove it. The removed code was analogous to the snapshot serialization in table_parallelscan_initialize(), but that was the wrong role model. A parallel bitmap heap scan is more like an independent non-parallel bitmap heap scan in each parallel worker as far as the table AM is concerned, because the coordination is done in nodeBitmapHeapscan.c, and the table AM doesn't need to know anything about it. This relies on the assumption that es_snapshot == GetActiveSnapshot(). That's not a new assumption, things would get weird if you used the QueryDesc's snapshot for visibility checks in the scans, but the active snapshot for evaluating quals, for example. This could use some refactoring and cleanup, but for now, just add some assertions. Reviewed-by: Dilip Kumar, Robert Haas Discussion: https://www.postgresql.org/message-id/5f3b9d59-0f43-419d-80ca-6d04c07cf61a@iki.fi