aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Fix an assortment of typosDavid Rowley2024-05-04
| | | | | Author: Alexander Lakhin Discussion: https://postgr.es/m/ae9f2fcb-4b24-5bb0-4240-efbbbd944ca1@gmail.com
* Fix parallel vacuum buffer usage reporting.Masahiko Sawada2024-05-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A parallel worker's buffer usage is accumulated to its pgBufferUsage and then is accumulated into the leader's one at the end of the parallel vacuum. However, since the leader process used to use dedicated VacuumPage{Hit, Miss, Dirty} globals for the buffer usage reporting, the worker's buffer usage was not included, leading to an incorrect buffer usage report. To fix the problem, this commit makes vacuum use pgBufferUsage instruments for buffer usage reporting instead of VacuumPage{Hit, Miss, Dirty} globals. These global variables are still used by ANALYZE command and autoanalyze. This also fixes the buffer usage report of vacuuming on temporary tables, since the buffers dirtied by MarkLocalBufferDirty() were not tracked by the VacuumPageDirty variable. Parallel vacuum was introduced in 13, but the buffer usage reporting for VACUUM command with the VERBOSE option was implemented in 15. So backpatch to 15. Reported-by: Anthonin Bonnefoy Author: Anthonin Bonnefoy Reviewed-by: Alena Rybakina, Masahiko Sawada Discussion: https://postgr.es/m/CAO6_XqrQk+QZQcYs_C6nk0cMfHuUWk85vT9CrcA1NffFbAVE2A@mail.gmail.com Backpatch-through: 15
* Avoid repeating loads of frozen ID values.Noah Misch2024-04-29
| | | | | | | | | Repeating loads of inplace-updated fields tends to cause bugs like the one from the previous commit. While there's no bug to fix in these code sites, adopt the load-once style. This improves the chance of future copy/paste finding the safe style. Discussion: https://postgr.es/m/20240423003956.e7.nmisch@google.com
* Fix duplicated consecutive words in commentsDavid Rowley2024-04-28
| | | | | | | Also, fix a comment incorrectly referencing the "streaming read API". This was renamed to "read stream" shortly before being committed. Discussion: https://postgr.es/m/CAApHDvq-2Zdqytm_Hf3RmVf0qg5PS9jTFAJ5QTc9xH9pwvwDTA@mail.gmail.com
* Remove unneeded nbtree array preprocessing assert.Peter Geoghegan2024-04-22
| | | | | | | | | | | | | | | | | | | | | | | | Certain cases involving the use of cursors had assertion failures within _bt_preprocess_keys's recently added no-op return path. The assertion in question made the faulty assumption that a second or third call to _bt_preprocess_keys (within the same btrescan) could only happen when another scheduled primitive index scan was just about to begin. It would be possible to address the problem by only allowing scans that have array keys to take the new no-op path, forcing affected cases to perform redundant preprocessing work. It seems simpler to just remove the assertion, and reframe the no-op path as a more general mechanism. Take this simpler approach. The important underlying principle is that we only need to perform preprocessing once per btrescan (at most). This is expected regardless of whether or not the scan happens to have array keys. Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/ef0f7c8b-a6fa-362e-6fd6-054950f947ca@gmail.com
* Remove overzealous array element type assertion.Peter Geoghegan2024-04-21
| | | | | | | | | | | This led to spurious assertion failures in certain scenarios involving pseudo types. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs48f5rDOwxaT76Zd40m7n9iGZQcjEk7vG_5p3YWNh6oPfA@mail.gmail.com
* Add missing index_insert_cleanup callsTomas Vondra2024-04-19
| | | | | | | | | | | | | | | | | | | | | | | The optimization for inserts into BRIN indexes added by c1ec02be1d79 relies on a cache that needs to be explicitly released after calling index_insert(). The commit however failed to invoke the cleanup in validate_index(), which calls index_insert() indirectly through table_index_validate_scan(). After inspecting index_insert() callers, it seems unique_key_recheck() is missing the call too. Fixed by adding the two missing index_insert_cleanup() calls. The commit does two additional improvements. The aminsertcleanup() signature is modified to have the index as the first argument, to make it more like the other AM callbacks. And the aminsertcleanup() callback is invoked even if the ii_AmCache is NULL, so that it can decide if the cleanup is necessary. Author: Alvaro Herrera, Tomas Vondra Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/202401091043.e3nrqiad6gb7@alvherre.pgsql
* Fix a couple typos in BRIN codeTomas Vondra2024-04-19
| | | | | | | | Typos introduced by commits c1ec02be1d79, b43757171470 and dae761a87eda. Author: Alvaro Herrera Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/202401091043.e3nrqiad6gb7@alvherre.pgsql
* Fix typos and duplicate wordsDaniel Gustafsson2024-04-18
| | | | | | | | | | | | This fixes various typos, duplicated words, and tiny bits of whitespace mainly in code comments but also in docs. Author: Daniel Gustafsson <daniel@yesql.se> Author: Heikki Linnakangas <hlinnaka@iki.fi> Author: Alexander Lakhin <exclusion@gmail.com> Author: David Rowley <dgrowleyml@gmail.com> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
* Don't try to fix eliminated nbtree array scan keys.Peter Geoghegan2024-04-18
| | | | | | | | | | | | | | | | | Preprocessing for nbtree index scans allowed array "input" scan keys already marked eliminated during array-specific preprocessing to be "fixed up" during preprocessing proper. This allowed eliminated scan keys on DESC index columns to spurious have their strategy commuted, causing assertion failures. To fix, teach _bt_fix_scankey_strategy to ignore these scan keys. This brings it in line with its only caller, _bt_preprocess_keys. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Donghang Lin <donghanglin@gmail.com> Discussion: https://postgr.es/m/CAA=D8a2sHK6CAzZ=0CeafC-Y-MFXbYxnRSHvZTi=+JHu6kAa8Q@mail.gmail.com
* Refactoring for CommitTransactionCommand()/AbortCurrentTransaction()Alexander Korotkov2024-04-18
| | | | | | | | | | | | | | | fefd9a3fed turned tail recursion of CommitTransactionCommand() and AbortCurrentTransaction() into iteration. However, it splits the handling of cases between different functions. This commit puts the handling of all the cases into AbortCurrentTransactionInternal() and CommitTransactionCommandInternal(). Now CommitTransactionCommand() and AbortCurrentTransaction() are just doing the repeated calls of internal functions. Reported-by: Andres Freund Discussion: https://postgr.es/m/20240415224834.w6piwtefskoh32mv%40awork3.anarazel.de Author: Andres Freund
* Cleanup parallel BRIN index build codeTomas Vondra2024-04-17
| | | | | | | | | | | | | | | | | | Commit b43757171470 added support for parallel builds of BRIN indexes, using code similar to BTREE. But there were to be a couple unnecessary differences, particularly in how the leader waits for the workers, and merges the results. So remove these, to make the code more similar. The leader never waited on the workersdonecv condition variable, but simply called WaitForParallelWorkersToFinish() in _brin_end_parallel() and then merged the per-worker results. This worked correctly, but it seems better to do the wait and merge before _brin_end_parallel(). This commit moves the relevant code to _brin_parallel_heapscan/merge(), which means _brin_end_parallel() remains responsible only for exiting the parallel mode and accumulating WAL usage data. Discussion: https://postgr.es/m/3733d042-71e1-6ae6-5fac-00c12db62db6@enterprisedb.com
* Fix nbtree "deduce NOT NULL" scan key comment.Peter Geoghegan2024-04-16
| | | | Oversight in commit c9c0589fda.
* revert: Generalize relation analyze in table AM interfaceAlexander Korotkov2024-04-16
| | | | | | This commit reverts 27bc1772fc and dd1f6b0c17. Per review by Andres Freund. Discussion: https://postgr.es/m/20240415201057.khoyxbwwxfgzomeo%40awork3.anarazel.de
* Use the correct PG_DETOAST_DATUM macro in BRINTomas Vondra2024-04-14
| | | | | | | | | | | | Commit 6bcda4a721 replaced PG_DETOAST_DATUM with PG_DETOAST_DATUM_PACKED in two BRIN output functions, for minmax-multi and bloom opclasses. But this is incorrect - the code is accessing the data through structs that already include a 4B header, so the detoast needs to match that. But the PACKED macro may keep the 1B header, which means the struct fields will point to incorrect data. Backpatch-through: 16 Discussion: https://postgr.es/m/1df00a66-db5a-4e66-809a-99b386a06d86%40enterprisedb.com
* Update nbits_set in brin_bloom_unionTomas Vondra2024-04-14
| | | | | | | | | | | | | | | | | | | Properly update the number of bits set in the bitmap after merging the filters in brin_bloom_union. This is mostly harmless, as the counter is used only in the output function, which means pageinspect may show incorrect information about the BRIN summary. The counter does not affect correctness. Discovered while adding a regression test comparing indexes built with and without parallelism. The parallel index builds exercise the union procedure when merging results from workers, which is otherwise very hard to do in a test. Which is why this went unnoticed until now. Backpatch through 14, where the BRIN bloom opclasses were introduced. Backpatch-through: 14 Discussion: https://postgr.es/m/1df00a66-db5a-4e66-809a-99b386a06d86%40enterprisedb.com
* Revert: Implement pg_wal_replay_wait() stored procedureAlexander Korotkov2024-04-11
| | | | | | | This commit reverts 06c418e163, e37662f221, bf1e650806, 25f42429e2, ee79928441, and 74eaf66f98 per review by Heikki Linnakangas. Discussion: https://postgr.es/m/b155606b-e744-4218-bda5-29379779da1a%40iki.fi
* Revert: Allow table AM to store complex data structures in rd_amcacheAlexander Korotkov2024-04-11
| | | | | | This commit reverts 02eb07ea89 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
* Revert: Allow table AM tuple_insert() method to return the different slotAlexander Korotkov2024-04-11
| | | | | | This commit reverts c35a3fb5e0 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
* Revert: Allow locking updated tuples in tuple_update() and tuple_delete()Alexander Korotkov2024-04-11
| | | | | | This commit reverts 87985cc925 and 818861eb57 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
* Revert: Let table AM insertion methods control index insertionAlexander Korotkov2024-04-11
| | | | | | This commit reverts b1484a3f19 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
* Revert: Custom reloptions for table AMAlexander Korotkov2024-04-11
| | | | | | This commit reverts 9bd99f4c26 and 422041542f per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
* Fix inconsistency with replay of hash squeeze record for clean buffersMichael Paquier2024-04-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | aa5edbe379d6 has tweaked _hash_freeovflpage() so as the write buffer's LSN is updated only when necessary, when REGBUF_NO_CHANGE is not used. The replay code was not consistent with that, causing the write buffer's LSN to be updated and its page to be marked as dirty even if the buffer was registered in a "clean" state. This was possible for the case of a squeeze record when there are no tuples to add to the write buffer, for (is_prim_bucket_same_wrt && !is_prev_bucket_same_wrt). I have performed some validation of this commit with wal_consistency_checking and a change in WAL that logs REGBUF_NO_CHANGE to a new BKPIMAGE_*. Thanks to that, it is possible to know at replay if a buffer was clean when it was registered, then cross-checked the LSN of the "clean" page copy coming from WAL with the LSN of the block once the record has been replayed. This eats one bit in bimg_info, which is not acceptable to be integrated as-is, but it could become handy in the future. I didn't spot other areas than the one fixed by this commit at the extent of what the main regression test suite covers. As this is an oversight in aa5edbe379d6, no backpatch is required. Reported-by: Zubeyr Eryilmaz Author: Hayato Kuroda Reviewed-by: Amit Kapila, Michael Paquier Discussion: https://postgr.es/m/ZbyVVG_7eW3YD5-A@paquier.xyz
* Get rid of anonymous structJohn Naylor2024-04-09
| | | | | | | | | | | | | | This is a C11 feature, and we require C99. While at it, go the further step and get rid of the surrounding union (with uintptr_t) entirely, as there is currently no use case for this file to access the header of BlocktableEntry as a uintptr_t, and there are no additional alignment requirements. The least invasive way seems to be to transfer the old union name to this struct. Reported by Pavel Borisov and Andres Freund, per buildfarm member mylodon Reviewed by Pavel Borisov Discussion: https://postgr.es/m/CALT9ZEH11NYV8AOzKb1bWhCf6J0H=H31f0MgT9xX+HdqvcA1rw@mail.gmail.com
* Teach radix tree to embed values at runtimeJohn Naylor2024-04-08
| | | | | | | | | | | | | | | | | | | Previously, the decision to store values in leaves or within the child pointer was made at compile time, with variable length values using leaves by necessity. This commit allows introspecting the length of variable length values at runtime for that decision. This requires the ability to tell whether the last-level child pointer is actually a value, so we use a pointer tag in the lowest level bit. Use this in TID store. This entails adding a byte to the header to reserve space for the tag. Commit f35bd9bf3 stores up to three offsets within the header with no bitmap, and now the header can be embedded as above. This reduces worst-case memory usage when TIDs are sparse. Reviewed (in an earlier version) by Masahiko Sawada Discussion: https://postgr.es/m/CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com Discussion: https://postgr.es/m/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com
* Teach TID store to skip bitmap for small numbers of offsetsJohn Naylor2024-04-08
| | | | | | | | | | | | The header portion of BlocktableEntry has enough padding space for an array of 3 offsets (1 on 32-bit platforms). Use this space instead of having a sparse bitmap array. This will take up a constant amount of space no matter what the offsets are. Reviewed (in an earlier version) by Masahiko Sawada Discussion: https://postgr.es/m/CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com Discussion: https://postgr.es/m/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com
* Provide a way block-level table AMs could re-use acquire_sample_rows()Alexander Korotkov2024-04-08
| | | | | | | | | | While keeping API the same, this commit provides a way for block-level table AMs to re-use existing acquire_sample_rows() by providing custom callbacks for getting the next block and the next tuple. Reported-by: Andres Freund Discussion: https://postgr.es/m/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de Reviewed-by: Pavel Borisov
* Fill CommonRdOptions with default values in extract_autovac_opts()Alexander Korotkov2024-04-08
| | | | | | Reported-by: Thomas Munro Reported-by: Pavel Borisov Discussion: https://postgr.es/m/CA%2BhUKGLZzLR50RBvuqOO3MZ%3DF54ETz-rTp1PDX9uDGP_GqyYqA%40mail.gmail.com
* Custom reloptions for table AMAlexander Korotkov2024-04-08
| | | | | | | | | | | | | | | | | | Let table AM define custom reloptions for its tables. This allows specifying AM-specific parameters by the WITH clause when creating a table. The reloptions, which could be used outside of table AM, are now extracted into the CommonRdOptions data structure. These options could be by decision of table AM directly specified by a user or calculated in some way. The new test module test_tam_options evaluates the ability to set up custom reloptions and calculate fields of CommonRdOptions on their base. The code may use some parts from prior work by Hao Wu. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
* Use bump context for TID bitmaps stored by vacuumJohn Naylor2024-04-08
| | | | | | | | | | | | | | | Vacuum does not pfree individual entries, and only frees the entire storage space when finished with it. This allows using a bump context, eliminating the chunk header in each leaf allocation. Most leaf allocations will be 16 to 32 bytes, so that's a significant savings. TidStoreCreateLocal gets a boolean parameter to indicate that the created store is insert-only. This requires a separate tree context for iteration, since we free the iteration state after iteration completes. Discussion: https://postgr.es/m/CANWCAZac%3DpBePg3rhX8nXkUuaLoiAJJLtmnCfZsPEAS4EtJ%3Dkg%40mail.gmail.com Discussion: https://postgr.es/m/CANWCAZZQFfxvzO8yZHFWtQV+Z2gAMv1ku16Vu7KWmb5kZQyd1w@mail.gmail.com
* Remove references to old function nameAndres Freund2024-04-07
| | | | | | | | | | | In a97bbe1f1df I accidentally referenced heapgetpage(), both in a function name and a comment. But since 44086b09753 the relevant function is named heap_prepare_pagescan(). Rename the new function to page_collect_tuples(). Reported-by: Melanie Plageman <melanieplageman@gmail.com> Reported-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20240407172615.cocrsvboqm3ttqe4@awork3.anarazel.de Discussion: https://postgr.es/m/CAApHDvp4SniHopTrVeKWcEvNXFtdki0utAvO=5R7H6TNhtULRQ@mail.gmail.com
* Fix alignment of stack variableJohn Naylor2024-04-08
| | | | | | | | Declare with union similar to PGAlignedBlock. Report and fix by Andres Freund Discussion: https://postgr.es/m/20240407190731.izm3mdazednrsiqk%40awork3.anarazel.de
* Remove redundant nbtree preprocessing assertions.Peter Geoghegan2024-04-07
| | | | | | | | One of the assertions was the subject of a false positive complaint from Coverity, but none of the assertions added much, so get rid of them. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3000247.1712537309@sss.pgh.pa.us
* Use streaming I/O in ANALYZE.Thomas Munro2024-04-08
| | | | | | | | | | | | | | The ANALYZE command prefetches and reads sample blocks chosen by a BlockSampler algorithm. Instead of calling [Prefetch|Read]Buffer() for each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
* Use conditional variable to wait for next MultiXact offsetAlvaro Herrera2024-04-07
| | | | | | | | | | | | | | | In one multixact.c edge case, we need a mechanism to wait for one multixact offset to be written before being allowed to read the next one. We used to handle this case by sleeping for one millisecond and retrying, but such sleeps have been reported as problematic in production cases. We can avoid the problem by using a condition variable: readers sleep on it and then every creator of multixacts broadcasts into the CV when creation is sufficiently far along. Author: Kyotaro Horiguchi <horikyotajntt@gmail.com> Reviewed-by: Andrey Borodin <amborodin@acm.org> Discussion: https://postgr.es/m/47A598F4-B4E7-4029-8FEC-A06A6C3CB4B5@yandex-team.ru Discussion: https://postgr.es/m/20200515.090333.24867479329066911.horikyota.ntt
* Avoid extra lookups with nbtree array inequalities.Peter Geoghegan2024-04-07
| | | | | | | | | | | | | | | | | | | nbtree index scans with SAOP inequalities (but no SAOP equalities) performed extra ORDER proc lookups for any remaining equality strategy scan keys. This could waste cycles, and caused assertion failures. Keeping around a separate ORDER proc is only necessary for a scan's non-array/non-SAOP equality scan keys when the scan has at least one other SAOP equality strategy key (a SAOP inequality shouldn't count). To fix, replace _bt_preprocess_array_keys_final's assertion with a test that makes the function return early when the scan has no SAOP equality scan keys. Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/0539d3d3-a402-0a49-ed5e-26429dffc4bd@gmail.com
* Use streaming I/O in sequential scans.Thomas Munro2024-04-08
| | | | | | | | | | Instead of calling ReadBuffer() for each block, heap sequential scans and TID range scans now use the streaming API introduced in b5a9b18cd0. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
* Add XLogCtl->logInsertResultAlvaro Herrera2024-04-07
| | | | | | | | | | | | | | | | | | This tracks the position of WAL that's been fully copied into WAL buffers by all processes emitting WAL. (For some reason we call that "WAL insertion"). This is updated using atomic monotonic advance during WaitXLogInsertionsToFinish, which is not when the insertions actually occur, but it's the only place where we know where have all the insertions have completed. This value is useful in WALReadFromBuffers, which can verify that callers don't try to read past what has been inserted. (However, more infrastructure is needed in order to actually use WAL after the flush point, since it could be lost.) The value is also useful in WaitXLogInsertionsToFinish() itself, since we can now exit quickly when all WAL has been already inserted, without even having to take any locks.
* Reduce branches in heapgetpage()'s per-tuple loopAndres Freund2024-04-06
| | | | | | | | | | | | | | | | | | Until now, heapgetpage()'s loop over all tuples performed some conditional checks for each tuple, even though condition did not change across the loop. This commit fixes that by moving the loop into an inline function. By calling it with different constant arguments, the compiler can generate an optimized loop for the different conditions, at the price of two per-page checks. For cases of all-visible tables and an isolation level other than serializable, speedups of up to 25% have been measured. Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Tested-by: Quan Zongliang <quanzongliang@yeah.net> Discussion: https://postgr.es/m/20230716015656.xjvemfbp5fysjiea@awork3.anarazel.de Discussion: https://postgr.es/m/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce@yeah.net
* Optimize visibilitymap_count() with AVX-512 instructions.Nathan Bossart2024-04-06
| | | | | | | | | | | | | | Commit 792752af4e added infrastructure for using AVX-512 intrinsic functions, and this commit uses that infrastructure to optimize visibilitymap_count(). Specificially, a new pg_popcount_masked() function is introduced that applies a bitmask to every byte in the buffer prior to calculating the population count, which is used to filter out the all-visible or all-frozen bits as needed. Platforms without AVX-512 support should also see a nice speedup due to the reduced number of calls to a function pointer. Co-authored-by: Ants Aasma Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
* BitmapHeapScan: Push skip_fetch optimization into table AMTomas Vondra2024-04-07
| | | | | | | | | | | | | | | | | | | | | Commit 7c70996ebf0949b142 introduced an optimization to allow bitmap scans to operate like index-only scans by not fetching a block from the heap if none of the underlying data is needed and the block is marked all visible in the visibility map. With the introduction of table AMs, a FIXME was added to this code indicating that the skip_fetch logic should be pushed into the table AM-specific code, as not all table AMs may use a visibility map in the same way. This commit resolves this FIXME for the current block. The layering violation is still present in BitmapHeapScans's prefetching code, which uses the visibility map to decide whether or not to prefetch a block. However, this can be addressed independently. Author: Melanie Plageman Reviewed-by: Andres Freund, Heikki Linnakangas, Tomas Vondra, Mark Dilger Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
* Call WaitLSNCleanup() in AbortTransaction()Alexander Korotkov2024-04-07
| | | | | | | | | | | Even though waiting for replay LSN happens without explicit transaction, AbortTransaction() is responsible for the cleanup of the shared memory if the error is thrown in a stored procedure. So, we need to do WaitLSNCleanup() there to clean up after some unexpected error happened while waiting for replay LSN. Discussion: https://postgr.es/m/202404051815.eri4u5q6oj26%40alvherre.pgsql Author: Alvaro Herrera
* Enhance nbtree ScalarArrayOp execution.Peter Geoghegan2024-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals natively. This works by pushing down the full context (the array keys) to the nbtree index AM, enabling it to execute multiple primitive index scans that the planner treats as one continuous index scan/index path. This earlier enhancement enabled nbtree ScalarArrayOp index-only scans. It also allowed scans with ScalarArrayOp quals to return ordered results (with some notable restrictions, described further down). Take this general approach a lot further: teach nbtree SAOP index scans to decide how to execute ScalarArrayOp scans (when and where to start the next primitive index scan) based on physical index characteristics. This can be far more efficient. All SAOP scans will now reliably avoid duplicative leaf page accesses (just like any other nbtree index scan). SAOP scans whose array keys are naturally clustered together now require far fewer index descents, since we'll reliably avoid starting a new primitive scan just to get to a later offset from the same leaf page. The scan's arrays now advance using binary searches for the array element that best matches the next tuple's attribute value. Required scan key arrays (i.e. arrays from scan keys that can terminate the scan) ratchet forward in lockstep with the index scan. Non-required arrays (i.e. arrays from scan keys that can only exclude non-matching tuples) "advance" without the process ever rolling over to a higher-order array. Naturally, only required SAOP scan keys trigger skipping over leaf pages (non-required arrays cannot safely end or start primitive index scans). Consequently, even index scans of a composite index with a high-order inequality scan key (which we'll mark required) and a low-order SAOP scan key (which we won't mark required) now avoid repeating leaf page accesses -- that benefit isn't limited to simpler equality-only cases. In general, all nbtree index scans now output tuples as if they were one continuous index scan -- even scans that mix a high-order inequality with lower-order SAOP equalities reliably output tuples in index order. This allows us to remove a couple of special cases that were applied when building index paths with SAOP clauses during planning. Bugfix commit 807a40c5 taught the planner to avoid generating unsafe path keys: path keys on a multicolumn index path, with a SAOP clause on any attribute beyond the first/most significant attribute. These cases are now all safe, so we go back to generating path keys without regard for the presence of SAOP clauses (just like with any other clause type). Affected queries can now exploit scan output order in all the usual ways (e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early). Also undo changes from follow-up bugfix commit a4523c5a, which taught the planner to produce alternative index paths, with path keys, but without low-order SAOP index quals (filter quals were used instead). We'll no longer generate these alternative paths, since they can no longer offer any meaningful advantages over standard index qual paths. Affected queries thereby avoid all of the disadvantages that come from using filter quals within index scan nodes. They can avoid extra heap page accesses from using filter quals to exclude non-matching tuples (index quals will never have that problem). They can also skip over irrelevant sections of the index in more cases (though only when nbtree determines that starting another primitive scan actually makes sense). There is a theoretical risk that removing restrictions on SAOP index paths from the planner will break compatibility with amcanorder-based index AMs maintained as extensions. Such an index AM could have the same limitations around ordered SAOP scans as nbtree had up until now. Adding a pro forma incompatibility item about the issue to the Postgres 17 release notes seems like a good idea. Author: Peter Geoghegan <pg@bowt.ie> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
* Operate XLogCtl->log{Write,Flush}Result with atomicsAlvaro Herrera2024-04-05
| | | | | | | | | | | | | | | | | This removes the need to hold both the info_lck spinlock and WALWriteLock to update them. We use stock atomic write instead, with WALWriteLock held. Readers can use atomic read, without any locking. This allows for some code to be reordered: some places were a bit contorted to avoid repeated spinlock acquisition, but that's no longer a concern, so we can turn them to more natural coding. Some further changes are possible (maybe to performance wins), but in this commit I did rather minimal ones only, to avoid increasing the blast radius. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
* Secondary refactor of heap scanning functionsDavid Rowley2024-04-04
| | | | | | | | Similar to 44086b097, refactor heap scanning functions to be more suitable for the read stream API. Author: Melanie Plageman Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
* Preliminary refactor of heap scanning functionsDavid Rowley2024-04-04
| | | | | | | | | | | | | | | | | | | | | | | | To allow the use of the read stream API added in b5a9b18cd for sequential scans on heap tables, here we make some adjustments to make that change less invasive and perhaps make the code easier to follow in the process. Here heapgetpage() gets broken into two functions: 1) The part which reads the block has now been moved into a function named heapfetchbuf(). 2) The part which performed pruning and populated the scan's rs_vistuples[] array is now moved into a new function named heap_prepare_pagescan(). The functionality provided by heap_prepare_pagescan() was only ever required by SO_ALLOW_PAGEMODE scans, so the branching that was previously done in heapgetpage() is no longer needed as we simply just don't call heap_prepare_pagescan() from heapgettup() in the refactored code. Author: Melanie Plageman Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
* Invent SERIALIZE option for EXPLAIN.Tom Lane2024-04-03
| | | | | | | | | | | | | | | | EXPLAIN (ANALYZE, SERIALIZE) allows collection of statistics about the volume of data emitted by a query, as well as the time taken to convert the data to the on-the-wire format. Previously there was no way to investigate this without actually sending the data to the client, in which case network transmission costs might swamp what you wanted to see. In particular this feature allows investigating the costs of de-TOASTing compressed or out-of-line data during formatting. Stepan Rutz and Matthias van de Meent, reviewed by Tomas Vondra and myself Discussion: https://postgr.es/m/ca0adb0e-fa4e-c37e-1cd7-91170b18cae1@gmx.de
* Split XLogCtl->LogwrtResult into separate struct membersAlvaro Herrera2024-04-03
| | | | | | | | | | | | | | | | | | | After this change we have XLogCtl->logWriteResult and ->logFlushResult. There's no functional change, other than the fact that the assignment from shared memory to local is no longer done via struct assignment, but instead using a macro that copies each member separately. The current representation is inconvenient going forward; notably, we would like to add a new member "Copy" (to keep track of the last position copied into WAL buffers), so the symmetry between the values in shared memory vs. those in local would be lost. This also gives us freedom to later change the concurrency model for the values in shared memory: we can make them use atomics instead of relying on the info_lck spinlock. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/202404031119.cd2kugjk2vho@alvherre.pgsql
* Combine freezing and pruning steps in VACUUMHeikki Linnakangas2024-04-03
| | | | | | | | | | | | | | | | | | | | | | | | Execute both freezing and pruning of tuples in the same heap_page_prune() function, now called heap_page_prune_and_freeze(), and emit a single WAL record containing all changes. That reduces the overall amount of WAL generated. This moves the freezing logic from vacuumlazy.c to the heap_page_prune_and_freeze() function. The main difference in the coding is that in vacuumlazy.c, we looked at the tuples after the pruning had already happened, but in heap_page_prune_and_freeze() we operate on the tuples before pruning. The heap_prepare_freeze_tuple() function is now invoked after we have determined that a tuple is not going to be pruned away. VACUUM no longer needs to loop through the items on the page after pruning. heap_page_prune_and_freeze() does all the work. It now returns the list of dead offsets, including existing LP_DEAD items, to the caller. Similarly it's now responsible for tracking 'all_visible', 'all_frozen', and 'hastup' on the caller's behalf. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
* Refactor how heap_prune_chain() updates prunable_xidHeikki Linnakangas2024-04-03
| | | | | | | | | | | | | | | In preparation of freezing and counting tuples which are not candidates for pruning, split heap_prune_record_unchanged() into multiple functions, depending the kind of line pointer. That's not too interesting right now, but makes the next commit smaller. Recording the lowest soon-to-be prunable xid is one of the actions we take for unchanged LP_NORMAL item pointers but not for others, so move that to the new heap_prune_record_unchanged_lp_normal() function. The next commit will add more actions to these functions. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov