aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/table
Commit message (Collapse)AuthorAge
* Fix varatt versus Datum type confusionsHEADmasterPeter Eisentraut4 hours
| | | | | | | | | | | | | | | | Macros like VARDATA() and VARSIZE() should be thought of as taking values of type pointer to struct varlena or some other related struct. The way they are implemented, you can pass anything to it and it will cast it right. But this is in principle incorrect. To fix, add the required DatumGetPointer() calls. Or in a couple of cases, remove superfluous PointerGetDatum() calls. It is planned in a subsequent patch to change macros like VARDATA() and VARSIZE() to inline functions, which will enforce stricter typing. This is in preparation for that. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/928ea48f-77c6-417b-897c-621ef16685a6%40eisentraut.org
* Remove table AM callback scan_bitmap_next_blockMelanie Plageman2025-03-15
| | | | | | | | | | | | | | | After pushing the bitmap iterator into table-AM specific code (as part of making bitmap heap scan use the read stream API in 2b73a8cd33b7), scan_bitmap_next_block() no longer returns the current block number. Since scan_bitmap_next_block() isn't returning any relevant information to bitmap table scan code, it makes more sense to get rid of it. Now, bitmap table scan code only calls table_scan_bitmap_next_tuple(), and the heap AM implementation of scan_bitmap_next_block() is a local helper in heapam_handler.c. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
* Correct relation size estimate with low fillfactorTomas Vondra2025-02-19
| | | | | | | | | | | | | | | | | | | | | Since commit 29cf61ade3, table_block_relation_estimate_size() considers fillfactor when estimating number of rows in a relation before the first ANALYZE. The formula however did not consider tuples may be larger than available space determined by fillfactor, ending with density 0. This ultimately means the relation was estimated to contain a single row. The executor however places at least one tuple per page, even with very low fillfactor values, so the density should be at least 1. Fixed by clamping the density estimate using clamp_row_est(). Reported by Heikki Linnakangas. Fix by me, with regression test inspired by example provided by Heikki. Backpatch to 17, where the issue was introduced. Reported-by: Heikki Linnakangas Backpatch-through: 17 Discussion: https://postgr.es/m/2bf9d973-7789-4937-a7ca-0af9fb49c71e@iki.fi
* Remove unnecessary (char *) casts [mem]Peter Eisentraut2025-02-12
| | | | | | | | | | Remove (char *) casts around memory functions such as memcmp(), memcpy(), or memset() where the cast is useless. Since these functions don't take char * arguments anyway, these casts are at best complicated casts to (void *), about which see commit 7f798aca1d5. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
* Update copyright for 2025Bruce Momjian2025-01-01
| | | | Backpatch-through: 13
* Introduce CompactAttribute array in TupleDesc, take 2David Rowley2024-12-20
| | | | | | | | | | | | | | | | | | | | | | | | The new compact_attrs array stores a few select fields from FormData_pg_attribute in a more compact way, using only 16 bytes per column instead of the 104 bytes that FormData_pg_attribute uses. Using CompactAttribute allows performance-critical operations such as tuple deformation to be performed without looking at the FormData_pg_attribute element in TupleDesc which means fewer cacheline accesses. For some workloads, tuple deformation can be the most CPU intensive part of processing the query. Some testing with 16 columns on a table where the first column is variable length showed around a 10% increase in transactions per second for an OLAP type query performing aggregation on the 16th column. However, in certain cases, the increases were much higher, up to ~25% on one AMD Zen4 machine. This also makes pg_attribute.attcacheoff redundant. A follow-on commit will remove it, thus shrinking the FormData_pg_attribute struct by 4 bytes. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
* Improve Asserts checking relation matching in parallel scans.Tom Lane2024-09-20
| | | | | | | | | | | | | | | | table_beginscan_parallel and index_beginscan_parallel contain Asserts checking that the relation a worker will use in a parallel scan is the same one the leader intended. However, they were checking for relation OID match, which was not strong enough to detect the mismatch problem fixed in 126ec0bc7. What would be strong enough is to compare relfilenodes instead. Arguably, that's a saner definition anyway, since a scan surely operates on a physical relation not a logical one. Hence, store and compare RelFileLocators not relation OIDs. Also ensure that index_beginscan_parallel checks the index identity not just the table identity. Discussion: https://postgr.es/m/2127254.1726789524@sss.pgh.pa.us
* Revise GUC names quoting in messages againPeter Eisentraut2024-05-17
| | | | | | | | | | | | | | | After further review, we want to move in the direction of always quoting GUC names in error messages, rather than the previous (PG16) wildly mixed practice or the intermittent (mid-PG17) idea of doing this depending on how possibly confusing the GUC name is. This commit applies appropriate quotes to (almost?) all mentions of GUC names in error messages. It partially supersedes a243569bf65 and 8d9978a7176, which had moved things a bit in the opposite direction but which then were abandoned in a partial state. Author: Peter Smith <smithpb2250@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAHut%2BPv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w%40mail.gmail.com
* revert: Generalize relation analyze in table AM interfaceAlexander Korotkov2024-04-16
| | | | | | This commit reverts 27bc1772fc and dd1f6b0c17. Per review by Andres Freund. Discussion: https://postgr.es/m/20240415201057.khoyxbwwxfgzomeo%40awork3.anarazel.de
* Revert: Allow locking updated tuples in tuple_update() and tuple_delete()Alexander Korotkov2024-04-11
| | | | | | This commit reverts 87985cc925 and 818861eb57 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
* Revert: Let table AM insertion methods control index insertionAlexander Korotkov2024-04-11
| | | | | | This commit reverts b1484a3f19 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
* Revert: Custom reloptions for table AMAlexander Korotkov2024-04-11
| | | | | | This commit reverts 9bd99f4c26 and 422041542f per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
* Custom reloptions for table AMAlexander Korotkov2024-04-08
| | | | | | | | | | | | | | | | | | Let table AM define custom reloptions for its tables. This allows specifying AM-specific parameters by the WITH clause when creating a table. The reloptions, which could be used outside of table AM, are now extracted into the CommonRdOptions data structure. These options could be by decision of table AM directly specified by a user or calculated in some way. The new test module test_tam_options evaluates the ability to set up custom reloptions and calculate fields of CommonRdOptions on their base. The code may use some parts from prior work by Hao Wu. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
* Revert "Custom reloptions for table AM"Alexander Korotkov2024-04-02
| | | | | | | | This reverts commit c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues spotted after commit. Reported-by: Jeff Davis Discussion: https://postgr.es/m/11550b536211d5748bb2865ed6cb3502ff073bf7.camel%40j-davis.com
* Let table AM insertion methods control index insertionAlexander Korotkov2024-03-30
| | | | | | | | | | | | Previously, the executor did index insert unconditionally after calling table AM interface methods tuple_insert() and multi_insert(). This commit introduces the new parameter insert_indexes for these two methods. Setting '*insert_indexes' to true saves the current logic. Setting it to false indicates that table AM cares about index inserts itself and doesn't want the caller to do that. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Reviewed-by: Pavel Borisov, Matthias van de Meent, Mark Dilger
* Custom reloptions for table AMAlexander Korotkov2024-03-30
| | | | | | | | | | | Let table AM define custom reloptions for its tables. This allows to specify AM-specific parameters by WITH clause when creating a table. The code may use some parts from prior work by Hao Wu. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent
* Generalize relation analyze in table AM interfaceAlexander Korotkov2024-03-30
| | | | | | | | | | | | | | | Currently, there is just one algorithm for sampling tuples from a table written in acquire_sample_rows(). Custom table AM can just redefine the way to get the next block/tuple by implementing scan_analyze_next_block() and scan_analyze_next_tuple() API functions. This approach doesn't seem general enough. For instance, it's unclear how to sample this way index-organized tables. This commit allows table AM to encapsulate the whole sampling algorithm (currently implemented in acquire_sample_rows()) into the relation_analyze() API function. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Reviewed-by: Pavel Borisov, Matthias van de Meent
* Allow locking updated tuples in tuple_update() and tuple_delete()Alexander Korotkov2024-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, in read committed transaction isolation mode (default), we have the following sequence of actions when tuple_update()/tuple_delete() finds the tuple updated by the concurrent transaction. 1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which returns TM_Updated. 2. Lock tuple with tuple_lock(). 3. Re-evaluate plan qual (recheck if we still need to update/delete and calculate the new tuple for update). 4. Second attempt to update/delete tuple with tuple_update()/tuple_delete(). This attempt should be successful, since the tuple was previously locked. This commit eliminates step 2 by taking the lock during the first tuple_update()/tuple_delete() call. The heap table access method saves some effort by checking the updated tuple once instead of twice. Future undo-based table access methods, which will start from the latest row version, can immediately place a lock there. Also, this commit makes tuple_update()/tuple_delete() optionally save the old tuple into the dedicated slot. That saves efforts on re-fetching tuples in certain cases. The code in nodeModifyTable.c is simplified by removing the nested switch/case. Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp Reviewed-by: Andres Freund, Chris Travers
* Remove redundant snapshot copying from parallel leader to workersHeikki Linnakangas2024-03-14
| | | | | | | | | | | | | | | | | | | | | | | | The parallel query infrastructure copies the leader backend's active snapshot to the worker processes. But BitmapHeapScan node also had bespoken code to pass the snapshot from leader to the worker. That was redundant, so remove it. The removed code was analogous to the snapshot serialization in table_parallelscan_initialize(), but that was the wrong role model. A parallel bitmap heap scan is more like an independent non-parallel bitmap heap scan in each parallel worker as far as the table AM is concerned, because the coordination is done in nodeBitmapHeapscan.c, and the table AM doesn't need to know anything about it. This relies on the assumption that es_snapshot == GetActiveSnapshot(). That's not a new assumption, things would get weird if you used the QueryDesc's snapshot for visibility checks in the scans, but the active snapshot for evaluating quals, for example. This could use some refactoring and cleanup, but for now, just add some assertions. Reviewed-by: Dilip Kumar, Robert Haas Discussion: https://www.postgresql.org/message-id/5f3b9d59-0f43-419d-80ca-6d04c07cf61a@iki.fi
* Remove unused #include's from backend .c filesPeter Eisentraut2024-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Add const to values and nulls argumentsPeter Eisentraut2023-10-10
| | | | | | | This excludes any changes that would change the external AM APIs. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/14c31f4a-0347-0805-dce8-93a9072c05a5%40eisentraut.org
* Consider fillfactor when estimating relation sizeTomas Vondra2023-07-03
| | | | | | | | | | | | | | | | | When table_block_relation_estimate_size() estimated the number of tuples in a relation without statistics (e.g. right after load), it did not consider fillfactor when calculating density. With non-default fillfactor values, this may result in significant overestimate of the number of tuples - up to 10x with the minimum 10% fillfactor. This may have unexpected consequences, e.g. when creating hash indexes. This considers the current fillfactor value in the "no statistics" code path. If the fillfactor changes after loading data into the table, the estimate may be off. But that seems much less likely than changing the fillfactor before the data load. Reviewed-by: Corey Huinker, Peter Eisentraut Discussion: https://postgr.es/m/cf154ef9-6bac-d268-b735-67a3443debba@enterprisedb.com
* Pre-beta mechanical code beautification.Tom Lane2023-05-19
| | | | | | | | | | | | | | | Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
* Revert 11470f544eAlexander Korotkov2023-04-03
| | | | Discussion: https://postgr.es/m/20230323003003.plgaxjqahjgkuxrk%40awork3.anarazel.de
* Allow locking updated tuples in tuple_update() and tuple_delete()Alexander Korotkov2023-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, in read committed transaction isolation mode (default), we have the following sequence of actions when tuple_update()/tuple_delete() finds the tuple updated by concurrent transaction. 1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which returns TM_Updated. 2. Lock tuple with tuple_lock(). 3. Re-evaluate plan qual (recheck if we still need to update/delete and calculate the new tuple for update). 4. Second attempt to update/delete tuple with tuple_update()/tuple_delete(). This attempt should be successful, since the tuple was previously locked. This patch eliminates step 2 by taking the lock during first tuple_update()/tuple_delete() call. Heap table access method saves some efforts by checking the updated tuple once instead of twice. Future undo-based table access methods, which will start from the latest row version, can immediately place a lock there. The code in nodeModifyTable.c is simplified by removing the nested switch/case. Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp Reviewed-by: Andres Freund, Chris Travers
* Ignore BRIN indexes when checking for HOT updatesTomas Vondra2023-03-20
| | | | | | | | | | | | | | | | | | | | | | | | When determining whether an index update may be skipped by using HOT, we can ignore attributes indexed by block summarizing indexes without references to individual tuples that need to be cleaned up. A new type TU_UpdateIndexes provides a signal to the executor to determine which indexes to update - no indexes, all indexes, or only the summarizing indexes. This also removes rd_indexattr list, and replaces it with rd_attrsvalid flag. The list was not used anywhere, and a simple flag is sufficient. This was originally committed as 5753d4ee32, but then got reverted by e3fcca0d0d because of correctness issues. Original patch by Josef Simanek, various fixes and improvements by Tomas Vondra and me. Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra Reviewed-by: Tomas Vondra, Alvaro Herrera Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
* New header varatt.h split off from postgres.hPeter Eisentraut2023-01-10
| | | | | | | | | This new header contains all the variable-length data types support (TOAST support) from postgres.h, which isn't needed by large parts of the backend code. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/ddcce239-0f29-6e62-4b47-1f8ca742addf%40enterprisedb.com
* Update copyright for 2023Bruce Momjian2023-01-02
| | | | Backpatch-through: 11
* Add copyright notices to meson filesAndrew Dunstan2022-12-20
| | | | Discussion: https://postgr.es/m/222b43a5-2fb3-2c1b-9cd0-375d376c8246@dunslane.net
* meson: Add initial version of meson based build systemAndres Freund2022-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Autoconf is showing its age, fewer and fewer contributors know how to wrangle it. Recursive make has a lot of hard to resolve dependency issues and slow incremental rebuilds. Our home-grown MSVC build system is hard to maintain for developers not using Windows and runs tests serially. While these and other issues could individually be addressed with incremental improvements, together they seem best addressed by moving to a more modern build system. After evaluating different build system choices, we chose to use meson, to a good degree based on the adoption by other open source projects. We decided that it's more realistic to commit a relatively early version of the new build system and mature it in tree. This commit adds an initial version of a meson based build system. It supports building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD, Solaris and Windows (however only gcc is supported on aix, solaris). For Windows/MSVC postgres can now be built with ninja (faster, particularly for incremental builds) and msbuild (supporting the visual studio GUI, but building slower). Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM bitcode generation, documentation adjustments) are done in subsequent commits requiring further review. Other aspects (e.g. not installing test-only extensions) are not yet addressed. When building on Windows with msbuild, builds are slower when using a visual studio version older than 2019, because those versions do not support MultiToolTask, required by meson for intra-target parallelism. The plan is to remove the MSVC specific build system in src/tools/msvc soon after reaching feature parity. However, we're not planning to remove the autoconf/make build system in the near future. Likely we're going to keep at least the parts required for PGXS to keep working around until all supported versions build with meson. Some initial help for postgres developers is at https://wiki.postgresql.org/wiki/Meson With contributions from Thomas Munro, John Naylor, Stone Tickle and others. Author: Andres Freund <andres@anarazel.de> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Author: Peter Eisentraut <peter@eisentraut.org> Reviewed-By: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Discussion: https://postgr.es/m/20211012083721.hvixq4pnh2pixr3j@alap3.anarazel.de
* Harmonize heapam and tableam parameter names.Peter Geoghegan2022-09-19
| | | | | | | | | | | | | | | | Make sure that function declarations use names that exactly match the corresponding names from function definitions. Having parameter names that are reliably consistent in this way will make it easier to reason about groups of related C functions from the same translation unit as a module. It will also make certain refactoring tasks easier. Like other recent commits that cleaned up function parameter names, this commit was written with help from clang-tidy. Later commits will do the same for other parts of the codebase. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
* Split up guc.c for better build speed and ease of maintenance.Tom Lane2022-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | guc.c has grown to be one of our largest .c files, making it a bottleneck for compilation. It's also acquired a bunch of knowledge that'd be better kept elsewhere, because of our not very good habit of putting variable-specific check hooks here. Hence, split it up along these lines: * guc.c itself retains just the core GUC housekeeping mechanisms. * New file guc_funcs.c contains the SET/SHOW interfaces and some SQL-accessible functions for GUC manipulation. * New file guc_tables.c contains the data arrays that define the built-in GUC variables, along with some already-exported constant tables. * GUC check/assign/show hook functions are moved to the variable's home module, whenever that's clearly identifiable. A few hard- to-classify hooks ended up in commands/variable.c, which was already a home for miscellaneous GUC hook functions. To avoid cluttering a lot more header files with #include "guc.h", I also invented a new header file utils/guc_hooks.h and put all the GUC hook functions' declarations there, regardless of their originating module. That allowed removal of #include "guc.h" from some existing headers. The fallout from that (hopefully all caught here) demonstrates clearly why such inclusions are best minimized: there are a lot of files that, for example, were getting array.h at two or more levels of remove, despite not having any connection at all to GUCs in themselves. There is some very minor code beautification here, such as renaming a couple of inconsistently-named hook functions and improving some comments. But mostly this just moves code from point A to point B and deals with the ensuing needs for #include adjustments and exporting a few functions that previously weren't exported. Patch by me, per a suggestion from Andres Freund; thanks also to Michael Paquier for the idea to invent guc_funcs.c. Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
* Fix incorrect uses of Datum conversion macrosPeter Eisentraut2022-09-05
| | | | | | | | | | | Since these macros just cast whatever you give them to the designated output type, and many normal uses also cast the output type further, a number of incorrect uses go undiscovered. The fixes in this patch have been discovered by changing these macros to inline functions, which is the subject of a future patch. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
* Eliminate duplicate code in table.c.Amit Kapila2022-07-26
| | | | | | | | | Additionally improve the error message similar to how it was done in 2ed532ee8c. Author: Junwang Zhao, Aleksander Alekseev Reviewed-by: Amit Kapila, Alvaro Herrera, Kyotaro Horiguchi Discussion: https://postgr.es/m/CAEG8a3KbVtBm_BYf5tGsKHvmMieQVsq_jBPOg75VViQB7ACL8Q%40mail.gmail.com
* Change internal RelFileNode references to RelFileNumber or RelFileLocator.Robert Haas2022-07-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
* Remove extraneous blank lines before block-closing bracesAlvaro Herrera2022-04-13
| | | | | | | | | These are useless and distracting. We wouldn't have written the code with them to begin with, so there's no reason to keep them. Author: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20220411020336.GB26620@telsasoft.com Discussion: https://postgr.es/m/attachment/133167/0016-Extraneous-blank-lines.patch
* Update copyright for 2022Bruce Momjian2022-01-07
| | | | Backpatch-through: 10
* Fix some typos with {a,an}Michael Paquier2021-12-09
| | | | | | | | One of the changes impacts the documentation, so backpatch. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pu6+c+r3mY24VT7u+H+E_s6vMr5OdRiZ8NT3EOa-E5Lmw@mail.gmail.com Backpatch-through: 14
* Replace RelationOpenSmgr() with RelationGetSmgr().Tom Lane2021-07-12
| | | | | | | | | | | | | | | | | | | | The idea behind this patch is to design out bugs like the one fixed by commit 9d523119f. Previously, once one did RelationOpenSmgr(rel), it was considered okay to access rel->rd_smgr directly for some not-very-clear interval. But since that pointer will be cleared by relcache flushes, we had bugs arising from overreliance on a previous RelationOpenSmgr call still being effective. Now, very little code except that in rel.h and relcache.c should ever touch the rd_smgr field directly. The normal coding rule is to use RelationGetSmgr(rel) and not expect the result to be valid for longer than one smgr function call. There are a couple of places where using the function every single time seemed like overkill, but they are now annotated with large warning comments. Amul Sul, after an idea of mine. Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
* Allow configurable LZ4 TOAST compression.Robert Haas2021-03-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is now a per-column COMPRESSION option which can be set to pglz (the default, and the only option in up until now) or lz4. Or, if you like, you can set the new default_toast_compression GUC to lz4, and then that will be the default for new table columns for which no value is specified. We don't have lz4 support in the PostgreSQL code, so to use lz4 compression, PostgreSQL must be built --with-lz4. In general, TOAST compression means compression of individual column values, not the whole tuple, and those values can either be compressed inline within the tuple or compressed and then stored externally in the TOAST table, so those properties also apply to this feature. Prior to this commit, a TOAST pointer has two unused bits as part of the va_extsize field, and a compessed datum has two unused bits as part of the va_rawsize field. These bits are unused because the length of a varlena is limited to 1GB; we now use them to indicate the compression type that was used. This means we only have bit space for 2 more built-in compresison types, but we could work around that problem, if necessary, by introducing a new vartag_external value for any further types we end up wanting to add. Hopefully, it won't be too important to offer a wide selection of algorithms here, since each one we add not only takes more coding but also adds a build dependency for every packager. Nevertheless, it seems worth doing at least this much, because LZ4 gets better compression than PGLZ with less CPU usage. It's possible for LZ4-compressed datums to leak into composite type values stored on disk, just as it is for PGLZ. It's also possible for LZ4-compressed attributes to be copied into a different table via SQL commands such as CREATE TABLE AS or INSERT .. SELECT. It would be expensive to force such values to be decompressed, so PostgreSQL has never done so. For the same reasons, we also don't force recompression of already-compressed values even if the target table prefers a different compression method than was used for the source data. These architectural decisions are perhaps arguable but revisiting them is well beyond the scope of what seemed possible to do as part of this project. However, it's relatively cheap to recompress as part of VACUUM FULL or CLUSTER, so this commit adjusts those commands to do so, if the configured compression method of the table happens not to match what was used for some column value stored therein. Dilip Kumar. The original patches on which this work was based were written by Ildus Kurbangaliev, and those were patches were based on even earlier work by Nikita Glukhov, but the design has since changed very substantially, since allow a potentially large number of compression methods that could be added and dropped on a running system proved too problematic given some of the architectural issues mentioned above; the choice of which specific compression method to add first is now different; and a lot of the code has been heavily refactored. More recently, Justin Przyby helped quite a bit with testing and reviewing and this version also includes some code contributions from him. Other design input and review from Tomas Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander Korotkov, and me. Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com
* Enhance nbtree index tuple deletion.Peter Geoghegan2021-01-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach nbtree and heapam to cooperate in order to eagerly remove duplicate tuples representing dead MVCC versions. This is "bottom-up deletion". Each bottom-up deletion pass is triggered lazily in response to a flood of versions on an nbtree leaf page. This usually involves a "logically unchanged index" hint (these are produced by the executor mechanism added by commit 9dc718bd). The immediate goal of bottom-up index deletion is to avoid "unnecessary" page splits caused entirely by version duplicates. It naturally has an even more useful effect, though: it acts as a backstop against accumulating an excessive number of index tuple versions for any given _logical row_. Bottom-up index deletion complements what we might now call "top-down index deletion": index vacuuming performed by VACUUM. Bottom-up index deletion responds to the immediate local needs of queries, while leaving it up to autovacuum to perform infrequent clean sweeps of the index. The overall effect is to avoid certain pathological performance issues related to "version churn" from UPDATEs. The previous tableam interface used by index AMs to perform tuple deletion (the table_compute_xid_horizon_for_tuples() function) has been replaced with a new interface that supports certain new requirements. Many (perhaps all) of the capabilities added to nbtree by this commit could also be extended to other index AMs. That is left as work for a later commit. Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic to consider extra index tuples (that are not LP_DEAD-marked) for deletion in passing. This increases the number of index tuples deleted significantly in many cases. The LP_DEAD deletion process (which is now called "simple deletion" to clearly distinguish it from bottom-up deletion) won't usually need to visit any extra table blocks to check these extra tuples. We have to visit the same table blocks anyway to generate a latestRemovedXid value (at least in the common case where the index deletion operation's WAL record needs such a value). Testing has shown that the "extra tuples" simple deletion enhancement increases the number of index tuples deleted with almost any workload that has LP_DEAD bits set in leaf pages. That is, it almost never fails to delete at least a few extra index tuples. It helps most of all in cases that happen to naturally have a lot of delete-safe tuples. It's not uncommon for an individual deletion operation to end up deleting an order of magnitude more index tuples compared to the old naive approach (e.g., custom instrumentation of the patch shows that this happens fairly often when the regression tests are run). Add a further enhancement that augments simple deletion and bottom-up deletion in indexes that make use of deduplication: Teach nbtree's _bt_delitems_delete() function to support granular TID deletion in posting list tuples. It is now possible to delete individual TIDs from posting list tuples provided the TIDs have a tableam block number of a table block that gets visited as part of the deletion process (visiting the table block can be triggered directly or indirectly). Setting the LP_DEAD bit of a posting list tuple is still an all-or-nothing thing, but that matters much less now that deletion only needs to start out with the right _general_ idea about which index tuples are deletable. Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. No bump in BTREE_VERSION, since there are no changes to the on-disk representation of nbtree indexes. Indexes built on PostgreSQL 12 or PostgreSQL 13 will automatically benefit from bottom-up index deletion (i.e. no reindexing required) following a pg_upgrade. The enhancement to simple deletion is available with all B-Tree indexes following a pg_upgrade, no matter what PostgreSQL version the user upgrades from. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
* Update copyright for 2021Bruce Momjian2021-01-02
| | | | Backpatch-through: 9.5
* Improve handling of dropped relations for REINDEX DATABASE/SCHEMA/SYSTEMMichael Paquier2020-09-02
| | | | | | | | | | | | | | | | | | | When multiple relations are reindexed, a scan of pg_class is done first to build the list of relations to work on. However the REINDEX logic has never checked if a relation listed still exists when beginning the work on it, causing for example sudden cache lookup failures. This commit adds safeguards against dropped relations for REINDEX, similarly to VACUUM or CLUSTER where we try to open the relation, ignoring it if it is missing. A new option is added to the REINDEX routines to control if a missed relation is OK to ignore or not. An isolation test, based on REINDEX SCHEMA, is added for the concurrent and non-concurrent cases. Author: Michael Paquier Reviewed-by: Anastasia Lubennikova Discussion: https://postgr.es/m/20200813043805.GE11663@paquier.xyz
* Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.Tom Lane2020-08-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically, we've considered the state with relpages and reltuples both zero as indicating that we do not know the table's tuple density. This is problematic because it's impossible to distinguish "never yet vacuumed" from "vacuumed and seen to be empty". In particular, a user cannot use VACUUM or ANALYZE to override the planner's normal heuristic that an empty table should not be believed to be empty because it is probably about to get populated. That heuristic is a good safety measure, so I don't care to abandon it, but there should be a way to override it if the table is indeed intended to stay empty. Hence, represent the initial state of ignorance by setting reltuples to -1 (relpages is still set to zero), and apply the minimum-ten-pages heuristic only when reltuples is still -1. If the table is empty, VACUUM or ANALYZE (but not CREATE INDEX) will override that to reltuples = relpages = 0, and then we'll plan on that basis. This requires a bunch of fiddly little changes, but we can get rid of some ugly kluges that were formerly needed to maintain the old definition. One notable point is that FDWs' GetForeignRelSize methods will see baserel->tuples = -1 when no ANALYZE has been done on the foreign table. That seems like a net improvement, since those methods were formerly also in the dark about what baserel->tuples = 0 really meant. Still, it is an API change. I bumped catversion because code predating this change would get confused by seeing reltuples = -1. Discussion: https://postgr.es/m/F02298E0-6EF4-49A1-BCB6-C484794D9ACC@thebuild.com
* Implement streaming mode in ReorderBuffer.Amit Kapila2020-08-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of serializing the transaction to disk after reaching the logical_decoding_work_mem limit in memory, we consume the changes we have in memory and invoke stream API methods added by commit 45fdc9738b. However, sometimes if we have incomplete toast or speculative insert we spill to the disk because we can't generate the complete tuple and stream. And, as soon as we get the complete tuple we stream the transaction including the serialized changes. We can do this incremental processing thanks to having assignments (associating subxact with toplevel xacts) in WAL right away, and thanks to logging the invalidation messages at each command end. These features are added by commits 0bead9af48 and c55040ccd0 respectively. Now that we can stream in-progress transactions, the concurrent aborts may cause failures when the output plugin consults catalogs (both system and user-defined). We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table scan APIs to the backend or WALSender decoding a specific uncommitted transaction. The decoding logic on the receipt of such a sqlerrcode aborts the decoding of the current transaction and continue with the decoding of other transactions. We have ReorderBufferTXN pointer in each ReorderBufferChange by which we know which xact it belongs to. The output plugin can use this to decide which changes to discard in case of stream_abort_cb (e.g. when a subxact gets discarded). We also provide a new option via SQL APIs to fetch the changes being streamed. Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
* Move syncscan.c to src/backend/access/common.Thomas Munro2020-07-29
| | | | | | | | | Since the tableam.c code needs to make use of the syncscan.c routines itself, and since other block-oriented AMs might also want to use it one day, it didn't make sense for it to live under src/backend/access/heap. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGLCnG%3DNEAByg6bk%2BCT9JZD97Y%3DAxKhh27Su9FeGWOKvDg%40mail.gmail.com
* Allocate consecutive blocks during parallel seqscansDavid Rowley2020-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we would allocate blocks to parallel workers during a parallel sequential scan 1 block at a time. Since other workers were likely to request a block before a worker returns for another block number to work on, this could lead to non-sequential I/O patterns in each worker which could cause the operating system's readahead to perform poorly or not at all. Here we change things so that we allocate consecutive "chunks" of blocks to workers and have them work on those until they're done, at which time we allocate another chunk for the worker. The size of these chunks is based on the size of the relation. Initial patch here was by Thomas Munro which showed some good improvements just having a fixed chunk size of 64 blocks with a simple ramp-down near the end of the scan. The revisions of the patch to make the chunk size based on the relation size and the adjusted ramp-down in powers of two was done by me, along with quite extensive benchmarking to determine the optimal chunk sizes. For the most part, benchmarks have shown significant performance improvements for large parallel sequential scans on Linux, FreeBSD and Windows using SSDs. It's less clear how this affects the performance of cloud providers. Tests done so far are unable to obtain stable enough performance to provide meaningful benchmark results. It is possible that this could cause some performance regressions on more obscure filesystems, so we may need to later provide users with some ability to get something closer to the old behavior. For now, let's leave that until we see that it's really required. Author: Thomas Munro, David Rowley Reviewed-by: Ranier Vilela, Soumyadeep Chakraborty, Robert Haas Reviewed-by: Amit Kapila, Kirk Jamison Discussion: https://postgr.es/m/CA+hUKGJ_EErDv41YycXcbMbCBkztA34+z1ts9VQH+ACRuvpxig@mail.gmail.com
* Fix misuse of table_index_fetch_tuple_check().Peter Geoghegan2020-06-25
| | | | | | | | | | | | | Commit 0d861bbb, which added deduplication to nbtree, had _bt_check_unique() pass a TID to table_index_fetch_tuple_check() that isn't safe to mutate. table_index_fetch_tuple_check()'s tid argument is modified when the TID in question is not the latest visible tuple in a hot chain, though this wasn't documented. To fix, go back to using a local copy of the TID in _bt_check_unique(), and update comments above table_index_fetch_tuple_check(). Backpatch: 13-, where B-Tree deduplication was introduced.
* Introduce macros for typalign and typstorage constants.Tom Lane2020-03-04
| | | | | | | | | | | | | | | | | | | | | Our usual practice for "poor man's enum" catalog columns is to define macros for the possible values and use those, not literal constants, in C code. But for some reason lost in the mists of time, this was never done for typalign/attalign or typstorage/attstorage. It's never too late to make it better though, so let's do that. The reason I got interested in this right now is the need to duplicate some uses of the TYPSTORAGE constants in an upcoming ALTER TYPE patch. But in general, this sort of change aids greppability and readability, so it's a good idea even without any specific motivation. I may have missed a few places that could be converted, and it's even more likely that pending patches will re-introduce some hard-coded references. But that's not fatal --- there's no expectation that we'd actually change any of these values. We can clean up stragglers over time. Discussion: https://postgr.es/m/16457.1583189537@sss.pgh.pa.us