aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/gist
Commit message (Collapse)AuthorAge
* Fix typo in comment of gistdoinsert().Masahiko Sawada2024-11-04
| | | | | | Author: Tender Wang Reviewed-by: Masahiko Sawada Discussion: https://postgr.es/m/CAHewXN%3D3sH2sNw4nC3QGCEVw1Lftmw9m5y1Xje0bXK6ApDrsPQ%40mail.gmail.com
* Add temporal PRIMARY KEY and UNIQUE constraintsPeter Eisentraut2024-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add WITHOUT OVERLAPS clause to PRIMARY KEY and UNIQUE constraints. These are backed by GiST indexes instead of B-tree indexes, since they are essentially exclusion constraints with = for the scalar parts of the key and && for the temporal part. (previously committed as 46a0cd4cefb, reverted by 46a0cd4cefb; the new part is this:) Because 'empty' && 'empty' is false, the temporal PK/UQ constraint allowed duplicates, which is confusing to users and breaks internal expectations. For instance, when GROUP BY checks functional dependencies on the PK, it allows selecting other columns from the table, but in the presence of duplicate keys you could get the value from any of their rows. So we need to forbid empties. This all means that at the moment we can only support ranges and multiranges for temporal PK/UQs, unlike the original patch (above). Documentation and tests for this are added. But this could conceivably be extended by introducing some more general support for the notion of "empty" for other types. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Add stratnum GiST support functionPeter Eisentraut2024-09-17
| | | | | | | | | | | | | | | | | | | | | This is support function 12 for the GiST AM and translates "well-known" RT*StrategyNumber values into whatever strategy number is used by the opclass (since no particular numbers are actually required). We will use this to support temporal PRIMARY KEY/UNIQUE/FOREIGN KEY/FOR PORTION OF functionality. This commit adds two implementations, one for internal GiST opclasses (just an identity function) and another for btree_gist opclasses. It updates btree_gist from 1.7 to 1.8, adding the support function for all its opclasses. (previously committed as 6db4598fcb8, reverted by 8aee330af55; this is essentially unchanged from those) Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Replace gratuitous memmove() with memcpy()Peter Eisentraut2024-09-11
| | | | | | | | | | The index access methods all had similar code that copied the passed-in scan keys to local storage. They all used memmove() for that, which is not wrong, but it seems confusing not to use memcpy() when that would work. Presumably, this was all once copied from ancient code and never adjusted. Discussion: https://www.postgresql.org/message-id/flat/f8c739d9-f48d-4187-b214-df3391ba41ab@eisentraut.org
* Add amgettreeheight index AM API routinePeter Eisentraut2024-09-10
| | | | | | | | | | | The only current implementation is for btree where it calls _bt_getrootheight(). Other index types can now also use this to pass information to their amcostestimate routine. Previously, btree was hardcoded and other index types could not hook into the optimizer at this point. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Revert temporal primary keys and foreign keysPeter Eisentraut2024-05-16
| | | | | | | | | | | | | | | | | | | | This feature set did not handle empty ranges correctly, and it's now too late for PostgreSQL 17 to fix it. The following commits are reverted: 6db4598fcb8 Add stratnum GiST support function 46a0cd4cefb Add temporal PRIMARY KEY and UNIQUE constraints 86232a49a43 Fix comment on gist_stratnum_btree 030e10ff1a3 Rename pg_constraint.conwithoutoverlaps to conperiod a88c800deb6 Use daterange and YMD in without_overlaps tests instead of tsrange. 5577a71fb0c Use half-open interval notation in without_overlaps tests 34768ee3616 Add temporal FOREIGN KEY contraints 482e108cd38 Add test for REPLICA IDENTITY with a temporal key c3db1f30cba doc: clarify PERIOD and WITHOUT OVERLAPS in CREATE TABLE 144c2ce0cc7 Fix ON CONFLICT DO NOTHING/UPDATE for temporal indexes Discussion: https://www.postgresql.org/message-id/d0b64a7a-dfe4-4b84-a906-c7dedfa40a3e@eisentraut.org
* Merge prune, freeze and vacuum WAL record formatsHeikki Linnakangas2024-03-25
| | | | | | | | | | | | | | | | | | | | | | | The new combined WAL record is now used for pruning, freezing and 2nd pass of vacuum. This is in preparation for changing VACUUM to write a combined prune+freeze record per page, instead of separate two records. The new WAL record format now supports that, but the code still always writes separate records for pruning and freezing. This reserves separate XLOG_HEAP2_* info codes for when the pruning record is emitted for on-access pruning or VACUUM, per Peter Geoghegan's suggestion. The record format is identical, but having separate info codes makes it easier analyze pruning and vacuuming with pg_waldump. The function to emit the new WAL record, log_heap_prune_and_freeze(), is in pruneheap.c. The existing heap_log_freeze_plan() and its subroutines are moved to pruneheap.c without changes, to keep them together with log_heap_prune_and_freeze(). Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/CAAKRu_azf-zH%3DDgVbquZ3tFWjMY1w5pO8m-TXJaMdri8z3933g@mail.gmail.com Discussion: https://www.postgresql.org/message-id/CAAKRu_b2oE4GL%3Dq4g9mcByS9yT7wTQvEH9OLpabj28e%2BWKFi2A@mail.gmail.com
* Remove unused #include's from backend .c filesPeter Eisentraut2024-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
* Introduce a new smgr bulk loading facility.Heikki Linnakangas2024-02-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new facility makes it easier to optimize bulk loading, as the logic for buffering, WAL-logging, and syncing the relation only needs to be implemented once. It's also less error-prone: We have had a number of bugs in how a relation is fsync'd - or not - at the end of a bulk loading operation. By centralizing that logic to one place, we only need to write it correctly once. The new facility is faster for small relations: Instead of of calling smgrimmedsync(), we register the fsync to happen at next checkpoint, which avoids the fsync latency. That can make a big difference if you are e.g. restoring a schema-only dump with lots of relations. It is also slightly more efficient with large relations, as the WAL logging is performed multiple pages at a time. That avoids some WAL header overhead. The sorted GiST index build did that already, this moves the buffering to the new facility. The changes to pageinspect GiST test needs an explanation: Before this patch, the sorted GiST index build set the LSN on every page to the special GistBuildLSN value, not the LSN of the WAL record, even though they were WAL-logged. There was no particular need for it, it just happened naturally when we wrote out the pages before WAL-logging them. Now we WAL-log the pages first, like in B-tree build, so the pages are stamped with the record's real LSN. When the build is not WAL-logged, we still use GistBuildLSN. To make the test output predictable, use an unlogged index. Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/30e8f366-58b3-b239-c521-422122dd5150%40iki.fi
* Add temporal PRIMARY KEY and UNIQUE constraintsPeter Eisentraut2024-01-24
| | | | | | | | | | | | Add WITHOUT OVERLAPS clause to PRIMARY KEY and UNIQUE constraints. These are backed by GiST indexes instead of B-tree indexes, since they are essentially exclusion constraints with = for the scalar parts of the key and && for the temporal part. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Add stratnum GiST support functionPeter Eisentraut2024-01-19
| | | | | | | | | | | | | | | | | | This is support function 12 for the GiST AM and translates "well-known" RT*StrategyNumber values into whatever strategy number is used by the opclass (since no particular numbers are actually required). We will use this to support temporal PRIMARY KEY/UNIQUE/FOREIGN KEY/FOR PORTION OF functionality. This commit adds two implementations, one for internal GiST opclasses (just an identity function) and another for btree_gist opclasses. It updates btree_gist from 1.7 to 1.8, adding the support function for all its opclasses. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* gist: fix typo "split(t)ed" -> "split"Robert Haas2024-01-02
| | | | | | Dagfinn Ilmari Mannsåker, reviewed by Shubham Khanna. Discussion: http://postgr.es/m/87le9fmi01.fsf@wibble.ilmari.org
* Allow parallel CREATE INDEX for BRIN indexesTomas Vondra2023-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow using multiple worker processes to build BRIN index, which until now was supported only for BTREE indexes. For large tables this often results in significant speedup when the build is CPU-bound. The work is split in a simple way - each worker builds BRIN summaries on a subset of the table, determined by the regular parallel scan used to read the data, and feeds them into a shared tuplesort which sorts them by blkno (start of the range). The leader then reads this sorted stream of ranges, merges duplicates (which may happen if the parallel scan does not align with BRIN pages_per_range), and adds the resulting ranges into the index. The number of duplicate results produced by workers (requiring merging in the leader process) should be fairly small, thanks to how parallel scans assign chunks to workers. The likelihood of duplicate results may increase for higher pages_per_range values, but then there are fewer page ranges in total. In any case, we expect the merging to be much cheaper than summarization, so this should be a win. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for BRIN indexes (e.g. uniqueness checks). This also introduces a new index AM flag amcanbuildparallel, determining whether to attempt to start parallel workers for the index build. Original patch by me, with reviews and substantial reworks by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent Discussion: https://postgr.es/m/c2ee7d69-ce17-43f2-d1a0-9811edbda6e6%40enterprisedb.com
* Reuse BrinDesc and BrinRevmap in brininsertTomas Vondra2023-11-25
| | | | | | | | | | | | | | | | The brininsert code used to initialize (and destroy) BrinDesc and BrinRevmap for each tuple, which is not free. This patch initializes these structures only once, and reuses them for all inserts in the same command. The data is passed through indexInfo->ii_AmCache. This also introduces an optional AM callback "aminsertcleanup" that allows performing custom cleanup in case simply pfree-ing ii_AmCache is not sufficient (which is the case when the cache contains TupleDesc, Buffers, and so on). Author: Soumyadeep Chakraborty Reviewed-by: Alvaro Herrera, Matthias van de Meent, Tomas Vondra Discussion: https://postgr.es/m/CAE-ML%2B9r2%3DaO1wwji1sBN9gvPz2xRAtFUGfnffpd0ZqyuzjamA%40mail.gmail.com
* Add trailing commas to enum definitionsPeter Eisentraut2023-10-26
| | | | | | | | | | | | | | | | | | | | Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
* Add const to values and nulls argumentsPeter Eisentraut2023-10-10
| | | | | | | This excludes any changes that would change the external AM APIs. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/14c31f4a-0347-0805-dce8-93a9072c05a5%40eisentraut.org
* Fix another bug in parent page splitting during GiST index build.Heikki Linnakangas2023-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yet another bug in the ilk of commits a7ee7c851 and 741b88435. In 741b88435, we took care to clear the memorized location of the downlink when we split the parent page, because splitting the parent page can move the downlink. But we missed that even *updating* a tuple on the parent can move it, because updating a tuple on a gist page is implemented as a delete+insert, so the updated tuple gets moved to the end of the page. This commit fixes the bug in two different ways (belt and suspenders): 1. Clear the downlink when we update a tuple on the parent page, even if it's not split. This the same approach as in commits a7ee7c851 and 741b88435. I also noticed that gistFindCorrectParent did not clear the 'downlinkoffnum' when it stepped to the right sibling. Fix that too, as it seems like a clear bug even though I haven't been able to find a test case to hit that. 2. Change gistFindCorrectParent so that it treats 'downlinkoffnum' merely as a hint. It now always first checks if the downlink is still at that location, and if not, it scans the page like before. That's more robust if there are still more cases where we fail to clear 'downlinkoffnum' that we haven't yet uncovered. With this, it's no longer necessary to meticulously clear 'downlinkoffnum', so this makes the previous fixes unnecessary, but I didn't revert them because it still seems nice to clear it when we know that the downlink has moved. Also add the test case using the same test data that Alexander posted. I tried to reduce it to a smaller test, and I also tried to reproduce this with different test data, but I was not able to, so let's just include what we have. Backpatch to v12, like the previous fixes. Reported-by: Alexander Lakhin Discussion: https://www.postgresql.org/message-id/18129-caca016eaf0c3702@postgresql.org
* Fix GiST README's explanation of the NSN cross-check.Heikki Linnakangas2023-09-19
| | | | | | | | The text got the condition backwards, it's "NSN > LSN", not "NSN < LSN". While we're at it, expand it a little for clarity. Reviewed-by: Daniel Gustafsson Discussion: https://www.postgresql.org/message-id/4cb46e18-e688-524a-0f73-b1f03ed5d6ee@iki.fi
* Remove the "snapshot too old" feature.Thomas Munro2023-09-05
| | | | | | | | | | | | | | | | | Remove the old_snapshot_threshold setting and mechanism for producing the error "snapshot too old", originally added by commit 848ef42b. Unfortunately it had a number of known problems in terms of correctness and performance, mostly reported by Andres in the course of his work on snapshot scalability. We agreed to remove it, after a long period without an active plan to fix it. This is certainly a desirable feature, and someone might propose a new or improved implementation in the future. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CACG%3DezYV%2BEvO135fLRdVn-ZusfVsTY6cH1OZqWtezuEYH6ciQA%40mail.gmail.com Discussion: https://postgr.es/m/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de Discussion: https://postgr.es/m/CA%2BTgmoY%3Daqf0zjTD%2B3dUWYkgMiNDegDLFjo%2B6ze%3DWtpik%2B3XqA%40mail.gmail.com
* ExtendBufferedWhat -> BufferManagerRelation.Thomas Munro2023-08-23
| | | | | | | | | | | | | | | | | Commit 31966b15 invented a way for functions dealing with relation extension to accept a Relation in online code and an SMgrRelation in recovery code. It seems highly likely that future bufmgr.c interfaces will face the same problem, and need to do something similar. Generalize the names so that each interface doesn't have to re-invent the wheel. Back-patch to 16. Since extension AM authors might start using the constructor macros once 16 ships, we agreed to do the rename in 16 rather than waiting for 17. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com
* Pre-beta mechanical code beautification.Tom Lane2023-05-19
| | | | | | | | | | | | | | | Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
* Fix typos in commentsMichael Paquier2023-05-02
| | | | | | | | | The changes done in this commit impact comments with no direct user-visible changes, with fixes for incorrect function, variable or structure names. Author: Alexander Lakhin Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com
* Handle logical slot conflicts on standbyAndres Freund2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During WAL replay on the standby, when a conflict with a logical slot is identified, invalidate such slots. There are two sources of conflicts: 1) Using the information added in 6af1793954e, logical slots are invalidated if required rows are removed 2) wal_level on the primary server is reduced to below logical Uses the infrastructure introduced in the prior commit. FIXME: add commit reference. Change InvalidatePossiblyObsoleteSlot() to use a recovery conflict to interrupt use of a slot, if called in the startup process. The new recovery conflict is added to pg_stat_database_conflicts, as confl_active_logicalslot. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion for the addition of the pg_stat_database_conflicts column. Bumps PGSTAT_FILE_FORMAT_ID for the same reason. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
* Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.Thomas Munro2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit 439f6175. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
* Convert many uses of ReadBuffer[Extended](P_NEW) to ExtendBufferedRel()Andres Freund2023-04-05
| | | | | | | | | A few places are not converted. Some because they are tackled in later commits (e.g. hio.c, xlogutils.c), some because they are more complicated (e.g. brin_pageops.c). Having a few users of ReadBuffer(P_NEW) is good anyway, to ensure the backward compat path stays working. Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
* Add info in WAL records in preparation for logical slot conflict handlingAndres Freund2023-04-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit only implements one prerequisite part for allowing logical decoding. The commit message contains an explanation of the overall design, which later commits will refer back to. Overall design: 1. We want to enable logical decoding on standbys, but replay of WAL from the primary might remove data that is needed by logical decoding, causing error(s) on the standby. To prevent those errors, a new replication conflict scenario needs to be addressed (as much as hot standby does). 2. Our chosen strategy for dealing with this type of replication slot is to invalidate logical slots for which needed data has been removed. 3. To do this we need the latestRemovedXid for each change, just as we do for physical replication conflicts, but we also need to know whether any particular change was to data that logical replication might access. That way, during WAL replay, we know when there is a risk of conflict and, if so, if there is a conflict. 4. We can't rely on the standby's relcache entries for this purpose in any way, because the startup process can't access catalog contents. 5. Therefore every WAL record that potentially removes data from the index or heap must carry a flag indicating whether or not it is one that might be accessed during logical decoding. Why do we need this for logical decoding on standby? First, let's forget about logical decoding on standby and recall that on a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed. This is done thanks to the catalog_xmin associated with the logical replication slot. But, with logical decoding on standby, in the following cases: - hot_standby_feedback is off - hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then, hot_standby_feedback will work, but only while the connection is alive (for example a node restart would break it) Then, the primary may delete system catalog rows that could be needed by the logical decoding on the standby (as it does not know about the catalog_xmin on the standby). So, it’s mandatory to identify those rows and invalidate the slots that may need them if any. Identifying those rows is the purpose of this commit. Implementation: When a WAL replay on standby indicates that a catalog table tuple is to be deleted by an xid that is greater than a logical slot's catalog_xmin, then that means the slot's catalog_xmin conflicts with the xid, and we need to handle the conflict. While subsequent commits will do the actual conflict handling, this commit adds a new field isCatalogRel in such WAL records (and a new bit set in the xl_heap_visible flags field), that is true for catalog tables, so as to arrange for conflict handling. The affected WAL records are the ones that already contain the snapshotConflictHorizon field, namely: - gistxlogDelete - gistxlogPageReuse - xl_hash_vacuum_one_page - xl_heap_prune - xl_heap_freeze_page - xl_heap_visible - xl_btree_reuse_page - xl_btree_delete - spgxlogVacuumRedirect Due to this new field being added, xl_hash_vacuum_one_page and gistxlogDelete do now contain the offsets to be deleted as a FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignment. It's not needed on the others struct where isCatalogRel has been added. This commit just introduces the WAL format changes mentioned above. Handling the actual conflicts will follow in future commits. Bumps XLOG_PAGE_MAGIC as the several WAL records are changed. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> (in an older version) Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
* Pass down table relation into more index relation functionsAndres Freund2023-04-01
| | | | | | | | | | | | This is done in preparation for logical decoding on standby, which needs to include whether visibility affecting WAL records are about a (user) catalog table. Which is only known for the table, not the indexes. It's also nice to be able to pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder(). Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
* Fix dereference of dangling pointer in GiST index buffering build.Tom Lane2023-03-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gistBuildCallback tried to fetch the size of an index tuple that might have already been freed by gistProcessEmptyingQueue. While this seems to usually be harmless in production builds, in principle it could result in a SIGSEGV, or more likely a bogus value for indtuplesSize leading to poor page-split decisions later in the build. The memory management here is confusing and could stand to be refactored, but for the moment it seems to be enough to fetch the tuple size sooner. AFAICT the indtuples[Size] totals aren't used in between these places; even if they were, the updated values shouldn't be any worse to use. So just move the incrementing of the totals up. It's not very clear why our valgrind-using buildfarm animals haven't noticed this problem, because the relevant code path does seem to be exercised according to the code coverage report. I think the reason that we didn't fix this bug after the first report is that I'd wanted to try to understand that better. However, now that it's been re-discovered let's just be pragmatic and fix it already. Original report by Alexander Lakhin (bug #16329), later rediscovered by Egor Chindyaskin (bug #17874). Patch by Alexander Lakhin (commentary by Pavel Borisov and me). Back-patch to all supported branches. Discussion: https://postgr.es/m/16329-7a6aa9b6fa1118a1@postgresql.org Discussion: https://postgr.es/m/17874-63ca6c7ce42d2103@postgresql.org
* Ignore BRIN indexes when checking for HOT updatesTomas Vondra2023-03-20
| | | | | | | | | | | | | | | | | | | | | | | | When determining whether an index update may be skipped by using HOT, we can ignore attributes indexed by block summarizing indexes without references to individual tuples that need to be cleaned up. A new type TU_UpdateIndexes provides a signal to the executor to determine which indexes to update - no indexes, all indexes, or only the summarizing indexes. This also removes rd_indexattr list, and replaces it with rd_attrsvalid flag. The list was not used anywhere, and a simple flag is sufficient. This was originally committed as 5753d4ee32, but then got reverted by e3fcca0d0d because of correctness issues. Original patch by Josef Simanek, various fixes and improvements by Tomas Vondra and me. Authors: Matthias van de Meent, Josef Simanek, Tomas Vondra Reviewed-by: Tomas Vondra, Alvaro Herrera Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com Discussion: https://postgr.es/m/CAFp7QwpMRGcDAQumN7onN9HjrJ3u4X3ZRXdGFT0K5G2JWvnbWg%40mail.gmail.com
* Remove useless casts to (void *) in arguments of some system functionsPeter Eisentraut2023-02-07
| | | | | | | | The affected functions are: bsearch, memcmp, memcpy, memset, memmove, qsort, repalloc Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/fd9adf5d-b1aa-e82f-e4c7-263c30145807%40enterprisedb.com
* Remove useless casts to (void *) in hash_search() callsPeter Eisentraut2023-02-06
| | | | | | | | | | | Some of these appear to be leftovers from when hash_search() took a char * argument (changed in 5999e78fc45dcb91784b64b6e9ae43f4e4f68ca2). Since after this there is some more horizontal space available, do some light reformatting where suitable. Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/fd9adf5d-b1aa-e82f-e4c7-263c30145807%40enterprisedb.com
* Add BufFileRead variants with short read and EOF detectionPeter Eisentraut2023-01-16
| | | | | | | | | | | | | | | Most callers of BufFileRead() want to check whether they read the full specified length. Checking this at every call site is very tedious. This patch provides additional variants BufFileReadExact() and BufFileReadMaybeEOF() that include the length checks. I considered changing BufFileRead() itself, but this function is also used in extensions, and so changing the behavior like this would create a lot of problems there. The new names are analogous to the existing LogicalTapeReadExact(). Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/f3501945-c591-8cc3-5ef0-b72a2e0eaa9c@enterprisedb.com
* Update copyright for 2023Bruce Momjian2023-01-02
| | | | Backpatch-through: 11
* Add const to BufFileWritePeter Eisentraut2022-12-30
| | | | | | | | Make data buffer argument to BufFileWrite a const pointer and bubble this up to various callers and related APIs. This makes the APIs clearer and more consistent. Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
* Add copyright notices to meson filesAndrew Dunstan2022-12-20
| | | | Discussion: https://postgr.es/m/222b43a5-2fb3-2c1b-9cd0-375d376c8246@dunslane.net
* Standardize rmgrdesc recovery conflict XID output.Peter Geoghegan2022-11-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Standardize on the name snapshotConflictHorizon for all XID fields from WAL records that generate recovery conflicts when in hot standby mode. This supersedes the previous latestRemovedXid naming convention. The new naming convention places emphasis on how the values are actually used by REDO routines. How the values are generated during original execution (details of which vary by record type) is deemphasized. Users of tools like pg_waldump can now grep for snapshotConflictHorizon to see all potential sources of recovery conflicts in a standardized way, without necessarily having to consider which specific record types might be involved. Also bring a couple of WAL record types that didn't follow any kind of naming convention into line. These are heapam's VISIBLE record type and SP-GiST's VACUUM_REDIRECT record type. Now every WAL record whose REDO routine calls ResolveRecoveryConflictWithSnapshot() passes through the snapshotConflictHorizon field from its WAL record. This is follow-up work to the refactoring from commit 9e540599 that made FREEZE_PAGE WAL records use a standard snapshotConflictHorizon style XID cutoff. No bump in XLOG_PAGE_MAGIC, since the underlying format of affected WAL records doesn't change. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wzm2CQUmViUq7Opgk=McVREHSOorYaAjR1ZpLYkRN7_dPw@mail.gmail.com
* Remove unnecessary uses of Abs()Peter Eisentraut2022-10-07
| | | | | | | | Use C standard abs() or fabs() instead. Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/4beb42b5-216b-bce8-d452-d924d5794c63%40enterprisedb.com
* Convert *GetDatum() and DatumGet*() macros to inline functionsPeter Eisentraut2022-09-27
| | | | | | | | | | | | | | The previous macro implementations just cast the argument to a target type but did not check whether the input type was appropriate. The function implementation can do better type checking of the input type. For the *GetDatumFast() macros, converting to an inline function doesn't work in the !USE_FLOAT8_BYVAL case, but we can use AssertVariableIsOfTypeMacro() to get a similar level of type checking. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
* meson: Add initial version of meson based build systemAndres Freund2022-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Autoconf is showing its age, fewer and fewer contributors know how to wrangle it. Recursive make has a lot of hard to resolve dependency issues and slow incremental rebuilds. Our home-grown MSVC build system is hard to maintain for developers not using Windows and runs tests serially. While these and other issues could individually be addressed with incremental improvements, together they seem best addressed by moving to a more modern build system. After evaluating different build system choices, we chose to use meson, to a good degree based on the adoption by other open source projects. We decided that it's more realistic to commit a relatively early version of the new build system and mature it in tree. This commit adds an initial version of a meson based build system. It supports building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD, Solaris and Windows (however only gcc is supported on aix, solaris). For Windows/MSVC postgres can now be built with ninja (faster, particularly for incremental builds) and msbuild (supporting the visual studio GUI, but building slower). Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM bitcode generation, documentation adjustments) are done in subsequent commits requiring further review. Other aspects (e.g. not installing test-only extensions) are not yet addressed. When building on Windows with msbuild, builds are slower when using a visual studio version older than 2019, because those versions do not support MultiToolTask, required by meson for intra-target parallelism. The plan is to remove the MSVC specific build system in src/tools/msvc soon after reaching feature parity. However, we're not planning to remove the autoconf/make build system in the near future. Likely we're going to keep at least the parts required for PGXS to keep working around until all supported versions build with meson. Some initial help for postgres developers is at https://wiki.postgresql.org/wiki/Meson With contributions from Thomas Munro, John Naylor, Stone Tickle and others. Author: Andres Freund <andres@anarazel.de> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Author: Peter Eisentraut <peter@eisentraut.org> Reviewed-By: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Discussion: https://postgr.es/m/20211012083721.hvixq4pnh2pixr3j@alap3.anarazel.de
* Suppress variable-set-but-not-used warnings from clang 15.Tom Lane2022-09-20
| | | | | | | | | | | | | | | | | | | | | | | | clang 15+ will issue a set-but-not-used warning when the only use of a variable is in autoincrements (e.g., "foo++;"). That's perfectly sensible, but it detects a few more cases that we'd not noticed before. Silence the warnings with our usual methods, such as PG_USED_FOR_ASSERTS_ONLY, or in one case by actually removing a useless variable. One thing that we can't nicely get rid of is that with %pure-parser, Bison emits "yynerrs" as a local variable that falls foul of this warning. To silence those, I inserted "(void) yynerrs;" in the top-level productions of affected grammars. Per recently-established project policy, this is a candidate for back-patching into out-of-support branches: it suppresses annoying compiler warnings but changes no behavior. Hence, back-patch to 9.5, which is as far as these patches go without issues. (A preliminary check shows that the prior branches need some other set-but-not-used cleanups too, so I'll leave them for another day.) Discussion: https://postgr.es/m/514615.1663615243@sss.pgh.pa.us
* Harmonize parameter names in storage and AM code.Peter Geoghegan2022-09-19
| | | | | | | | | | | | | | | Make sure that function declarations use names that exactly match the corresponding names from function definitions in storage, catalog, access method, executor, and logical replication code, as well as in miscellaneous utility/library code. Like other recent commits that cleaned up function parameter names, this commit was written with help from clang-tidy. Later commits will do the same for other parts of the codebase. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
* Revert "Convert *GetDatum() and DatumGet*() macros to inline functions"Peter Eisentraut2022-09-12
| | | | | | This reverts commit 595836e99bf1ee6d43405b885fb69bb8c6d3ee23. It has problems when USE_FLOAT8_BYVAL is off.
* Convert *GetDatum() and DatumGet*() macros to inline functionsPeter Eisentraut2022-09-12
| | | | | | | | | The previous macro implementations just cast the argument to a target type but did not check whether the input type was appropriate. The function implementation can do better type checking of the input type. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
* Small refactor to get rid of -Wshadow=compatible-local warningDavid Rowley2022-08-26
| | | | | | | | | | Further reduce -Wshadow=compatible-local warnings by 1 by refactoring the code in gistRelocateBuildBuffersOnSplit() to make use of foreach_current_index() instead of manually incrementing a variable on each loop. Author: David Rowley Discussion: https://postgr.es/m/CAApHDvpGZX-X=Bn4moyXgfFa0CdSUwoa04d3isit3=1qo8F8Bw@mail.gmail.com
* Further reduce warnings with -Wshadow=compatible-localDavid Rowley2022-08-24
| | | | | | | | | | | | | | | | | | | | | | | In a similar effort to f01592f91, here we're targetting fixing the warnings that -Wshadow=compatible-local produces that we can fix by moving a variable to an inner scope to stop that variable from being shadowed by another variable declared somewhere later in the function. All of the warnings being fixed here are changing the scope of variables which are being used as an iterator for a "for" loop. In each instance, the fix happens to be changing the for loop to use the C99 type initialization. Much of this code likely pre-dates our use of C99. Reducing the scope of the outer scoped variable seems like the safest way to fix these. Renaming seems more likely to risk patches using the wrong variable. Reducing the scope is more likely to result in a compilation failure after applying some future patch rather than introducing bugs with it. By my count, this takes the warning count from 129 down to 114. Author: Justin Pryzby Discussion: https://postgr.es/m/CAApHDvrwLGBP%2BYw9vriayyf%3DXR4uPWP5jr6cQhP9au_kaDUhbA%40mail.gmail.com
* Change internal RelFileNode references to RelFileNumber or RelFileLocator.Robert Haas2022-07-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
* Revert changes in HOT handling of BRIN indexesTomas Vondra2022-06-16
| | | | | | | | | | | | | | | | | | | | | | | | This reverts commits 5753d4ee32 and fe60b67250 that modified HOT to ignore BRIN indexes. The commit message for 5753d4ee32 claims that: When determining whether an index update may be skipped by using HOT, we can ignore attributes indexed only by BRIN indexes. There are no index pointers to individual tuples in BRIN, and the page range summary will be updated anyway as it relies on visibility info. This is partially incorrect - it's true BRIN indexes don't point to individual tuples, so HOT chains are not an issue, but the visibitlity info is not sufficient to keep the index up to date. This can easily result in corrupted indexes, as demonstrated in the hackers thread. This does not mean relaxing the HOT restrictions for BRIN is a lost cause, but it needs to handle the two aspects (allowing HOT chains and updating the page range summaries) as separate. But that requires a major changes, and it's too late for that in the current dev cycle. Reported-by: Tomas Vondra Discussion: https://postgr.es/m/05ebcb44-f383-86e3-4f31-0a97a55634cf@enterprisedb.com
* Adjust tuplesort API to have bitwise option flagsDavid Rowley2022-04-04
| | | | | | | | | | | | | | | | This replaces the bool flag for randomAccess. An upcoming patch requires adding another option, so instead of breaking the API for that, then breaking it again one day if we add more options, let's just break it once. Any boolean options we add in the future will just make use of an unused bit in the flags. Any extensions making use of tuplesorts will need to update their code to pass TUPLESORT_RANDOMACCESS instead of true for randomAccess. TUPLESORT_NONE can be used for a set of empty options. Author: David Rowley Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com
* Improve the generation memory allocatorDavid Rowley2022-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we make a series of improvements to the generation memory allocator, namely: 1. Allow generation contexts to have a minimum, initial and maximum block sizes. The standard allocator allows this already but when the generation context was added, it only allowed fixed-sized blocks. The problem with fixed-sized blocks is that it's difficult to choose how large to make the blocks. If the chosen size is too small then we'd end up with a large number of blocks and a large number of malloc calls. If the block size is made too large, then memory is wasted. 2. Add support for "keeper" blocks. This is a special block that is allocated along with the context itself but is never freed. Instead, when the last chunk in the keeper block is freed, we simply mark the block as empty to allow new allocations to make use of it. 3. Add facility to "recycle" newly empty blocks instead of freeing them and having to later malloc an entire new block again. We do this by recording a single GenerationBlock which has become empty of any chunks. When we run out of space in the current block, we check to see if there is a "freeblock" and use that if it contains enough space for the allocation. Author: David Rowley, Tomas Vondra Reviewed-by: Andy Fan Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0ab8@enterprisedb.com