aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/heap/heapam_handler.c
Commit message (Collapse)AuthorAge
* Report tuple address in data-corruption error messageAlvaro Herrera2021-08-30
| | | | | | | | | | | Most data-corruption reports mention the location of the problem, but this one failed to. Add it. Backpatch all the way back. In 12 and older, also assign the ERRCODE_DATA_CORRUPTED error code as was done in commit fd6ec93bf890 for 13 and later. Discussion: https://postgr.es/m/202108191637.oqyzrdtnheir@alvherre.pgsql
* Fix CLUSTER progress reporting of number of blocks scanned.Fujii Masao2020-11-27
| | | | | | | | | | | | | | | | | | | | | | | | | Previously pg_stat_progress_cluster view reported the current block number in heap scan as the number of heap blocks scanned (i.e., heap_blks_scanned). This reported number could be incorrect when synchronize_seqscans is enabled, because it allowed the heap scan to start at block in middle. This could result in wraparounds in the heap_blks_scanned column when the heap scan wrapped around. This commit fixes the bug by calculating the number of blocks from the block that the heap scan starts at to the current block in scan, and reporting that number in the heap_blks_scanned column. Also, in pg_stat_progress_cluster view, previously heap_blks_scanned could not reach heap_blks_total at the end of heap scan phase if the last pages scanned were empty. This commit fixes the bug by manually updating heap_blks_scanned to the same value as heap_blks_total when the heap scan phase finishes. Back-patch to v12 where pg_stat_progress_cluster view was introduced. Reported-by: Matthias van de Meent Author: Matthias van de Meent Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/CAEze2WjCBWSGkVfYag001Rc4+-nNLDpWM7QbyD6yPvuhKs-gYQ@mail.gmail.com
* Handle new HOT chains in index-build table scansAlvaro Herrera2020-08-13
| | | | | | | | | | | | | | | | | | | | | | | | | | When a table is scanned by heapam_index_build_range_scan (née IndexBuildHeapScan) and the table lock being held allows concurrent data changes, it is possible for new HOT chains to sprout in a page that were unknown when the scan of a page happened. This leads to an error such as ERROR: failed to find parent tuple for heap-only tuple at (X,Y) in table "tbl" because the root tuple was not present when we first obtained the list of the page's root tuples. This can be fixed by re-obtaining the list of root tuples, if we see that a heap-only tuple appears to point to a non-existing root. This was reported by Anastasia as occurring for BRIN summarization (which exists since 9.5), but I think it could theoretically also happen with CREATE INDEX CONCURRENTLY (much older) or REINDEX CONCURRENTLY (very recent). It seems a happy coincidence that BRIN forces us to backpatch this all the way to 9.5. Reported-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Diagnosed-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Co-authored-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru> Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/602d8487-f0b2-5486-0088-0f372b2549fa@postgrespro.ru Backpatch: 9.5 - master
* Dial back -Wimplicit-fallthrough to level 3Alvaro Herrera2020-05-13
| | | | | | | | | The additional pain from level 4 is excessive for the gain. Also revert all the source annotation changes to their original wordings, to avoid back-patching pain. Discussion: https://postgr.es/m/31166.1589378554@sss.pgh.pa.us
* Add -Wimplicit-fallthrough to CFLAGS and CXXFLAGSAlvaro Herrera2020-05-12
| | | | | | | | | | | | | | | Use it at level 4, a bit more restrictive than the default level, and tweak our commanding comments to FALLTHROUGH. (However, leave zic.c alone, since it's external code; to avoid the warnings that would appear there, change CFLAGS for that file in the Makefile.) Author: Julien Rouhaud <rjuju123@gmail.com> Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20200412081825.qyo5vwwco3fv4gdo@nol Discussion: https://postgr.es/m/flat/E1fDenm-0000C8-IJ@gemulon.postgresql.org
* Skip WAL for new relfilenodes, under wal_level=minimal.Noah Misch2020-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN. Future servers accept older WAL, so this bump is discretionary. Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
* Revert "Skip WAL for new relfilenodes, under wal_level=minimal."Noah Misch2020-03-22
| | | | | | | | This reverts commit cb2fd7eac285b1b0a24eeb2b8ed4456b66c5a09f. Per numerous buildfarm members, it was incompatible with parallel query, and a test case assumed LP64. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20200321224920.GB1763544@rfd.leadboat.com
* Skip WAL for new relfilenodes, under wal_level=minimal.Noah Misch2020-03-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, only selected bulk operations (e.g. COPY) did this. If a given relfilenode received both a WAL-skipping COPY and a WAL-logged operation (e.g. INSERT), recovery could lose tuples from the COPY. See src/backend/access/transam/README section "Skipping WAL for New RelFileNode" for the new coding rules. Maintainers of table access methods should examine that section. To maintain data durability, just before commit, we choose between an fsync of the relfilenode and copying its contents to WAL. A new GUC, wal_skip_threshold, guides that choice. If this change slows a workload that creates small, permanent relfilenodes under wal_level=minimal, try adjusting wal_skip_threshold. Users setting a timeout on COMMIT may need to adjust that timeout, and log_min_duration_statement analysis will reflect time consumption moving to COMMIT from commands like COPY. Internally, this requires a reliable determination of whether RollbackAndReleaseCurrentSubTransaction() would unlink a relation's current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the specification of rd_createSubid such that the field is zero when a new rel has an old rd_node. Make relcache.c retain entries for certain dropped relations until end of transaction. Back-patch to 9.5 (all supported versions). This introduces a new WAL record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As always, update standby systems before master systems. This changes sizeof(RelationData) and sizeof(IndexStmt), breaking binary compatibility for affected extensions. (The most recent commit to affect the same class of extensions was 089e4d405d0f3b94c74a2c6a54357a84a681754b.) Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert Haas. Heikki Linnakangas and Michael Paquier implemented earlier designs that materially clarified the problem. Reviewed, in earlier designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane, Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout. Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
* Introduce macros for typalign and typstorage constants.Tom Lane2020-03-04
| | | | | | | | | | | | | | | | | | | | | Our usual practice for "poor man's enum" catalog columns is to define macros for the possible values and use those, not literal constants, in C code. But for some reason lost in the mists of time, this was never done for typalign/attalign or typstorage/attstorage. It's never too late to make it better though, so let's do that. The reason I got interested in this right now is the need to duplicate some uses of the TYPSTORAGE constants in an upcoming ALTER TYPE patch. But in general, this sort of change aids greppability and readability, so it's a good idea even without any specific motivation. I may have missed a few places that could be converted, and it's even more likely that pending patches will re-introduce some hard-coded references. But that's not fatal --- there's no expectation that we'd actually change any of these values. We can clean up stragglers over time. Discussion: https://postgr.es/m/16457.1583189537@sss.pgh.pa.us
* Remove dependency on HeapTuple from predicate locking functions.Thomas Munro2020-01-28
| | | | | | | | | | | | | | | | The following changes make the predicate locking functions more generic and suitable for use by future access methods: - PredicateLockTuple() is renamed to PredicateLockTID(). It takes ItemPointer and inserting transaction ID instead of HeapTuple. - CheckForSerializableConflictIn() takes blocknum instead of buffer. - CheckForSerializableConflictOut() no longer takes HeapTuple or buffer. Author: Ashwin Agrawal Reviewed-by: Andres Freund, Kuntal Ghosh, Thomas Munro Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com
* tableam: New callback relation_fetch_toast_slice.Robert Haas2020-01-07
| | | | | | | | | | | | | | | | Instead of always calling heap_fetch_toast_slice during detoasting, invoke a table AM callback which, when the toast table is a heap table, will be heap_fetch_toast_slice. This makes it possible for a table AM other than heap to be used as a TOAST table. It also completes the series of commits intended to improve the interaction of tableam with TOAST that began with commit 8b94dab06617ef80a0901ab103ebd8754427ef5a; detoast.c is now, hopefully, fully AM-independent. Patch by me, reviewed by Andres Freund and Peter Eisentraut. Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
* tableam: Allow choice of toast AM.Robert Haas2020-01-07
| | | | | | | | | | | | Previously, the toast table had to be implemented by the same AM that was used for the main table, which was bad, because the detoasting code won't work with anything but heap. This commit doesn't fix the latter problem, although there's another patch coming which does, but it does let you pick something that works (i.e. heap, right now). Patch by me, reviewed by Andres Freund. Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
* Update copyrights for 2020Bruce Momjian2020-01-01
| | | | Backpatch-through: update all files in master, backpatch legal files through 9.4
* Revert "Rename files and headers related to index AM"Michael Paquier2019-12-27
| | | | | | | | This follows multiple complains from Peter Geoghegan, Andres Freund and Alvaro Herrera that this issue ought to be dug more before actually happening, if it happens. Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql
* Rename files and headers related to index AMMichael Paquier2019-12-25
| | | | | | | | | | | | | | | | | | | | | The following renaming is done so as source files related to index access methods are more consistent with table access methods (the original names used for index AMs ware too generic, and could be confused as including features related to table AMs): - amapi.h -> indexam.h. - amapi.c -> indexamapi.c. Here we have an equivalent with backend/access/table/tableamapi.c. - amvalidate.c -> indexamvalidate.c. - amvalidate.h -> indexamvalidate.h. - genam.c -> indexgenam.c. - genam.h -> indexgenam.h. This has been discussed during the development of v12 when table AM was worked on, but the renaming never happened. Author: Michael Paquier Reviewed-by: Fabien Coelho, Julien Rouhaud Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz
* Make the order of the header file includes consistent in backend modules.Amit Kapila2019-11-12
| | | | | | | | | | | Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order of header file inclusion consistent for backend modules. In the passing, removed a couple of duplicate inclusions. Author: Vignesh C Reviewed-by: Kuntal Ghosh and Amit Kapila Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
* Pass ItemPointer not HeapTuple to IndexBuildCallback.Andres Freund2019-11-08
| | | | | | | | | | Not all AMs use HeapTuples internally, making it inconvenient to pass a HeapTuple. As the index callbacks really only need the TID, not the full tuple, modify callback to only take ItemPointer. Author: Ashwin Agrawal Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CALfoeis6=8ehuR=VNtHvj3z16cYfCwPdTcpaxU+sfSUJ5QgR3g@mail.gmail.com
* Split tuptoaster.c into three separate files.Robert Haas2019-09-05
| | | | | | | | | | | | | | | | | | | | | | | | detoast.c/h contain functions required to detoast a datum, partially or completely, plus a few other utility functions for examining the size of toasted datums. toast_internals.c/h contain functions that are used internally to the TOAST subsystem but which (mostly) do not need to be accessed from outside. heaptoast.c/h contains code that is intrinsically specific to the heap AM, either because it operates on HeapTuples or is based on the layout of a heap page. detoast.c and toast_internals.c are placed in src/backend/access/common rather than src/backend/access/heap. At present, both files still have dependencies on the heap, but that will be improved in a future commit. Patch by me, reviewed and tested by Prabhat Sabu, Thomas Munro, Andres Freund, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
* Add error codes to some corruption log messagesPeter Eisentraut2019-08-01
| | | | | | | | | In some cases we have elog(ERROR) while corruption is certain and we can give a clear error code ERRCODE_DATA_CORRUPTED or ERRCODE_INDEX_CORRUPTED. Author: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://www.postgresql.org/message-id/flat/25F6C686-6442-4A6B-BAF8-A6F7B84B16DE@yandex-team.ru
* tableam: Provide helper functions for relation sizing.Robert Haas2019-07-08
| | | | | | | | | | | | | Most block-based table AMs will need the exact same implementation of the relation_size callback as the heap, and if they use a standard page layout, they will likely need an implementation of the relation_estimate_size callback that is very similar to that of the heap. Rearrange to facilitate code reuse. Patch by me, reviewed by Michael Paquier, Daniel Gustafsson, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoZ6DBPnP1E-vRpQZUJQijJFD54F+SR_pxGiAAS-MyrigA@mail.gmail.com
* Fix memory corruption/crash in ANALYZE.Andres Freund2019-06-18
| | | | | | | | | | | | | | | | This fixes an embarrassing oversight I (Andres) made in 737a292b, namely missing two place where liverows/deadrows were used when converting those variables to pointers, leading to incrementing the pointer, rather than the value. It's not that actually that easy to trigger a crash: One needs tuples deleted by the current transaction, followed by a tuple deleted in another session, all in one page. Which is presumably why this hasn't been noticed before. Reported-By: Steve Singer Author: Steve Singer Discussion: https://postgr.es/m/c7988239-d42c-ddc4-41db-171b23b35e4f@ssinger.info
* Fix assorted inconsistencies.Amit Kapila2019-06-08
| | | | | | | | | | There were a number of issues in the recent commits which include typos, code and comments mismatch, leftover function declarations. Fix them. Reported-by: Alexander Lakhin Author: Alexander Lakhin, Amit Kapila and Amit Langote Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/ef0c0232-0c1d-3a35-63d4-0ebd06e31387@gmail.com
* Fix typos in various placesMichael Paquier2019-06-03
| | | | | | Author: Andrea Gelmini Reviewed-by: Michael Paquier, Justin Pryzby Discussion: https://postgr.es/m/20190528181718.GA39034@glet
* Phase 2 pgindent run for v12.Tom Lane2019-05-22
| | | | | | | | | Switch to 2.1 version of pg_bsd_indent. This formats multiline function declarations "correctly", that is with additional lines of parameter declarations indented to match where the first line's left parenthesis is. Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
* Initial pgindent run for v12.Tom Lane2019-05-22
| | | | | | | | This is still using the 2.0 version of pg_bsd_indent. I thought it would be good to commit this separately, so as to document the differences between 2.0 and 2.1 behavior. Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
* tableam: Move heap-specific logic from needs_toast_table below tableam.Robert Haas2019-05-21
| | | | | | | | | This allows table AMs to completely suppress TOAST table creation, or to modify the conditions under which they are created. Patch by me. Reviewed by Andres Freund. Discussion: http://postgr.es/m/CA+Tgmoa4O2n=yphqD2pERUnYmUO84bH1SqMsA-nSxBGsZ7gWfA@mail.gmail.com
* Don't to predicate lock for analyze scans, refactor scan option passing.Andres Freund2019-05-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this commit, when ANALYZE was run on a table and serializable was used (either by virtue of an explicit BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE, or default_transaction_isolation being set to serializable) a null pointer dereference lead to a crash. The analyze scan doesn't need a snapshot (nor predicate locking), but before this commit a scan only contained information about being a bitmap or sample scan. Refactor the option passing to the scan_begin callback to use a bitmask instead. Alternatively we could have added a new boolean parameter, but that seems harder to read. Even before this issue various people (Heikki, Tom, Robert) suggested doing so. These changes don't change the scan APIs outside of tableam. The flags argument could be exposed, it's not necessary to fix this problem. Also the wrapper table_beginscan* functions encapsulate most of that complexity. After these changes fixing the bug is trivial, just don't acquire predicate lock for analyze style scans. That was already done for bitmap heap scans. Add an assert that a snapshot is passed when acquiring the predicate lock, so this kind of bug doesn't require running with serializable. Also add a comment about sample scans currently requiring predicate locking the entire relation, that previously wasn't remarked upon. Reported-By: Joe Wildish Author: Andres Freund Discussion: https://postgr.es/m/4EA80A20-E9BF-49F1-9F01-5B66CAB21453@elusive.cx https://postgr.es/m/20190411164947.nkii4gaeilt4bui7@alap3.anarazel.de https://postgr.es/m/20190518203102.g7peu2fianukjuxm@alap3.anarazel.de
* tableam: Avoid relying on relation size to determine validity of tids.Andres Freund2019-05-17
| | | | | | | | | | | Instead add a tableam callback to do so. To avoid adding per validation overhead, pass a scan to tuple_tid_valid. In heap's case we'd otherwise incurred a RelationGetNumberOfBlocks() call for each tid - which'd have added noticable overhead to nodeTidscan.c. Author: Andres Freund Reviewed-By: Ashwin Agrawal Discussion: https://postgr.es/m/20190515185447.gno2jtqxyktylyvs@alap3.anarazel.de
* tableam: Don't assume that every AM uses md.c style storage.Andres Freund2019-05-17
| | | | | | | | | | | | | | | | | | | | | | Previously various parts of the code routed size requests through RelationGetNumberOfBlocks[InFork]. That works if md.c is used by the AM, but not otherwise. Add a tableam callback to return the size of the table. As not every AM will use postgres' BLCKSZ, have it return bytes, and have RelationGetNumberOfBlocksInFork() round the byte size up into blocks. To allow code outside of the AM to determine the actual relation size map InvalidForkNumber the total size of a relation, as not every AM might just need the postgres defined forks. A few users of RelationGetNumberOfBlocks() ought to be converted away from that. One case, the use of it to determine whether a tid is valid, will be fixed in a follow up commit. Others will have to wait for v13. Author: Andres Freund Discussion: https://postgr.es/m/20190423225201.3bbv6tbqzkb5w7cw@alap3.anarazel.de
* Handle table_complete_speculative's succeeded argument as documented.Andres Freund2019-05-14
| | | | | | | | | | | | | | | | For some reason both callsite and the implementation for heapam had the meaning inverted (i.e. succeeded == true was passed in case of conflict). That's confusing. I (Andres) briefly pondered whether it'd be better to rename table_complete_speculative's argument to 'bool specConflict' or such, but decided not to. The 'complete' in the function name for me makes `succeeded` sound a bit better. Reported-By: Ashwin Agrawal, Melanie Plageman, Heikki Linnakangas Discussion: https://postgr.es/m/CALfoeitk7-TACwYv3hCw45FNPjkA86RfXg4iQ5kAOPhR+F1Y4w@mail.gmail.com https://postgr.es/m/97673451-339f-b21e-a781-998d06b1067c@iki.fi
* Standardize ItemIdData terminology.Peter Geoghegan2019-05-13
| | | | | | | | | | | | | The term "item pointer" should not be used to refer to ItemIdData variables, since that is needlessly ambiguous. Only ItemPointerData/ItemPointer variables should be called item pointers. To fix, establish the convention that ItemIdData variables should always be referred to either as "item identifiers" or "line pointers". The term "item identifier" already predominates in docs and translatable messages, and so should be the preferred alternative there. Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
* Fix several recently introduced issues around handling new relation forks.Andres Freund2019-04-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of these stem from d25f519107 "tableam: relation creation, VACUUM FULL/CLUSTER, SET TABLESPACE.". 1) To pass data to the relation_set_new_filenode() RelationSetNewRelfilenode() was made to update RelationData.rd_rel directly. That's not OK however, as it makes the relcache entries temporarily inconsistent. Which among other scenarios is a problem if a REINDEX targets an index on pg_class - the CatalogTupleUpdate() in RelationSetNewRelfilenode(). Presumably that was introduced because other places in the code do so - while those aren't "good practice" they don't appear to be actively buggy (e.g. because system tables may not be targeted). I (Andres) should have caught this while reviewing and signficantly evolving the code in that commit, mea culpa. Fix that by instead passing in the new RelFileNode as separate argument to relation_set_new_filenode() and rely on the relcache to update the catalog entry. Also revert that the RelationMapUpdateMap() call was changed to immediate, and undo some other more unnecessary changes. 2) Document that the relation_set_new_filenode cannot rely on the whole relcache entry to be valid. It might be worthwhile to refactor the code to never have to rely on that, but given the way heap_create() is currently coded, that'd be a large change. 3) ATExecSetTableSpace() shouldn't do FlushRelationBuffers() itself. A table AM might not use shared buffers at all. Move to index_copy_data() and heapam_relation_copy_data(). 4) heapam_relation_set_new_filenode() previously sometimes accessed rel->rd_rel->relpersistence rather than the `persistence` argument. Code movement mistake. 5) Previously heapam_relation_set_new_filenode() re-opened the smgr relation to create the init for, if necesary. Instead have RelationCreateStorage() return the SMgrRelation and use it to create the init fork. 6) Add a note about the danger of modifying the relcache directly to ATExecSetTableSpace() - it's currently not a bug because there's a check ERRORing for catalog tables. Regression tests and assertion improvements that together trigger the bug described in 1) will be added in a later commit, as there is a related bug on all branches. Reported-By: Michael Paquier Diagnosed-By: Tom Lane and Andres Freund Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz
* Allow pg_class xid & multixid horizons to not be set.Andres Freund2019-04-23
| | | | | | | | | | | | | | | | | | | | | | | This allows table AMs that don't need these horizons. This was already documented in the tableam relation_set_new_filenode callback, but an assert prevented if from actually working (the test AM code contained the change itself). Defang the asserts in the general code, and move the stronger ones into heap AM. Relatedly, after CLUSTER/VACUUM, we'd always assign a relfrozenxid / relminmxid. Change the table_relation_copy_for_cluster() interface to allow the AM to overwrite the horizons that get set on the pg_class entry. This'd also in the future allow AMs like heap to compute a relfrozenxid during rewrite that's the table's actual minimum rather than a pre-determined value. Arguably it'd have been better to move the whole computation / setting of those values into the callback, but it seems likely that for other reasons it'd be better to be able to use one value to vacuum/cluster multiple tables (e.g. a toast's horizon shouldn't be different than the table's). Reported-By: Heikki Linnakangas Author: Andres Freund Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f122@iki.fi
* Fix a number of issues around modifying a previously updated row.Andres Freund2019-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes three, unfortunately related, issues: 1) Since 5db6df0c01, the introduction of DML via tableam, it was possible to trigger "ERROR: unexpected table_lock_tuple status: 1" when updating a row that was previously updated in the same transaction - but only when the previously updated row was before updated in a concurrent transaction (and READ COMMITTED was used). The reason for that was that that case simply wasn't expected. Fixing that lead to: 2) Even before the above commit, there were error checks (introduced in 6868ed7491b7) preventing a row being updated by different commands within the same statement (say in a function called by an UPDATE) - but that check wasn't performed when the row was first updated in a concurrent transaction - instead the second update was silently skipped in that case. After this change we throw the same error as we'd without the concurrent transaction. 3) The error messages (introduced in 6868ed7491b7) preventing such updates emitted the same error message for both DELETE and UPDATE ("tuple to be updated was already modified by an operation triggered by the current command"). While that could be changed separately, it made it hard to write tests that verify the correct correct behavior of the code. This commit changes heap's implementation of table_lock_tuple() to return TM_SelfModified instead of TM_Invisible (previously loosely modeled after EvalPlanQualFetch), and teaches nodeModifyTable.c to handle that in response to table_lock_tuple() and not just in response to table_(delete|update). Additionally it fixes the wrong error message (see 3 above). The comment for table_lock_tuple() is also adjusted to state that TM_Deleted won't return information in TM_FailureData - it'll not always be available. This also adds tests to ensure that DELETE/UPDATE correctly error out when affecting a row that concurrently was modified by another transaction. Author: Andres Freund Reported-By: Tom Lane, when investigating a bug bug fix to another bug by Amit Langote Discussion: https://postgr.es/m/19321.1554567786@sss.pgh.pa.us
* tableam: Add table_multi_insert() and revamp/speed-up COPY FROM buffering.Andres Freund2019-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | This adds table_multi_insert(), and converts COPY FROM, the only user of heap_multi_insert, to it. A simple conversion of COPY FROM use slots would have yielded a slowdown when inserting into a partitioned table for some workloads. Different partitions might need different slots (both slot types and their descriptors), and dropping / creating slots when there's constant partition changes is measurable. Thus instead revamp the COPY FROM buffering for partitioned tables to allow to buffer inserts into multiple tables, flushing only when limits are reached across all partition buffers. By only dropping slots when there've been inserts into too many different partitions, the aforementioned overhead is gone. By allowing larger batches, even when there are frequent partition changes, we actuall speed such cases up significantly. By using slots COPY of very narrow rows into unlogged / temporary might slow down very slightly (due to the indirect function calls). Author: David Rowley, Andres Freund, Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de
* Report progress of CREATE INDEX operationsAlvaro Herrera2019-04-02
| | | | | | | | | | | | | | | | | | | | This uses the progress reporting infrastructure added by c16dc1aca5e0, adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY. There are two pieces to this: one is index-AM-agnostic, and the other is AM-specific. The latter is fairly elaborate for btrees, including reportage for parallel index builds and the separate phases that btree index creation uses; other index AMs, which are much simpler in their building procedures, have simplistic reporting only, but that seems sufficient, at least for non-concurrent builds. The index-AM-agnostic part is fairly complete, providing insight into the CONCURRENTLY wait phases as well as block-based progress during the index validation table scan. (The index validation index scan requires patching each AM, which has not been included here.) Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql
* tableam: Add table_finish_bulk_insert().Andres Freund2019-04-01
| | | | | | | | | | | | | | | | | This replaces the previous calls of heap_sync() in places using bulk-insert. By passing in the flags used for bulk-insert the AM can decide (first at insert time and then during the finish call) which of the optimizations apply to it, and what operations are necessary to finish a bulk insert operation. Also change HEAP_INSERT_* flags to TABLE_INSERT, and rename hi_options to ti_options. These changes are made even in copy.c, which hasn't yet been converted to tableam. There's no harm in doing so. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: bitmap table scan.Andres Freund2019-03-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | This moves bitmap heap scan support to below an optional tableam callback. It's optional as the whole concept of bitmap heapscans is fairly block specific. This basically moves the work previously done in bitgetpage() into the new scan_bitmap_next_block callback, and the direct poking into the buffer done in BitmapHeapNext() into the new scan_bitmap_next_tuple() callback. The abstraction is currently somewhat leaky because nodeBitmapHeapscan.c's prefetching and visibilitymap based logic remains - it's likely that we'll later have to move more into the AM. But it's not trivial to do so without introducing a significant amount of code duplication between the AMs, so that's a project for later. Note that now nodeBitmapHeapscan.c and the associated node types are a bit misnamed. But it's not clear whether renaming wouldn't be a cure worse than the disease. Either way, that'd be best done in a separate commit. Author: Andres Freund Reviewed-By: Robert Haas (in an older version) Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: sample scan.Andres Freund2019-03-31
| | | | | | | | | | | | | | | | | | | | This moves sample scan support to below tableam. It's not optional as there is, in contrast to e.g. bitmap heap scans, no alternative way to perform tablesample queries. If an AM can't deal with the block based API, it will have to throw an ERROR. The tableam callbacks for this are block based, but given the current TsmRoutine interface, that seems to be required. The new interface doesn't require TsmRoutines to perform visibility checks anymore - that requires the TsmRoutine to know details about the AM, which we want to avoid. To continue to allow taking the returned number of tuples account SampleScanState now has a donetuples field (which previously e.g. existed in SystemRowsSamplerData), which is only incremented after the visibility check succeeds. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: Formatting and other minor cleanups.Andres Freund2019-03-31
| | | | | The superflous heapam_xlog.h includes were reported by Peter Geoghegan.
* tableam: Move heap specific logic from estimate_rel_size below tableam.Andres Freund2019-03-30
| | | | | | | | | | | | | This just moves the table/matview[/toast] determination of relation size to a callback, and uses a copy of the existing logic to implement that callback for heap. It probably would make sense to also move the index specific logic into a callback, so the metapage handling (and probably more) can be index specific. But that's a separate task. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: VACUUM and ANALYZE support.Andres Freund2019-03-30
| | | | | | | | | | | | This is a relatively straightforward move of the current implementation to sit below tableam. As the current analyze sampling implementation is pretty inherently block based, the tableam analyze interface is as well. It might make sense to generalize that at some point, but that seems like a larger project that shouldn't be undertaken at the same time as the introduction of tableam. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: Comment fixes.Andres Freund2019-03-29
| | | | | Author: Haribabu Kommi Discussion: CAJrrPGeeYOqP3hkZyohDx_8dot4zvPuPMDBmhJ=iC85cTBNeYw@mail.gmail.com
* tableam: relation creation, VACUUM FULL/CLUSTER, SET TABLESPACE.Andres Freund2019-03-28
| | | | | | | | | | | | | | | | This moves the responsibility for: - creating the storage necessary for a relation, including creating a new relfilenode for a relation with existing storage - non-transactional truncation of a relation - VACUUM FULL / CLUSTER's rewrite of a table below tableam. This is fairly straight forward, with a bit of complexity smattered in to move the computation of xid / multixid horizons below the AM, as they don't make sense for every table AM. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: Support for an index build's initial table scan(s).Andres Freund2019-03-27
| | | | | | | | | | | | | To support building indexes over tables of different AMs, the scans to do so need to be routed through the table AM. While moving a fair amount of code, nearly all the changes are just moving code to below a callback. Currently the range based interface wouldn't make much sense for non block based table AMs. But that seems aceptable for now. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* Compute XID horizon for page level index vacuum on primary.Andres Freund2019-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the xid horizon was only computed during WAL replay. That had two major problems: 1) It relied on knowing what the table pointed to looks like. That was easy enough before the introducing of tableam (we knew it had to be heap, although some trickery around logging the heap relfilenodes was required). But to properly handle table AMs we need per-database catalog access to look up the AM handler, which recovery doesn't allow. 2) Not knowing the xid horizon also makes it hard to support logical decoding on standbys. When on a catalog table, we need to be able to conflict with slots that have an xid horizon that's too old. But computing the horizon by visiting the heap only works once consistency is reached, but we always need to be able to detect conflicts. There's also a secondary problem, in that the current method performs redundant work on every standby. But that's counterbalanced by potentially computing the value when not necessary (either because there's no standby, or because there's no connected backends). Solve 1) and 2) by moving computation of the xid horizon to the primary and by involving tableam in the computation of the horizon. To address the potentially increased overhead, increase the efficiency of the xid horizon computation for heap by sorting the tids, and eliminating redundant buffer accesses. When prefetching is available, additionally perform prefetching of buffers. As this is more of a maintenance task, rather than something routinely done in every read only query, we add an arbitrary 10 to the effective concurrency - thereby using IO concurrency, when not globally enabled. That's possibly not the perfect formula, but seems good enough for now. Bumps WAL format, as latestRemovedXid is now part of the records, and the heap's relfilenode isn't anymore. Author: Andres Freund, Amit Khandekar, Robert Haas Reviewed-By: Robert Haas Discussion: https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: Add table_get_latest_tid, to wrap heap_get_latest_tid.Andres Freund2019-03-25
| | | | | | | | | | This primarily is to allow WHERE CURRENT OF to continue to work as it currently does. It's not clear to me that these semantics make sense for every AM, but it works for the in-core heap, and the out of core zheap. We can refine it further at a later point if necessary. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: Add and use table_fetch_row_version().Andres Freund2019-03-25
| | | | | | | | | | | | | | | | | | This is essentially the tableam version of heapam_fetch(), i.e. fetching a tuple identified by a tid, performing visibility checks. Note that this different from table_index_fetch_tuple(), which is for index lookups. It therefore has to handle a tid pointing to an earlier version of a tuple if the AM uses an optimization like heap's HOT. Add comments to that end. This commit removes the stats_relation argument from heap_fetch, as it's been unused for a long time. Author: Andres Freund Reviewed-By: Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
* tableam: Add tuple_{insert, delete, update, lock} and use.Andres Freund2019-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds new, required, table AM callbacks for insert/delete/update and lock_tuple. To be able to reasonably use those, the EvalPlanQual mechanism had to be adapted, moving more logic into the AM. Previously both delete/update/lock call-sites and the EPQ mechanism had to have awareness of the specific tuple format to be able to fetch the latest version of a tuple. Obviously that needs to be abstracted away. To do so, move the logic that find the latest row version into the AM. lock_tuple has a new flag argument, TUPLE_LOCK_FLAG_FIND_LAST_VERSION, that forces it to lock the last version, rather than the current one. It'd have been possible to do so via a separate callback as well, but finding the last version usually also necessitates locking the newest version, making it sensible to combine the two. This replaces the previous use of EvalPlanQualFetch(). Additionally HeapTupleUpdated, which previously signaled either a concurrent update or delete, is now split into two, to avoid callers needing AM specific knowledge to differentiate. The move of finding the latest row version into tuple_lock means that encountering a row concurrently moved into another partition will now raise an error about "tuple to be locked" rather than "tuple to be updated/deleted" - which is accurate, as that always happens when locking rows. While possible slightly less helpful for users, it seems like an acceptable trade-off. As part of this commit HTSU_Result has been renamed to TM_Result, and its members been expanded to differentiated between updating and deleting. HeapUpdateFailureData has been renamed to TM_FailureData. The interface to speculative insertion is changed so nodeModifyTable.c does not have to set the speculative token itself anymore. Instead there's a version of tuple_insert, tuple_insert_speculative, that performs the speculative insertion (without requiring a flag to signal that fact), and the speculative insertion is either made permanent with table_complete_speculative(succeeded = true) or aborted with succeeded = false). Note that multi_insert is not yet routed through tableam, nor is COPY. Changing multi_insert requires changes to copy.c that are large enough to better be done separately. Similarly, although simpler, CREATE TABLE AS and CREATE MATERIALIZED VIEW are also only going to be adjusted in a later commit. Author: Andres Freund and Haribabu Kommi Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20190313003903.nwvrxi7rw3ywhdel@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
* tableam: Add and use scan APIs.Andres Freund2019-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Too allow table accesses to be not directly dependent on heap, several new abstractions are needed. Specifically: 1) Heap scans need to be generalized into table scans. Do this by introducing TableScanDesc, which will be the "base class" for individual AMs. This contains the AM independent fields from HeapScanDesc. The previous heap_{beginscan,rescan,endscan} et al. have been replaced with a table_ version. There's no direct replacement for heap_getnext(), as that returned a HeapTuple, which is undesirable for a other AMs. Instead there's table_scan_getnextslot(). But note that heap_getnext() lives on, it's still used widely to access catalog tables. This is achieved by new scan_begin, scan_end, scan_rescan, scan_getnextslot callbacks. 2) The portion of parallel scans that's shared between backends need to be able to do so without the user doing per-AM work. To achieve that new parallelscan_{estimate, initialize, reinitialize} callbacks are introduced, which operate on a new ParallelTableScanDesc, which again can be subclassed by AMs. As it is likely that several AMs are going to be block oriented, block oriented callbacks that can be shared between such AMs are provided and used by heap. table_block_parallelscan_{estimate, intiialize, reinitialize} as callbacks, and table_block_parallelscan_{nextpage, init} for use in AMs. These operate on a ParallelBlockTableScanDesc. 3) Index scans need to be able to access tables to return a tuple, and there needs to be state across individual accesses to the heap to store state like buffers. That's now handled by introducing a sort-of-scan IndexFetchTable, which again is intended to be subclassed by individual AMs (for heap IndexFetchHeap). The relevant callbacks for an AM are index_fetch_{end, begin, reset} to create the necessary state, and index_fetch_tuple to retrieve an indexed tuple. Note that index_fetch_tuple implementations need to be smarter than just blindly fetching the tuples for AMs that have optimizations similar to heap's HOT - the currently alive tuple in the update chain needs to be fetched if appropriate. Similar to table_scan_getnextslot(), it's undesirable to continue to return HeapTuples. Thus index_fetch_heap (might want to rename that later) now accepts a slot as an argument. Core code doesn't have a lot of call sites performing index scans without going through the systable_* API (in contrast to loads of heap_getnext calls and working directly with HeapTuples). Index scans now store the result of a search in IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the target is not generally a HeapTuple anymore that seems cleaner. To be able to sensible adapt code to use the above, two further callbacks have been introduced: a) slot_callbacks returns a TupleTableSlotOps* suitable for creating slots capable of holding a tuple of the AMs type. table_slot_callbacks() and table_slot_create() are based upon that, but have additional logic to deal with views, foreign tables, etc. While this change could have been done separately, nearly all the call sites that needed to be adapted for the rest of this commit also would have been needed to be adapted for table_slot_callbacks(), making separation not worthwhile. b) tuple_satisfies_snapshot checks whether the tuple in a slot is currently visible according to a snapshot. That's required as a few places now don't have a buffer + HeapTuple around, but a slot (which in heap's case internally has that information). Additionally a few infrastructure changes were needed: I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now internally uses a slot to keep track of tuples. While systable_getnext() still returns HeapTuples, and will so for the foreseeable future, the index API (see 1) above) now only deals with slots. The remainder, and largest part, of this commit is then adjusting all scans in postgres to use the new APIs. Author: Andres Freund, Haribabu Kommi, Alvaro Herrera Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql