aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Don't export basebackup.c's sendTablespace().Robert Haas2020-06-17
| | | | | | | | | | | | | Commit 72d422a5227ef6f76f412486a395aba9f53bf3f0 made xlog.c call sendTablespace() with the 'sizeonly' argument set to true, which required basebackup.c to export sendTablespace(). However, that's kind of ugly, so instead defer the call to sendTablespace() until basebackup.c regains control. That way, it can still be a static function. Patch by me, reviewed by Amit Kapila and Kyotaro Horiguchi. Discussion: http://postgr.es/m/CA+TgmoYq+59SJ2zBbP891ngWPA9fymOqntqYcweSDYXS2a620A@mail.gmail.com
* Remove STATUS_WAITINGPeter Eisentraut2020-06-17
| | | | | | | Add a separate enum for use in the locking APIs, which were the only user. Discussion: https://www.postgresql.org/message-id/flat/a6f91ead-0ce4-2a34-062b-7ab9813ea308%402ndquadrant.com
* Fix buffile.c error handling.Thomas Munro2020-06-16
| | | | | | | | | | | | | | | | | | Convert buffile.c error handling to use ereport. This fixes cases where I/O errors were indistinguishable from EOF or not reported. Also remove "%m" from error messages where errno would be bogus. While we're modifying those strings, add block numbers and short read byte counts where appropriate. Back-patch to all supported releases. Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
* Fix some comments referring to past featuresMichael Paquier2020-06-15
| | | | | | | | Timestamp can only be an int64 since b9d092c, and support for WITH OIDS has been removed as of 578b229. Author: Justin Pryzby Discussion: https://postgr.es/m/20200612023709.GC14879@telsasoft.com
* Silence _bt_check_unique compiler warning.Peter Geoghegan2020-06-13
| | | | | Reported-By: Tom Lane Discussion: https://postgr.es/m/841649.1592065060@sss.pgh.pa.us
* Have pg_itoa, pg_ltoa and pg_lltoa return the length of the stringDavid Rowley2020-06-13
| | | | | | | | | | | Core by no means makes excessive use of these functions, but quite a large number of those usages do require the caller to call strlen() on the returned string. This is quite wasteful since these functions do already have a good idea of the length of the string, so we might as well just have them return that. Reviewed-by: Andrew Gierth Discussion: https://postgr.es/m/CAApHDvrm2A5x2uHYxsqriO2cUaGcFvND%2BksC9e7Tjep0t2RK_A%40mail.gmail.com
* Improve comments for [Heap]CheckForSerializableConflictOut().Thomas Munro2020-06-12
| | | | | | | | | | | | Rewrite the documentation of these functions, in light of recent bug fix commit 5940ffb2. Back-patch to 13 where the check-for-conflict-out code was split up into AM-specific and generic parts, and new documentation was added that now looked wrong. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io
* Avoid update conflict out serialization anomalies.Peter Geoghegan2020-06-11
| | | | | | | | | | | | | | | | | | | | | | | | | | SSI's HeapCheckForSerializableConflictOut() test failed to correctly handle conditions involving a concurrently inserted tuple which is later concurrently updated by a separate transaction . A SELECT statement that called HeapCheckForSerializableConflictOut() could end up using the same XID (updater's XID) for both the original tuple, and the successor tuple, missing the XID of the xact that created the original tuple entirely. This only happened when neither tuple from the chain was visible to the transaction's MVCC snapshot. The observable symptoms of this bug were subtle. A pair of transactions could commit, with the later transaction failing to observe the effects of the earlier transaction (because of the confusion created by the update to the non-visible row). This bug dates all the way back to commit dafaa3ef, which added SSI. To fix, make sure that we check the xmin of concurrently inserted tuples that happen to also have been updated concurrently. Author: Peter Geoghegan Reported-By: Kyle Kingsbury Reviewed-By: Thomas Munro Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443@jepsen.io Backpatch: All supported versions
* Fix locking bugs that could corrupt pg_control.Thomas Munro2020-06-08
| | | | | | | | | | | | | | | | | | The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire ControlFileLock before modifying ControlFile->checkPointCopy, or the checkpointer could write out a control file with a bad checksum. Likewise, XLogReportParameters() must acquire ControlFileLock before modifying ControlFile and calling UpdateControlFile(). Back-patch to all supported releases. Author: Nathan Bossart <bossartn@amazon.com> Author: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
* Fix crash in WAL sender when starting physical replicationMichael Paquier2020-06-08
| | | | | | | | | | | | | | | | | | | | | Since database connections can be used with WAL senders in 9.4, it is possible to use physical replication. This commit fixes a crash when starting physical replication with a WAL sender using a database connection, caused by the refactoring done in 850196b. There have been discussions about forbidding the use of physical replication in a database connection, but this is left for later, taking care only of the crash new to 13. While on it, add a test to check for a failure when attempting logical replication if the WAL sender does not have a database connection. This part is extracted from a larger patch by Kyotaro Horiguchi. Reported-by: Vladimir Sitnikov Author: Michael Paquier, Kyotaro Horiguchi Reviewed-by: Kyotaro Horiguchi, Álvaro Herrera Discussion: https://postgr.es/m/CAB=Je-GOWMj1PTPkeUhjqQp-4W3=nW-pXe2Hjax6rJFffB5_Aw@mail.gmail.com Backpatch-through: 13
* Reconsider nbtree page deletion assertion.Peter Geoghegan2020-05-19
| | | | | | | | | | | | | | | | Commit 624686abcf8 added an assertion that verified that _bt_search successfully relocated the leaf page undergoing deletion. Page deletion cannot deal with the case where the descent stack is to the right of the page, so this seemed critical (deletion can only handle the case where the descent stack is to the left of the leaf/target page). However, the assertion went a bit too far. Since only a buffer pin is held on the leaf page throughout the call to _bt_search, nothing guarantees that it can't have split during this small window. And if does actually split, _bt_search may end up "relocating" a page to the right of the original target leaf page. This scenario seems extremely unlikely, but it must still be considered. Remove the assertion, and document how we cope in this scenario.
* Mop-up for wait event naming issues.Tom Lane2020-05-16
| | | | | | | | | | | | Synchronize the event names for parallel hash join waits with other event names, by getting rid of the slashes and dropping "-ing" suffixes. Rename ClogGroupUpdate to XactGroupUpdate, to match the new SLRU name. Move the ProcSignalBarrier event to the IPC category; it doesn't belong under IO. Also a bit more wordsmithing in the wait event documentation tables. Discussion: https://postgr.es/m/4505.1589640417@sss.pgh.pa.us
* Final pgindent run with pg_bsd_indent version 2.1.Tom Lane2020-05-16
| | | | | | | This is just to provide a clean basis for comparison of the results of the new version. I did fix a typo that crept into 242dfcbaf. Discussion: https://postgr.es/m/20200114221814.GA19630@alvherre.pgsql
* Rename assorted LWLock tranches.Tom Lane2020-05-15
| | | | | | | | | | | | | | | | Choose names that fit into the conventions for wait event names (particularly, that multi-word names are in the style MultiWordName) and hopefully convey more information to non-hacker users than the previous names did. Also rename SerializablePredicateLockListLock to SerializablePredicateListLock; the old name was long enough to cause table formatting problems, plus the double occurrence of "Lock" seems confusing/error-prone. Also change a couple of particularly opaque LWLock field names. Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
* Avoid killing btree items that are already deadAlvaro Herrera2020-05-15
| | | | | | | | | | | | | | | | | | | | | | _bt_killitems marks btree items dead when a scan leaves the page where they live, but it does so with only share lock (to improve concurrency). This was historicall okay, since killing a dead item has no consequences. However, with the advent of data checksums and wal_log_hints, this action incurs a WAL full-page-image record of the page. Multiple concurrent processes would write the same page several times, leading to WAL bloat. The probability of this happening can be reduced by only killing items if they're not already dead, so change the code to do that. The problem could eliminated completely by having _bt_killitems upgrade to exclusive lock upon seeing a killable item, but that would reduce concurrency so it's considered a cure worse than the disease. Backpatch all the way back to 9.5, since wal_log_hints was introduced in 9.4. Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
* Rename SLRU structures and associated LWLocks.Tom Lane2020-05-15
| | | | | | | | | | | | | | | | | | | | | | | | | | Originally, the names assigned to SLRUs had no purpose other than being shmem lookup keys, so not a lot of thought went into them. As of v13, though, we're exposing them in the pg_stat_slru view and the pg_stat_reset_slru function, so it seems advisable to take a bit more care. Rename them to names based on the associated on-disk storage directories (which fortunately we *did* think about, to some extent; since those are also visible to DBAs, consistency seems like a good thing). Also rename the associated LWLocks, since those names are likewise user-exposed now as wait event names. For the most part I only touched symbols used in the respective modules' SimpleLruInit() calls, not the names of other related objects. This renaming could have been taken further, and maybe someday we will do so. But for now it seems undesirable to change the names of any globally visible functions or structs, so some inconsistency is unavoidable. (But I *did* terminate "oldserxid" with prejudice, as I found that name both unreadable and not descriptive of the SLRU's contents.) Table 27.12 needs re-alphabetization now, but I'll leave that till after the other LWLock renamings I have in mind. Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
* Initial pgindent and pgperltidy run for v13.Tom Lane2020-05-14
| | | | | | | | | | | Includes some manual cleanup of places that pgindent messed up, most of which weren't per project style anyway. Notably, it seems some people didn't absorb the style rules of commit c9d297751, because there were a bunch of new occurrences of function calls with a newline just after the left paren, all with faulty expectations about how the rest of the call would get indented.
* Collect built-in LWLock tranche names statically, not dynamically.Tom Lane2020-05-14
| | | | | | | | | | | | | | | | | | There is little point in using the LWLockRegisterTranche mechanism for built-in tranche names. It wastes cycles, it creates opportunities for bugs (since failing to register a tranche name is a very hard-to-detect problem), and the lack of any centralized list of names encourages sloppy nonconformity in name choices. Moreover, since we have a centralized list of the tranches anyway in enum BuiltinTrancheIds, we're certainly not buying any flexibility in return for these disadvantages. Hence, nuke all the backend-internal LWLockRegisterTranche calls, and instead provide a const array of the builtin tranche names. (I have in mind to change a bunch of these names shortly, but this patch is just about getting them into one place.) Discussion: https://postgr.es/m/9056.1589419765@sss.pgh.pa.us
* Dial back -Wimplicit-fallthrough to level 3Alvaro Herrera2020-05-13
| | | | | | | | | The additional pain from level 4 is excessive for the gain. Also revert all the source annotation changes to their original wordings, to avoid back-patching pain. Discussion: https://postgr.es/m/31166.1589378554@sss.pgh.pa.us
* Improve management of SLRU statistics collection.Tom Lane2020-05-13
| | | | | | | | | | | | | | | | | | | | | | | Instead of re-identifying which statistics bucket to use for a given SLRU on every counter increment, do it once during shmem initialization. This saves a fair number of cycles, and there's no real cost because we could not have a bucket assignment that varies over time or across backends anyway. Also, get rid of the ill-considered decision to let pgstat.c pry directly into SLRU's shared state; it's cleaner just to have slru.c pass the stats bucket number. In consequence of these changes, there's no longer any need to store an SLRU's LWLock tranche info in shared memory, so get rid of that, making this a net reduction in shmem consumption. (That partly reverts fe702a7b3.) This is basically code review for 28cac71bd, so I also cleaned up some comments, removed a dangling extern declaration, fixed some things that should be static and/or const, etc. Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
* Adjust walsender usage of xlogreader, simplify APIsAlvaro Herrera2020-05-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Have both physical and logical walsender share a 'xlogreader' state struct for tracking state. This replaces the existing globals sendSeg and sendCxt. * Change WALRead not to receive XLogReaderState->seg and ->segcxt as separate arguments anymore; just use the ones from 'state'. This is made possible by the above change. * have the XLogReader segment_open contract require the callbacks to install the file descriptor in the state struct themselves instead of returning it. xlogreader was already ignoring any possible failed return from the callbacks, relying solely on them never returning. (This point is not altogether excellent, as it means the callbacks have to know more of XLogReaderState; but to really improve on that we would have to pass back error info from the callbacks to xlogreader. And the complexity would not be saved but instead just transferred to the callers of WALRead, which would have to learn how to throw errors from the open_segment callback in addition of, as currently, from pg_pread.) * segment_open no longer receives the 'segcxt' as a separate argument, since it's part of the XLogReaderState argument. Per comments from Kyotaro Horiguchi. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20200511203336.GA9913@alvherre.pgsql
* Add -Wimplicit-fallthrough to CFLAGS and CXXFLAGSAlvaro Herrera2020-05-12
| | | | | | | | | | | | | | | Use it at level 4, a bit more restrictive than the default level, and tweak our commanding comments to FALLTHROUGH. (However, leave zic.c alone, since it's external code; to avoid the warnings that would appear there, change CFLAGS for that file in the Makefile.) Author: Julien Rouhaud <rjuju123@gmail.com> Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20200412081825.qyo5vwwco3fv4gdo@nol Discussion: https://postgr.es/m/flat/E1fDenm-0000C8-IJ@gemulon.postgresql.org
* Fix comment in xlogutils.cMichael Paquier2020-05-12
| | | | | | | | | | | The existing callers of XLogReadDetermineTimeline() performing recovery need to check a replay LSN position when determining on which timeline to read a WAL page. A portion of the comment describing this function said exactly that, while referring to a routine for fetching a write LSN, something not available in recovery. Author: Kyotaro Horiguchi Discussion: https://postgr.es/m/20200511.101619.2043820539323292957.horikyota.ntt@gmail.com
* Adjust "root of to-be-deleted subtree" function.Peter Geoghegan2020-05-11
| | | | | | | | | | | | | | | | | | | | Restructure the function that locates the root of the to-be-deleted subtree during nbtree page deletion. Handle the conditions that make page deletion unsafe in a slightly more uniform way, and acknowledge the fact that the behavior with incomplete splits on internal pages is different (as pointed out in the nbtree README as of commit 35bc0ec7). Also invent new terminology that avoids ambiguity around which pages are about to be deleted. Consistently use the term "to-be-deleted subtree", not the ambiguous term "branch". We were calling the subtree parent page the "top parent page", but that was quite misleading. The top parent page usually refers to a page unlinked from its siblings and marked deleted (during the second stage of page deletion). There was one kind of top parent page that we merely removed a downlink from, and another kind of top parent page that we actually marked deleted. Eliminate the ambiguity by inventing a new term ("subtree parent page") that refers to the former kind of page only.
* Rework XLogReader callback systemAlvaro Herrera2020-05-08
| | | | | | | | | | | | | | | | | | | Code review for 0dc8ead46363, prompted by a bug closed by 91c40548d5f7. XLogReader's system for opening and closing segments had gotten too complicated, with callbacks being passed at both the XLogReaderAllocate level (read_page) as well as at the WALRead level (segment_open). This was confusing and hard to follow, so restructure things so that these callbacks are passed together at XLogReaderAllocate time, and add another callback to the set (segment_close) to make it a coherent whole. Also, ensure XLogReaderState is an argument to all the callbacks, so that they can grab at the ->private data if necessary. Document the whole arrangement more clearly. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20200422175754.GA19858@alvherre.pgsql
* Report missing wait event for timeline history file.Fujii Masao2020-05-08
| | | | | | | | | | | | | | | | TimelineHistoryRead and TimelineHistoryWrite wait events are reported during waiting for a read and write of a timeline history file, respectively. However, previously, TimelineHistoryRead wait event was not reported while readTimeLineHistory() was reading a timeline history file. Also TimelineHistoryWrite was not reported while writeTimeLineHistory() was writing one line with the details of the timeline split, at the end. This commit fixes these issues. Back-patch to v10 where wait events for a timeline history file was added. Author: Masahiro Ikeda Reviewed-by: Michael Paquier, Fujii Masao Discussion: https://postgr.es/m/d11b0c910b63684424e06772eb844ab5@oss.nttdata.com
* Refactor nbtree deletion INCOMPLETE_SPLIT check.Peter Geoghegan2020-05-07
| | | | | | | | | | | | | | | | | | | | Factor out code common to _bt_lock_branch_parent() and _bt_pagedel() into a new utility function. This new function is used to check that the left sibling of a deletion target page does not have the INCOMPLETE_SPLIT page flag set. If it is set then deletion is unsafe; there won't be a usable pivot tuple (with a downlink) in the parent page that points to the deletion target page. The page deletion algorithm is not prepared to deal with that. Also restructure an existing, related utility function that checks if the right sibling of the target page has the ISHALFDEAD page flag set. This organization highlights the symmetry between the two cases. The goal is to make the design of page deletion clearer. Both functions involve a sibling page with a flag that indicates that there was an interrupted operation (a page split or a page deletion) that resulted in a page pointed to by sibling pages, but not pointed to in the parent. And, both functions indicate if page deletion is unsafe due to the absence of a particular downlink in the parent page.
* Change the display of WAL usage statistics in Explain.Amit Kapila2020-05-05
| | | | | | | | | | | | | | | | In commit 33e05f89c5, we have added the option to display WAL usage statistics in Explain and auto_explain. The display format used two spaces between each field which is inconsistent with Buffer usage statistics which is using one space between each field. Change the format to make WAL usage statistics consistent with Buffer usage statistics. This commit also changed the usage of "full page writes" to "full page images" for WAL usage statistics to make it consistent with other parts of code and docs. Author: Julien Rouhaud, Amit Kapila Reviewed-by: Justin Pryzby, Kyotaro Horiguchi and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
* Refactor btvacuumpage().Peter Geoghegan2020-05-02
| | | | | | | | | | | | | | | | | | | | | | | | | | Remove one of the arguments to btvacuumpage(), and give up on the idea that it's a recursive function. We now use the term "backtracking" to refer to the case where an earlier block must be visited to make sure no tuples that need to be removed were missed. Advertising btvacuumpage() as a recursive function was unhelpful. In reality the function always simulates recursion with a loop (it doesn't actually call itself). This wasn't just necessary as a precaution (per the comments mentioning tail recursion), though. There is no reliable natural limit on the number of times we can backtrack. There are important behavioral difference when "recursing"/backtracking, mostly related to page deletion. We don't perform page deletion when backtracking due to the extra complexity. And when we recurse, we're not performing a physical order scan anymore, so we expect fairly different conditions to hold for the page. Structuring the code like this makes it clearer how _bt_pagedel() cooperates with btvacuumpage() and btvacuumscan() (as established in commit b0229f26 and commit 73a076b0). Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzmRGMDWiLMcb+zagG9652PboNN4Gfcq1Gc_wJL6A716MA@mail.gmail.com
* Clear up issue with FSM and oldest bpto.xact.Peter Geoghegan2020-05-01
| | | | | | | | | | | | | | | | | On further reflection, code comments added by commit b0229f26 slightly misrepresented how we determine the oldest bpto.xact for the index. btvacuumpage() does not treat the bpto.xact of a page that it put in the FSM as a candidate to be the oldest deleted page (the delete-marked page that has the oldest bpto.xact XID among all pages encountered). The definition of a deleted page for the purposes of the bpto.xact calculation is different from the definition used by the bulk delete statistics. The bulk delete statistics don't distinguish between pages that were deleted by the current VACUUM, pages deleted by a previous VACUUM operation but not yet recyclable/reusable, and pages that are reusable (though reusable pages are counted separately). Backpatch: 11-, just like commit b0229f26.
* Reorder function prototypes for consistency.Peter Geoghegan2020-05-01
|
* Fix undercounting in VACUUM VERBOSE output.Peter Geoghegan2020-05-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic for determining how many nbtree pages in an index are deleted pages sometimes undercounted pages. Pages that were deleted by the current VACUUM operation (as opposed to some previous VACUUM operation whose deleted pages have yet to be reused) were sometimes overlooked. The final count is exposed to users through VACUUM VERBOSE's "%u index pages have been deleted" output. btvacuumpage() avoided double-counting when _bt_pagedel() deleted more than one page by assuming that only one page was deleted, and that the additional deleted pages would get picked up during a future call to btvacuumpage() by the same VACUUM operation. _bt_pagedel() can legitimately delete pages that the btvacuumscan() scan will not visit again, though, so that assumption was slightly faulty. Fix the accounting by teaching _bt_pagedel() about its caller's requirements. It now only reports on pages that it knows btvacuumscan() won't visit again (including the current btvacuumpage() page), so everything works out in the end. This bug has been around forever. Only backpatch to v11, though, to keep _bt_pagedel() is sync on the branches that have today's bugfix commit b0229f26da. Note that this commit changes the signature of _bt_pagedel(), just like commit b0229f26da. Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com Backpatch: 11-
* Fix bug in nbtree VACUUM "skip full scan" feature.Peter Geoghegan2020-05-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 857f9c36cda (which taught nbtree VACUUM to skip a scan of the index from btcleanup in situations where it doesn't seem worth it) made VACUUM maintain the oldest btpo.xact among all deleted pages for the index as a whole. It failed to handle all the details surrounding pages that are deleted by the current VACUUM operation correctly (though pages deleted by some previous VACUUM operation were processed correctly). The most immediate problem was that the special area of the page was examined without a buffer pin at one point. More fundamentally, the handling failed to account for the full range of _bt_pagedel() behaviors. For example, _bt_pagedel() sometimes deletes internal pages in passing, as part of deleting an entire subtree with btvacuumpage() caller's page as the leaf level page. The original leaf page passed to _bt_pagedel() might not be the page that it deletes first in cases where deletion can take place. It's unclear how disruptive this bug may have been, or what symptoms users might want to look out for. The issue was spotted during unrelated code review. To fix, push down the logic for maintaining the oldest btpo.xact to _bt_pagedel(). btvacuumpage() is now responsible for pages that were fully deleted by a previous VACUUM operation, while _bt_pagedel() is now responsible for pages that were deleted by the current VACUUM operation (this includes half-dead pages from a previous interrupted VACUUM operation that become fully deleted in _bt_pagedel()). Note that _bt_pagedel() should never encounter an existing deleted page. This commit theoretically breaks the ABI of a stable release by changing the signature of _bt_pagedel(). However, if any third party extension is actually affected by this, then it must already be completely broken (since there are numerous assumptions made in _bt_pagedel() that cannot be met outside of VACUUM). It seems highly unlikely that such an extension actually exists, in any case. Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com Backpatch: 11-, where the "skip full scan" feature was introduced.
* Fix AddressSanitizer use-after-scope complaint.Peter Geoghegan2020-04-30
| | | | | | | | | | | XLogRegisterBufData() does not copy data pointed to by caller's pointer argument. Oversight in commit 0d861bbb702. Author: Peter Eisentraut Reported-By: Peter Eisentraut Discussion: https://postgr.es/m/21800dbe-a13e-22f7-d423-b81db9d249f5@2ndquadrant.com
* Remove redundant _bt_killitems() buffer check.Peter Geoghegan2020-04-29
| | | | | | _bt_getbuf() cannot return an invalid buffer. Oversight in commit 2ed5b87f96d.
* Fix some typosMichael Paquier2020-04-27
| | | | | Author: Justin Pryzby Discussion: https://postgr.es/m/20200408165653.GF2228@telsasoft.com
* Fix typoPeter Eisentraut2020-04-26
| | | | from 303640199d0
* Fix another minor page deletion buffer lock issue.Peter Geoghegan2020-04-25
| | | | | | | | | | | | | | | | | Avoid accessing the leaf page's top parent tuple without a buffer lock held during the second phase of nbtree page deletion. The old approach was safe, though only because VACUUM never drops its buffer pin (and because only VACUUM itself can modify a half-dead page). Even still, it seems like a good idea to be strict here. Tighten things up by copying the top parent page's block number to a local variable before releasing the buffer lock on the leaf page -- not after. This is a follow-up to commit fa7ff642, which fixed a similar issue in the first phase of nbtree page deletion. Update some related comments in passing. Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
* Fix minor nbtree page deletion buffer lock issue.Peter Geoghegan2020-04-25
| | | | | | | | | | | | | | | | | | Avoid accessing the deletion target page's special area during nbtree page deletion at a point where there is no buffer lock held. This issue was detected by a patch that extends Valgrind's memcheck tool to mark nbtree pages that are unsafe to access (due to not having a buffer lock or buffer pin) as NOACCESS. We do hold a buffer pin at this point, and only access the special area, so the old approach was safe. Even still, it seems like a good idea to tighten up the rules in this area. There is no reason to not simply insist on always holding a buffer lock (not just a pin) when accessing nbtree pages. Update some related comments in passing. Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
* Fix handling of WAL segments ready to be archived during crash recoveryMichael Paquier2020-04-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 78ea8b5 has fixed an issue related to the recycling of WAL segments on standbys depending on archive_mode. However, it has introduced a regression with the handling of WAL segments ready to be archived during crash recovery, causing those files to be recycled without getting archived. This commit fixes the regression by tracking in shared memory if a live cluster is either in crash recovery or archive recovery as the handling of WAL segments ready to be archived is different in both cases (those WAL segments should not be removed during crash recovery), and by using this new shared memory state to decide if a segment can be recycled or not. Previously, it was not possible to know if a cluster was in crash recovery or archive recovery as the shared state was able to track only if recovery was happening or not, leading to the problem. A set of TAP tests is added to close the gap here, making sure that WAL segments ready to be archived are correctly handled when a cluster is in archive or crash recovery with archive_mode set to "on" or "always", for both standby and primary. Reported-by: Benoît Lobréau Author: Jehan-Guillaume de Rorthais Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost Backpatch-through: 9.5
* Remove HEAPDEBUGALLPeter Eisentraut2020-04-22
| | | | | | | | | This has been broken since PostgreSQL 12 and was probably never really used. PostgreSQL 12 added an analogous HEAPAMSLOTDEBUGALL, which still works right now, but it's also not very useful, so remove that as well. Discussion: https://www.postgresql.org/message-id/flat/645c0646-4218-d4c3-409a-a7003a0c108d%402ndquadrant.com
* Fix possible crash during FATAL exit from reindexing.Tom Lane2020-04-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | index.c supposed that it could just use a PG_TRY block to clean up the state associated with an active REINDEX operation. However, that code doesn't run if we do a FATAL exit --- for example, due to a SIGTERM shutdown signal --- while the REINDEX is happening. And that state does get consulted during catalog accesses, which makes it problematic if we do any catalog accesses during shutdown --- for example, to clean up any temp tables created in the session. If this combination of circumstances occurred, we could find ourselves trying to access already-freed memory. In debug builds that'd fairly reliably cause an assertion failure. In production we might often get away with it, but with some bad luck it could cause a core dump. Another possible bad outcome is an erroneous conclusion that an index-to-be-accessed is being reindexed; but it looks like that would be unlikely to have any consequences worse than failing to drop temp tables right away. (They'd still get dropped by the next session that uses that temp schema.) To fix, get rid of the use of PG_TRY here, and instead hook into the transaction abort mechanisms to clean up reindex state. Per bug #16378 from Alexander Lakhin. This has been wrong for a very long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/16378-7a70ca41b3ec2009@postgresql.org
* Consider outliers in split interval calculation.Peter Geoghegan2020-04-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0d861bbb, which introduced deduplication to nbtree, added some logic to take large posting list tuples into account when choosing a split point. We subtract firstright posting list overhead from the projected new high key size when calculating leftfree/rightfree values for an affected candidate split point. Posting list tuples aren't special to nbtsplitloc.c, but taking them into account like this makes a huge difference in practice. Posting list tuples are frequently tuple size outliers. However, commit 0d861bbb missed a closely related issue: split interval itself is calculated based on the assumption that tuples on the page being split are roughly equisized. That assumption was acceptable back when commit fab25024 taught the logic for choosing a split point about suffix truncation, but it's pretty questionable now that very large tuple sizes are common. This oversight led to unbalanced page splits in low cardinality multi-column indexes when deduplication was used: page splits that don't give sufficient weight to how unbalanced the split is when the interval happens to include some large posting list tuples (and when most other tuples on the page are not so large). Nail this down by calculating an initial split interval in a way that's attuned to the actual cost that we want to keep under control (not a fuzzy proxy for the cost): apply a leftfree + rightfree evenness test to each candidate split point that actually gets included in the split interval (for the default strategy). This replaces logic that used a percentage of all legal split points for the page as the basis of the initial split interval. Discussion: https://postgr.es/m/CAH2-WznJt5aT2uUB2Bs+JBLdwe0XTX67+xeLFcaNvCKxO=QBVQ@mail.gmail.com
* Slightly simplify nbtree split point choice loop.Peter Geoghegan2020-04-15
| | | | | Spotted during post-commit review of the nbtree deduplication commit (commit 0d861bbb).
* Remove obsolete "hole in center of page" comment.Peter Geoghegan2020-04-14
| | | | | | | | | A comment from the Berkeley days incorrectly claimed that the page management code cares about the contents of the hole in the center of the page (at least in the case of the left half of an nbtree page split). Commit 8fa30f906be added an addendum that stated that the original comment was "probably obsolete". It's definitely obsolete, though, so remove the original comment plus the addendum.
* Rearrange _bt_insertonpg() "update metapage" code.Peter Geoghegan2020-04-14
| | | | | | Nest the "update metapage as part of insert into root-like page" branch inside the broader "insert into internal page" branch. This improves readability.
* Add defensive "split_only_page" nbtree assertion.Peter Geoghegan2020-04-13
| | | | | | | | | | | | | | Clearly it's not okay for nbtree to split a page that is the only page on its level, and then find that it has to split the parent one level up in turn. There is simply no code to handle the split_only_page case in the _bt_insertonpg() "newitem won't fit" branch (only the "newitem fits" branch handles split_only_page). Add a defensive assertion that will fail if a split_only_page call to _bt_insertonpg() somehow ends up splitting the target/parent page. I (pgeoghegan) believe that we don't need split_only_page handling for the "newitem won't fit" branch because anybody calling _bt_insertonpg() like this would have to hold a lock on the same one and only child page.
* Comments and doc fixes for commit 40d964ec99.Amit Kapila2020-04-14
| | | | | | | Reported-by: Justin Pryzby Author: Justin Pryzby, with few changes by me Reviewed-by: Amit Kapila and Sawada Masahiko Discussion: https://postgr.es/m/20200322021801.GB2563@telsasoft.com
* Make _bt_insertonpg() more like _bt_split().Peter Geoghegan2020-04-13
| | | | | | | | | | It seems like a good idea for nbtree's retail insert code to be absolutely consistent with nbtree's page split code for anything that naturally requires equivalent handling. Anything that concerns inserting newitem (which is handled as part of the page split atomic action when a page split is required) should work in exactly the same way. With that in mind, make _bt_insertonpg() handle 'cbuf' in a way that matches _bt_split().
* Harmonize nbtree page split point code.Peter Geoghegan2020-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An nbtree split point can be thought of as a point between two adjoining tuples from an imaginary version of the page being split that includes the incoming/new item (in addition to the items that really are on the page). These adjoining tuples are called the lastleft and firstright tuples. The variables that represent split points contained a field called firstright, which is an offset number of the first data item from the original page that goes on the new right page. The corresponding tuple from origpage was usually the same thing as the actual firstright tuple, but not always: the firstright tuple is sometimes the new/incoming item instead. This situation seems unnecessarily confusing. Make things clearer by renaming the origpage offset returned by _bt_findsplitloc() to "firstrightoff". We now have a firstright tuple and a firstrightoff offset number which are comparable to the newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers respectively. Also make sure that we are consistent about how we describe nbtree page split point state. Push the responsibility for dealing with pg_upgrade'd !heapkeyspace indexes down to lower level code, relieving _bt_split() from dealing with it directly. This means that we always have a palloc'd left page high key on the leaf level, no matter what. This enables simplifying some of the code (and code comments) within _bt_split(). Finally, restructure the page split code to make it clearer why suffix truncation (which only takes place during leaf page splits) is completely different to the first data item truncation that takes place during internal page splits. Tuples are marked as having fewer attributes stored in both cases, and the firstright tuple is truncated in both cases, so it's easy to imagine somebody missing the distinction.