aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Adjust VACUUM's removable cutoff log message.Peter Geoghegan2022-04-15
| | | | | | | | | | | | | | | | | | | | | The age of OldestXmin (a.k.a. "removable cutoff") when VACUUM ends often indicates the approximate number of XIDs consumed while VACUUM ran. However, there is at least one important exception: the cutoff could be held back by a snapshot that was acquired before our VACUUM even began. Successive VACUUM operations may even use exactly the same old cutoff in extreme cases involving long held snapshots. The log messages that described how removable cutoff aged (which were added by commit 872770fd) created the impression that we were reporting on how VACUUM's usable cutoff advanced while VACUUM ran, which was misleading in these extreme cases. Fix by using a more general wording. Per gripe from Tom Lane. In passing, relocate related instrumentation code for clarity. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/1643035.1650035653@sss.pgh.pa.us
* Prevent access to no-longer-pinned buffer in heapam_tuple_lock().Tom Lane2022-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | heap_fetch() used to have a "keep_buf" parameter that told it to return ownership of the buffer pin to the caller after finding that the requested tuple TID exists but is invisible to the specified snapshot. This was thoughtlessly removed in commit 5db6df0c0, which broke heapam_tuple_lock() (formerly EvalPlanQualFetch) because that function needs to do more accesses to the tuple even if it's invisible. The net effect is that we would continue to touch the page for a microsecond or two after releasing pin on the buffer. Usually no harm would result; but if a different session decided to defragment the page concurrently, we could see garbage data and mistakenly conclude that there's no newer tuple version to chain up to. (It's hard to say whether this has happened in the field. The bug was actually found thanks to a later change that allowed valgrind to detect accesses to non-pinned buffers.) The most reasonable way to fix this is to reintroduce keep_buf, although I made it behave slightly differently: buffer ownership is passed back only if there is a valid tuple at the requested TID. In HEAD, we can just add the parameter back to heap_fetch(). To avoid an API break in the back branches, introduce an additional function heap_fetch_extended() in those branches. In HEAD there is an additional, less obvious API change: tuple->t_data will be set to NULL in all cases where buffer ownership is not returned, in particular when the tuple exists but fails the time qual (and !keep_buf). This is to defend against any other callers attempting to access non-pinned buffers. We concluded that making that change in back branches would be more likely to introduce problems than cure any. In passing, remove a comment about heap_fetch that was obsoleted by 9a8ee1dc6. Per bug #17462 from Daniil Anisimov. Back-patch to v12 where the bug was introduced. Discussion: https://postgr.es/m/17462-9c98a0f00df9bd36@postgresql.org
* Remove extraneous blank lines before block-closing bracesAlvaro Herrera2022-04-13
| | | | | | | | | These are useless and distracting. We wouldn't have written the code with them to begin with, so there's no reason to keep them. Author: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20220411020336.GB26620@telsasoft.com Discussion: https://postgr.es/m/attachment/133167/0016-Extraneous-blank-lines.patch
* Revert the addition of GetMaxBackends() and related stuff.Robert Haas2022-04-12
| | | | | | | | | | | | This reverts commits 0147fc7, 4567596, aa64f23, and 5ecd018. There is no longer agreement that introducing this function was the right way to address the problem. The consensus now seems to favor trying to make a correct value for MaxBackends available to mdules executing their _PG_init() functions. Nathan Bossart Discussion: http://postgr.es/m/20220323045229.i23skfscdbvrsuxa@jrouhaud
* Make XLogRecGetBlockTag() throw error if there's no such block.Tom Lane2022-04-11
| | | | | | | | | | | | | | | | | | | | | | | | All but a few existing callers assume without checking that this function succeeds. While it probably will, that's a poor excuse for not checking. Let's make it return void and instead throw an error if it doesn't find the block reference. Callers that actually need to handle the no-such-block case must now use the underlying function XLogRecGetBlockTagExtended. In addition to being a bit less error-prone, this should also serve to suppress some Coverity complaints about XLogRecGetBlockRefInfo. While at it, clean up some inconsistency about use of the XLogRecHasBlockRef macro: make XLogRecGetBlockTagExtended use that instead of open-coding the same condition, and avoid calling XLogRecHasBlockRef twice in relevant code paths. (That is, calling XLogRecHasBlockRef followed by XLogRecGetBlockTag is now deprecated: use XLogRecGetBlockTagExtended instead.) Patch HEAD only; this doesn't seem to have enough value to consider a back-branch API break. Discussion: https://postgr.es/m/425039.1649701221@sss.pgh.pa.us
* Remove comment about historic heap vacuuming issue.Peter Geoghegan2022-04-11
| | | | | | | | | Remove comment block about how heap page vacuuming used to set tuples with storage to LP_UNUSED in a rare edge case that can no longer happen following commit 8523492d4e. The comments seem unnecessary now, since it's now generally clear that heap vacuuming only applies to LP_DEAD items from VACUUM's first heap pass following more recent work from commits 12b5ade902 and 4f8d9d1217.
* Remove dead code in do_pg_backup_start().Tom Lane2022-04-11
| | | | | | | | | | As of commit 39969e2a1, no caller of do_pg_backup_start() passes NULL for labelfile or tblspcmapfile, nor is it plausible that any would do so in the future. Remove the code that coped with that case, as (a) it's dead and (b) it causes Coverity to bleat about possibly leaked storage. While here, do some janitorial work on the function's header comment.
* Fix various typos and spelling mistakes in code commentsDavid Rowley2022-04-11
| | | | | Author: Justin Pryzby Discussion: https://postgr.es/m/20220411020336.GB26620@telsasoft.com
* Rename delayChkpt to delayChkptFlags.Robert Haas2022-04-08
| | | | | | | | | | | | | | | | | | Before commit 412ad7a55639516f284cd0ef9757d6ae5c7abd43, delayChkpt was a Boolean. Now it's an integer. Extensions using it need to be appropriately updated, so let's rename the field to make sure that a hard compilation failure occurs. Replacing delayChkpt with delayChkptFlags made a few comments extend past 80 characters, so I reflowed them and changed some wording very slightly. The back-branches will need a different change to restore compatibility with existing minor releases; this is just for master. Per suggestion from Tom Lane. Discussion: http://postgr.es/m/a7880f4d-1d74-582a-ada7-dad168d046d1@enterprisedb.com
* Check XLogRecHasBlockRef() before XLogRecHasBlockImage().Jeff Davis2022-04-08
| | | | Trial fix of buildfarm failures on kestrel and tamandua.
* Add contrib/pg_walinspect.Jeff Davis2022-04-08
| | | | | | | | | Provides similar functionality to pg_waldump, but from a SQL interface rather than a separate utility. Author: Bharath Rupireddy Reviewed-by: Greg Stark, Kyotaro Horiguchi, Andres Freund, Ashutosh Sharma, Nitin Jadhav, RKN Sai Krishna Discussion: https://postgr.es/m/CALj2ACUGUYXsEQdKhEdsBzhGEyF3xggvLdD8C0VT72TNEfOiog%40mail.gmail.com
* Remove error message hints mentioning configure optionsPeter Eisentraut2022-04-08
| | | | | | | | | | | | These are usually not useful since users will use packaged distributions and won't be interested in rebuilding their installation from source. Also, we have only used these kinds of hints for some features and in some places, not consistently throughout. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/2552aed7-d0e9-280a-54aa-2dc7073f371d%40enterprisedb.com
* Truncate line pointer array during heap pruning.Peter Geoghegan2022-04-07
| | | | | | | | | | | | | | | | | | | Reclaim space from the line pointer array when heap pruning leaves behind a contiguous group of LP_UNUSED items at the end of the array. This happens during subsequent page defragmentation. Certain kinds of heap line pointer bloat are ameliorated by this new optimization. Follow-up work to commit 3c3b8a4b26, which taught VACUUM to truncate the line pointer array in about the same way during VACUUM's second pass over the heap. We now apply line pointer array truncation during both the first and the second pass over the heap made by VACUUM. We can also perform line pointer array truncation during opportunistic pruning. Matthias van de Meent, with small tweaks by me. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com Discussion: https://postgr.es/m/CAEze2Wg36%2B4at2eWJNcYNiW2FJmht34x3YeX54ctUSs7kKoNcA%40mail.gmail.com
* Fix typo in xlogrecovery.c code commentDaniel Gustafsson2022-04-07
| | | | | Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CALj2ACUoPtnReT=yAQMcWLtcCpk7p83xjeA8tiRX8Q0_sjh8kw@mail.gmail.com
* Prefetch data referenced by the WAL, take II.Thomas Munro2022-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new GUC recovery_prefetch. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Since not all OSes have that system call, "try" is provided so that it can be enabled where available. Better mechanisms for asynchronous I/O are possible in later work. Set to "try" for now for test coverage. Default setting to be finalized before release. The GUC wal_decode_buffer_size limits the distance we can look ahead in bytes of decoded data. The existing GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os allowed, based on pessimistic heuristics used to infer that I/Os have begun and completed. We'll also not look more than maintenance_io_concurrency * 4 block references ahead. Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version) Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version) Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version) Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version) Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version) Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
* Fix warning introduced in 5c279a6d350.Jeff Davis2022-04-07
| | | | | | | | | Change two macros to be static inline functions instead to keep the data type consistent. This avoids a "comparison is always true" warning that was occurring with -Wtype-limits. In the process, change the names to look less like macros. Discussion: https://postgr.es/m/20220407063505.njnnrmbn4sxqfsts@alap3.anarazel.de
* Fix compilation with WAL_DEBUG.Andres Freund2022-04-06
| | | | | | Broke with 5c279a6d350. But looks like it had been half-broken since 70e81861fad, because 'rmid' didn't refer to the current record's rmid anymore, but to rmid from "Initialize resource managers" - a constant.
* Custom WAL Resource Managers.Jeff Davis2022-04-06
| | | | | | | | | | | | | Allow extensions to specify a new custom resource manager (rmgr), which allows specialized WAL. This is meant to be used by a Table Access Method or Index Access Method. Prior to this commit, only Generic WAL was available, which offers support for recovery and physical replication but not logical replication. Reviewed-by: Julien Rouhaud, Bharath Rupireddy, Andres Freund Discussion: https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel%40j-davis.com
* Add single-item cache when looking at topmost XID of a subtrans XIDMichael Paquier2022-04-07
| | | | | | | | | | | | This change affects SubTransGetTopmostTransaction(), used to find the topmost transaction ID of a given transaction ID. The cache is able to store one value, so as we can save the backend from unnecessary lookups at pg_subtrans/ on repetitive calls of this routine. There is a similar practice in transam.c, for example. Author: Simon Riggs Reviewed-by: Andrey Borodin, Julien Rouhaud Discussion: https://postgr.es/m/CANbhV-G8Co=yq4v4BkW7MJDqVt68K_8A48nAZ_+8UQS7LrwLEQ@mail.gmail.com
* pgstat: store statistics in shared memory.Andres Freund2022-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the statistics collector received statistics updates via UDP and shared statistics data by writing them out to temporary files regularly. These files can reach tens of megabytes and are written out up to twice a second. This has repeatedly prevented us from adding additional useful statistics. Now statistics are stored in shared memory. Statistics for variable-numbered objects are stored in a dshash hashtable (backed by dynamic shared memory). Fixed-numbered stats are stored in plain shared memory. The header for pgstat.c contains an overview of the architecture. The stats collector is not needed anymore, remove it. By utilizing the transactional statistics drop infrastructure introduced in a prior commit statistics entries cannot "leak" anymore. Previously leaked statistics were dropped by pgstat_vacuum_stat(), called from [auto-]vacuum. On systems with many small relations pgstat_vacuum_stat() could be quite expensive. Now that replicas drop statistics entries for dropped objects, it is not necessary anymore to reset stats when starting from a cleanly shut down replica. Subsequent commits will perform some further code cleanup, adapt docs and add tests. Bumps PGSTAT_FILE_FORMAT_ID. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: Justin Pryzby <pryzby@telsasoft.com> Reviewed-By: "David G. Johnston" <david.g.johnston@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@2ndquadrant.com> (in a much earlier version) Reviewed-By: Arthur Zakirov <a.zakirov@postgrespro.ru> (in a much earlier version) Reviewed-By: Antonin Houska <ah@cybertec.at> (in a much earlier version) Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de Discussion: https://postgr.es/m/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de Discussion: https://postgr.es/m/20210319235115.y3wz7hpnnrshdyv6@alap3.anarazel.de
* pgstat: normalize function naming.Andres Freund2022-04-06
| | | | | | Most of pgstat uses pgstat_<verb>_<subject>() or just <verb>_<subject>(). But not all (some introduced fairly recently by me). Rename ones that aren't intentionally following a different scheme (e.g. AtEOXact_*).
* pgstat: scaffolding for transactional stats creation / drop.Andres Freund2022-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One problematic part of the current statistics collector design is that there is no reliable way of getting rid of statistics entries. Because of that pgstat_vacuum_stat() (called by [auto-]vacuum) matches all stats for the current database with the catalog contents and tries to drop now-superfluous entries. That's quite expensive. What's worse, it doesn't work on physical replicas, despite physical replicas collection statistics entries. This commit introduces infrastructure to create / drop statistics entries transactionally, together with the underlying catalog objects (functions, relations, subscriptions). pgstat_xact.c maintains a list of stats entries created / dropped transactionally in the current transaction. To ensure the removal of statistics entries is durable dropped statistics entries are included in commit / abort (and prepare) records, which also ensures that stats entries are dropped on standbys. Statistics entries created separately from creating the underlying catalog object (e.g. when stats were previously lost due to an immediate restart) are *not* WAL logged. However that can only happen outside of the transaction creating the catalog object, so it does not lead to "leaked" statistics entries. For this to work, functions creating / dropping functions / relations / subscriptions need to call into pgstat. For subscriptions this was already done when dropping subscriptions, via pgstat_report_subscription_drop() (now renamed to pgstat_drop_subscription()). This commit does not actually drop stats yet, it just provides the infrastructure. It is however a largely independent piece of infrastructure, so committing it separately makes sense. Bumps XLOG_PAGE_MAGIC. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de
* Suppress "variable 'pagesaving' set but not used" warning.Tom Lane2022-04-06
| | | | | | | With asserts disabled, late-model clang notices that this variable is incremented but never otherwise read. Discussion: https://postgr.es/m/3171401.1649275153@sss.pgh.pa.us
* pgstat: stats collector references in comments.Andres Freund2022-04-06
| | | | | | | | | | | | | | | | | | Soon the stats collector will be no more, with statistics instead getting stored in shared memory. There are a lot of references to the stats collector in comments. This commit replaces most of these references with "cumulative statistics system", with the remaining ones getting replaced as part of subsequent commits. This is done separately from the - quite large - shared memory statistics patch to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Justin Pryzby <pryzby@telsasoft.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de Discussion: https://postgr.es/m/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de
* Remove exclusive backup modeStephen Frost2022-04-06
| | | | | | | | | | | | | | | | | | | | | | Exclusive-mode backups have been deprecated since 9.6 (when non-exclusive backups were introduced) due to the issues they can cause should the system crash while one is running and generally because non-exclusive provides a much better interface. Further, exclusive backup mode wasn't really being tested (nor was most of the related code- like being able to log in just to stop an exclusive backup and the bits of the state machine related to that) and having to possibly deal with an exclusive backup and the backup_label file existing during pg_basebackup, pg_rewind, etc, added other complexities that we are better off without. This patch removes the exclusive backup mode, the various special cases for dealing with it, and greatly simplifies the online backup code and documentation. Authors: David Steele, Nathan Bossart Reviewed-by: Chapman Flack Discussion: https://postgr.es/m/ac7339ca-3718-3c93-929f-99e725d1172c@pgmasters.net https://postgr.es/m/CAHg+QDfiM+WU61tF6=nPZocMZvHDzCK47Kneyb0ZRULYzV5sKQ@mail.gmail.com
* Fix unsigned output format in SLRU error reportingPeter Eisentraut2022-04-06
| | | | | | | | Avoid printing signed values as unsigned. (No impact in practice expected.) Author: Pavel Borisov <pashkin.elfe@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CALT9ZEHN7hWJo6MgJKqoDMGj%3DGOzQU50wTvOYZXDj7x%3DsUK-kw%40mail.gmail.com
* vacuumlazy.c: Further consolidate resource allocation.Peter Geoghegan2022-04-04
| | | | | | | | | Move remaining VACUUM resource allocation and deallocation code from lazy_scan_heap() to its caller, heap_vacuum_rel(). This finishes off work started by commit 73f6ec3d. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wzk3fNBa_S3Ngi+16GQiyJ=AmUu3oUY99syMDTMRxitfyQ@mail.gmail.com
* Adjust tuplesort API to have bitwise option flagsDavid Rowley2022-04-04
| | | | | | | | | | | | | | | | This replaces the bool flag for randomAccess. An upcoming patch requires adding another option, so instead of breaking the API for that, then breaking it again one day if we add more options, let's just break it once. Any boolean options we add in the future will just make use of an unused bit in the flags. Any extensions making use of tuplesorts will need to update their code to pass TUPLESORT_RANDOMACCESS instead of true for randomAccess. TUPLESORT_NONE can be used for a set of empty options. Author: David Rowley Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com
* Improve the generation memory allocatorDavid Rowley2022-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we make a series of improvements to the generation memory allocator, namely: 1. Allow generation contexts to have a minimum, initial and maximum block sizes. The standard allocator allows this already but when the generation context was added, it only allowed fixed-sized blocks. The problem with fixed-sized blocks is that it's difficult to choose how large to make the blocks. If the chosen size is too small then we'd end up with a large number of blocks and a large number of malloc calls. If the block size is made too large, then memory is wasted. 2. Add support for "keeper" blocks. This is a special block that is allocated along with the context itself but is never freed. Instead, when the last chunk in the keeper block is freed, we simply mark the block as empty to allow new allocations to make use of it. 3. Add facility to "recycle" newly empty blocks instead of freeing them and having to later malloc an entire new block again. We do this by recording a single GenerationBlock which has become empty of any chunks. When we run out of space in the current block, we check to see if there is a "freeblock" and use that if it contains enough space for the allocation. Author: David Rowley, Tomas Vondra Reviewed-by: Andy Fan Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0ab8@enterprisedb.com
* Generalize how VACUUM skips all-frozen pages.Peter Geoghegan2022-04-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to aggressive VACUUMs) around advancing relfrozenxid and relminmxid before now. The issue only came up when concurrent activity unset some heap page's visibility map bit right as VACUUM was considering if the page should get counted in frozenskipped_pages. The non-aggressive case would recheck the all-frozen bit at this point. The aggressive case reasoned that the page (a skippable page) must have at least been all-frozen in the recent past, so skipping it won't make relfrozenxid advancement unsafe (which is never okay for aggressive VACUUMs). The recheck created a window for some other backend to confuse matters for VACUUM. If the page's VM bit turned out to be unset, VACUUM would conclude that the page was _never_ all-frozen. frozenskipped_pages was not incremented, and yet VACUUM couldn't back out of skipping at this late stage (it couldn't choose to scan the page instead). This made it unsafe to advance relfrozenxid later on. Consistently avoid the issue by generalizing how we skip frozen pages during aggressive VACUUMs: take the same approach when skipping any skippable page range during aggressive and non-aggressive VACUUMs alike. The new approach makes ranges (not individual pages) the fundamental unit of skipping using the visibility map. frozenskipped_pages is replaced with a boolean flag that represents whether some skippable range with one or more all-visible pages was actually skipped. It is safe for VACUUM to treat a page as all-frozen provided it at least had its all-frozen bit set after the OldestXmin cutoff was established. VACUUM is only required to scan pages that might have XIDs < OldestXmin (unfrozen XIDs) to be able to safely advance relfrozenxid. Tuples concurrently inserted on "skipped" pages can be thought of as equivalent to tuples concurrently inserted on a block >= rel_pages. It's possible that the issue this commit fixes hardly ever came up in practice. But we only had to be unlucky once to lose out on advancing relfrozenxid -- a single affected heap page was enough to throw VACUUM off. That seems like something to avoid on general principle. This is similar to an issue fixed by commit 44fa8488, which taught vacuumlazy.c to not give up on non-aggressive relfrozenxid advancement just because a cleanup lock wasn't immediately available on some heap page. Skipping an all-visible range is now explicitly structured as a choice made by non-aggressive VACUUMs, by weighing known costs (scanning extra skippable pages to freeze their tuples early) against known benefits (advancing relfrozenxid early). This works in essentially the same way as it always has (don't skip ranges < SKIP_PAGES_THRESHOLD). We could do much better here in the future by considering other relevant factors. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com Discussion: https://postgr.es/m/CA%2BTgmoZiSOY6H7aadw5ZZGm7zYmfDzL6nwmL5V7GL4HgJgLF_w%40mail.gmail.com
* Set relfrozenxid to oldest extant XID seen by VACUUM.Peter Geoghegan2022-04-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When VACUUM set relfrozenxid before now, it set it to whatever value was used to determine which tuples to freeze -- the FreezeLimit cutoff. This approach was very naive. The relfrozenxid invariant only requires that new relfrozenxid values be <= the oldest extant XID remaining in the table (at the point that the VACUUM operation ends), which in general might be much more recent than FreezeLimit. VACUUM now carefully tracks the oldest remaining XID/MultiXactId as it goes (the oldest remaining values _after_ lazy_scan_prune processing). The final values are set as the table's new relfrozenxid and new relminmxid in pg_class at the end of each VACUUM. The oldest XID might come from a tuple's xmin, xmax, or xvac fields. It might even come from one of the table's remaining MultiXacts. Final relfrozenxid values must still be >= FreezeLimit in an aggressive VACUUM (FreezeLimit still acts as a lower bound on the final value that aggressive VACUUM can set relfrozenxid to). Since standard VACUUMs still make no guarantees about advancing relfrozenxid, they might as well set relfrozenxid to a value from well before FreezeLimit when the opportunity presents itself. In general standard VACUUMs may now set relfrozenxid to any value > the original relfrozenxid and <= OldestXmin. Credit for the general idea of using the oldest extant XID to set pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
* vacuumlazy.c: Clean up variable declarations.Peter Geoghegan2022-04-02
| | | | | | | Move some of the heap_vacuum_rel() instrumentation related variables to the scope where they're actually needed. Also reorder some of the variable declarations at the start of heap_vacuum_rel() so that related variables appear together.
* Specialize tuplesort routines for different kinds of abbreviated keysJohn Naylor2022-04-02
| | | | | | | | | | | | | | | | | | | Previously, the specialized tuplesort routine inlined handling for reverse-sort and NULLs-ordering but called the datum comparator via a pointer in the SortSupport struct parameter. Testing has showed that we can get a useful performance gain by specializing datum comparison for the different representations of abbreviated keys -- signed and unsigned 64-bit integers and signed 32-bit integers. Almost all abbreviatable data types will benefit -- the only exception for now is numeric, since the datum comparison is more complex. The performance gain depends on data type and input distribution, but often falls in the range of 10-20% faster. Thomas Munro Reviewed by Peter Geoghegan, review and performance testing by me Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKKYttZZk-JMRQSVak%3DCXSJ5fiwtirFf%3Dn%3DPAbumvn1Ww%40mail.gmail.com
* Add macros in hash and btree AMs to get the special area of their pagesMichael Paquier2022-04-01
| | | | | | | | | | | This makes the code more consistent with SpGiST, GiST and GIN, that already use this style, and the idea is to make easier the introduction of more sanity checks for each of these AM-specific macros. BRIN uses a different set of macros to get a page's type and flags, so it has no need for something similar. Author: Matthias van de Meent Discussion: https://postgr.es/m/CAEze2WjE3+tGO9Fs9+iZMU+z6mMZKo54W1Zt98WKqbEUHbHOBg@mail.gmail.com
* Add new block-by-block strategy for CREATE DATABASE.Robert Haas2022-03-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because this strategy logs changes on a block-by-block basis, it avoids the need to checkpoint before and after the operation. However, because it logs each changed block individually, it might generate a lot of extra write-ahead logging if the template database is large. Therefore, the older strategy remains available via a new STRATEGY parameter to CREATE DATABASE, and a corresponding --strategy option to createdb. Somewhat controversially, this patch assembles the list of relations to be copied to the new database by reading the pg_class relation of the template database. Cross-database access like this isn't normally possible, but it can be made to work here because there can't be any connections to the database being copied, nor can it contain any in-doubt transactions. Even so, we have to use lower-level interfaces than normal, since the table scan and relcache interfaces will not work for a database to which we're not connected. The advantage of this approach is that we do not need to rely on the filesystem to determine what ought to be copied, but instead on PostgreSQL's own knowledge of the database structure. This avoids, for example, copying stray files that happen to be located in the source database directory. Dilip Kumar, with a fairly large number of cosmetic changes by me. Reviewed and tested by Ashutosh Sharma, Andres Freund, John Naylor, Greg Nancarrow, Neha Sharma. Additional feedback from Bruce Momjian, Heikki Linnakangas, Julien Rouhaud, Adam Brusselback, Kyotaro Horiguchi, Tomas Vondra, Andrew Dunstan, Álvaro Herrera, and others. Discussion: http://postgr.es/m/CA+TgmoYtcdxBjLh31DLxUXHxFVMPGzrU5_T=CYCvRyFHywSBUQ@mail.gmail.com
* Revert "Fix replay of create database records on standby"Alvaro Herrera2022-03-29
| | | | | | | This reverts commit 49d9cfc68bf4. The approach taken by this patch has problems, so we'll come up with a radically different fix. Discussion: https://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com
* Fix replay of create database records on standbyAlvaro Herrera2022-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Crash recovery on standby may encounter missing directories when replaying create database WAL records. Prior to this patch, the standby would fail to recover in such a case. However, the directories could be legitimately missing. Consider a sequence of WAL records as follows: CREATE DATABASE DROP DATABASE DROP TABLESPACE If, after replaying the last WAL record and removing the tablespace directory, the standby crashes and has to replay the create database record again, the crash recovery must be able to move on. This patch adds a mechanism similar to invalid-page tracking, to keep a tally of missing directories during crash recovery. If all the missing directory references are matched with corresponding drop records at the end of crash recovery, the standby can safely continue following the primary. Backpatch to 13, at least for now. The bug is older, but fixing it in older branches requires more careful study of the interactions with commit e6d8069522c8, which appeared in 13. A new TAP test file is added to verify the condition. However, because it depends on commit d6d317dbf615, it can only be added to branch master. I (Álvaro) manually verified that the code behaves as expected in branch 14. It's a bit nervous-making to leave the code uncovered by tests in older branches, but leaving the bug unfixed is even worse. Also, the main reason this fix took so long is precisely that we couldn't agree on a good strategy to approach testing for the bug, so perhaps this is the best we can do. Diagnosed-by: Paul Guo <paulguo@gmail.com> Author: Paul Guo <paulguo@gmail.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Asim R Praveen <apraveen@pivotal.io> Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
* Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.Robert Haas2022-03-24
| | | | | | | | | | | | | | | | If TRUNCATE causes some buffers to be invalidated and thus the checkpoint does not flush them, TRUNCATE must also ensure that the corresponding files are truncated on disk. Otherwise, a replay from the checkpoint might find that the buffers exist but have the wrong contents, which may cause replay to fail. Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design suggestion from Heikki Linnakangas, with some changes to the comments by me. Review of this and a prior patch that approached the issue differently by Heikki Linnakangas, Andres Freund, Álvaro Herrera, Masahiko Sawada, and Tom Lane. Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
* Change fastgetattr and heap_getattr to inline functionsAlvaro Herrera2022-03-24
| | | | | | | | | | | | They were macros previously, but recent callsite additions made Coverity complain about one of the assertions being always true. This change could have been made a long time ago, but the Coverity complain broke the inertia. Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Discussion: https://postgr.es/m/202203241021.uts52sczx3al@alvherre.pgsql
* Fix "missing continuation record" after standby promotionAlvaro Herrera2022-03-23
| | | | | | | | | | | | | | Invalidate abortedRecPtr and missingContrecPtr after a missing continuation record is successfully skipped on a standby. This fixes a PANIC caused when a recently promoted standby attempts to write an OVERWRITE_RECORD with an LSN of the previously read aborted record. Backpatch to 10 (all stable versions). Author: Sami Imseih <simseih@amazon.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/44D259DE-7542-49C4-8A52-2AB01534DCA9@amazon.com
* Add support for security invoker views.Dean Rasheed2022-03-22
| | | | | | | | | | | | | | | | | | | | A security invoker view checks permissions for accessing its underlying base relations using the privileges of the user of the view, rather than the privileges of the view owner. Additionally, if any of the base relations are tables with RLS enabled, the policies of the user of the view are applied, rather than those of the view owner. This allows views to be defined without giving away additional privileges on the underlying base relations, and matches a similar feature available in other database systems. It also allows views to operate more naturally with RLS, without affecting the assignments of policies to users. Christoph Heiss, with some additional hacking by me. Reviewed by Laurenz Albe and Wolfgang Walther. Discussion: https://postgr.es/m/b66dd6d6-ad3e-c6f2-8b90-47be773da240%40cybertec.at
* Remove workarounds for avoiding [U]INT64_FORMAT in translatable strings.Tom Lane2022-03-21
| | | | | | | | | Further code simplification along the same lines as d914eb347 and earlier patches. Aleksander Alekseev, Japin Li Discussion: https://postgr.es/m/CAJ7c6TMSKi3Xs8h5MP38XOnQQpBLazJvVxVfPn++roitDJcR7g@mail.gmail.com
* pgstat: rename pgstat_initstats() to pgstat_relation_init().Andres Freund2022-03-20
| | | | | | | | | The old name was overly generic. An upcoming commit moves relation stats handling into its own file, making pgstat_initstats() look even more out of place. Author: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de
* Add circular WAL decoding buffer, take II.Thomas Munro2022-03-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach xlogreader.c to decode the WAL into a circular buffer. This will support optimizations based on looking ahead, to follow in a later commit. * XLogReadRecord() works as before, decoding records one by one, and allowing them to be examined via the traditional XLogRecGetXXX() macros and certain traditional members like xlogreader->ReadRecPtr. * An alternative new interface XLogReadAhead()/XLogNextRecord() is added that returns pointers to DecodedXLogRecord objects so that it's now possible to look ahead in the WAL stream while replaying. * In order to be able to use the new interface effectively while streaming data, support is added for the page_read() callback to respond to a new nonblocking mode with XLREAD_WOULDBLOCK instead of waiting for more data to arrive. No direct user of the new interface is included in this commit, though XLogReadRecord() uses it internally. Existing code doesn't need to change, except in a few places where it was accessing reader internals directly and now needs to go through accessor macros. Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
* Fix race between DROP TABLESPACE and checkpointing.Thomas Munro2022-03-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commands like ALTER TABLE SET TABLESPACE may leave files for the next checkpoint to clean up. If such files are not removed by the time DROP TABLESPACE is called, we request a checkpoint so that they are deleted. However, there is presently a window before checkpoint start where new unlink requests won't be scheduled until the following checkpoint. This means that the checkpoint forced by DROP TABLESPACE might not remove the files we expect it to remove, and the following ERROR will be emitted: ERROR: tablespace "mytblspc" is not empty To fix, add a call to AbsorbSyncRequests() just before advancing the unlink cycle counter. This ensures that any unlink requests forwarded prior to checkpoint start (i.e., when ckpt_started is incremented) will be processed by the current checkpoint. Since AbsorbSyncRequests() performs memory allocations, it cannot be called within a critical section, so we also need to move SyncPreCheckpoint() to before CreateCheckPoint()'s critical section. This is an old bug, so back-patch to all supported versions. Author: Nathan Bossart <nathandbossart@gmail.com> Reported-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20220215235845.GA2665318%40nathanxps13
* Fix collection of typos in the code and the documentationMichael Paquier2022-03-15
| | | | | | | | Some words were duplicated while other places were grammatically incorrect, including one variable name in the code. Author: Otto Kekalainen, Justin Pryzby Discussion: https://postgr.es/m/7DDBEFC5-09B6-4325-B942-B563D1A24BDC@amazon.com
* Fix pg_basebackup with in-place tablespaces.Thomas Munro2022-03-15
| | | | | | | | | | | Previously, pg_basebackup from a cluster that contained an 'in-place' tablespace, as introduced by commit 7170f215, would produce a harmless warning on Unix and fail completely on Windows. Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/20220304.165449.1200020258723305904.horikyota.ntt%40gmail.com
* VACUUM VERBOSE: tweak scanned_pages logic.Peter Geoghegan2022-03-13
| | | | | | | | | | | | Commit 872770fd6c taught VACUUM VERBOSE and autovacuum logging to display the total number of pages scanned by VACUUM. This information was also displayed as a percentage of rel_pages in parenthesis, which makes it easy to spot trends over time and across tables. The instrumentation displayed "0 scanned (0.00% of total)" for totally empty tables. Tweak the instrumentation: have it show "0 scanned (100.00% of total)" for empty tables instead. This approach is clearer and more consistent.
* vacuumlazy.c: Standardize rel_pages terminology.Peter Geoghegan2022-03-12
| | | | | | | | | | | | | | | | | VACUUM's rel_pages field indicates the size of the target heap rel just after the table_relation_vacuum() operation began. There are specific expectations around how rel_pages can be related to other nearby state. In particular, the range of rel_pages must contain every tuple in the relation whose tuple headers might contain an XID < OldestXmin. Consistently refer to the field as rel_pages to make this clearer and more discoverable. This is follow-up work to commit 73f6ec3d from earlier today. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20220311031351.sbge5m2bpvy2ttxg@alap3.anarazel.de
* vacuumlazy.c: document vistest and OldestXmin.Peter Geoghegan2022-03-12
| | | | | | | | | | | | | | | | | | | | | | | Explain the relationship between vacuumlazy.c's vistest and OldestXmin cutoffs. These closely related cutoffs are different in subtle but important ways. Also document a closely related rule: we must establish rel_pages _after_ OldestXmin to ensure that no XID < OldestXmin can be missed by lazy_scan_heap(). It's easier to explain these issues by initializing everything together, so consolidate initialization of vacrel state. Now almost every vacrel field is initialized by heap_vacuum_rel(). The only remaining exception is the dead_items array, which is still managed by lazy_scan_heap() due to interactions with how we initialize parallel VACUUM. Also move the process that updates pg_class entries for each index into heap_vacuum_rel(), and adjust related assertions. All pg_class updates now take place after lazy_scan_heap() returns, which seems clearer. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20211211045710.ljtuu4gfloh754rs@alap3.anarazel.de Discussion: https://postgr.es/m/CAH2-WznYsUxVT156rCQ+q=YD4S4=1M37hWvvHLz-H1pwSM8-Ew@mail.gmail.com