aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/transam
Commit message (Collapse)AuthorAge
* Fix thinko in commentAlvaro Herrera2021-01-12
| | | | | | | | This comment has been wrong since its introduction in commit 2c03216d8311. Author: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoAzz6qipFJBbGEaHmyWxvvNDp8httbwLR9tUQWaTjUs2Q@mail.gmail.com
* Use vectored I/O to fill new WAL segments.Thomas Munro2021-01-11
| | | | | | | | | | | | | Instead of making many block-sized write() calls to fill a new WAL file with zeroes, make a smaller number of pwritev() calls (or various emulations). The actual number depends on the OS's IOV_MAX, which PG_IOV_MAX currently caps at 32. That means we'll write 256kB per call on typical systems. We may want to tune the number later with more experience. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
* Rename "enum blacklist" to "uncommitted enums".Thomas Munro2021-01-05
| | | | | | We agreed to remove this terminology and use something more descriptive. Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue%40alap3.anarazel.de
* Update copyright for 2021Bruce Momjian2021-01-02
| | | | Backpatch-through: 9.5
* Revert "Add key management system" (978f869b99) & later commitsBruce Momjian2020-12-27
| | | | | | | | | | The patch needs test cases, reorganization, and cfbot testing. Technically reverts commits 5c31afc49d..e35b2bad1a (exclusive/inclusive) and 08db7c63f3..ccbe34139b. Reported-by: Tom Lane, Michael Paquier Discussion: https://postgr.es/m/E1ktAAG-0002V2-VB@gemulon.postgresql.org
* Add key management systemBruce Momjian2020-12-25
| | | | | | | | | | | | | | | This adds a key management system that stores (currently) two data encryption keys of length 128, 192, or 256 bits. The data keys are AES256 encrypted using a key encryption key, and validated via GCM cipher mode. A command to obtain the key encryption key must be specified at initdb time, and will be run at every database server start. New parameters allow a file descriptor open to the terminal to be passed. pg_upgrade support has also been added. Discussion: https://postgr.es/m/CA+fd4k7q5o6Nc_AaX6BcYM9yqTbC6_pnH-6nSD=54Zp6NBQTCQ@mail.gmail.com Discussion: https://postgr.es/m/20201202213814.GG20285@momjian.us Author: Masahiko Sawada, me, Stephen Frost
* Fix typos and grammar in docs and commentsMichael Paquier2020-12-24
| | | | | | | | This fixes several areas of the documentation and some comments in matters of style, grammar, or even format. Author: Justin Pryzby Discussion: https://postgr.es/m/20201222041153.GK30237@telsasoft.com
* Revert "Get rid of the dedicated latch for signaling the startup process".Fujii Masao2020-12-17
| | | | | | | | | | | | | | | | | Revert ac22929a26, as well as the followup fix 113d3591b8. Because it broke the assumption that the startup process waiting for the recovery conflict on buffer pin should be waken up only by buffer unpin or the timeout enabled in ResolveRecoveryConflictWithBufferPin(). It caused, for example, SIGHUP signal handler or walreceiver process to wake that startup process up unnecessarily frequently. Additionally, add the comments about why that dedicated latch that the reverted patch tried to get rid of should not be removed. Thanks to Kyotaro Horiguchi for the discussion. Author: Fujii Masao Discussion: https://postgr.es/m/d8c0c608-021b-3c73-fffd-3240829ee986@oss.nttdata.com
* Improve hash_create()'s API for some added robustness.Tom Lane2020-12-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Invent a new flag bit HASH_STRINGS to specify C-string hashing, which was formerly the default; and add assertions insisting that exactly one of the bits HASH_STRINGS, HASH_BLOBS, and HASH_FUNCTION be set. This is in hopes of preventing recurrences of the type of oversight fixed in commit a1b8aa1e4 (i.e., mistakenly omitting HASH_BLOBS). Also, when HASH_STRINGS is specified, insist that the keysize be more than 8 bytes. This is a heuristic, but it should catch accidental use of HASH_STRINGS for integer or pointer keys. (Nearly all existing use-cases set the keysize to NAMEDATALEN or more, so there's little reason to think this restriction should be problematic.) Tweak hash_create() to insist that the HASH_ELEM flag be set, and remove the defaults it had for keysize and entrysize. Since those defaults were undocumented and basically useless, no callers omitted HASH_ELEM anyway. Also, remove memset's zeroing the HASHCTL parameter struct from those callers that had one. This has never been really necessary, and while it wasn't a bad coding convention it was confusing that some callers did it and some did not. We might as well save a few cycles by standardizing on "not". Also improve the documentation for hash_create(). In passing, improve reinit.c's usage of a hash table by storing the key as a binary Oid rather than a string; and, since that's a temporary hash table, allocate it in CurrentMemoryContext for neatness. Discussion: https://postgr.es/m/590625.1607878171@sss.pgh.pa.us
* Add some checkpoint/restartpoint status to ps displayMichael Paquier2020-12-14
| | | | | | | | | | | | | | | | | This is done for end-of-recovery and shutdown checkpoints/restartpoints (end-of-recovery restartpoints don't exist) rather than all types of checkpoints, in cases where it may not be possible to rely on pg_stat_activity to get a status from the startup or checkpointer processes. For example, at the end of a crash recovery, this is useful to know if a checkpoint is running in the startup process, while previously the ps display may only show some information about "recovering" something, that can be confusing while a checkpoint runs. Author: Justin Pryzby Reviewed-by: Nathan Bossart, Kirk Jamison, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200818225238.GP17022@telsasoft.com
* Convert elog(LOG) calls to ereport() where appropriatePeter Eisentraut2020-12-04
| | | | | | | | | | | User-visible log messages should go through ereport(), so they are subject to translation. Many remaining elog(LOG) calls are really debugging calls. Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com
* Centralize logic for skipping useless ereport/elog calls.Tom Lane2020-11-23
| | | | | | | | | | | | | | | | | | | | | While ereport() and elog() themselves are quite cheap when the error message level is too low to be printed, some places need to do substantial work before they can call those macros at all. To allow optimizing away such setup work when nothing is to be printed, make elog.c export a new function message_level_is_interesting(elevel) that reports whether ereport/elog will do anything. Make use of that in various places that had ad-hoc direct tests of log_min_messages etc. Also teach ProcSleep to use it to avoid some work. (There may well be other places that could usefully use this; I didn't search hard.) Within elog.c, refactor a little bit to avoid having duplicate copies of the policy-setting logic. When that code was written, we weren't relying on the availability of inline functions; so it had some duplications in the name of efficiency, which I got rid of. Alvaro Herrera and Tom Lane Discussion: https://postgr.es/m/129515.1606166429@sss.pgh.pa.us
* Replace a macro by a functionPeter Eisentraut2020-11-20
| | | | | | Using a macro is ugly and not justified here. Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
* Emit log when restore_command succeeds but archived file faills to be restored.Fujii Masao2020-11-20
| | | | | | | | | | | | | | | Previously, when restore_command claimed to succeed but failed to restore the file with the right name, for example, due to mis-configuration of restore_command, no log message was reported. Then the recovery failed later with an error message not directly related to the issue. This commit changes the recovery so that a log message is emitted in this error case. This would enable us to investigate what happened in this case more easily. Author: Jeff Janes, Fujii Masao Reviewed-by: Pavel Borisov, Kyotaro Horiguchi Discussion: https://postgr.es/m/CAMkU=1xkFs3Omp4JR4wMYWdam_KLuj6LXnTYfU8u3T0h=PLLMQ@mail.gmail.com
* Rename PGPROC->vacuumFlags to statusFlagsAlvaro Herrera2020-11-16
| | | | | | | | | | | | | | | | | | | | With more flags associated to a PGPROC entry that are not related to vacuum (currently existing or planned), the name "statusFlags" describes its purpose better. (The same is done to the mirroring PROC_HDR->vacuumFlags.) No functional changes in this commit. This was suggested first by Hari Babu Kommi in [1] and then by Michael Paquier at [2]. [1] https://postgr.es/m/CAJrrPGcsDC-oy1AhqH0JkXYa0Z2AgbuXzHPpByLoBGMxfOZMEQ@mail.gmail.com [2] https://postgr.es/m/20200820060929.GB3730@paquier.xyz Author: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20201116182446.qcg3o6szo2zookyr@localhost
* Make the standby server promptly handle interrupt signals.Fujii Masao2020-11-16
| | | | | | | | | | | | | | | | This commit changes the startup process in the standby server so that it handles the interrupt signals after waiting for wal_retrieve_retry_interval on the latch and resetting it, before entering another wait on the latch. This change causes the standby server to promptly handle interrupt signals. Otherwise, previously, there was the case where the standby needs to wait extra five seconds to shutdown when the shutdown request arrived while the startup process was waiting for wal_retrieve_retry_interval on the latch. Author: Fujii Masao, but implementation idea is from Soumyadeep Chakraborty Reviewed-by: Soumyadeep Chakraborty Discussion: https://postgr.es/m/9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com
* Fix and simplify some usages of TimestampDifference().Tom Lane2020-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce TimestampDifferenceMilliseconds() to simplify callers that would rather have the difference in milliseconds, instead of the select()-oriented seconds-and-microseconds format. This gets rid of at least one integer division per call, and it eliminates some apparently-easy-to-mess-up arithmetic. Two of these call sites were in fact wrong: * pg_prewarm's autoprewarm_main() forgot to multiply the seconds by 1000, thus ending up with a delay 1000X shorter than intended. That doesn't quite make it a busy-wait, but close. * postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute microseconds not milliseconds, thus ending up with a delay 1000X longer than intended. Somebody along the way had noticed this problem but misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather than fixing the units. This was relatively harmless in context, because we don't care that much about exactly how long this delay is; still, it's wrong. There are a few more callers of TimestampDifference() that don't have a direct need for seconds-and-microseconds, but can't use TimestampDifferenceMilliseconds() either because they do need microsecond precision or because they might possibly deal with intervals long enough to overflow 32-bit milliseconds. It might be worth inventing another API to improve that, but that seems outside the scope of this patch; so those callers are untouched here. Given the fact that we are fixing some bugs, and the likelihood that future patches might want to back-patch code that uses this new API, back-patch to all supported branches. Alexey Kondratov and Tom Lane Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
* In security-restricted operations, block enqueue of at-commit user code.Noah Misch2020-11-09
| | | | | | | | | | | | | | | | | | Specifically, this blocks DECLARE ... WITH HOLD and firing of deferred triggers within index expressions and materialized view queries. An attacker having permission to create non-temp objects in at least one schema could execute arbitrary SQL functions under the identity of the bootstrap superuser. One can work around the vulnerability by disabling autovacuum and not manually running ANALYZE, CLUSTER, REINDEX, CREATE INDEX, VACUUM FULL, or REFRESH MATERIALIZED VIEW. (Don't restore from pg_dump, since it runs some of those commands.) Plain VACUUM (without FULL) is safe, and all commands are fine when a trusted user owns the target object. Performance may degrade quickly under this workaround, however. Back-patch to 9.5 (all supported versions). Reviewed by Robert Haas. Reported by Etienne Stalmans. Security: CVE-2020-25695
* Fix unlinking of SLRU segments.Thomas Munro2020-11-05
| | | | | | | | | Commit dee663f7 intended to drop any queued up fsync requests before unlinking segment files, but missed a code path. Fix, by centralizing the forget-and-unlink code into a single function. Reported-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Discussion: https://postgr.es/m/20201104013205.icogbi773przyny5%40development
* Fix segmentation fault that commit ac22929a26 caused.Fujii Masao2020-11-04
| | | | | | | | | | | | | | | | | Commit ac22929a26 changed recoveryWakeupLatch so that it's reset to NULL at the end of recovery. This change could cause a segmentation fault in the buildfarm member 'elver'. Previously the latch was reset to NULL after calling ShutdownWalRcv(). But there could be a window between ShutdownWalRcv() and the actual exit of walreceiver. If walreceiver set the latch during that window, the segmentation fault could happen. To fix the issue, this commit changes walreceiver so that it sets the latch only when the latch has not been reset to NULL yet. Author: Fujii Masao Discussion: https://postgr.es/m/5c1f8a85-747c-7bf9-241e-dd467d8a3586@iki.fi
* Get rid of the dedicated latch for signaling the startup process.Fujii Masao2020-11-04
| | | | | | | | | | | | | | This commit gets rid of the dedicated latch for signaling the startup process in favor of using its procLatch, since that comports better with possible generic signal handlers using that latch. Commit 1e53fe0e70 changed background processes so that they use standard SIGHUP handler. Like that, this commit also makes the startup process use standard SIGHUP handler to simplify the code. Author: Fujii Masao Reviewed-by: Bharath Rupireddy, Michael Paquier Discussion: https://postgr.es/m/CALj2ACXPorUqePswDtOeM_s82v9RW32E1fYmOPZ5NuE+TWKj_A@mail.gmail.com
* Use PG_GETARG_TRANSACTIONID where appropriatePeter Eisentraut2020-11-02
| | | | | | | | | | | | Some places were using PG_GETARG_UINT32 where PG_GETARG_TRANSACTIONID would be more appropriate. (Of course, they are the same internally, so there is no externally visible effect.) To do that, export PG_GETARG_TRANSACTIONID outside of xid.c. We also export PG_RETURN_TRANSACTIONID for symmetry, even though there are currently no external users. Author: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> Discussion: https://www.postgresql.org/message-id/flat/d8f6bdd536df403b9b33816e9f7e0b9d@G08CNEXMBPEKD05.g08.fujitsu.local
* Extend amcheck to check heap pages.Robert Haas2020-10-22
| | | | | | | | Mark Dilger, reviewed by Peter Geoghegan, Andres Freund, Álvaro Herrera, Michael Paquier, Amul Sul, and by me. Some last-minute cosmetic revisions by me. Discussion: http://postgr.es/m/12ED3DA8-25F0-4B68-937D-D907CFBF08E7@enterprisedb.com
* Fix typo in multixact.cMichael Paquier2020-10-08
| | | | | | | | AtEOXact_MultiXact() was referenced in two places with an incorrect routine name. Author: Hou Zhijie Discussion: https://postgr.es/m/1b41e9311e8f474cb5a360292f0b3cb1@G08CNEXMBPEKD05.g08.fujitsu.local
* Fix compilation warning in xlog.cMichael Paquier2020-10-06
| | | | | | | Oversight in 9d0bd95. Reported-by: Andres Freund Discussion: https://postgr.es/m/20201006023802.qqfi6m5bw5y77zql@alap3.anarazel.de
* Add pg_stat_wal statistics view.Fujii Masao2020-10-02
| | | | | | | | | | | | | | | | | | | | This view shows the statistics about WAL activity. Currently it has only two columns: wal_buffers_full and stats_reset. wal_buffers_full column indicates the number of times WAL data was written to the disk because WAL buffers got full. This information is useful when tuning wal_buffers. stats_reset column indicates the time at which these statistics were last reset. pg_stat_wal view is also the basic infrastructure to expose other various statistics about WAL activity later. Bump PGSTAT_FILE_FORMAT_ID due to the change in pgstat format. Bump catalog version. Author: Masahiro Ikeda Reviewed-by: Takayuki Tsunakawa, Kyotaro Horiguchi, Amit Kapila, Fujii Masao Discussion: https://postgr.es/m/188bd3f2d2233cf97753b5ced02bb050@oss.nttdata.com
* Add block information in error context of WAL REDO apply loopMichael Paquier2020-10-02
| | | | | | | | | | | | | | | | Providing this information can be useful for example when diagnosing problems related to recovery conflicts or for recovery issues without having to go through the output generated by pg_waldump to get some information about the blocks a WAL record works on. The block information is printed in the same format as pg_waldump. This already existed in xlog.c for debugging purposes with -DWAL_DEBUG, so adding the block information in the callback has required just a small refactoring. Author: Bertrand Drouvot Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/c31e2cba-efda-762c-f4ad-5c25e5dac3d0@amazon.com
* Defer flushing of SLRU files.Thomas Munro2020-09-25
| | | | | | | | | | | | | | | | | | | | | | | | Previously, we called fsync() after writing out individual pg_xact, pg_multixact and pg_commit_ts pages due to cache pressure, leading to regular I/O stalls in user backends and recovery. Collapse requests for the same file into a single system call as part of the next checkpoint, as we already did for relation files, using the infrastructure developed by commit 3eb77eba. This can cause a significant improvement to recovery performance, especially when it's otherwise CPU-bound. Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer that it applies to all the SLRU mini-buffer-pools as well as the main buffer pool. Rearrange things so that data collected in CheckpointStats includes SLRU activity. Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions, because they were redundant after the shutdown checkpoint that immediately precedes them. (I'm not sure if they were ever needed, but they aren't now.) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts) Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
* Standardize the printf format for st_sizePeter Eisentraut2020-09-24
| | | | | | Existing code used various inconsistent ways to printf struct stat's st_size member. The type of that is off_t, which is in most cases a signed 64-bit integer, so use the long long int format for it.
* Fix missing fsync of SLRU directories.Thomas Munro2020-09-24
| | | | | | | | | | | | | Harmonize behavior by moving reponsibility for fsyncing directories down into slru.c. In 10 and later, only the multixact directories were missed (see commit 1b02be21), and in older branches all SLRUs were missed. Back-patch to all supported releases. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGLtsTUOScnNoSMZ-2ZLv%2BwGh01J6kAo_DM8mTRq1sKdSQ%40mail.gmail.com
* Add support for building GiST index by sorting.Heikki Linnakangas2020-09-17
| | | | | | | | | | | | | | | | | | | | | | This adds a new optional support function to the GiST access method: sortsupport. If it is defined, the GiST index is built by sorting all data to the order defined by the sortsupport's comparator function, and packing the tuples in that order to GiST pages. This is similar to how B-tree index build works, and is much faster than inserting the tuples one by one. The resulting index is smaller too, because the pages are packed more tightly, upto 'fillfactor'. The normal build method works by splitting pages, which tends to lead to more wasted space. The quality of the resulting index depends on how good the opclass-defined sort order is. A good order preserves locality of the input data. As the first user of this facility, add 'sortsupport' function to the point_ops opclass. It sorts the points in Z-order (aka Morton Code), by interleaving the bits of the X and Y coordinates. Author: Andrey Borodin Reviewed-by: Pavel Borisov, Thomas Munro Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
* Report resource usage at the end of recoveryDavid Rowley2020-09-16
| | | | | | | | | | Reporting this has been rather useful in some recent recovery speedup work. It also seems like something that will be useful to the average DBA too. Author: David Rowley Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/CAApHDvqYVORiZxq2xPvP6_ndmmsTkvr6jSYv4UTNaFa5i1kd%3DQ%40mail.gmail.com
* Message fixes and style improvementsPeter Eisentraut2020-09-14
|
* Report expected contrecord length on mismatchAlvaro Herrera2020-09-04
| | | | | | | | When reading a WAL record fails to find continuation record(s) of the proper length, report what it expects, for clarity. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20200903212152.GA15319@alvherre.pgsql
* Mark commit and abort WAL records with XLR_SPECIAL_REL_UPDATE.Heikki Linnakangas2020-08-17
| | | | | | | | | | | | | | | | If a commit or abort record includes "dropped relfilenodes", then replaying the record will remove data files. That is surely a "special rel update", but the records were not marked as such. Fix that, teach pg_rewind to expect and ignore them, and add a test case to cover it. It's always been like this, but no backporting for fear of breaking existing applications. If an application parsed the WAL but was not handling commit/abort records, it would stop working. That might be a good thing if it really needed to handle the dropped rels, but it will be caught when the application is updated to work with PostgreSQL v14 anyway. Discussion: https://www.postgresql.org/message-id/07b33e2c-46a6-86a1-5f9e-a7da73fddb95%40iki.fi Reviewed-by: Amit Kapila, Michael Paquier
* Correct several behavior descriptions in comments.Noah Misch2020-08-15
| | | | | | | | | Reuse cautionary language from src/test/ssl/README in src/test/kerberos/README. SLRUs have had access to six-character segments names since commit 73c986adde5d73a5e2555da9b5c8facedb146dcd, and recovery stopped calling HeapTupleHeaderAdvanceLatestRemovedXid() in commit 558a9165e081d1936573e5a7d576f5febd7fb55a. The other corrections are more self-evident.
* Prevent concurrent SimpleLruTruncate() for any given SLRU.Noah Misch2020-08-15
| | | | | | | | | | | | | | | | | The SimpleLruTruncate() header comment states the new coding rule. To achieve this, add locktype "frozenid" and two LWLocks. This closes a rare opportunity for data loss, which manifested as "apparent wraparound" or "could not access status of transaction" errors. Data loss is more likely in pg_multixact, due to released branches' thin margin between multiStopLimit and multiWrapLimit. If a user's physical replication primary logged ": apparent wraparound" messages, the user should rebuild standbys of that primary regardless of symptoms. At less risk is a cluster having emitted "not accepting commands" errors or "must be vacuumed" warnings at some point. One can test a cluster for this data loss by running VACUUM FREEZE in every database. Back-patch to 9.5 (all supported versions). Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
* snapshot scalability: Move subxact info to ProcGlobal, remove PGXACT.Andres Freund2020-08-14
| | | | | | | | | | | | | | | | | | | | Similar to the previous changes this increases the chance that data frequently needed by GetSnapshotData() stays in l2 cache. In many workloads subtransactions are very rare, and this makes the check for that considerably cheaper. As this removes the last member of PGXACT, there is no need to keep it around anymore. On a larger 2 socket machine this and the two preceding commits result in a ~1.07x performance increase in read-only pgbench. For read-heavy mixed r/w workloads without row level contention, I see about 1.1x. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
* snapshot scalability: Move PGXACT->vacuumFlags to ProcGlobal->vacuumFlags.Andres Freund2020-08-14
| | | | | | | | | | | | | Similar to the previous commit this increases the chance that data frequently needed by GetSnapshotData() stays in l2 cache. As we now take care to not unnecessarily write to ProcGlobal->vacuumFlags, there should be very few modifications to the ProcGlobal->vacuumFlags array. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
* snapshot scalability: Introduce dense array of in-progress xids.Andres Freund2020-08-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new array contains the xids for all connected backends / in-use PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT arrays which can have unused entries interspersed). This improves performance because GetSnapshotData() always needs to scan the xids of all live procarray entries and now there's no need to go through the procArray->pgprocnos indirection anymore. As the set of running top-level xids changes rarely, compared to the number of snapshots taken, this substantially increases the likelihood of most data required for a snapshot being in l2 cache. In read-mostly workloads scanning the xids[] array will sufficient to build a snapshot, as most backends will not have an xid assigned. To keep the xid array dense ProcArrayRemove() needs to move entries behind the to-be-removed proc's one further up in the array. Obviously moving array entries cannot happen while a backend sets it xid. I.e. locking needs to prevent that array entries are moved while a backend modifies its xid. To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray entry is not a very frequent operation, even taking 2PC into account. Due to the above, the dense array entries can only be read or modified while holding ProcArrayLock and/or XidGenLock. This prevents a concurrent ProcArrayRemove() from shifting the dense array while it is accessed concurrently. While the new dense array is very good when needing to look at all xids it is less suitable when accessing a single backend's xid. In particular it would be problematic to have to acquire a lock to access a backend's own xid. Therefore a backend's xid is not just stored in the dense array, but also in PGPROC. This also allows a backend to only access the shared xid value when the backend had acquired an xid. The infrastructure added in this commit will be used for the remaining PGXACT fields in subsequent commits. They are kept separate to make review easier. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
* Fix obsolete comment in xlogutils.c.Peter Geoghegan2020-08-14
| | | | Oversight in commit 2c03216d831.
* snapshot scalability: Move PGXACT->xmin back to PGPROC.Andres Freund2020-08-13
| | | | | | | | | | | | | | | | | | | | | | Now that xmin isn't needed for GetSnapshotData() anymore, it leads to unnecessary cacheline ping-pong to have it in PGXACT, as it is updated considerably more frequently than the other PGXACT members. After the changes in dc7420c2c92, this is a very straight-forward change. For highly concurrent, snapshot acquisition heavy, workloads this change alone can significantly increase scalability. E.g. plain pgbench on a smaller 2 socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench when submitting queries in batches of 100, and 2.85x for batches of 100 'SELECT';. The latter numbers are obviously not to be expected in the real-world, but micro-benchmark the snapshot computation scalability (previously spending ~80% of the time in GetSnapshotData()). Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
* Fix out-of-date version reference, grammar.Andres Freund2020-08-12
| | | | | | Time appears to be passing fast. Reported-By: Peter Geoghegan <pg@bowt.ie>
* snapshot scalability: Don't compute global horizons while building snapshots.Andres Freund2020-08-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make GetSnapshotData() more scalable, it cannot not look at at each proc's xmin: While snapshot contents do not need to change whenever a read-only transaction commits or a snapshot is released, a proc's xmin is modified in those cases. The frequency of xmin modifications leads to, particularly on higher core count systems, many cache misses inside GetSnapshotData(), despite the data underlying a snapshot not changing. That is the most significant source of GetSnapshotData() scaling poorly on larger systems. Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons / thresholds as it has so far. But we don't really have to: The horizons don't actually change that much between GetSnapshotData() calls. Nor are the horizons actually used every time a snapshot is built. The trick this commit introduces is to delay computation of accurate horizons until there use and using horizon boundaries to determine whether accurate horizons need to be computed. The use of RecentGlobal[Data]Xmin to decide whether a row version could be removed has been replaces with new GlobalVisTest* functions. These use two thresholds to determine whether a row can be pruned: 1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed are definitely still visible. 2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can definitely be removed GetSnapshotData() updates definitely_needed to be the xmin of the computed snapshot. When testing whether a row can be removed (with GlobalVisTestIsRemovableXid()) and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID < definitely_needed) the boundaries can be recomputed to be more accurate. As it is not cheap to compute accurate boundaries, we limit the number of times that happens in short succession. As the boundaries used by GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by GetSnapshotData()), it is likely that further test can benefit from an earlier computation of accurate horizons. To avoid regressing performance when old_snapshot_threshold is set (as that requires an accurate horizon to be computed), heap_page_prune_opt() doesn't unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the computation of the limited horizon, and the triggering of errors (with SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove tuples. This commit just removes the accesses to PGXACT->xmin from GetSnapshotData(), but other members of PGXACT residing in the same cache line are accessed. Therefore this in itself does not result in a significant improvement. Subsequent commits will take advantage of the fact that GetSnapshotData() now does not need to access xmins anymore. Note: This contains a workaround in heap_page_prune_opt() to keep the snapshot_too_old tests working. While that workaround is ugly, the tests currently are not meaningful, and it seems best to address them separately. Author: Andres Freund <andres@anarazel.de> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Thomas Munro <thomas.munro@gmail.com> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
* Track latest completed xid as a FullTransactionId.Andres Freund2020-08-11
| | | | | | | | | | | | | | | The reason for doing so is that a subsequent commit will need that to avoid wraparound issues. As the subsequent change is large this was split out for easier review. The reason this is not a perfect straight-forward change is that we do not want track 64bit xids in the procarray or the WAL. Therefore we need to advance lastestCompletedXid in relation to 32 bit xids. The code for that is now centralized in MaintainLatestCompletedXid*. Author: Andres Freund Reviewed-By: Thomas Munro, Robert Haas, David Rowley Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
* Rename VariableCacheData.nextFullXid to nextXid.Andres Freund2020-08-11
| | | | | | | | | Including Full in variable names duplicates the type information and leads to overly long names. As FullTransactionId cannot accidentally be casted to TransactionId that does not seem necessary. Author: Andres Freund Discussion: https://postgr.es/m/20200724011143.jccsyvsvymuiqfxu@alap3.anarazel.de
* Replace remaining StrNCpy() by strlcpy()Peter Eisentraut2020-08-10
| | | | | | | | | | | | | | | | | They are equivalent, except that StrNCpy() zero-fills the entire destination buffer instead of providing just one trailing zero. For all but a tiny number of callers, that's just overhead rather than being desirable. Remove StrNCpy() as it is now unused. In some cases, namestrcpy() is the more appropriate function to use. While we're here, simplify the API of namestrcpy(): Remove the return value, don't check for NULL input. Nothing was using that anyway. Also, remove a few unused name-related functions. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/44f5e198-36f6-6cdb-7fa9-60e34784daae%402ndquadrant.com
* Implement streaming mode in ReorderBuffer.Amit Kapila2020-08-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of serializing the transaction to disk after reaching the logical_decoding_work_mem limit in memory, we consume the changes we have in memory and invoke stream API methods added by commit 45fdc9738b. However, sometimes if we have incomplete toast or speculative insert we spill to the disk because we can't generate the complete tuple and stream. And, as soon as we get the complete tuple we stream the transaction including the serialized changes. We can do this incremental processing thanks to having assignments (associating subxact with toplevel xacts) in WAL right away, and thanks to logging the invalidation messages at each command end. These features are added by commits 0bead9af48 and c55040ccd0 respectively. Now that we can stream in-progress transactions, the concurrent aborts may cause failures when the output plugin consults catalogs (both system and user-defined). We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table scan APIs to the backend or WALSender decoding a specific uncommitted transaction. The decoding logic on the receipt of such a sqlerrcode aborts the decoding of the current transaction and continue with the decoding of other transactions. We have ReorderBufferTXN pointer in each ReorderBufferChange by which we know which xact it belongs to. The output plugin can use this to decide which changes to discard in case of stream_abort_cb (e.g. when a subxact gets discarded). We also provide a new option via SQL APIs to fetch the changes being streamed. Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
* Change XID and mxact limits to warn at 40M and stop at 3M.Noah Misch2020-08-01
| | | | | | | | | | | | | | | We have edge-case bugs when assigning values in the last few dozen pages before the wrap limit. We may introduce similar bugs in the future. At default BLCKSZ, this makes such bugs unreachable outside of single-user mode. Also, when VACUUM began to consume mxacts, multiStopLimit did not change to compensate. pg_upgrade may fail on a cluster that was already printing "must be vacuumed" warnings. Follow the warning's instructions to clear the warning, then run pg_upgrade again. One can still, peacefully consume 98% of XIDs or mxacts, so DBAs need not change routine VACUUM settings. Discussion: https://postgr.es/m/20200621083513.GA3074645@rfd.leadboat.com
* Use pg_pread() and pg_pwrite() in slru.c.Thomas Munro2020-08-02
| | | | | | | | | | This avoids lseek() system calls at every SLRU I/O, as was done for relation files in commit c24dcd0c. Reviewed-by: Ashwin Agrawal <aagrawal@pivotal.io> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2Biqke4uTRFj8D8uEUUgj%2BRokPSp%2BCWM6YYzaaamG9Wvg%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGJ%2BoHhnvqjn3%3DHro7xu-YDR8FPr0FL6LF35kHRX%3D_bUzg%40mail.gmail.com